Development of a Novel Tool for the Retrieval and Analysis of Hormone Receptor Expression Characteristics in Metastatic Breast Cancer via Data Mining on Pathology Reports

Information about the expression status of hormone receptors such as estrogen receptor (ER), progesterone receptor (PR), and Her-2 is crucial in the management and prognosis of breast cancer. Therefore, the retrieval and analysis of hormone receptor expression characteristics in metastatic breast cancer may be valuable in breast cancer study. Herein, we report a text mining tool based on word/phrase matching that retrieves hormone receptor expression data of regional or distant metastatic breast cancer from pathology reports. It was tested on pathology reports at the China Medical University Hospital from 2013 to 2018. The tool showed specificities of 91.6% and 63.3% for the detection of regional lymph node metastasis and distant metastasis, respectively. Sensitivity in immunohistochemical study result extraction in these cases was 98.6% for distant metastasis and 78.3% for regional lymph node metastasis. Statistical analysis on these retrieved data showed significant difference s in PR and Her-2 expressions between regional and metastatic breast cancer, which is compatible with previous studies. In conclusion, our study shows that metastatic breast cancer hormone receptor expression characteristics can be retrieved by text mining. The algorithm designed in this study may be useful in future studies about text mining in pathology reports.


Introduction
Breast cancer is the second most lethal cancer worldwide, accounting for 626,679 deaths in 2018 [1]. These fatalities are primarily due to its potential to metastasize, with 28.8% of patients experiencing axillary lymph node metastases [2] and 20-30% of patients experiencing subsequent distant metastasis even if the cancer is found in an early stage [3]. Therefore, a study on the behavior of metastatic breast cancer is of particular importance in breast cancer treatment and public health. During the previous two decades of medical advancement, numerous novel molecular targets, such as LIFR [4], PI3K [5], and aldehyde dehydrogenase-1 [6], have been studied for prognosis prediction and target therapy for metastatic breast cancer, but none of them have been proven to be more valuable than the long-standing markers estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (erbb-2 or Her-2).
According to recent studies, molecular subtypes luminal A, luminal B, Her-2, and triple-negative, which are determined by these markers, are still relevant to the treatment and prognosis of metastatic breast cancer [7][8][9][10].
As important markers of special value, ER, PR, and Her-2 expression are routinely examined by immunohistochemical study [11][12][13] on all invasive breast cancer slides and are documented in pathology reports. Combined with the fact that occurrences of lymph node or distant metastatic breast cancer are frequently sampled for pathologic examination [14], a pathology report database may be an important resource for the hormone receptor expression status of metastatic breast cancer. However, extraction of these data can be a tedious task. Unlike surgical pathology reports for primary breast cancer, in which pathologists are required to report in certain forms [15] or a synaptic report system [16][17][18], there are no required forms for reporting metastatic carcinoma in most institutions, and most of these reports stay in free text form. Retrieving these data requires text mining approaches to avoid tedious manual work. As we have discussed in a previous article [19], most general medical text mining utilities do not process immunohistochemical study results [20,21], while those that do process immunohistochemical data use advanced natural language processing (NLP) methods [22,23] and therefore will not be available in general hospital information system (HIS).
This difficulty can be solved by using simpler methods such as word/phrase matching, concept-match scrubbing [24], and semantic grammar-based concept finding [25] with clinical knowledge. We have shown in a previous publication [19] that regular expression-based word/phrase matching can be used to mine hormone receptor data for primary and recurrent breast cancer. In this article, we show that the text mining algorithm described in the previous publication can also be applied to metastatic breast cancer.

Materials and Methods
2.1. Data Retrieval and Preprocessing. All pathology reports issued at the China Medical University Hospital (CMUH) from the years 2013 to 2018, estimated 200,000 reports, were first exported into pure text form. The patient data within the text file was then automatically deidentified using the method described by Neamatullah et al. [26] to eliminate violation of privacy and ethical concerns. A Python script [27] was designed to extract the pathology diagnosis and description columns from the text files and build a client-side database using SQLite3 [28]. The data retrieval and preprocessing steps are shown in Figure 1.

Retrieval of Metastatic Breast Cancer
Cases. The authors first manually reviewed 50 pathology reports documenting regional lymph node metastatic breast cancer and 50 pathology reports documenting distant metastatic breast cancer. From these reports, it was seen that most pathology reports documenting a metastatic carcinoma had either "carcinoma, metastatic" or "carcinoma, involved" in the diagnosis. Those of breast origin were described as "breast origin" or "breast primary". Regional lymph node metastatic tumors were described as "soft tissue, axillary" or "lymph node, axillary", while distant metastatic tumor were described in the pattern "(any organ name other than axillary tissue), (procedure), carcinoma, metastatic/involved, and breast origin".
Based on these results, we designed our metastatic breast cancer finding algorithm according to the following strategy: (1) Each line from the diagnostic column is matched with the phrase "carcinoma, metastatic," "carcinoma, involved," or any phrase indicating metastatic carcinoma by a regular expression engine. If any of the lines matched one of the patterns, the report is passed to the next step for further processing (2) When one of the lines in the diagnosis indicates metastatic carcinoma, that line is checked for the presence of phrases that indicate breast origin, such as "breast primary" or "breast origin". Any reports that show a match in these phrases is passed into the next step for examination (3) For reports that show evidence of metastatic carcinoma of breast origin, the whole diagnostic column is checked for the presence of signs of primary breast cancer. If any of the lines from the diagnostic column shows any phrase that represents primary breast cancer, the report is excluded from further analysis (4) Metastatic sites are parsed and recorded by another regular expression engine. 490 reports documenting metastatic disease (359 regional metastases, 131 distant metastases) are retrieved in this step. The search protocol is shown in Figure 2 2.3. Identification of Paragraphs Containing Immunohistochemical Study Results. A two-step regular expression matching engine for immunohistochemical study extraction, as described in our previous study on extracting  2 BioMed Research International immunohistochemical result of primary and recurrent breast cancer [19], was utilized. In the first step, the program attempted to match common forms in which pathologists express immunohistochemical study results. There is, however, a significant difference between identification of immunohistochemical study in primary/recurrent breast cancer and metastatic breast cancer. When reporting metastatic carcinoma, pathologists in our institution usually document immunohistochemical study results in the description rather than the diagnostic column; therefore, searching immunohistochemistry-containing paragraphs in the current study only involved parsing the description column ( Figure 3) but not the diagnosis column (Figures 4  and 5). This approach can optimize the searching process without sacrificing sensitivity.
Paragraphs extracted from this step will then undergo the following steps for immunohistochemical study result extraction.

Extraction of Immunohistochemical Study
Results. In institutes that are routinely accredited by the College of American Pathologists (CAP), such as our institute, the reporting format of ER, PR, and Her-2 result is regulated by guidelines [29,30]. Therefore, in our method, the results of ER, PR, and Her-2 result are matched and extracted according to those guidelines. Since our laboratory applied the new 2018 CAP recommendations in 2019, so the ER, PR, and Her-2 results included in this study were issued using 2013 recommendation.
For ER and PR, positivity is required. If the result is positive, the expression percentage should be reported. Therefore, there would be three patterns: "ER/PR (positive, %)", "ER/PR: positive, %", and "ER (positive)".

2.5.
Recording of Results. The results are exported into a csv file by the program, recording each case in the form: "case ID, metastatic site, ER result, PR result, and Her-2 result". If there is a failed extraction, the result is recorded as "None".

Detection of Metastatic Breast Cancer Cases.
Our program labeled 131 pathology reports as describing distant metastatic breast cancer, of which 83 were correctly labeled, resulting in a specificity of 63.3%. There were 359 pathology reports labeled as describing regional lymph node metastatic breast cancer, of which 329 were correctly labeled, resulting in a specificity of 91.6%.
Sensitivity could not be determined, since there is no cancer registry data for metastatic carcinoma. The results are summarized in Table 1.

Immunohistochemical Study Result Detection and
Extraction. In the 83 cases documenting distant metastatic disease, the program detected immunohistochemical study results in 65 cases, with an error in documentation of the immunohistochemical study result in 1 case, resulting in a sensitivity of 78.3% and a specificity of 98.4%. In 329 cases documenting regional lymph node metastatic diseases, the program correctly detected immunohistochemical study results in 316 cases, resulting in a sensitivity of 98.1% and a specificity of 100%. The results are documented in Table 3.
Among the 322 cases of regional lymph node metastatic cases with correctly detected immunohistochemical study results, 308 were tested for ER, 91 were tested for PR, and 303 were tested for Her-2. Of the cases tested for ER, 198 were positive, and 110 were negative. Of the cases tested for PR, 52 were positive, and 29 were negative. Of the cases tested for Her-2, 103 were positive (score 3+), 95 were equivocal (score 2+), and 112 were negative (score 1+ or 0). The results are shown in Table 5.

Comparison of Hormone Receptor Expression between
Lymph Node Metastatic Breast Cancers. After applying chisquared tests to the above results, it was concluded that     BioMed Research International distant metastatic tumors had a significantly higher probability to be Her-2-positive and PR-negative than did regional metastatic tumors, while there was no significant difference between ER expression in regional and distant metastatic diseases. For details, please see Tables 6-8. Our observation that distant metastatic tumors are more prone to be Her-2 positive and PR-negative may be consistent with previous studies that Her-2 positive and PRnegative tumor have higher incidence of distant metastasis.

Comparison of Hormone Receptor Expression between
Major Metastatic Sites. According to our data, compared with bone and brain metastatic diseases, lung metastatic disease has a tendency to be more ER-positive and Her-2 positive, which is consistent with previous studies [31,32]. However, there is no statistically significant difference in the chi-squared analysis, which is probably due to a low sample number. Details are shown in Tables 9-11.

Specificity Issue of Distant Metastatic Case Detection.
The most significant flaw in our approach on metastatic breast cancer mining is its low specificity in distant metastatic cases. Of the 47 cases in which the program marked the report as a metastatic carcinoma but it actually was not, most (35) of them were documenting soft tissue or skin of the chest wall involved in recurrent breast cancer, in which the case should have been labeled as recurrent disease, not metastatic disease. Of the remaining wrongly marked cases, 11 of the 12 were due to a particular special habit of some pathologists when reporting negative sentinel lymph nodes, in which a phrase "s/p breast cancer" is inserted to the diagnosis to specify that the patient has undergone previous surgery for breast cancer. The last case is an endometrial curettage report, in which the pathologist noted in the diagnosis that the patient was under tamoxifen treatment for breast cancer.
Chest wall recurrent cases misinterpreted as metastatic carcinoma occurred most often, but they may be the most easily handled. In our previous publication [19], we developed an algorithm that detects recurrent carcinoma at either the breast or chest wall. If combined with that algorithm, chest wall recurrent cases can be easily filtered out. The cases in which the pathologist mentioned breast cancer in otherwise nonmalignant reports is a more difficult issue, since interpretation of that phrase will require semantic understanding of the pathology report.
To solve this problem, rule-based approaches, such as one described by Hur et al. [33] for mining biomedical literature and another described by Yang et al. [34] for mining hospital records, may be developed. However, since the pathology reports are written quite liberally, it is questionable whether specific rules can be built to fit theoretically infinite numbers of possible writing combinations on a pathology report. A more recent text-mining method is distributional semantic modeling [35]. In this method, corpora of text are first given, and the relationships between all words, including similarity and relatedness, are measured by vectorassisted analysis of coexistence in the corpus. This approach maybe more feasible, since this method would recognize

BioMed Research
International the semantics of pathology reports. Subgraph mining that deconstructs the whole pathology report into higher order elements (subgraphs) [36] may be helpful as well. With recent advancements in text mining technology, new methods will emerge, and the problem encountered in our study may be overcome.

Further Research Directions.
This study confirmed the concern in our previous publication that a nonstandardized pathology report may pose a difficulty in text mining, but we have discussed in the previous paragraph that it can be solved. By altering regular expression patterns, multiple forms of pathology report writing can be parsed and mined. Another issue mentioned in our previous publication, variation in reporting immunochemical study result, is nevertheless still not solved. Since we only have reports from one institution, it is unknown if our program works in pathology reports elsewhere. Therefore, for researchers in text mining, exploring the various forms in which hormone receptors such as ER, PR, and Her-2 are expressed may be an interesting and realistic research target. As we have stated above, the detection of metastatic disease, because of its difficulty, is also a potential research project.

Conclusions
In conclusion, our program showed that in metastatic breast cancer, the ER, PR, and Her-2 immunohistochemical study data can be mined using simple word/phrase matching assisted by regular expression. The algorithm designed in this study may be useful in future studies about text mining in pathology reports.

Data Availability
The data generated by the script is available as supplementary data S1. Sample code was deposited in GitHub https:// github.com/medchem/breastmeta/.

Conflicts of Interest
The authors report no conflict of interest.