Association Patterns of Ontological Features Signify Electronic Health Records in Liver Cancer

Electronic Health Record (EHR) system enables clinical decision support. In this study, a set of 112 abdominal computed tomography imaging examination reports, consisting of 59 cases of hepatocellular carcinoma (HCC) or liver metastases (so-called HCC group for simplicity) and 53 cases with no abnormality detected (NAD group), were collected from four hospitals in Hong Kong. We extracted terms related to liver cancer from the reports and mapped them to ontological features using Systematized Nomenclature of Medicine (SNOMED) Clinical Terms (CT). The primary predictor panel was formed by these ontological features. Association levels between every two features in the HCC and NAD groups were quantified using Pearson's correlation coefficient. The HCC group reveals a distinct association pattern that signifies liver cancer and provides clinical decision support for suspected cases, motivating the inclusion of new features to form the augmented predictor panel. Logistic regression analysis with stepwise forward procedure was applied to the primary and augmented predictor sets, respectively. The obtained model with the new features attained 84.7% sensitivity and 88.4% overall accuracy in distinguishing HCC from NAD cases, which were significantly improved when compared with that without the new features.


Introduction
Sheer amount of clinical data hosted by the electronic health record (EHR) system facilitates the exploration of disease signatures and potentiates the relevant clinical decision support functions [1,2].
As a real-time, digital patient-centered record, EHR contains a large amount of patient information and laboratory and test results. It provides opportunities to enhance patient care, to embed performance measures in clinical practice, and to make information available instantly and securely to the authorized users [3]. These voluminous complex data contain abundant input for precision medicine and big data analytics, which can extract huge knowledge to improve the quality of healthcare [4]. Integrated exploitation of multiple heterogeneous sources also serves for multidisciplinary renovation like biomedical engineering. In this article, extrapolating EHRs' human lexical judgments from computational models of semantics is one of the approaches that can minimize human intervention and save human efforts significantly.
The rapid development of EHR provides good opportunity to utilize the data for risk modeling and clinical decisions. Besides the well-structured demographics and laboratory information, clinical reports in EHR provide great potential for machine learning and data mining to exploit the detailed clinical information to improve risk modeling and prediction. For example, machine learning approaches could be developed based on admission notes and progress notes to improve prediction of major adverse cardiac events (MACE) of acute coronary syndrome (ACS) [5,6]. Extraction of key information from reports is a foundation step to enable these data mining applications.
As a simplifying representation in natural language processing and information retrieval, the bag-of-words model has long been applied in the text clustering tasks, in which documents are represented by independently treated single terms [7]. Without a reference terminology, a bag of words can be extracted from a document to form an array of unique features whose weights are determined by the term frequencies and form the feature vector. However, the length of feature vector increases monotonically with the number of documents in the dataset of interest, jeopardizing the practicality of the bag-of-words model.
Recently, some researchers focused on the application of ontology for extracting the conceptual features from documents. Based on reference ontology, the feature vectors consist of common fixed elements, which have already been defined before the feature extraction. Such ontological feature vector model could improve the performance of text retrieval and classification [8,9]. In some studies, feature vector model has been developed for converting the clinical texts and image patterns of an EHR into an array of numerical values [10][11][12][13].
The support of a medical ontology is required to map textual information, such as image findings in a diagnostic report, to a feature vector [12,13]. Systematized Nomenclature of Medicine (SNOMED) Clinical Terms (CT) is an ontological standard of clinical terms, which are organized as concepts and linked with "is-a" or inverse "is-a" relationships [14][15][16][17]. In such hierarchical structure, concepts at a particular level could be chosen as the feature concepts.
Some studies have compared SNOMED-CT with other standards, such as International Classification of Diseases (ICD) and MEDCIN [18,19]. As a trigger to order laboratory tests, clinical conditions were extracted from laboratory guidelines and mapped to ICD10 and SNOMED-CT. It was found that ICD10 could cover 43.1% of clinical conditions only, whereas 80.1% of these conditions were mapped by SNOMED-CT. For representing traumatic brain injury (TBI) concepts, SNOMED-CT yielded a sensitivity of 90%, outperforming MEDCIN whose sensitivity was 49%. Thus, SNOMED-CT was selected as the reference ontology in this study.
The semantic distance between a clinical term in EHR and a feature concept can be quantified by counting the edges along the path connecting them in the "is-a" hierarchy [10-12, 20, 21]. Aggregating all the semantic distances to the feature concepts generates an ontological feature vector that characterizes an EHR with its disease context. A study has performed the evaluation and comparison between information content and edge counting approaches proposed by various published works against benchmarks [11]. It was found that features built with edge-counting outperformed most of the information content approaches. Therefore, the edge counting is necessary for weighting the features. We hypothesize that the feature association patterns derived from the EHRs can uniquely distinguish a disease group from the nondisease group. If such distinguishable association patterns exist, new features could be derived from the patterns and incorporated into the existing ontological feature vector to strengthen the ontological characterization of EHRs and thus the classification performance using similarity algorithm, as illustrated in Figure 1.
The identified ontological patterns can be used to develop a clinical decision support functions. For the new cases, similar cases retrieved from EHR database using the patterns provide clinicians with evidence of the feasible diagnostic and therapeutic options. The similarity search algorithm based on the ontological vector model has been successfully applied to similar radiological image report retrieval and similar radiotherapy treatment plan retrieval [22][23][24].
In addition to the clinical evidence, the association between concepts in the patterns can be used to remind a clinician of checking the inclusion of a concept when its associated concept has already been mentioned in an EHR.

Data Collection.
We collected retrospectively 112 image reports of abdominal computed tomography examinations from the radiology departments of four local hospitals in Hong Kong. HCC or liver metastases were found in 59 cases (called HCC group for simplicity) and the other 53 cases had no abnormality detected (NAD group). These 112 cases were randomly selected from the pool of image reports where HCC or liver metastases were reported in the diagnoses of HCC cases and not reported in the diagnoses of NAD cases. Before the data collection, third party clinical personnel have removed the patient name, identity card number, telephone number, and address from the reports and assigned a randomly generated unique ID to each case. We have obtained Human Subject Ethics Approval from the Hong Kong Polytechnic University (HSEARS20140710002).

Ontological Feature Extraction.
The HCC-related clinical terms were extracted manually from the image reports according to SNOMED-CT curated in the Unified Medical Language System (UMLS; license code: NLM-0315126310). During the extraction process, the whole image reports were read and interpreted. The negation and uncertainty of a disease, disorder, or image finding was regarded as "not detected" and the corresponding term was not considered in the ontological feature mapping. Modifiers for clinical terms were not found in the image reports. To facilitate the future studies on a bigger dataset, the extraction can be automatic if the terms in image reports have been already tagged by SNOMED-CT or extracted automatically by text-mining methods. UMLS organizes clinical terms in concepts, and SNOMED-CT defines the relationship between concepts using the "is-a" hierarchical tree. The extracted terms were projected to the feature concepts at a particular level to ensure consistent comparison between reports.
In our previous study, a set of EHRs were collected from 47 subjects of type II diabetic patients in Hong Kong [21]. Levels 1-4 of the SNOMED-CT hierarchy were considered as the individual candidate sets of the feature concepts. For each level, ontological feature vectors were generated using the alignment with SNOMED-CT hierarchy and the similarity score between every possible pair of EHRs was calculated. Using SNOMOD-CT level 4, the accuracy was highest for ranking the agreement of carotid plaque identification in EHR pairs. It is important to note that level 4 has already had 6964 feature terms, providing sufficient granularity for characterizing EHRs. The use of level 5 is indeed infeasible due to the tremendously large number of features. Due to the optimal classification granularity, level 4 concepts were considered as feature concepts in this work.
Edge-counting approach is illustrated in Figure 2. For each report, the ontological features, a 1 , a 2 , …, a m , were generated using edge-counting approach based on the following formula: where p i is the conditional probability of the ith feature concept given the occurrence of liver cancer and s ij represents the edge count between the ith feature concept and the jth clinical term extracted from a report. A smaller edge count means that the feature concept is conceptually closer to the clinical term. Therefore, the minimum of the edge counts should be taken to determine the degree of activation of a feature concept. PubMed document clustering has been successfully demonstrated using the edgecounting method [25].
With the value between 0 and 1, a i indicates the relevance between the ith feature concept and a clinical term in a report. Such relevance can be modulated by the conditional probability, p i , which is estimated by the specific termweighting approach [22]. Indeed, a similarity measure derived from direction cosine represents the sum of the product of ontological features. Each product of corresponding features eliminates the square root, and the value p i becomes Edge counting based on level 4 concepts: "liver finding," "abdominal organ finding," and "fatty liver." (a) "Cirrhosis" at level 7 is the descendant of "liver finding," edge count is 3. (b) Edge count between "hepatic fibrosis" and "liver finding" is 2. (c) "Splenomegaly" is the descendant of "abdominal organ finding" but not "liver finding." Thus, edge count of "splenomegaly" with "abdominal organ finding" is 2 and that with "liver finding" is infinity. "Fatty liver" is a feature concept, and thus, the edge count with itself is 0. Diagram was extracted from [22].
the weight associated with the product of the degree of feature concept activation between two EHRs. It is obvious that the values of a i follow a nonnormal distribution in the HCC and NAD populations, which violates the assumption of statistical analysis using Pearson correlation coefficient. Rank-based inverse normalization is a popular approach that converts the feature values to those normally distributed across individuals [26]. Those features with zero values do not cause any effect on the characterization of image reports and the association patterns between features. Thus, those zero-valued features were excluded in inverse normalization process and remain unchanged. For each feature concept, the nonzero values of a i were ranked by R i ∈ 1, N − N 0 among reports of the group where N and N 0 are, respectively, the total number of reports and the number of zero-valued features in the group. The activation value of the ith feature concept is given by where Φ −1 represents the standard normal quantile function and ξ denotes a constant, whose value is given by zero as suggested by van der Waerden [27]. The activation values of a feature concept across a group form the following vector: Note that nonzero z i k follows normal distribution, N 0, 1 , after inverse normalization.

Ontological Association
Patterns. The association level between two feature concepts was denoted by C d i, j for the HCC group and C n i, j for the NAD group, as given by the following formulas: where u di and u dj represent the vectors weighting the ith and jth feature concepts across the HCC group; u ni and u nj represent the vector weighting the ith and jth feature concepts across the NAD group; and r u i , u j is Pearson correlation coefficient between two arrays. Two sets of correlation coefficients, C d and C n , in the HCC and NAD groups formed two cumulative distributions, F d and F n , which were compared using two-sample Kolmogorov-Smirnov (KS). To test the significant difference, the maximum deviation between two cumulative distributions, D value, was compared with its critical value, D α , which is derived based on our developed method [28] and given by following equations. A correlation threshold, at which F d and F n were extremely deviated, can be identified and used to characterize the perturbed ontological association pattern.
where α is the significance level, that is, 0.05, γ 0 05 = 3 1, and k = 30 in this study. The critical value of D is 0.2102, which has been proved by exhaustive computer simulations [28].

New Features Derived from Association Patterns.
It is interesting to explore some new features, which signify the image reports of HCC cases, based on the above-mentioned ontological association patterns. The first new feature, z 1 ′ k , is the square of the sum of activation values characterizing the image report of the kth case in a group.
The expected value of this new feature can be estimated by its average over the group.
It is clearly shown that the expected value of this new feature forms the lower bound of the sum of association levels over all possible pairs of features in the group. The second new feature, z 2 ′ k , is the square of the sum of the absolute values of activation values characterizing the image report of the kth case in a group.
The expected value of the second new feature, again, can be estimated by its average over the group.
The above formula clearly shows that the expected value of the second new feature defines the upper bound of the sum of association levels over all possible pairs of features in the group. When the KS test indicates that the ontological association patterns of two groups are significantly different, we expect that the sum of association levels of a group is distinguishable from that of the other group. Therefore, the new features could signify the difference between two groups.

Logistic Regression. The statistical analysis was performed by SPSS (IBM SPSS Statistics 22; Armonk, NY).
Binary logistic regression selects and estimates the optimal subset of independent variables for predicting categorical outcome Y coded by 1 or 0, which represents HCC and NAD in this work. Stepwise forward procedure was used to obtain the logistic regression model where the potential predictors were prioritized and entered into the model one by one until the predictive power was optimized. The procedure results in the following model with M predictors, where logit is the estimated log odds of Y = 1, P is the estimated probability of Y = 1, X i is the ith predictor entered into the models, and i is the coefficient associated with the ith predictor for i = 1, …, k. The statistical significance of the association between the outcome and each predictor is indicated by p < 0 05. For a well-balanced sample, we assume 50% of the cases will be classified as Y = 1 and the cut-off of logit is set at 0. Sample is imbalanced when the number of cases with an outcome category is about 2-5 times that with the other category. For an imbalanced sample, the constant 0 is corrected by deducting the log odds of Y = 1 observed in the sample. Omnibus test of model coefficients indicate the overall performance of an identified model. Two sets of candidate predictors, primary set and augmented set, are considered for identifying the logistic regression models. The primary set consists of the activation values of feature concepts: z 1 k , z 2 k , …, z m k . The augmented set is composed of the activation values of feature concepts and three new features derived from the association patterns: 2.6. Experimental Settings. Figure 3 illustrates the flow chart of the experimental steps performed in the study.

Performance
Evaluation. Sensitivity, specificity, and overall accuracy were used to evaluate the performance of two logistic regression models based on the primary and augmented predictor sets. To examine the agreement between primary predictor model (PPM) and augmented predictor model (APM), 2 × 2 contingency tables for HCC, NAD, and all cases are constructed. The McNemar test is used to compare sensitivities, specificities, and overall accuracy of two models. The difference in performance is considered significant if the P value is less than 0.05.

Extracted Features.
From 59 and 53 image reports of respective HCC and NAD groups, 38 clinical terms were extracted and mapped to 38 unique concepts in UMLS. Based on the approach illustrated in Figure 2, these terms were then projected to 30 feature concepts at level 4 of SNOMED-CT "is-a" hierarchy (Table 1). After counting the edges and estimating the conditional probabilities of these concepts, their weightings were calculated and formed 30 × 59 and 30 × 53 matrices for HCC and NAD groups.

Ontological Association Patterns.
The association level between every two feature concepts was calculated. We generated 435 association levels for each of HCC and NAD groups. Figure 4 shows the cumulative distributions of association levels for the two groups and their difference. The maximum deviation, D = 0 333, was found at C = 0 03 and greater than its critical value. Therefore, the two ontological association patterns are significantly different.

Primary Predictor
Model. The stepwise forward procedure stops at step 2 where the prediction accuracy is optimal, yielding the following regression:  where z 11 represents the activation value of "radiologic finding" and z 27 , "abnormal radiologic density, nodular." The predictor z 11 , "radiologic finding," is significantly associated with the log-odds of HCC (p = 0 016). The constant has been adjusted to compensate the imbalanced NAD and HCC cases. Omnibus test shows that the variance of log-odds explained by the model is significantly greater than the unexplained variance (χ 2 = 11.989, df = 2, p = 0 002). For logit ≤ 0, the case is NAD more likely than HCC. For logit > 0, the case is HCC more likely than NAD. Classifier based on this model is illustrated in Figure 5(a). The y-axis represents the linear combination of z 11 and z 27 in the above equation. The horizontal dotted line indicates the threshold level in the equation, 0.114, above which a lesion is classified as HCC and otherwise, NAD.

12
where z 1 ′ and z 2 ′ are the squares of sum of the ontological features and their absolute values, which were incorporated into the model in the first two steps; z 11 , z 13 , and z 25 represent the activation values of "radiologic finding," "mass of body region," and "imaging result abnormal," respectively, which were included in steps 3-5. The augmented predictors, z 1 ′ and z 2 ′ , and the primary predictors, z 11 , "radiologic finding," z 25 , and "imaging result abnormal" are significantly associated with the log-odds of HCC (p = 0 014, 0.006, 0.003, and 0.04). The constant has been also adjusted to compensate the imbalanced NAD and HCC cases. Omnibus test shows that the variance of log-odds explained by the model is significantly greater than the unexplained variance (χ 2 = 70.619, df = 5, p < 0 001). For logit ≤ 0, the case is NAD more likely than HCC. For logit > 0, the case is HCC more likely than NAD. Classifier based on this model is illustrated in Figure 5(b). In step 5, the linear combination of the augmented predictors, z 1 ′ and z 2 ′ , forms the y-axis and that of the primary predictors, z 11 , z 13 , and z 25 , the x-axis. The classifier is represented by the dotted line.

Performance Comparison of Models.
Using the PPM, 98.1%, that is, 52 out of 53 NAD cases, and 57.6%, that is, 34 out of 59 HCC cases, are correctly classified. The overall accuracy is 76.8%. Using the APM, the correctly classified HCC cases increase significantly to 84.7% (p < 0 0001), which consist of 50 out of 59 HCC cases. Although the correctly classified NAD cases are reduced slightly to 92.5% (p = 0 250), the APM raises the overall accuracy to

Discussion
This study illustrated an approach for characterizing textual image reports by numerical values that weight the alignment of report contents with the ontological standard. Such approach has been demonstrated in our previous study where all the level 4 feature concepts of SNOMED-CT were considered to characterize the same set of image reports used in this study [22]. Using the specific term weighting, the highest overall accuracy, 74.3%, was attained for mapping report pairs based on the similarity measure of modified direction cosine. In this study, the features were further converted to standardized values, following N 0, 1 , by inverse normalization [26]. Such conversion could help reduce the noise or outlier that was induced to the features through the edge-counting approach. The converted features were considered as primary predictors. Binary logistic regression model, identified using the primary predictors as the candidates in stepwise forward procedure, was used to classify the reports. The overall accuracy was increased to 76.8%.
It was shown that the interfeature association levels in HCC and NAD groups exhibited significantly different distributions where the feature concepts have particularly strong association in HCC [29]. This observation led to the derivation of two new features, which are squared sums of the existing features and their absolute values. We proved that the expected values of these two new features, which are estimated by their averages, represent the lower and upper limits of the sum of association levels over the group. The new features were combined with the existing features to provide the augmented predictor set for the stepwise forward procedure. It was found that the overall accuracy was significantly increased to 88.4% (p < 0 0001). The sensitivity, an important diagnostic performance indicator, was also significantly increased from 57.6% to 84.7% (p < 0 0001). Step 2 Step 3 No improvement in accuracy by introducing any new predictor z 11 : "radiologic finding" z 27 : "abnormal radiologic density, nodular" NAD HCC

(a)
Augmented predictor model Step 2 Step 5 z 11 : "radiologic finding" z 13 : "mass of body region" z 25 : "imaging result abnormal" z′ 1 : squared sum of the identification procedure incorporated z 11 , "radiologic finding," z 27 , and "abnormal radiologic density, nodular" in the first two steps. The process stopped at step 2 as new predictor cannot make any improvement in classification. (b) Augmented predictor model (APM): two new features, z 1 ′ and z 2 ′, were included in the model in the first two steps of the procedure. The predictors, z 1 ′ and z 2 ′, represent the squared sums of ontological features and their absolute values, respectively. The identification proceeds to step 5 that extends the feature space into three predictor dimensions, z 11 , z 13 , and z 25 , representing "radiologic finding," "mass of body region," and "imaging result abnormal." Besides the first two new features, we identified the feature concepts: "radiologic finding," "mass of body region," and "imaging result abnormal." For new suspected cases, this panel of predictors representing a disease signature can be used to assist the clinical decision when associations of those pairs are observed. In future work, the discovered signature should be validated with independent data before its clinical applications. The detailed underlying meanings of the signature in patient management should be further explored using big data analytics.
An alternative application of the identified association patterns is the detection of inaccurate medical coding. When a disease is diagnosed, the "coactivated" feature concepts can be obtained and checked against the pairs in the diseasespecific patterns. Potential inaccurate coding can be detected and the clinicians will be alerted. On a public health level, systematic failure in appropriate medical coding may result in under-or overadjustment to case-mix measurements when assessing quality of care [30]. In some healthcare models, this will also affect billing, reimbursement, and insurance claims [31].
Some observed image patterns mentioned in the image reports cannot be mapped to concepts in SNOMED-CT. For example, intravenous contrast injection induces changes of pixel optical density in different phases of CT scan. Contrast enhancement in particular phases is critically important for HCC diagnosis. However, SNOMED-CT has not defined the concepts, which could represent closely "contrast enhancement," "arterial enhancement," and "hyperdensity in arterial phase." This is a limitation of this study that hindered the precision of the proposed predictor model.

Conclusions
This study demonstrated the extraction of ontological features from image report contents based on the ontological standard. Combining new features, derived from the differential association patterns, with the ontological features forms a panel of augmented predictors that signifies the HCC image reports.

Conflicts of Interest
The authors declare that they have no conflicts of interest.