Background. Hepatocellular carcinoma (HCC) is a highly aggressive malignancy. Traditional Chinese Medicine (TCM), with the characteristics of syndrome differentiation, plays an important role in the comprehensive treatment of HCC. This study aims to develop a nonnegative matrix factorization- (NMF-) based feature selection approach (NMFBFS) to identify potential clinical symptoms for HCC patient stratification.
Methods. The NMFBFS approach consisted of three major steps. Firstly, statistics-based preliminary feature screening was designed to detect and remove irrelevant symptoms. Secondly, NMF was employed to infer redundant symptoms. Based on NMF-derived basis matrix, we defined a novel similarity measurement of intersymptoms. Finally, we converted each group of redundant symptoms to a new single feature so that the dimension was further reduced.
Results. Based on a clinical dataset consisting of 407 patient samples of HCC with 57 symptoms, NMFBFS approach detected 8 irrelevant symptoms and then identified 16 redundant symptoms within 6 groups. Finally, an optimal feature subset with 39 clinical features was generated after compressing the redundant symptoms by groups. The validation of classification performance shows that these 39 features obviously improve the prediction accuracy of HCC patients. Conclusions. Compared with other methods, NMFBFS has obvious advantages in identifying important clinical features of HCC.
1. Introduction
Hepatocellular carcinoma (HCC) is the third most common cause of cancer-related death worldwide and the leading cause of death in patients with the complication of cirrhosis [1, 2]. The occurrence of HCC is larvaceous and short of specific symptoms [3, 4]. Its diagnosis depends on biopsy, imaging examination such as Doppler ultrasound, computed tomography, magnetic resonance imaging, and blood test [5, 6]. Once the patients with HCC see doctors, the disease has often entered its late stage, losing the chance of resection. Hence, seeking simple methods to predict HCC and its clinical stage is very meaningful and helpful to improve the diagnosis of HCC.
As one of the most popular complementary and alternative medicine modalities, Traditional Chinese Medicine (TCM) plays an active role in treatment of malignant tumors including HCC in Chinese and some East Asian countries [7, 8]. Unlike modern medicine, the diagnosis and treatment of TCM depend on the analysis of symptoms and signs of HCC collected by inspection, auscultation and olfaction, inquiry, and pulse taking and palpation [8]. TCM regards specific combination of symptoms and signs as a TCM syndrome, which is the main basis for treatment; and it can be also used to guide clinical diagnosis of HCC. Our previous work proposed a hierarchical feature selection (PSOHFS) model to quickly identify the potential HCC syndromes from a TCM clinical dataset [9], by which all the original symptoms were classified into several groups according to the categories of clinical observations, and each symptom group was then converted into a syndrome signature to reduce the searching space of feature selection. But the limitation of this method is that the interactions among symptoms which belong to different categories (aspects) were ignored. Therefore, the current challenge is to design an efficient feature selection approach for high-dimensional TCM data with consideration of clinical significance.
In this study, a nonnegative matrix factorization- (NMF- [10]) based feature selection (NMFBFS) method was proposed to select pivotal clinical symptoms for HCC diagnoses. A TCM clinical dataset was used in this work, which consisted of 407 HCC patients with 57 clinical symptoms. Each patient sample is labeled with a clinical-staging symbol which indicates the severity of certain patient. Firstly, the preliminary screening with statistical method was designed to detect irrelevant symptoms from the full symptom set. Secondly, the process of NMF was implemented after eliminating the irrelevant symptoms. Based on the NMF-derived basis matrix, we defined a similarity measure to infer redundant symptoms by calculating the distance and correlation among the symptoms. Finally, the secondary dimension reduction was implemented based on the inferred groups of redundant symptoms. We converted each symptom group to a new feature (named “mixed feature”) if these symptoms represent similar distribution patterns on the sample space. The experiment results show that 39 novel features inferred by NMFBFS obviously improve the accuracy of diagnosis of HCC clinical samples. Moreover, NMFBFS-derived 39 optimal clinical features included some well-known common symptoms of HCC patients. Comparing to three representative feature selection methods (ReliefF [11], mRMR [12], and Elastic Net [13]), our proposed approach showed the best performance to identify optimal clinical features for HCC patients.
2. Materials and Methods2.1. Experimental Data2.1.1. Description
In this work, the questionnaire survey dataset of HCC includes 407 samples within two years, and each patient was observed on 57 clinical symptoms (Table 1). Each patient sample is labeled with a symbol of clinical stage, which is related to TCM pattern of syndrome and indicates the severity degree of HCC. According to the international staging system [14], there are three stages and two substages in each phase in this dataset. The aim of our work is to identify the symptom signatures, which are related to three clinical stages: phases I, II, and III, and the larger value indicates that stronger positive symptom occurred. Within our dataset, all the original symptoms are described by two types of data: binary (0 or 1) or integer (0, 1, 2, 3, …). For example, the type of symptom “tinnitus” is binary (0 or 1), which means two possible states: occurrence (positive) or nonoccurrence (nonpositive). Another example is “sleeplessness” whose value can be 0, 1, 2, or 3. The larger the value is, the stronger the positive state will be. A symptom does not appear positive if its value equals zero.
The description of original clinical data of HCC patients.
Sex
Phase I (82)
Phase II (195)
Phase III (130)
PhaseIA
PhaseIB
PhaseIIA
PhaseIIB
PhaseIIIA
PhaseIIIB
Male
33
27
50
115
95
10
Female
12
10
10
20
16
9
2.1.2. Data Preprocessing
Refinement of Feature Set. Our original dataset consists of 407 HCC patient samples (Table 1). The first step of preprocessing is to remove the useless features because they provide no useful information for the following classification. If a feature is constant on all the observed samples, it can be considered as useless feature. For our dataset, some symptoms, such as “pale tongue” and “slow pulse,” were removed out because there is no any observed patient positive on these symptoms. After removing this kind of features, a refined clinical dataset with 407 samples and 57 symptoms (V1,…,V57) can be obtained.
Simplification of Clinical Staging. The clinical staging of HCC patients in our original dataset was marked with collections “IA,” “IB,” “IIA,” “IIB,” “IIIA,” and “IIIB.” For identifying the symptom signatures related to three clinical stages, all the samples would be relabeled as three classes. Here, we remarked class label “1” for the samples labeled “IA” and “IB.” In a similar way, class label “2” is used for “IIA” and “IIB” and “3” is for “IIIA” and “IIIB.” Finally, all the 407 clinical samples can be distributed in three categories: 82 samples in phase I, 195 in phase II, and 130 in phase III. The details of the refined dataset were described in Table 1.
2.2. Feature Selection
Feature selection can be organized into three categories, depending on how they interact with the construction of model. Filter methods employ a criterion to evaluate each feature individually and are independent of the model [15]. Among them, feature ranking is a common method which involves ranking all the features based on a certain measurement and selecting a feature subset which contains high ranked features [16]. However, one of the drawbacks of ranking methods is that the selected subset might not be optimal in that a redundant subset might be obtained. Wrapper methods involve combination searches through the feature space, guided by the predicting performance of a model [17]. Heuristic search is widely used in wrapper methods as searching strategy which can produce good results and is computationally feasible; however, they often yield local optimum results. For an embedded method, the feature search process is embedded into classification algorithm, so that the learning process and the feature selection process cannot be separated [18].
2.3. Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) aims to obtain a linear representation of multivariate data under nonnegativity constraints. These constraints lead to a part-based representation because only additive, not subtractive, combinations of the original data are allowed [19]. In general, NMF can be used to describe hundreds to thousands of features in a dataset in terms of a small number of metafeatures, particularly in gene expression profiles analysis [20–22].
Let X be n×p nonnegative matrix; that is, each element xij≥0 in X. Nonnegative matrix factorization (NMF) consists in finding an approximation(1)X≈WH,where the basis matrix W and the mixture coefficient matrix H are n×r and r×p nonnegative matrices, respectively, where r>0 and r≪min(n,p). The objective behind the small value of r is to summarize and split the information contained in X into r factors (also called “basis” or “metafeature”). The matrix H has the same number of samples but much smaller number of features rather than matrix X. Therefore, the metafeature expression patterns in H usually provide a robust clustering of samples [22].
The main approach to NMF is for solving estimate matrices W and H as a local minimum:(2)[D(X,WH)+R(W,H)]W,H≥0min,where D is a loss function that measures the quality of the approximation which is usually based on either the Frobenius distance or the Kullback-Leibler divergence [19]. R is an optional regularization function, defined to enforce desirable properties on matrices W and H, such as smoothness or sparse [23, 24].
In our study, the loss function in NMF is based on Kullback-Leibler divergence [25]. The above function R was defined as follows:(3)RW,H=F1W+F2H,where F1W and F2H are regulation functions for W and H, respectively. Here, we applied Tikhonov smoothness regularization [26] for W in (4)F1W=12∑i,jWij-c2,where c is a constant positive or zero. In addition, we applied sparsity-enforcing regularization [26] for H in(5)F2H=12∑jH.j22-α2H.j122.In formula (5), H.j is jth row of H. H.j22 and H.j12 define the l2-norm and l1-norm of H.j. The algorithm proposed by Lee is a well-established method to solve the optimization of NMF [27].
2.4. NMF-Based Feature Selection
In this study, our proposed NMF-based feature selection (NMFBFS) approach can be seen as a two-stage filter method. In the first stage, preliminary screening is implemented to detect irrelevant symptoms and remove them from the whole feature set. In the second stage, NMF clusters the redundant symptoms which potentially have similar patterns into different groups, and each group is then transformed into new single features to reduce the dimension. Obviously, the process of NMFBFS is independent of classifier and can quickly infer the optimal feature subset even in the high-dimensional dataset. The flowchart of NMFBFS is shown in Figure 1.
The flowchart of the proposed approach.
2.4.1. Removing the Irrelevant Symptoms
In our questionnaire, all the symptoms were defined by clinical doctors, which covered many aspects of patients. However, the relevance weight of each feature for distinguishing samples among the clinical stages was not quantitatively studied. In machine learning, the irrelevant features provide no useful information in any context and always scarcely contribute to patient stratification [28]. If the sample size is large, it is meaningful to quickly detect the irrelevant symptoms by calculating the frequencies of positive. Here, we calculated the ratio (frequency) of presence (positive) of each symptom on the samples in every clinical stage. If the frequencies of certain symptom in all the clinical stages are very low, which indicates that this symptom hardly appears positive in most of patients, therefore it is considered as an irrelevant symptom. After removing the irrelevant symptoms from the original dataset, the rest of symptoms are considered as relevant features, which are potentially related to at least one class of patients (or one clinical stage).
2.4.2. Identifying Redundant Symptoms Based on NMF
After the irrelevant symptoms had been removed, nonnegative matrix factorization was applied on the dataset X (n×p). For a given rank r, the matrix X can be decomposed to basis matrix W and coefficient matrix H. Usually, the value of rank r is much smaller than the number of features (n) and the sample number (p), so that there is at least one dimension in both W and H being very small. The widespread appliances of NMF in biclustering further indicate that basis matrix W can be used for feature clustering and coefficient matrix H is used for sample clustering, respectively [20, 21]. In our study, the number of samples is much larger than the dimensionality; hence, directly calculating distance or correlation to measure the similarity between original features (symptoms) on all the samples will lead to biases because some features might represent local similar patterns on a part of samples. Fortunately, the basis matrix W represents the compressed sample space of matrix X, which facilitates uncovering the difference between features. Here, we introduced two features (vi and vj) in original dataset X as an example to clarify the basic idea of this step. According to the definition of NMF, we can easily know(6)xi=wi×H,xj=wj×H,where xi and xj are ith and jth rows of matrix X; wi and wj are ith and jth rows of matrix W. The following can be easily found. (1) If wi≈wj, then xi≈xj; (2) if wi=kwj, then xi=kxj, where k is a constant. Furthermore, if ith row wi in matrix W is very close to wj, the feature vi might have a similar pattern as vj on all the samples. Therefore, we defined a novel similarity measurement in formula (7) to approximately evaluate redundancy between the two original symptoms via matrix W: (7)simvi,vj≈simwi,wj=sim_distwi,wj+sim_corrwi,wj2,where(8)sim_distwi,wj=1-wi-wj×wi-wjTMaxD,(9)sim_corrwi,wj=wi-w-×wj-w-Twi-w-×wi-w-T×wj-w-×wj-w-T.Formula (8) uses distance-based similarity, which indicates how two corresponding features are close to each other; and formula (9) adopts correlation-based similarity, which describes similar patterns of two original features. Hence, our developed similarity measurement considered distance and correlation between features at the same time. MaxD in formula (8) is the maximal distance value in all pairs of (wi,wj). Based on the above definition of similarity, we further calculated the similarity matrix SMX using all the basis rows in W (SMXi,j=simvi,vj), where element SMXi,j denotes the similarity between original features i and j. Given a threshold θ (0<θ<1), we can screen all the redundant features by groups with SMXi,j>θ.
2.4.3. Transformation of Redundant Symptoms by Group
In the above section, all the redundant symptoms were screened out and were organized into different groups. For each symptom group, a new mixed feature was extracted as the representation of the whole group and replaced all the original features within this group. Therefore, NMFBFS-inferred optimal feature subset includes two parts: nonredundant original features and new generated mixed features (see Figure 1). There are two strategies that can be used to transform the redundant symptom groups to mixed features.
(1) Calculate the mean vector from all the redundant symptoms as in(10)xNF=meanxr1,xr2,…,xrn,where xr1,xr2,…, and xrn are the feature vectors of original dataset X and are determined as redundant symptoms in a group. n denotes the number of inferred redundant symptoms in a group. The vector xNF of new single feature vNF was averaged on that group.
(2) Randomly select a vector from one of redundant symptoms as(11)xNF∈xr1,xr2,…,xrn.In our study, we transformed the groups of redundant symptoms to new mixed features by using formula (10). After this step, the feature space of the clinical dataset was further reduced so that the optimal feature subset rarely included redundant features.
3. Simulation Design
Firstly, we calculated the frequencies of each original symptom appearing positive at each clinical stage and then removed the irrelevant symptoms if their frequency values were very low.
Secondly, a representative sample set was screened out for NMF analysis. In our dataset, the number of samples in three phases of HCC varies a lot, that is, from 82, 130 to 195. If the whole dataset is used, a class imbalance problem will be caused [29–31]. In addition, the sex ratio of patients is also seriously unbalanced in the original dataset (Table 1). For avoiding the bias caused by imbalance of samples, we selected 40 samples from each clinical phase with equal proportion of male and female (20 : 20) to construct a representative clinical dataset DR (120 samples in total) for the following NMF analysis. Considering the fact that each original sample has a class label which corresponds to clinical stage of that patient, for all the original samples (407), we can actually get a preliminary participation of samples as three clusters, which can also be considered as a trained KNN clustering model [32]. We then defined the center of each cluster, which is the mean vector of all the samples in the same cluster. Given a large value of K, we input each center of cluster into the above KNN model and keep the output consistent with the corresponding class label of the center. Based on the K-nearest neighbors, we can finally screen out 40 representative samples (20 males and 20 females) of each clinical stage according to Euclidean distance.
Finally, several redundant symptom groups were identified. Then we transformed each redundant symptom group into a new mixed feature. Combining all the nonredundant original features with new generated mixed features, we obtained an optimal clinical symptom subset of HCC. At last, the classification performance of this feature subset was further validated by least squares support vector machines (LSSVM) [33, 34].
Experimental Parameters. At first, we set a frequency threshold to identify the irrelevant symptoms. The NMF R package [35] was then employed as a computational framework for nonnegative matrix factorization algorithms in R. For this method, the optimal rank r should be determined firstly. Currently there are several approaches that had been proposed to determine the optimal value of r [36, 37]. In our study, two methods, that is, cophenetic coefficient [36] and RSS curve [37], had been adopted to determine the optimal rank r range from 2 to 7. After obtaining the results of NMF with optimal r, we calculated the similarity matrix SMX with all the basis rows and inferred the redundant symptoms with a threshold θ=0.95, which meet the following conditions: sim_corr(wi,wj)≥0.95 and sim_dist(wi,wj)≥0.95 in formulas (7)–(9). Finally, a LSSVM classifier had been implemented to validate the classification performance of inferred optimal symptom subset. In the LSSVM multiclass model, Gaussian RBF kernel was employed, and the kernel parameters σ2 and γ were determined by grid search [38]. In our grid search, we set σ2=10a and γ=10b. Variable a changes from −1 to 5 with step 0.25, and variable b changes from −1 to 4 with step 0.2. Therefore, we have the range of [0.1,100000] for σ2 and the range of [0.1,10000] for γ. Totally, there are 24 levels for the value of σ2 and 25 levels for γ. In other words, there are 600 pairs of σ2,γ tested when training a LSSVM classifier. To find an optimal value of σ2,γ, we used 5-fold cross-validation to evaluate the classification accuracy of LSSVM model.
4. Results and Discussion
Firstly, we calculated the frequencies of positive for all the original symptoms (57) at each clinical stage (see Supplementary Table S1 available online at http://dx.doi.org/10.1155/2015/846942). Eight irrelevant symptoms were judged as irrelevant features (threshold: 10%). From Table 2, we can clearly see that these symptoms appeared on few patients (less than 10% in each clinical stage) in the clinical observation and therefore they were considered as noisy features in the process of diagnosis. Because the total number of samples is large (407), we considered that the eight irrelevant symptoms identified with statistical analysis are very reliable. A part of symptoms shown in Table 2 was proved by previous studies. For example, Lai et al. concluded that no association is detected between “emotional depression” and the risk of hepatocellular carcinoma in older people in Taiwan [39, 40]. In addition, Peng et al. studied 169 Chinese patients with HCC; only three patients presented with hydrothorax, which also indicated that this symptom was not a key symptom in the process of liver cancer development [41, 42]. In addition, “edema in lower extremities” is undoubtedly a well-known symptom of HCC patients in clinic [43]; however, it was considered an irrelevant symptom in this study because it rarely appeared in all the three stages of our data. Increasing the observed samples or reducing the threshold will make it as a candidate symptom.
Eight irrelevant symptoms were screened with threshold 10%. Each of them is rarely positive in each phase.
Symptoms
Phase I
Phase II
Phase III
Phase IA
Phase IB
Phase IIA
Phase IIB
Phase IIIA
Phase IIIB
Pale white lip [V1]
0
5.41%
6.67%
5.19%
4.5%
0
Edema in lower extremities [V16]
2.22%
8.1%
1.67%
5.19%
3.6%
0
Lack of urine output [V41]
0
2.7%
0
0
5.41%
0
Emotional depression [V43]
4.44%
0
5%
8.89%
6.31%
5.26%
Head body trapped heavy [V47]
0
2.7%
3.33%
2.22%
2.7%
0
Hydrothorax [V51]
6.67%
2.7%
1.67%
3.7%
2.7%
0
Rapid pulse [V55]
4.44%
2.7%
1.67%
0.74%
5.41%
5.26%
Uneven pulse [V56]
4.44%
5.41%
8.33%
3.7%
3.6%
0
Secondly, the calculation of NMF was implemented after removing all the detected irrelevant symptoms. According to the description in “Simulation Design”, NMF was applied on the representative matrix DR with 120 HCC samples, which uniformly covered three clinical phases. Figure 2(a) represents the fact that DR is a sparse matrix, in which large partition of elements is zero (no positive), such as symptom V6 shown in Figure 2(b). However, there are also some symptoms that were positive on many patients, such as symptom V25 shown in Figure 2(c). Matrix DR does not show obvious subtypes and patterns; hence, it is hard to compare the similarity directly between symptoms with the row vectors of DR since the number of samples is still very large. In this study, we used NMF to compress the representative matrix DR and to reveal the distribution patterns of features (symptoms) on fewer samples. Before the calculation of NMF, a critical parameter should be firstly determined: the value of factorization rank r. According to Brunet’s method, the first value of r for which the cophenetic coefficient starts decreasing is the optimal one [36]. Frigyesi and Höglund suggested choosing the first value where the RSS curve presents an inflection point [37]. Based on these two methods, we determined that “3” is a reasonable value of rank r for the clinical data matrix DR. The curves shown in Figure 3 also confirm this conclusion. Nonnegative matrix factorization was then implemented on the matrix DR (49×120) with rank 3. It also indicates that the number of metafeatures (basis) equals 3.
The heatmap of the representative clinical dataset DR. (a) The heatmap of DR with 49 symptoms and 120 samples. (b) The distribution patterns of symptoms V6, V8, V28, V37, and V53 indicate that the frequencies of positive are low. (c) The distribution patterns of symptoms V46, V42, and V25 indicate that the frequencies of positive are high.
Estimation of the optimal rank r.
Figure 4 represents the final results of NMF which included the basis matrix W (49×3) and mixture coefficient H (3×120). Each row in matrix W uses a compressed pattern to approximatively represent the distribution of a symptom on all the original samples. Comparing with matrix DR shown in Figure 2, the obvious difference in matrix W is that there are several groups of features revealing similar patterns in the compressed sample space, such as V40 and V36 in Figure 4. According to Figure 2(a), we can find that the distance between the vectors of symptoms V40 and V36 in DR is also close; furthermore, the compressed patterns of V40 and V36 in matrix W (w40 and w36) in Figure 4 facilitate easier identifying of redundant features which have very similar distribution patterns.
The result of NMF on the dataset DR. The left side indicates the visualization of matrix W (49∗3), and right side denotes matrix H (3∗120).
The matrix H has the same number of samples but much smaller number of metafeatures (basis) rather than original matrix X [36]. Therefore, the metafeature expression patterns in H usually provide a robust clustering of samples. Given the jth column in H as Hj=[hj1,hj2,hj3]T, we determined that jth clinical sample is placed into kth cluster if maxHj=Hj(k), where k∈{1,2,3}. Hence, we used matrix H to group all the samples into 3 clusters, which correspond to 3 bases (metafeature). Figure 5 shows that there are great overlaps between the clinical-staging markers (a priori knowledge of class labels) and indexes of basis components (metafeatures) on the 120 original clinical samples included in dataset DR.
The relationships between NMF-derived basis components and clinical stages of samples.
In matrix W, each column also corresponds to a metafeature or basis (see Figure 4). Entry wij in matrix W is the coefficient of original feature i in metafeature (basis) j [36]. Therefore, an original feature i relates to certain basis j if wij is the largest entry in row i of matrix W. From Figure 4, we can clearly see that the original symptom features participating in the same basis have similar expression patterns rather than that in other bases. Table 3 represents the symptoms which are related to all basis components. Combination of Figure 5 and Table 3 further indicates that the “basis 1” related symptoms are very related to the clinical samples of phase II, and “basis 2” and “basis 3” related symptoms are very related for phase I and phase III, respectively. This finding contributes to identifying clinical phase-specific important symptoms via NMF. Moreover, the partition of 49 clinical symptoms shown in Table 3 was well supported by some related studies. For example, nausea is observed as a common adverse effect in HCC patients in phase I [44]. The symptoms ascites, anorexia, fever, and jaundice often occurred in phase II [43, 45–48]. The symptoms “yellow complexion” and “yellow skin and eye” shown in Table 3 are obvious appearances of jaundice. For phase III, pain is the most obvious characteristic in HCC patients [49]. There are three pain-related symptoms presented in Table 3: “pain in shoulder and back,” “chest pain,” and “distending pain in hypochondrium.” Moreover, fatigue and weakness were also common in HCC patients [43]. Together, these findings suggest that NMF with an optimal rank can reveal the latent associations between the potential symptom features and clinical phases.
The NMF-derived participation of the symptoms to each corresponding basis component.
Basis components
Number of symptoms
The names of symptoms
Basis 1
16
Varicose veins [V7]; yellow complexion [V11]; yellow skin and eye [V13]; stomach pain [V31]; dry stool [V38]; feeling thirsty [V27]; hot flash [V20]; doing belly full bilge [V33]; fullness in stomach [V32]; block under the rib [V49]; chills [V18]; fever [V19]; spider telangiectasia in liver palm [V15]; ascites [V50]; yellow greasiness [V9]; anorexia [V34]
Basis 2
17
Nausea [V35]; pulse slip [V54]; petechial and ecchymosis tongue [V6]; white slip [V8]; chest distress [V28]; semiliquid stool [V37]; weak pulse [V53];night sweat [V22]; dirty mouth [V17]; red tongue [V3]; thready pulse [V57];sticky greasy coating [V10]; purple tongue [V4]; stringy pulse [V52]; pale white lip [V2]; large and teeth-printed tongue [V5]; gloomy complexion [V14]
Basis 3
16
Tinnitus [V24]; dizziness [V23]; pain in shoulder and back [V48]; chest pain [V29]; distending pain in hypochondrium [V30]; bitter taste [V26]; insomnia [V42]; appearance with stained yellow [V12]; yellow urine [V40]; hiccup [V36]; soreness and weakness of waist and knees [V44]; dry throat [V25]; feverishness in palms and soles [V45]; spontaneous perspiration [V21]; night urination much [V39]; physically and mentally fatigued [V46]
Just as mentioned above in “Simulation Design,” several groups of redundant features were then screened out according to a given threshold θ=0.95 (Table 4). We obtained two redundant symptom groups from each basis component, which indicates that the redundant symptoms included in the same group also might have similar patterns in the original sample space. Here, we take Figures 2(b)-2(c) as examples to collaborate the effectiveness of our method. Figure 2(b) represents the distribution of positive of five symptoms in the dataset DR. These five symptoms (V6, V8, V28, V37, and V53) were identified as basis 2 related features, and they are most possibly belonging to phase I (Table 4). Although each of the row vectors in Figure 2(b) is not completely equal, they all represent relative lower frequency of positive (15.17±3.25%) and their local distribution patterns are similar in a way. Comparing the corresponding rows of these five symptoms in matrix W in Figure 4, we found that the compressed patterns of these symptoms are very similar. Similarly, the symptoms (V46, V42, and V25) are potentially related to basis 3, the frequency of positive for each is over 50%, and the mean value of positive for these three symptoms is 1.77, which further indicate that they might be related to some patients whose conditions are very serious. Although the symptoms V46,V42, and V25 were not identified as redundant symptoms with given threshold (0.95), their compressed patterns in matrix W in Figure 4 also suggested that their patterns were very close. In summary, we considered a fact that the matrix W facilitates evaluating the difference among symptoms, and matrix H can validate the high degree of correlation between class labels of samples and basis indexes. After inferring the redundant symptoms with given threshold, we combined each symptoms’ group together and converted it into a new feature (named mixed feature). Finally, we obtained 39 clinical features (FS1) of HCC as the optimal feature subset, which consisted of two parts: 33 original symptom features (FS2) and 6 new mixed features (FS3) (Table 5). Based on the analysis of results of NMF, the feature space of original dataset was further reduced.
The mean similarity values about the pairs of redundant symptoms within the same groups.
Basis components
The screened redundant symptoms
Distance-based similaritysim_dist(wi,wj)
Correlation-based similaritysim_corr(wi,wj)
Basis 1
V38,V27, V20
0.9672
1.0
V19,V15
0.9507
1.0
Basis 2
V35,V54
0.9685
0.9960
V6,V8,V53,V37,V28
0.9628
1.0
Basis 3
V48,V29
0.9686
1.0
V44,V45
0.9520
0.9926
The NMF-driven potential clinical features of HCC (threshold: 0.95).
For evaluating the potential of NMFBFS-inferred optimal feature subset, we firstly tested the classification accuracy of three candidate feature subsets FS1, FS2, and OFS on the training set (120 representative samples). FS1 and FS2 were generated by feature selection with the threshold θ (0.95). OFS denoted 49 original symptom features in the dataset DR. Table 6 indicates that the 39 optimal features, which covered 33 original symptom features and 6 new mixed features, result in the best classification accuracy on the training samples. The performance of FS2 was much better than OFS; however, it was still worse than FS1 because the new mixed features also have important contributions to classification.
Classification accuracy among three feature subsets on the training set (120 representative samples). FS_{1} was obtained by the proposed approach with a given threshold (θ=0.95), in which 33 original symptom features and 6 new mixed features were included. FS_{2} denotes the above-mentioned 33 original symptom features (FS2⊂FS1). OFS indicates all the 49 symptoms before NMF calculation.
Feature subsets
Dimension
Classification accuracy in LSSVM (%)
FS_{1}
39
80.002 ± 9.95
FS_{2}
33
77.50 ± 12.36
OFS
49
72.50 ± 11.64
We then compared the performance of our NMFBFS with three well-known feature selection methods (ReliefF [11], mRMR [12], and Elastic Net [13]). ReliefF was implemented using MATLAB function. “mRMRe” and “elasticnet” R packages were applied for mRMR and Elastic Net based feature selection, respectively. Supplementary Figure S1 represents the ReliefF-based feature ranking. Supplementary Figure S2 represents the Elastic Net (λ=0.5) solution paths for feature selection. We selected Top 20 features and Top 40 features as two candidate feature subsets for each method to evaluate their classification performances: FSRF20 and FSRF40 generated from ReliefF; FSMR20 and FSMR40 inferred from mRMR; FSEN20 and FSEN40 inferred from Elastic Net. Table 7 represents the classification performance of the above six candidate feature subsets and the NMFBFS-derived optimal feature subset FS1 on the training set (120 representative samples). The results indicate that NMFBFS-inferred feature subset has the best classification accuracy in training samples.
Classification accuracy of inferred optimal feature subset via NMFBFS, ReliefF, mRMR, and Elastic Net on the training set.
Methods
Feature subset
Dimension
Classification accuracy in LSSVM (%)
NMFBFS
FS1
39
80.002 ± 9.95
ReliefF
FS_{RF20}
20
65.00 ± 10.03
FS_{RF40}
40
73.33 ± 15.76
mRMR
FS_{MR20}
20
70.83 ± 12.5
FS_{MR40}
40
74.17 ± 9.03
Elastic Net
FS_{EN20}
20
70.00 ± 11.56
FS_{EN40}
40
76.67 ± 10.46
Except 120 representative training samples which were screened out to implement the NMF analysis, the remaining samples can be used to test the classification accuracy of optimal feature subset. We randomly selected 40 samples (10 : 20 : 10 for each clinical stage) from the rest of the samples and then evaluated the classification accuracy of inferred feature subset by each method (NMFBFS, ReliefF, mRMR, and Elastic Net). Table 8 shows the differences among all these methods, and it can be found that the optimal feature subset inferred by our proposed method has the best generalization performance.
Classification accuracy of inferred optimal feature subset via NMFBFS, ReliefF, mRMR, and Elastic Net on the testing set.
Methods
Feature subset
Dimension
Classification accuracy in LSSVM (%)
NMFBFS
FS1
39
79.65 ± 6.48
ReliefF
FS_{RF20}
20
50.71 ± 1.22
FS_{RF40}
40
76.43 ± 8.27
mRMR
FS_{MR20}
20
63.79 ± 1.22
FS_{MR40}
40
77.14 ± 9.18
Elastic Net
FS_{EN20}
20
67.57 ± 4.09
FS_{EN40}
40
78.38 ± 9.62
Finally, the more important thing is that the selection of threshold θ determines how many groups of redundant symptoms will be screened out. Here, we further discussed the effects of threshold θ to the optimal feature subsets on the classification performance. Table 9 shows the differences among three optimal feature subsets inferred by the proposed approach with different values for threshold θ. From Table 9, we can obviously see that the bigger value of θ will screen redundant symptoms strictly, which leads to less similar symptoms that would be obtained. With a smaller value of θ, much more symptoms can be categorized into the same groups; hence, the original feature space will be sharply reduced by our approach. Table 9 denotes that, with the decrease of θ, the size of optimal feature subset was narrowed down but the classification accuracy was also decreased. These results suggest that a bigger value of θ will result in less redundant symptoms and therefore induce a larger size of optimal feature subset; oppositely, smaller θ can provide more redundant symptoms and sharply reduce the feature dimension. An extreme case is that θ equals “0,” which means that we can get one mixed feature for each basis and the size of optimal feature subset is equal to the number of bases. In a word, how to determine the value of θ depends on the size of optimal feature subset and its corresponding classification performance.
The performance of classification for the inferred optimal feature subsets with different threshold θ.
The values of θ
Original symptom features
New mixed features
Total number of features
Classification accuracy (%)
θ=0.95
33
6
39
80.002 ± 9.95
θ=0.90
21
9
30
70.83 ± 6.59
θ=0.85
10
8
18
70.00 ± 4.56
5. Conclusions
In this study, we developed the NMFBFS approach to efficiently extract the important clinical symptoms of HCC from clinical observation data. NMFBFS is a two-stage filter method for feature selection as follows. (1) In the first stage, preliminary screening is implemented to detect and remove the irrelevant features; (2) in the second stage, NMF was applied to identify the redundant features by groups which might represent similar distribution patterns. Each redundant symptom group was then transformed into a new mixed feature so that the dimension of dataset was further reduced.
The application of NMFBFS on a clinical dataset of HCC proved the effectiveness of this approach. The optimal clinical features derived from NMFBFS approach contained many well-recognized symptoms of HCC patients. Moreover, this study also provides a general computational framework of a novel feature selection approach to efficiently extract the optimal feature subset from a high-dimensional dataset.
AbbreviationsHCC:
Hepatocellular carcinoma
TCM:
Traditional Chinese Medicine
NMF:
Nonnegative matrix factorization
LSSVM:
Least squares support vector machines
KNN:
K-nearest neighbor.
Conflict of Interests
The authors declare that they have no competing interests.
Authors’ Contribution
Zhiwei Ji and Guanmin Meng contributed equally to this work.
Acknowledgments
This work was supported by the National Science Foundation of China (nos. 61472282 and 61133010). The data in this work was collected by the Changhai Hospital, Shanghai, China.
BoschF. X.RibesJ.ClériesR.DíazM.Epidemiology of hepatocellular carcinomaCenterM. M.JemalA.SmithR. A.WardE.Worldwide variations in colorectal cancerEl-SeragH. B.Hepatocellular carcinomaA new prognostic system for hepatocellular carcinoma: a retrospective study of 435 patients: the Cancer of the Liver Italian Program (CLIP) investigatorsMillerG.SchwartzL. H.D'AngelicaM.The use of Imaging in the diagnosis and staging of hepatobiliary malignanciesFornerA.BruixJ.Diagnosis of hepatic nodules 20 mm or smaller in cirrhosis: prospective validation of the noninvasive diagnostic criteria for hepatocellular carcinoma—replyLiaoY.-H.LinC.-C.LiT.-C.LinJ.-G.Utilization pattern of traditional Chinese medicine for liver cancer patients in TaiwanMouradR.SinoquetC.LerayP.Probabilistic graphical models for genetic association studiesJiZ.WangB.Identifying potential clinical syndromes of hepatocellular carcinoma using PSO-based hierarchical feature selection algorithmDuJ.-X.ZhaiC.-M.YeY.-Q.Face aging simulation and recognition based on NMF algorithm with sparseness constraintsLiangJ. N.YangS.WinstanleyA.Invariant optimal feature selection: a distance discriminant and feature ranking based solutionPengH. C.LongF. H.DingC.Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancyZouH.HastieT.Regularization and variable selection via the elastic netWildiS.PestalozziB. C.McCormackL.ClavienP.-A.Critical evaluation of the different staging systems for hepatocellular carcinomaSharmaA.ImotoS.MiyanoS.A filter based feature selection algorithm using null space of covariance matrix for DNA microarray gene expression dataBellalF.ElghazelH.AussemA.A semi-supervised feature ranking method with ensemble learningChangH.-W.ChiuY.-H.KaoH.-Y.YangC.-H.HoW.-H.Comparison of classification algorithms with wrapper-based feature selection for predicting osteoporosis outcome based on genetic factors in a Taiwanese women populationImaniM. B.KeyvanpourM. R.AzmiR.A novel embedded feature selection method: a comparative study in the application of text categorizationZdunekR.CichockiA.Nonnegative matrix factorization with constrained second-order optimizationChangZ.WangZ.AshbyC.ZhouC.LiG.ZhangS.HuangX.eMBI: boosting gene expression-based clustering for cancer subtypesZhengC.-H.HuangD.-S.ZhangL.KongX.-Z.Tumor clustering using nonnegative matrix factorization with gene selectionZhengC.-H.NgT.-Y.ZhangL.ShiuC.-K.WangH.-Q.Tumor classification based on non-negative matrix factorization using gene expression dataCichockiA.LeeH.KimY.-D.ChoiS.Non-negative matrix factorization with α-divergenceZdunekR.CichockiA.Nonnegative matrix factorization with quadratic programmingLeeD. D.SeungH. S.Algorithms for non-negative matrix factorizationProceedings of the Advances in Neural Information Processing Systems (NIPS '01)2001TheysC.LantériH.RichardC.SGM to solve NMF—application to hyperspectral dataCasalinoG.del BuonoN.MencarC.Subtractive clustering for seeding non-negative matrix factorizationsVignoloL. D.MiloneD. H.ScharcanskiJ.Feature selection for face recognition based on multi-objective evolutionary wrappersAnandA.PugalenthiG.FogelG. B.SuganthanP. N.An approach for classification of highly imbalanced data using weighting and undersamplingBriaA.KarssemeijerN.TortorellaF.Learning from unbalanced data: a cascade-based approach for detecting clustered microcalcificationsCaoP.ZhaoD. Z.ZaianeO.Hybrid probabilistic sampling with random subspace for imbalanced data learningShubairA.RamadassS.AltyebA. A.KENFIS: kNN-based evolving neuro-fuzzy inference system for computer worms detectionWangH.-Q.SunF.-C.CaiY.-N.DingL.-G.ChenN.An unbiased LSSVM model for classification and regressionMustaffaZ.YusofY.LSSVM parameters tuning with enhanced artificial bee colonyLiY.NgomA.The non-negative matrix factorization toolbox for biological data miningBrunetJ.-P.TamayoP.GolubT. R.MesirovJ. P.Metagenes and molecular pattern discovery using matrix factorizationFrigyesiA.HöglundM.Non-negative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypesBoL. F.WangL.JiaoL. C.Multiple parameter selection for LS-SVM using smooth leave-one-out errorLaiS.-W.ChenH.-J.LinC.-L.LiaoK.-F.No correlation between Alzheimer's disease and risk of hepatocellular carcinoma in older people: an observation in TaiwanOuS.-M.LeeY.-J.HuY.-W.LiuC.-J.ChenT.-J.FuhJ.-L.WangS.-J.Does Alzheimer's disease protect against cancers? A nationwide population-based studyPengS.-Y.FengX.-D.LiuY.-B.QianH.-R.LiJ.-T.WangJ.-W.XuB.FangH.-Q.CaoL.-P.ShenH.-W.DuJ.-J.CaiX.-J.MuY.-P.Surgical treatment of hepatocellular carcinoma originating from caudate lobePengS. Y.LiJ. T.LiuY. B.CaiX. J.MouY. P.FengX. D.WangJ. W.XuB.QianH. R.HongD. F.WangX. B.FangH. Q.CaoL. P.ChenL.PengC. H.LiuF. B.XueJ. F.Surgical treatment of hepatocellular carcinoma originating from caudate lobe—a report of 39 casesLinM.-H.WuP.-Y.TsaiS.-T.LinC.-L.ChenT.-W.HwangS.-J.Hospice palliative care for patients with hepatocellular carcinoma in TaiwanFujiyamaS.ShibataJ.MaedaS.TanakaM.NoumaruS.SatoK.TomitaK.Phase I clinical study of a novel lipophilic platinum complex (SM-11355) in patients with hepatocellular carcinoma refractory to cisplatin/lipiodolYuX.ZhaoH.LiuL.CaoS.RenB.ZhangN.AnX.YuJ.LiH.RenX.A randomized phase II study of autologous cytokine-induced killer cells in treatment of hepatocelluar carcinomaCiomborK. K.FengY.BensonA. B.IIISuY.HortonL.ShortS. P.KauhJ. S. W.StaleyC.MulcahyM.PowellM.AmiriK. I.RichmondA.BerlinJ.Phase II trial of bortezomib plus doxorubicin in hepatocellular carcinoma (E6202): a trial of the Eastern Cooperative Oncology GroupWuJ.HendersonC.FeunL.Van VeldhuizenP.GoldP.ZhengH.RyanT.BlaszkowskyL. S.ChenH.CostaM.RosenzweigB.NierodzikM.HochsterH.MuggiaF.AbbadessaG.LewisJ.ZhuA. X.Phase II study of darinaparsin in patients with advanced hepatocellular carcinomaLinJ.-J.JinC.-N.ZhengM.-L.OuyangX.-N.ZengJ.-X.DaiX.-H.Clinical study on treatment of primary hepatocellular carcinoma by Shenqi mixture combined with microwave coagulationDoffoëlM.BonnetainF.BouchéO.VetterD.AbergelA.FrattéS.GrangéJ. D.StremsdoerferN.BlanchiA.BronowickiJ. P.Caroli-BoscF. X.CausseX.MasskouriF.RougierP.BedenneL.Multicentre randomised phase III trial comparing Tamoxifen alone or with Transarterial Lipiodol Chemoembolisation for unresectable hepatocellular carcinoma in cirrhotic patients (Federation Francophone de Cancerologie Digestive 9402)