Identifying Potential Clinical Syndromes of Hepatocellular Carcinoma Using PSO-Based Hierarchical Feature Selection Algorithm

Hepatocellular carcinoma (HCC) is one of the most common malignant tumors. Clinical symptoms attributable to HCC are usually absent, thus often miss the best therapeutic opportunities. Traditional Chinese Medicine (TCM) plays an active role in diagnosis and treatment of HCC. In this paper, we proposed a particle swarm optimization-based hierarchical feature selection (PSOHFS) model to infer potential syndromes for diagnosis of HCC. Firstly, the hierarchical feature representation is developed by a three-layer tree. The clinical symptoms and positive score of patient are leaf nodes and root in the tree, respectively, while each syndrome feature on the middle layer is extracted from a group of symptoms. Secondly, an improved PSO-based algorithm is applied in a new reduced feature space to search an optimal syndrome subset. Based on the result of feature selection, the causal relationships of symptoms and syndromes are inferred via Bayesian networks. In our experiment, 147 symptoms were aggregated into 27 groups and 27 syndrome features were extracted. The proposed approach discovered 24 syndromes which obviously improved the diagnosis accuracy. Finally, the Bayesian approach was applied to represent the causal relationships both at symptom and syndrome levels. The results show that our computational model can facilitate the clinical diagnosis of HCC.


Introduction
Hepatocellular carcinoma (HCC) is the third most common cause of cancer-related death worldwide and the leading cause of death in patients with cirrhosis [1,2]. In clinical practice, symptoms attributable to HCC are usually absent, so the majority of patients are diagnosed with advanced disease, often precluding potentially curative therapies. This has resulted, in part, in a 5-year overall survival rate of 12% and a median survival following diagnosis ranging from 6 to 20 months [3,4]. Therefore, timely and accurate diagnosis is very important for treatment of HCC. Currently, the modalities employed in the diagnosis of HCC mainly include crosssectional imaging, biopsy, and serum AFP, which depend on both the size of the lesion and underlying liver function, and some of them are controversial [5,6].
Traditional Chinese Medicine (TCM) is one of the most popular complementary and alternative medicine modalities. It plays an active role in diagnosis and treatment of HCC in Chinese and East some Asian countries [7,8]. Different from other diagnostic methods, it is possible to accurately diagnose HCC using inspection, auscultation and olfaction, inquiry, and pulse taking and palpation [8]. In this study, we will work on a TCM clinical dataset, which is observed from 120 HCC patients. Each patient is observed on 147 clinical symptoms and a positive score is evaluated to indicate total positive strength of symptoms. Based on this TCM dataset, we could achieve two aims: (1) screening the potential clinical syndromes for this cancer and (2) inferring the relationships among the potential clinical features via Bayesian network analysis. However, the computational cost will be exceedingly high if the dimensions of the raw dataset 2 BioMed Research International are large. Furthermore, the causal relationships between all the features are difficult to infer because high dimensional data sharply increases the complexity of Bayesian network structure learning [9].
In this study, a particle swarm optimization-based hierarchical feature selection (PSOHFS) model was proposed to select potential clinical syndromes for HCC diagnoses. Firstly, all the 147 original symptoms were arranged into 27 groups according to the categories of clinical observations, and 27 new syndrome features were generated from these groups. Then, the hierarchical feature representation was built with a tree structure, in which different layers indicate different scales of clinical information ( Figure 1). Secondly, an improved PSO algorithm was employed at the syndrome level to search an optimal syndrome subset for diagnoses. The experiment shows that 24 novel syndromes searched by PSOHFS could improve accuracy of diagnosis. In addition, Bayesian networks were further constructed at two levels: (1) a global network on the middle-layer features revealed the relationships among 24 potential syndromes; (2) the local networks were used to represent the connections of symptoms in the same groups.
The rest of the paper is organized as follows. Section 2 introduces the details about the experimental data and the feature selection approach. Sections 1 and 2 present the experiment design and results, respectively. Some important conclusions drawn are presented in Section 5.

Experimental Data.
In this study, the raw data was observed from 120 HCC patients. The clinical dataset includes 300 samples and 147 clinical symptoms. The levels of positive of each symptom are quantified with nonnegative integers. The larger value indicates stronger positive symptom occurred. There are two types of data range for all the original symptoms: binary or integer. For example, the symptom "lip color is white" is binary (0 or 1); that means there are two possible states for this symptom: occurrence or nonoccurrence. Another example is "abdominal pain"; its data range is 0, 1, 2, and 3. The symptom is not positive if its value equals zero; otherwise, the larger the value is, the stronger positive symptom will be. In addition, each patient is marked with a score (nonnegative value) to represent the total evaluation of positive symptoms on this patient. It is obvious that if the HCC patients have larger positive scores than normal people, it is because some clinical symptoms appeared.

Feature Selection.
Feature selection for classification or regression can be widely organized into three categories, depending on how they interact with the construction of model. Filter methods employ a criterion to evaluate each feature individually and is independent of the model [10]. Among them, feature ranking is a common method which involves ranking all the features based on a certain measurement and selecting a feature subset which contains highranked features [11]. Wrapper methods involve combinatorial Positive score 27 syndromes · · · · · · · · · · · · 147 symptoms searches through the feature space, guided by the predicting performance of a classification or regression model [12]. Embedded methods perform feature selection in the process of training a model [13].

Hierarchical Feature Selection.
When the raw dataset is high dimensional, the complexity of feature selection may be extremely high: (a) the computational cost will sharply increase, particularly for the wrapper and embedded methods; (b) the potential optimal feature subset may include many irrelevant or redundant features. Therefore, it is necessary to preliminarily reduce the dimension of original feature set before feature selection. As a common preselecting strategy, feature ranking-based approach could quickly reduce the feature space by picking up high-ranked features [14]. However, this type of approach always leads to inclusion of some redundant features. In addition, the optimal feature subset which covers high-ranked features may not provide the best performance in the classification (or regression) model. Ruvolo et al. proposed a novel hierarchical feature selection approach for the audio classification by converting the raw data to three-layer feature representation with a tree structure [15]. All the low-layer features are aggregated into several groups in a "bag of features" manner, and then a higher-layer feature is extracted based on the lower-layer features in the same group. Obviously, the highlayer feature set constitutes a reduced feature space with little redundancy and might provide lower computational cost for classification or regression model.
In this study, our raw TCM data is high dimensional and there are some redundant clinical symptom features included. For example, there are four redundant observed features to describe lip color of patients, such as "lip color is pale, " "lip color is red, " "lip color is pink, " and "lip color is dark purple. " Therefore, we aggregate several features into a group if they describe the same category of clinical symptoms or the same part of body and define a new syndrome feature for each symptom group. After extracting all the syndrome features, we build a tree structure to achieve the hierarchical feature representation (Figure 1). In this hierarchical structure, the Elitist strategy is used in the later stage of iteration bottom-layer nodes (leaf nodes) are the original clinical symptom features which are directly collected from the original TCM clinical dataset. And a middle-layer syndrome feature is defined on a group of symptoms which are related to the same part of the body. If the symptoms in the same group are not mutually exclusive (concurrent), the corresponding syndrome is defined as the sum of all these symptoms; otherwise, the level of positivity of the syndrome is based on the frequency of each symptom in all the patients (see Section 2). The top-layer node is the root of the tree, which denotes the positive score of a patient. It is obvious that each syndrome roughly represents the positive strength of one specific aspect or part of body, while symptom provides much more detailed information. Particularly, our study focuses on how to reasonably extract the syndrome features to generate a reduced feature set for feature selection and infer the causal relationships among these two-layer features.

Particle Swarm Optimization-Based Hierarchical Feature Selection (PSOHFS).
Based on the hierarchical feature representation, the dimension of the processed TCM dataset is sharply reduced on the syndrome level. We designed a chaotic binary particle swarm optimization (CBPSO) algorithm to search potential syndromes for diagnosing efficiently. The flow chart of proposed CBPSO-based feature selection is shown in Figure 2. Particle swarm optimization (PSO) is a population-based random optimization algorithm [16]. A swarm consists of particles moving around in a -dimensional search space. The position of the th particle is represented as = ( 1 , 2 , . . . , ), and the velocity = (V 1 , V 2 , . . . , V ), where 1 ≤ ≤ . The positions and velocities of particles are confined within [ min , max ] and [ min , max ] , respectively. Each particle coexists and evolves simultaneously based on knowledge shared with neighboring particles; it makes use of its own memory and knowledge gained by the swarm as a whole to find the best solution. The best previously encountered position of the th particle is considered as its individual best position best = ( 1 , 2 , . . . , ). The best position of all the best is considered as the global best position best = ( 1 , 2 , . . . , ). The limitation of the standard PSO algorithm is applied to optimize the problems in continuous space. However, many optimization problems occur in a discrete feature space; thus binary PSO (BPSO) was proposed to combinatorial optimization [17]. In BPSO, each particle is presented as a binary vector, thus, the overall velocity of particle may be described by the number of bits changed per iteration. Generally, each particle is updated as the following equations: Equation (1) will be used to update the velocities and positions of each particle in each generation. The inertia weight controls the impact of the previous velocity of a particle on its current one. 1 and 2 are random numbers between [0, 1]; 1 and 2 are acceleration constants that control how far a particle moves in a single generation. Velocities V new and V old denote the th velocities of the th particle in the current and the last generations, respectively. new and old indicate corresponding positions on the th dimension, respectively. In our case, max = 6, min = −6.
Generally, the speed of convergence of BPSO is fast; however, it has high risk of converging to local optimum. Because chaos is a complex behavior of a nonlinear deterministic system which has ergodic and stochastic properties, we combine chaos theory with BPSO to design chaotic binary particle swarm optimization (CBPSO), which potentially promotes the convergence performance of BPSO [18].
CBPSO-based feature selection is introduced in the following steps ( Figure 2).
(1) Chaotic Initialization of Particle Swarm. When CBPSO is used for feature selection, each particle indicates a candidate feature subset. Given an original feature set = { 1 , 2 , . . . , }, each particle is denoted by where is the number of features. It is obvious that each particle represents a candidate feature subset. If equals 1 indicates the th feature is selected; otherwise, is not selected. The performance of convergence about BPSO largely depends on initial particle swarm. The chaotic initialization via globally searching combined the ergodic and stochastic property of chaotic system is often has a better quality than random initialization. The common chaotic model is logistic model; it can be shown as follows: Equation (2) indicates a dynamical system, where is a control parameter. Given the value of , a time series 1 , 2 , . . . , is generated from a random initial value 0 , which ranges from 0 to 1. When equals 4, there is no stable solution for the dynamic system. It appears as a complete chaotic state. Now, an initial random vector 0 = { 01 , 02 , . . . , 0 } is generated. We substitute each element of 0 into (2) orderly and iterate times, respectively, and then obtain chaotic variables = [ 1 , 2 , . . . , ], which have different locus. When is substituted into (3), we get binary vectors [ 1 ; 2 ; . . . ; ], where the binary vector = [ ( 1 ), ( 2 ), . . . , ( )] represents a particle (1 ≤ ≤ ): At last, we select top binary vectors to constitute initial particle swarm based on the fitness values. For fully traversal of chaotic variable, the iteration of chaotic series is always large (here, = 500, < ).
(2) Fitness Calculation Based on LSSVR. Support vector machine (SVM) has excellent capabilities in classification (SVC) or regression (SVR), even for small sample [19]. It minimizes an upper bound of the generalization error based on the principle of structure risk minimize. However, SVM training process will be time consuming if dataset is huge. Therefore, least squares support vector machine (LSSVM) is proposed to overcome the shortcoming of high computational cost [20]. Generally, LSSVM can be categorized into LSSVR which is used for regression and LSSVC for classification. Because the problem-solving process of the SVR is a QP problem, which will inevitably cause a high computational complexity especially for large-scale QP problem, LSSVR can overcome these shortcomings by a set of linear equations and squared loss function which lead to important reduction in computational complexity [21].
In this study, we use LSSVR as a regression model to evaluate the predicting performance of each candidate feature subset. We assume that an optimal feature subset not only has excellent performance of prediction but also contains more relevant features and less irrelevant features. The fitness function is defined in fitvalue ( ) = pdterror ( ) + * mfr ( ) .
(4) denotes a particle-coding binary vector which indicates a candidate feature subset. The function pdterror( ) calculates the predicting error of LSSVR model based on the selected features in . The parameter is a weight between 0 and 1. Function mfr( ) indicates the correlation measure between a feature subset and the target variable. In (5), the function fr( ) measures the relevance between feature (included in ) and target value via a feature-ranking strategy. In our experiment, the more predictive features have smaller values of fr( * ) (see experiment in Section 3.2). Therefore, the smaller fitness value corresponds to the better candidate feature subset: mfr ( ) = mean (fr ( 1 ) , fr ( 2 ) , . . . , fr ( )) .
(3) Update the Velocity and Position for Each Particle. The velocity and position of each particle are updated according to (1). Considering the searching performance of CBPSO is affected largely by inertia weight ( ), the value of is dynamically updated in our CBPSO by using nonlinear decreasing strategy. Its calculation is as follows: In (6), max is the number of iterations, is the current iteration, and 3 is a constant (set 3 = 10). and , respectively, are the values of on the initial and last generation ( > ). In our case, = 1.2, = 0.4. The performance of global search of CBPSO is increased using larger at the beginning of iteration, and the local search will be enhanced using smaller at the later stage.
(4) Reinitialization of Particle Swarm with Probability. The trajectory of particle is largely affected by best and all the best . At the beginning of iteration, the convergence rate of swarm is fast, but it is slow at the later stage which has high risk of converging to local optimum. For overcoming this shortcoming, each particle in each generation is reinitialized with small probability (Figure 3). In Figure 3, is the probability of reinitialization for current particle swarm, with its calculation based on (7). At the early stage of iteration, there are many chances for particles to approximate the optimal solution, so that the probability of reinitialization for whole swarm is small. In the later stage, the probability of reinitialization is increased, it can largely avoid the particles fall into the local optimum. The parameter currun denotes the current iteration, and is a small random probability (in our case, = 0.3). When the better particle is found after reinitialization, update the current best and best : (5) Mutation of the Potential Global Optimal Solution. If the global optimal particle best is not constantly improved for a long time, it is necessary to make variation for it to jump out from the local optimal point. In our case, when best is invariant in 10 iterations, its binary coding vector will be mutated with a random probability. If a better particle is found, best is updated again.  Figure 4(a) shows an example of several symptoms in the same group being mutually exclusive. That means if the lip color of a patient is red, the rest of the three colors will not appear with him/her. We name a new feature with five possible discrete values ( = 0, 1, 2, 3, 4) to simplistically represent the combined meaning of four original symptoms. According to Figure 4(a), the states of lip color for a patient are presented with a binary vector (length is four) in original TCM data, while we can represent it with a single value , where ∈ {0, 1, 2, 3, 4}. If equals zero, that means all four symptoms are not positive. Otherwise, one of the symptoms appears positive. As for the mapping between four symptoms and four discrete values (1, 2, 3, and 4), we follow a simple rule to assign each candidate value to a possible level of this symptom: the larger discrete value of indicates that much more patients are positive on this clinical symptom. We count the statistic distributions of all the samples on these four symptoms, respectively, and map each discrete value to a symptom of lip color according to the mean value of positive scores on each symptom.

Example 2.
The symptoms in the same group are not mutually exclusive. Figure 4(b) shows three clinical symptoms of emotion: irritability, depression, and sigh. These symptoms could be positive simultaneously on a patient. For example, the clinical symptoms of emotion for a patient are denoted by a vector = [2, 0, 1] in original data, which means two emotion-related positive symptoms appeared with him/her. In this case, a new syndrome feature is extracted from , where = sum( ) = 3. Therefore, if a patient has several positive symptoms which belong to the same syndrome, cumulative summation is a feasible strategy to get a total positive strength on this syndrome.

Experiment Design.
First, we proposed a feature-ranking strategy for association analysis between individual syndrome and positive score (target value) with function fr( * ): mcc ( , ps) = 1 − corr ( , ps) Combining ((4)-(5), (8)- (9)), we can determine the fitness function in the proposed PSOHFS model for feature subset optimizing. The function corr( , ps) is the correlation coefficient between feature and target value (ps). Function pe( ) denotes the predicting error of LSSVR model with all the features except . If the predicting error is obviously increased after moving out from the whole feature set, it indicates the feature is high predictive. The smaller value of fr( ), the higher-ranked feature will be. The result of feature ranking can provide a reference about the importance of each syndrome to positive score. Secondly, our developed CBPSO algorithm was applied at the syndrome level for feature selection. Different swarm size and the number of iterations were chosen to test the searching performance of the proposed CBPSO. And then, the predicting performance of the optimal syndrome subset (OPS) by proposed model was further validated. On the one hand, we employed two well-established feature selection methods to compare them with our proposed PSOHFS model: (1) correlation-based filter method (CFM) [14,23] and (2) PSO-based wrapper method (PWM) [14]. These standard approaches were applied on original symptom features. On the other hand, we further validated the performance of OPS by feature ranking on the syndrome feature level. Two types of syndrome subsets were selected to compare: (1) full collection with all the 27 syndromes (FCS) and (2) filterbased syndrome set by feature ranking via (8). Here, we set threshold 0.8 and 0.9 to get two potential syndrome subsets: FRS1 and FRS2.
Finally, based on the optimal potential syndrome subset inferred by our PSOHFS model, Bayesian networks were constructed, respectively, at the symptom and syndrome levels. On the one hand, the global Bayesian network on potential syndromes was inferred using GES algorithm [24]. Such coarser-grained network can roughly reveal the causal relationships among these potential syndromes of this cancer. Before structure learning of global network, the processed TCM dataset (TD) in Section 3.1 should be firstly discretized according to  ) is used to estimate the optimal intervals of discretization for the sample of th syndrome. If the number of positive levels for a syndrome is larger than four, the discretization is necessary on this syndrome. On the other hand, we chose three syndromes as examples to construct local networks using GES algorithm (Table 4). When a network structure is learned, Maximum Likelihood Estimation (MLE) is utilized to compute all the conditional probability tables. Then, the probability inference could be achieved using inference algorithm, such as junction tree method [25,26].

Experimental Parameters.
The simulating experiments were implemented under the environment of MATLAB2011a with Intel Core i5-2410 CPU @ 2.3GHZ, 4 GB RAM. In the LSSVR regression model, Gaussian RBF kernel is employed, and the kernel parameters 2 and should be determined firstly. Currently, many approaches have been applied in parameter optimization of LSSVR, such as grid search [27], cross-validation [28,29], genetic algorithm (GA) [30], and simulated annealing algorithm [31]. In our study, grid search was selected to determine the parameters in the range To evaluate the accuracy of prediction, three statistical metrics are widely employed: (1) mean square error (MSE), (2) root mean square error (RMSE), and (3) mean relative percentage error (MRPE). In (11), where and are the observed value and predicted value, the smaller MSE, RMSE, and MRPE are, the better the LSSVR model will be: In our experiment, we used MSE to calculate the values of function pdterror( * ) and pe( * ). Moreover, the Matlab Bayes Net Toolbox FullBNT-1.0.7 [32] and BNT Structure Learning Package BNT SLP 1.5 were, respectively, used in the Bayesian network structure learning, parameters learning, and probability inference. The probability distribution between nodes in a Bayesian network could be computed according to the inferred network structure and conditional probability tables. Table 1 shows the results of association analysis between individual syndromes and positive score. mcc( , ps) reflects the predicting performance of feature to ps (positive score). The smaller the value of mcc is, the more important the feature will be. The value of pe( ) indicates predicting error of LSSVR model based on all the features except ; it is measured by MSE. Here, it is obvious that the higherranked features have lower values of fr( ). We clearly see some important syndromes are high predictive, such as "facial features, " "skin of the limbs, " "diet, " "sternocostal and abdominal pain, " and so forth.

Results and Discussion
Our developed CBPSO algorithm was applied to search the optimal syndrome subset on the processed TCM dataset. Assigning different swarm size and the number of iterations, this CBPSO algorithm shows excellent convergence performance ( Figure 5). Different assignments of parameters for CBPSO finally got the same optimal solution:  001101111111111111111111111. It means the potential syndrome subset containing 24 syndromes is a steady solution for this NP-hard problem ( Table 2). These 24 syndromes reflect many cancer-related parts of body or aspects of observation, which are helpful to clinically diagnose HCC. Now, two well-established feature selection methods were introduced to be compared with our proposed PSOHFS model: (1) correlation-based filter method (CFM) [14,23] and (2) PSO-based wrapper method (PWM) [14]. The first one uses correlation-based feature ranking as the principle criteria for feature selection by ordering. The second one uses standard BPSO algorithm to search an optimal feature subset. These two methods were all applied on the original symptom features. For CFM, we used 15% and 30% topranked features to validate its performance, while, for PWM, we set population size equal to 100 and iterations equal to 100 and 200. Table 3 shows the error of prediction of the LSSVR model based on candidate optimal feature subsets.
Five candidate feature subsets were searched by the above two methods and PSOHFS model, respectively. In Table 3, the values of MSE, MRSE, and MRPE were calculated based on LSSVR by 5-fold cross-validation.
Comparing the values of MSE, RMSE, and MRPE in Table 3, we can see that the optimal syndrome set (OPS) searched by our PSOHFS model has the obvious superiority in the predicting performance. The dimension of the PSOHFS-based optimal syndrome subset equals 24, which is significantly smaller relatively to the dimension of the original symptoms (147). Because CFM and PWM work directly on the original high dimensional feature space, it is hard for them to achieve an optimized prediction performance and the dimension of potential feature subset, simultaneously. PWM searches for the optimal solution depending on the evaluation of regression model, so the optimal feature subset from PWM is more predictive than CFM's. However, standard wrapper-based methods do not optimize the size of optimal feature subset. CFM got the worst result is reasonable because the correlation measurement can only detect linear dependencies between variable and target. Next, we further validate the performance of OPS on the syndrome level. Two types of syndrome subsets were selected to compare: (1) full collection with all the 27 syndromes (FCS) and (2) filter-based syndrome subset by feature ranking via (8). Here, we chose threshold 0.8 and 0.9 to get two potential syndrome subsets: FRS1 and FRS2 (Table 1). In Table 4, we obviously find OPS can get good balance between the dimension and predicting performance. The verification on FRS1 and FRS2 proves the fact that, although feature-ranking methods run quickly, they still easily lead to worse results because feature-ranking filter ignores the possible interactions and dependences among the features [29]. The difference between Tables 3 and 4 indicates the feature selection on a reduced feature space of original dataset potentially obtains a better solution. 24 potential syndrome features could quickly diagnose the positive level of HCC patients with high accuracy. Our result suggested that "lip color, " "tongue color, " and "coated tongue color" could be ignored during the process of prediction because they are weak predictive features for discriminating these HCC samples.
Finally, based on the hierarchical feature representation and the result of feature selection on syndromes, Bayesian network on two layers was constructed and the conditional probability tables were inferred. Here, we picked up three cases to explain what we can obtain from the Bayesian network analysis in the symptom and syndrome feature space (Table 5). Figure 6(a) shows the Bayesian network structure of "emotion" syndrome. We can clearly see that there is a causal relationship between "depression" and "sigh. " When a patient is depressive, sigh is a usual symptom with him/her. While "irritability" seems to reflect inversely comparing to "depression"; therefore it is an independent node in this inferred network structure. The conditional probability tables of "emotion" are shown as in Supplementary Table S1A-S1C. For example, P("irritability" = 0, "depression" = 1, "sigh" = 1) = 0.027 suggests the probability of the clinical symptoms "depression" and "sigh" is positive on a patient. Figure 6(b) shows the network structure of "cardiothoracic condition" syndrome. From Figure 6(b), "tightness in the chest" might lead to three other clinical symptoms: "shortness of breath, " "palpitations, " and "pain in chest. " The conditional probability tables of "cardiothoracic condition" are shown in Supplementary Table S2A-S2D. For example, P("tightness in the chest" = 1, "shortness of breath" = 1, "palpitations" = 1, "pain in chest" = 0) = 0.01143. Similarly, Figure 6(c) shows the network structure of "diet" syndrome. The conditional probability tables of "diet" are shown in Supplementary Table  S3A-S3G. At last, Figure 7 represents the global network on 24 potential syndromes. There are three subnetwork modules and six independent nodes in Figure 7. All the relationships among these syndromes were represented. Their conditional probability tables were listed in Supplementary Table SS1-SS24. Based on the hierarchical feature representation, the Bayesian networks potentially provided us with useful knowledge with multi-granularity. From Table 6, we can clearly see that the computational cost of network structure learning is sharply increased when the number of nodes in the network is increasing. It further proves that if we construct Bayesian network on 147 original clinical symptoms directly, it will    meet unimaginable computational complex; therefore, our method proposed in this paper provided a good solution.

Conclusions
In this paper, a particle swarm optimization-based hierarchical feature selection (PSOHFS) model was proposed to infer potential clinical features of HCC on a Traditional Chinese Medicine dataset which was collected from 120 patients. The PSOHFS model firstly arranged all the 147 original symptoms into 27 groups according to the categories of clinical symptoms and extracted a new syndrome feature from each group. The raw TCM clinical dataset was represented in a reduced feature space so that we can build a hierarchical feature representation pattern with a tree structure. Based on such hierarchical feature graph, we reached two aims: (1) based on a significant reduced feature space, the feature selection can be easily realized, and the optimal feature subset could diagnose patient samples efficiently; (2) we constructed Bayesian network on symptom and syndrome levels. A global Bayesian network for all the potential syndromes roughly described the relationships among the main important aspects of HCC. While each local network was constructed for the symptom features in the same group, the causal relationships among them could be inferred. In our simulating experiment, our CBPSO algorithm in PSOHFS model discovered an optimal syndrome subset of HCC, which included 24 syndromes. With a LSSVR regression model built by these 24 potential syndromes, the diagnosis accuracy of HCC is high and computational cost is sharply reduced. The significance of the proposed model is as follows: (1) feature selection is implemented on a reduced feature space, so that the dimension of optimal feature subset is smaller; (2) the fitness function in CBPSO algorithm optimizes the predicting performance and the correlation between features and target variable. Based on the results of feature selection, we further achieved the Bayesian network construction at both syndrome and symptom levels to explain the relationships among all the nodes and the probability inference could be computed based on learned network structure and conditional probability tables. However, our model also has some shortcomings: (1) most of syndrome groups were aggregated from the clinical symptoms observed from the same parts of body, while much more evidence proved that there are significant relationships between symptoms which describe different parts (aspects) of body; (2) we did not study the relationships of clinical symptom features which belong to different groups. In the future, we will collect more clinical samples of HCC to deeply analyze the correlation between any clinical features. Also, some high-predictive clinical features inferred in this study need to be validated further in other TCM clinical datasets. If we can discover and validate some high-predictive clinical features in the next step of research, that might be the significant phenotypes of this cancer.