The classification of different cancer types owns great significance in the medical field. However, the great majority of existing cancer classification methods are clinical-based and have relatively weak diagnostic ability. With the rapid development of gene expression technology, it is able to classify different kinds of cancers using DNA microarray. Our main idea is to confront the problem of cancer classification using gene expression data from a graph-based view. Based on a new node influence model we proposed, this paper presents a novel high accuracy method for cancer classification, which is composed of four parts: the first is to calculate the similarity matrix of all samples, the second is to compute the node influence of training samples, the third is to obtain the similarity between every test sample and each class using weighted sum of node influence and similarity matrix, and the last is to classify each test sample based on its similarity between every class. The data sets used in our experiments are breast cancer, central nervous system, colon tumor, prostate cancer, acute lymphoblastic leukemia, and lung cancer. experimental results showed that our node influence based method (NIM) is more efficient and robust than the support vector machine,

Cancer research is one of the major research areas in the medical field. In cancer, cells divide and grow uncontrollably, forming malignant tumors and invading adjacent parts of the body. The cancer may also spread to more distant parts of the body through the lymphatic system or bloodstream. Many things are deemed to increase the risk of cancer, including tobacco use, dietary factors, certain infections, exposure to radiation, lack of physical activity, obesity, and environmental pollutants. The famous Apple founder Steve Jobs also died of pancreatic cancer. Any method which benefits cancer treatment should receive sufficient attention.

The biggest challenge facing cancer treatment process is a means of developing individualized treatment programs for specific tumor types. Traditional diagnosis of cancer depends on the type of tissue-derived tumor cells, cell morphology, and protein markers, and biological behavior does not adequately reflect the real situation of the tumor; it is sometimes difficult to make a correct diagnosis of forecasts.

In order to gain a better insight into the problem of cancer classification, systematic approaches based on global gene expression analysis have been proposed [

Let

The centrality of nodes, or the identification of the importance of nodes, is a key issue in network analysis. Degree is the simplest of the node centrality measures by using the local structure around nodes only. In an undirected network, the degree is equal to the number of edges a node has. In a directed network, a node may have a different number of outgoing and incoming edges, and therefore, degree is split into out-degree and in-degree, respectively. The degree centrality of a vertex

Closeness is defined as the inverse of farness, which in turn, is the sum of distances to all other nodes [

Another famous node centrality is betweenness [

K-shell [

A schematic representation of the K-shell.

As can be seen from the four above node centralities in complex networks, degree is the most intuitive and simple, but only considering local information. Both betweenness and closeness use shortest paths between every pair of nodes in the network as primary factor. K-shell approach is based on the node degree, but it is from a global perspective.

In our opinion, the evaluation of node centrality can start from the influence of a node on another node. Now consider the influence of node

When

To facilitate the proof, we introduce one good nature of adjacency matrix. That is, the

As we consider the undirected network,

There is a special case that the largest absolute eigenvalues of matrix

When

From the proof of Theorem

From Theorem

For example, the network shown in Figure

Node influence centrality example network.

According to (

Node 4 to each node’s influence.

We also calculated the influence of each node on node 4, curves as shown in Figure

Each node to node 4 influence.

Let

For example, the distance matrix of prostate cancer [

Description for cancer gene data sets.

Dataset | Number of samples | Number of genes | Number of classes | Test method |
---|---|---|---|---|

Breast cancer | 97 | 24481 | 2 | 78train-19test |

Central nervous system | 60 | 7129 | 2 | LOOCV |

Colon |
62 | 2000 | 2 | LOOCV |

Prostate cancer | 136 | 12600 | 2 | 102train-34test |

Acute |
327 | 12558 | 7 | 215train-112test |

Lung cancer | 181 | 12533 | 2 | 32train-149test |

Distance matrix of prostate cancer.

Similarity matrix of prostate cancer.

Node influence centrality plays a significant role in our graph-based method for cancer classification. Let

Similarity Matrix is used twice in seven main steps of NIM1. The first is located in Step 4, in order to obtain the adjacency matrix. The second is in Step 6, in order to calculate the similarity between every test sample and each class. We believe in two steps used in different similarity matrix, resulting in node influence based method 2 (NIM2). Only two main steps of NIM2 are different from NIM1, as shown below.

We use 6 data sets to validate NIM1 and NIM2. Below are six publicly available gene expression data from DNA microarray that are widely used by researchers for cancer classification experiments. All the data sets are used to predict various kinds of cancers by measuring gene sequences and are outlined in Table

The first data set is breast cancer [

The second data set is central nervous system [

The third data set is colon tumor [

The fourth data set is prostate cancer [

The fifth data set is acute lymphoblastic leukemia [

The sixth data set is lung cancer [

If the dataset has not been divided into training set and testing set, we adopt leave-one-out cross validation (LOOCV) to validate NIM1 and NIM2. LOOCV involves using a single observation from the original sample as the validation data and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. This is the same as a

Most proposed cancer classification methods are from the statistical and machine learning area, ranging from the old nearest neighbor analysis to the new support vector machines. There is no single classifier that is superior over the rest. Some of the methods only work well on binary-class problems and are not extensible to multiclass problems, while others are more general and flexible. The methods we choose for comparing are all top 10 algorithms in data mining, mentioned in [

Experimental results for cancer gene datasets.

Due to high dimension, small sample size, and nonbalanced distribution, traditional classification algorithms do not obtain high accuracy in these data sets. From Figure

NIM2 is an improved version of NIM1 and has one more parameter. NIM1 can be viewed as a special case of NIM2 when

The traditional classification methods usually tend to have many parameters need to be set before application. And the parameters are closely related to the performance. However, there is little information on how to set parameters, usually based on experience. So we try to propose an algorithm with as few parameters as possible. NIM1 has only one parameter

The parameter setting for

Parameter setting for

Dataset | Minimum of |
Maximum of |
Change interval | Number of experiments |
---|---|---|---|---|

Colon tumor | 0.501 | 1.5 | 0.001 | 1000 |

Acute lymphoblastic leukemia | 0.01 | 10 | 0.01 | 1000 |

Lung cancer | 0.01 | 10 | 0.01 | 1000 |

Parameters setting for

Dataset | Minimum of |
Maximum of |
Change interval of |
Number of experiments |
---|---|---|---|---|

Colon tumor | 0.501, 0.501 | 1.5, 1.5 | 0.001, 0.001 | 1000000 |

Acute lymphoblastic leukemia | 0.501, 0.501 | 10.5, 10.5 | 0.01, 0.01 | 1000000 |

Lung cancer | 0.501, 0.501 | 10.5, 10.5 | 0.01, 0.01 | 1000000 |

NIM1 results in colon tumor with the variation of

IM1 results in ALL with the variation of

NIM1 results in lung cancer with the variation of

NIM2 results in colon tumor with the variation of

NIM2 results in ALL with the variation of

NIM2 results in lung cancer with the variation of

Graph is a powerful representation formalism that has been widely employed in machine learning and data mining. In order to gain deep insight into the cancer classification problem, we analyze the problem from graph-based view. Let

In the method NIM1, after selecting the appropriate distance metric, the graph (or network) of all samples is created by computing the similarity matrix. Then the node influence of training samples is calculated. Treat node influence as weight; the similarity between every test sample and each class is obtained. At last, every test sample is classified according to its similarity between each class.

Furthermore, we also propose NIM2, which is an improved version of NIM1. NIM1 can be viewed as a special case of NIM2 when

Due to high dimension, small sample size, and nonbalanced distribution, SVM, KNN, C4.5, Naive Bayes, and CART do not obtain high accuracy in these cancer gene data sets. From the experimental results in the 6 cancer gene data sets, it can be seen that NIM1 and NIM2 are more efficient than these traditional algorithms. At the end, we also discuss the parameters in both NIM1 and NIM2. The parameters play an important role in the performance of NIM1 and NIM2.

The authors declare that there is no conflict of interests regarding the publication of this paper.

The research work was the partial achievement of Project 2013CB329504 supported by National Key Basic Research and Development Program (973 program) and STD of Zhejiang (2012C21002).