DriverFinder: A Gene Length-Based Network Method to Identify Cancer Driver Genes

Integration of multi-omics data of cancer can help people to explore cancers comprehensively. However, with a large volume of different omics and functional data being generated, there is a major challenge to distinguish functional driver genes from a sea of inconsequential passenger genes that accrue stochastically but do not contribute to cancer development. In this paper, we present a gene length-based network method, named DriverFinder, to identify driver genes by integrating somatic mutations, copy number variations, gene-gene interaction network, tumor expression, and normal expression data. To illustrate the performance of DriverFinder, it is applied to four cancer types fromThe Cancer Genome Atlas including breast cancer, head and neck squamous cell carcinoma, thyroid carcinoma, and kidney renal clear cell carcinoma. Compared with some conventional methods, the results demonstrate that the proposedmethod is effective.Moreover, it can decrease the influence of gene length in identifying driver genes and identify some rare mutated driver genes.


Background
At present, understanding the mechanisms of cancer development and uncovering actionable target genes for cancer treatment are still difficult challenges.With rapid advances in high-throughput sequencing technologies, some large-scale cancer genomics projects, such as The Cancer Genome Atlas (TCGA) [1] and International Cancer Genome Consortium (ICGC) [2], have produced different omics data including a rich dataset of whole-exome and RNA sequence data [3,4], which provides chances to allow us to accurately infer tumorspecific alterations [5] and help in precision medicine in cancers treatment [6,7].However, many of genetic changes represent neutral variations that do not contribute to cancer development which are called passenger mutations [6,8].Only a few alterations are causally implicated in the process of oncogenesis and provide a selection growth advantage which are referred to as driver mutations [8,9].Hence, it is a major challenge to distinguish pathogenic driver mutations from the so-called random mutated passenger mutations [10].
Previously, there were multiple computational methods to identify driver genes based on gene mutational frequency (termed as frequency-based method) in a large cohort of cancer patients [11][12][13].However, the infrequently mutated drivers are inclined to be ignored by frequency-based methods.Also mutational heterogeneity in cancer genomes is an important factor affecting the performance of frequencybased methods [14].In addition, further studies have realized that driver mutations or genes disrupt some cellular signaling or regulatory pathways which promote the progression of cancer [15,16].In fact, genes affect various biological processes by related complex networks instead of acting in isolation in cancer [17].In addition, the cancer is a result of interplay of various types of genetic changes which form complex and dynamic networks [18].Thus, many networkbased and pathway-based approaches have been proposed to prioritize driver mutations and genes.For instance, Dendrix was a pathway-based algorithm for discovery of mutated driver pathways in cancer using somatic mutation data [19].After that, Multi-Dendrix algorithm was proposed to extend 2 Complexity Dendrix method in order to guarantee yielding the optimal set of pathways [20].MDPFinder was also a pathway-based method to solve the so-called maximum weight submatrix problem proposed in Dendrix method [19] which was aimed at identifying mutated driver pathways from mutation data in cancer [21].And Zhang et al. proposed CoMDP method which focused on cooccurring driver pathways rather than single pathway [22].In addition, iMCMC was a networkbased method by integrating somatic mutation, CNVs, and gene expressions without any prior information [6].Another method, DawnRank, was also a network-based algorithm to discover personalized causal driver mutations by ranking mutated genes according to their potential to be drivers based on PageRank algorithm [23].Bashashati et al. developed a method called DriverNet which comprehensively analyzed genomes and transcriptomes datasets to identify likely driver genes in population-level by virtue of their effect on mRNA expression networks and also reveal the infrequent but important genes and patterns of pathway [10].VarWalker was a personalized network-assisted approach to prioritize wellknown, infrequently mutated genes and interpret mutation data in NGS studies [24].
Although some proposed methods can determine potential drivers, most of them do not consider the influence of gene length to the results; in other words, they may identify some likely false positive driver genes according to known driver genes datasets.And it has been indicated that driver genes are related to not only mutation frequency, but also mutation context or gene length [25] and variants tend to arise more frequently in long genes [26].For example, TTN, the longest gene in human genome, accumulates many variants just due to its length [24,26].TTN may be selected in many computational methods; however, it usually serves as passenger gene [27].This phenomenon indicates that many current methods have a strong preference towards identifying long genes [24].So it is essential to filter those frequently mutated genes due to long length.VarWalker takes into account the gene length; however, it does not consider the influence of mutation to expression.In addition, some genomic variations in a gene may lead to extreme changes in some outlying genes expression level which are associated with the mutated gene through genegene interaction network or pathways and these outlying genes are often called outliers [10].And, it has been proved that cancer-associated genes are more effectively detected by interindividual variation analysis rather than only calculating differences in the mean expression across different samples [28].That is, the outliers are related to not only tumor expression distribution but also the corresponding normal expression distribution.Moreover, various cellular processes are often affected by genes in complex networks rather than genes acting in isolation [17] and cancer is also related to a set of genes interacting together in a molecular network [29].So networks usually provide a convenient way to explore the context within which single gene operates [30].It should be noted that prior knowledge such as proteinprotein interaction (PPI) network can provide some useful information; however, prior knowledge is limited and may lead to discarding some important information in some instances [22].In our previous work [31], we only consider the prior information of gene-gene interaction network.So it is essential to extend gene-gene interaction network.
In this study, we proposed an integrated framework named DriverFinder to identify driver genes by integrating somatic mutation data, copy number variations (CNVs), tumor and normal expression data, and gene-gene interaction network.Firstly, the gene length is taken into consideration to filter some frequently genes because of long length.Moreover, in this method, we integrated tumor expression and normal expression to construct outlier matrix rather than only using tumor expression.Furthermore, to increase accuracy of identifying drivers, we calculated Pearson correlation coefficients (PCC) of genes and combined them with PPI network to construct a new dynamic interaction network for each cancer type.In order to estimate the performance of DriverFinder method, we applied it to four different large-scale TCGA datasets, including breast cancer (BRCA), head and neck squamous cell carcinoma (HNSC), thyroid carcinoma (THCA), and kidney renal clear cell carcinoma (KIRC), and compared it with MUFFINN [32], DriverNet [10], and frequency-based method.The results demonstrated that DriverFinder can identify drivers effectively and decrease false positive, that is, filtering some long and frequently mutated but functionally neutral genes.

Materials and Methods
We proposed DriverFinder to identify cancer driver genes by integrating multiomics data (Figure 1).The detailed description of it is shown in Figure 1.
As is shown in Figure 1, the first step is to estimate the occurrence of mutation events in the genome by fitting them into a generalized additive model [24].Then a weighted resample-based test is used to filter long passenger genes according to approximated probabilities based on benchmark of coding gene length.Secondly, for gene expression, we compare expression of tumors with normal samples to determine the outlying genes.Then a gene-gene interaction network combined with prior knowledge and PCC among expression data is used to relate mutations to their consequent effect on gene expression.The associations between mutated and outlying genes are formulated using a bipartite graph where the left nodes indicate the genes mutation status and the right nodes indicate the outlying expression status in each patient.For each patient, there is an edge between   and   if the left partition gene   is mutated and the right gene   is an outlier in some samples and they also have high correlations in gene-gene interaction network.Secondly, greedy algorithm is used to prioritize mutated genes based on the coverage.In each iteration of the greedy algorithm, the mutated gene on the left partition of the bipartite graph which relates to the most outlying expression genes is chosen.Until all the outlying expression genes are covered by the least mutated genes on the left, iteration is stopped.Then the mutated genes are ranked based on their coverage.So genes with the most outlying expression are appointed as candidate driver genes.Finally, the statistical significance test based on null distribution is applied to these putative driver genes.
Signi cant genes A generalized additive model is performed on somatic mutation data to filter mutated genes which occurred at random due to long length.After that, the residual significant genes are combined with CNV to construct mutation data.The gene-gene interaction network is constructed by integrating prior gene-gene interaction network and Pearson correlated coefficient network.And the outlying matrix is constructed by analyzing interindividual variation in tumor and normal expression.(b) Given mutation data, outlying data, and gene-gene interaction network, the bipartite graph is obtained.The black nodes on the left indicate mutated genes and the blue nodes on the right represent outlying expression genes.(c) Candidate genes are obtained by greedy algorithm.The more outlying expression events the gene overlaps, the higher the gene ranks are.(d) Statistical test is performed on candidate genes to select important putative drivers by  value < 0.05.

Construction of Mutation Matrix 𝑀.
In terms of somatic mutation, we downloaded it from TCGA data portal (https:// cancergenome.nih.gov/) and only considered the data of level 2. As a matter of fact, Jia et al. explored long genes in two datasets and examined gene length effects by plotting the proportion of mutated genes versus their complementary DNA (cDNA) length [24].They discovered that two sets of mutated genes were positively correlated with cDNA length, and longer genes were more likely to be mutated genes [24].Hence, frequency-based methods may be inclined to select long genes as drivers.So, it is necessary to perform gene length-based filtration.In this study, in order to accurately estimate the mutation rate for each gene, generalized additive model was adopted to compute a probability weight vector (PWV) for the mutation genes of each sample [24].
Here, only somatic mutated genes mapped onto the benchmark of consensus coding sequences (CCDS) dataset [33] which contains a core set of consistently annotated and high quality human and mouse protein coding regions are reserved.And those mapped genes in this study have been allocated cDNA length based on their coding sequences [24].Assuming the vector  as the cDNA gene length and the following model is used to assess the probability of a gene to be mutated, where (⋅) represents an unspecified smooth function and  = num of mutant genes/total num of CCDS genes represents the proportion of mutant genes in the specific samples [24].Each gene then would be assigned a PWV value.Afterwards, a resampling test based on the probability of each gene is performed 1000 times in each sample and the null distribution is that in genes mutations occur at random.Then we define mutation frequency as  = num of selecting the genes in 1000 times 1000 , where  represents mutation frequency.Next we filter out genes with frequencies ≥ 5% in random datasets unless they are Cancer Gene Census (CGC) genes.Then a list of significant mutant genes  is obtained.As for CNVs, which have been processed by GISTIC 2.0, they are acquired from http://gdac.broadinstitute.org/runs/(v2014 10 17).There are five types of copy number including amplification, gain, diploid, heterozygous deletion, and homozygous deletion in this dataset.Here, we only screen out amplifications and homozygous deletions to construct CNV matrix .Finally, the significant mutated gene list  and CNV matrix  are combined to generate the patientmutation binary matrix , in which   = 1 indicates there is genetic alteration, that is, mutation, amplification, or homozygous deletion, in the jth gene of the th sample.Otherwise,   = 0.

Construction of Expression Outlier Matrix 𝐸.
Gene expression data (level 3) including tumor and normal expression data is also downloaded from TCGA data portal.Moreover, some studies have shown that assessment of interindividual variation of gene expression performs well in predicting cancer-associated genes [28].So in this study, the outlying matrix is determined based on analysis of interindividual variation in tumor and normal expression rather than differences in mean expression levels or only tumor expression distribution [28].For each type of cancers, there are two expression datasets (, ) and (, ) which indicate the real-valued expression measure of gene  in sample  of tumor and normal datasets, respectively.For each gene, the outliers in this study are defined as tumors whose expression levels are outside the four-standard deviation range of the expression values of the gene across all the normal samples [28].It is formularized as in which () is the mean expression and sd() indicates the standard deviation of gene expression in normal samples.
Then the binary patient-outlier matrix  is constructed and the value of (, ) indicated whether gene  in patient  is an outlier among the population-level distribution for that gene.

Gene-Gene Interaction Network.
It is noteworthy that most prior knowledge such as PPI network or pathways is incomplete and a great deal of knowledge about biological pathways remains unclear [22].In our previous study [31], we relied on prior knowledge about gene influence graph integrated from known gene networks [10] which often leaved out some likely important nodes.So in this study, we constructed a new dynamic gene-gene interaction network by incorporating gene-gene correlation coefficients with prior knowledge.That is, firstly, PCCs between pairwise genes are obtained by normalized tumor expression.Acceptable correlations with PCC > 0.75 are often considered high correlated and selected [34].Here, we choose 0.8 as threshold to ensure selected pairwise genes being higher correlated and increase reliability.The edges with PCC > 0.8 are selected and set to 1, otherwise 0. In order to retrieve some important prior knowledge simultaneously, the known gene network (termed the influence graph in DriverNet [10]) is mapped onto the binary matrix obtained from correlation coefficients matrix.So a new and dynamic gene-gene interaction network (termed as  after) included prior knowledge and deduced knowledge is established.When there is a correlation, that is, PCC > 0.8 or 1 in influence graph between gene  and gene ,   = 1; otherwise,   = 0.

Significance Estimation.
With the aim of testing the statistical significance of the driver candidates, we apply a randomization framework.The algorithm is run on the random  permuted original datasets (mutation data and outlier data).Then we assess the significance by seeing if the results on real data are significantly different from the results on random datasets and obtain the  value of each candidate drivers.The statistical significance of  is defined as follows [10]: where  is permutation times and   is the number of candidate drivers in the th run of the approach.COV  is the coverage of  calculated from our method.Here we choose  = 50.The statistical significance of  means that the times of the observed driver genes with coverage are more than COV  .Finally, genes with  value less than 0.05 are nominated as candidate drivers.

Performance Evaluation.
To evaluate the performance of our method on identifying known driver genes, we used annotated cancer-related genes datasets CGC database (15/7/2015) [35] and 20/20 rule [25] as approximate benchmarks.CGC is a database which catalogues 571 genes whose mutations have been causally involved in cancer [35].20/20 rule contains 138 driver genes in which 125 genes are affected by subtle mutations and 13 are affected by amplification or homozygous deletion [25].We compared our method with frequency-based method, DriverNet [10], and MUFFINN [32] based on these two benchmarks.
In this work, we first counted the number of known drivers according to CGC genes of these four methods.For comparison, three measures based on top  genes including precision, recall, and 1 score are used which are defined as follows: precision = TP TP + FP = (# genes found in DriverFinder) ∩ (# mutated genes in CGC) (# genes found in DriverFinder) , recall = TP TP + FN = (# genes found in DriverFinder) ∩ (# mutated genes in CGC) where TP indicates the number of overlapping genes found in our method and annotated genes associated with cancers in CGC.FP means the number of genes identified in our method, however not cataloged by CGC.FN is the number of genes in CGC but not contained in our method.
In general, DriverFinder almost outperforms other three methods in the top ranking genes of all the four cancer datasets (Figure 2; results of DriverFinder are shown in Supplementary File 1, in Supplementary Material available online at https://doi.org/10.1155/2017/4826206).Although, after approximately ranking top 30 genes in KIRC, MUFFINN performs comparably with DriverFinder, it has poorer performance in the top 30 genes.And the same phenomenon arises after top 60 genes in THCA.Analogous to it, the cumulative number of retrieved cancer genes annotated by 20/20 rule in KIRC by MUFFINN (Figure 3(c)) is also more than DriverFinder.A potential explanation of them may be that the total numbers of mutations in KIRC (10359 genes with 21089 mutations) and THCA (8899 genes with 16497 mutations) are remarkably less than in BRCA (16717 genes with 118098 mutations) and HNSC (14830 genes with 57164 mutations).So the gene-gene interaction network in KIRC and THCA may be simpler than in BRCA and HNSC; that is, genes may easily correlate directly with each other.And MUFFINN consider mutations only in direct neighbors [32].On the one hand, this difference may help MUFFINN retrieve more genes; on the other hand, the number of mutations indicates that there may be more passengers (i.e., noise) in BRCA and HNSC and DriverFinder is more stable with noises.
Moreover, DriverFinder outperforms other three methods in BRCA, HNSC, and THCA by the cumulative numbers with 20/20 rule (Figure 3).In KIRC, it also has a better performance than DriverNet and frequency-based method within the top 100 genes.

DriverFinder Decreases the Effect of Gene Length.
It is worth noting that, from the results of the DriverFinder, we can find that it can decrease the false positive because it has the good performance on filtering randomly mutated genes due to long length.The longest gene across human genome is TTN and it has been proven that higher mutation rate in it is likely to be artifacts [23,36].For example, in BRCA, TTN ranked 4 and 6 in frequency-based method and in MUFFINN, respectively, due to high mutation rate.Also it ranked 51 ( value = 0.031) as a candidate driver in DriverNet algorithm; however, it was filtered out with DriverFinder.In KIRC and HNSC, it ranked 4 and 3, respectively, based on mutation frequency and 22 and 18 in DriverNet, respectively.In addition, it is also at top 4 and 3 with MUFFINN in KIRC and HNSC, respectively, but it does not rank among the top 160 or 790 separately according to DriverFinder.Furthermore, for THCA, frequency-based method and MUFFINN ranked TTN as top 4 and 5, respectively.However, it is not identified by DriverFinder.These results proved the better performance of DriverFinder on filtering randomly mutated genes with long length than DriverNet, MUFFINN, and frequency-based method.

Pathway Enrichment Analysis.
In order to investigate cancer-related pathways among the significant candidate drivers, Kyoto Encyclopedia of Genes and Genomes (KEGG) [37] pathway enrichment analysis was performed by statistics significantly candidate driver genes with  values less than 0.05 (see Supplementary File 2).The top 20 significant pathways are shown in Figure 4. We observed that the most enriched terms are cancer-related pathways in four cancer datasets.Moreover, ErbB signaling pathway, which is significantly enriched in BRCA ( value = 4.79E − 07) and KIRC ( value = 6.23E − 07), has been reported to play important roles in many tumors and ErbB2/ErbB3 heterodimer functioned as an oncogenic unit in breast cancer [38].Also, the significantly enriched VEGF signaling pathway (4.94E − 05) in HNSC plays a pivotal role in tumor angiogenesis [39].

Discovering Rare Driver Genes.
In this subsection, we exhibited that DriverFinder can identify rarely mutated but significant candidate driver genes which are defined as genes whose mutation frequency < 2% across all patients cohort.Here, we only selected highly ranked (top 30) rare genes for further analysis.In BRCA, 3 rare genes (PIK3R1, CREBBP, and PRKACB) are ranked in top 30.In them, for PIK3R1 (ranked 23) which is not ranked top 30 in the other three methods, underexpression might possibly lead to PI3K pathway activation and confer tumor development and progression in humans and it is a clinically useful independent prognostic marker in breast cancer [40].Due to its low frequency mutations, any further statistical analyses concerning a possible association between PIK3R1 mutations and clinical parameters are not allowed [40] and it is easily ignored by frequency-based methods.Moreover, for CREBBP (ranked 25) which also was not ranked top 30 in the other three methods, it has been occasionally reported in breast cancer [41].PRKACB downregulates in non-small cell lung cancer and the effect of its upregulation on cell proliferation, apoptosis, and invasion also has been investigated [42].
In HNSC, also 3 rare genes are selected and one of them (UGT2B4, ranked 30) has shown potential to be a novel driver.UGT2B4 genotypes associated with decreased enzyme activities are found to increase the risk of esophageal squamous cell carcinoma [43].
For KIRC and THCA, there are some important rare genes not identified by other methods.For example, CLTC in KIRC, which is contained in CGC and ranked 11 with low mutated frequency 1.68% in DriverFinder, encodes a major subunit of clathrin and is a fusion partner of TFE3.And CLTC-TFE3 is the fifth gene fusion involving TFE3 in pediatric renal cell carcinomas [44].Another example is AKT1 (0.69% of cases) in THCA, which is identified by DriverFinder (ranked 30) and contained in CGC; it is a serine/threonine protein kinase and its downstream proteins have been reported to be frequently activated in human cancers [45].

Discussion
Cancer is a complex disease and difficult to treat, and distinguishing driver genes from a mass of neutral passenger genes is extremely important to understand the mechanism of the cancer and design targeted treatments.In this study, we introduced a comprehensive framework DriverFinder to identify driver genes by incorporating genomes, transcriptomes, and gene-gene interaction information.We implemented genebased filtering model to exclude genes that were mutated largely due to random events.The method was applied to four independent cancer datasets from TCGA and the Complexity 9 results demonstrated that the power of it across multiple tumor types was mainly better than DriverNet, MUFFINN, and frequency-based methods.In summary, this method has advantages in both filtering random mutated genes and identifying driver genes regardless of their mutation frequencies.We expect that it can also be applied to other complex cancer types.However, in this work, we only explained the changes of the expression by somatic mutations, although other molecular or genetic changes such as transcription factors, methylations, and microRNAs also affect expression of other genes and play important roles in the development of cancer [46].Therefore, it is necessary to extend the method so that driver genes can be determined by not only somatic alterations but also other different types of molecular changes.Also, we can extend our aim of identifying drivers by some methods such as machine learning methods [47][48][49][50][51].

Figure 1 :
Figure 1:The flowchart of DriverFinder method.(a) Input datasets consist of somatic mutation, CNV, normal and tumor expression data, and gene-gene interaction network.A generalized additive model is performed on somatic mutation data to filter mutated genes which occurred at random due to long length.After that, the residual significant genes are combined with CNV to construct mutation data.The gene-gene interaction network is constructed by integrating prior gene-gene interaction network and Pearson correlated coefficient network.And the outlying matrix is constructed by analyzing interindividual variation in tumor and normal expression.(b) Given mutation data, outlying data, and gene-gene interaction network, the bipartite graph is obtained.The black nodes on the left indicate mutated genes and the blue nodes on the right represent outlying expression genes.(c) Candidate genes are obtained by greedy algorithm.The more outlying expression events the gene overlaps, the higher the gene ranks are.(d) Statistical test is performed on candidate genes to select important putative drivers by  value < 0.05.