Computational Analysis of Specific MicroRNA Biomarkers for Noninvasive Early Cancer Detection

Cancer is a complex disease residing in various tissues of human body, accompanied with many abnormalities and mutations in genomes, transcriptome, and epigenome. Early detection plays a crucial role in extending survival time of all major cancer types. Recent advances in microarray and sequencing techniques have given more support to identifying effective biomarkers for early detection of cancer. MicroRNAs (miRNAs) are more and more frequently used as candidates for biomarkers in cancer related studies due to their regulation of target gene expression. In this paper, the comparative analysis is used to discover miRNA expression patterns in cancer versus normal samples on early stage of eight prevalent cancer types. Our work focuses on the specific miRNAs biomarkers identification and function analysis. Several identified miRNA biomarkers in this paper are matched well with those reported in existing researches, and most of them could serve as potential candidate indicators for clinical early diagnosis applications.


Introduction
Cancer is a highly complex disease that contains many abnormalities and mutations in the genomes, transcriptome, and epigenome. These abnormalities play important roles in the cancer cell growth [1]. Cancer is one of the leading causes of death all over the world. According to the world cancer report of 2014, there are about 14 million new cases and more than 8 million deaths related to cancer in 2012. The number will continue to rise if there are no effective prediction and treatment of cancer. It is expected that the number of new cases will rise to 22 million over the next two decades [2].
Early detection of cancer is undoubtedly important. During the development of cancer, some genes or proteins, associated with genetic mutations, transcription, or epigenetic alterations, could be detected from the tissue of cancer or inflammation when being compared to the normal tissue. These genes or proteins could give a quantitative measurement to the severity of a cancer. There have been many genes or proteins as biomarkers used in the clinical detection of cancer including alpha-fetoprotein (AFP) for liver cancer [3], BRCA1 and BRCA2 for breast and ovarian cancer [4], prostate specific antigen (PSA) for prostate cancer [5], and epidermal growth factor receptor (EGFR) for nonsmall-cell lung carcinoma [6].
In recent years, more and more researches on microRNA (miRNA) biomarkers have been published. miRNAs are small noncoding RNA molecules that contain 21-24 nucleotides. They play important roles in the posttranscriptional regulation of target gene expression [7]. The miRNA biomarker identification has been extensively studied in recent years. High-throughput microarray and sequencing techniques are widely used for transcriptome analysis. We can acquire lots of transcriptome information for various kinds of cancers on gene expression level from public databases, such as Gene Expression Omnibus (GEO) [8], Stanford Microarray Database (SMD) [9], Oncomine [10], and the Cancer Genome Atlas (TCGA) [5]. Dysregulation of miRNA expression is important for cancer development through various 2 BioMed Research International mechanisms including deletions, amplifications, or epigenetic silencing [11]. The circulating miRNAs are suggested to be effective indicators of disease. They are important for clinical applications such as disease diagnostics, monitoring therapeutic effect, and predicting recurrence in cancer patients [12]. Circulating miRNAs are widely used as biomarkers for various human cancers, such as prostate cancer and breast cancer [13][14][15][16]. Some gene or miRNA biomarkers are identified using statistical methods to the high-throughput omics data. MiRNAs are widely used as biomarkers for human cancers. Resnick et al. acquired 21 differentially expressed miRNAs through comparing microarray data between 28 patients with ovarian cancer and 15 normal samples. Five differentially expressed miRNAs with overexpression and 3 miRNAs with underexpression in patients were evaluated through real-time PCR [17]. Xie et al. did the research on aberrant miRNAs used as biomarkers for the diagnosis of non-small-cell lung cancer (NSCLC). Their result indicated that mir-21 was detected with high expression in sputum specimen of patient [18]. Du Rieu et al. did the study on whether abnormal miRNA production for noninvasive precursor pancreatic intraepithelial neoplasia (PanIN) can be used as a potential early biomarker of pancreatic ductal adenocarcinoma (PDAC). They indicated that mir-21 was aberrantly expressed in early development of PanIN and it was worthy for further study as a biomarker for early detection of PDAC [19]. Habbe et al. did the research on the aberrant expressed miRNAs in intraductal papillary mucinous neoplasms (IPMNs). They showed that aberrant miRNA expression was an early event for pancreatic cancer, and miR-155 was worthy of further study as a biomarker for IPMNs in clinical samples [20].
Although several miRNA biomarkers have been reported in the above researches, only one or very few miRNAs are identified in each experiment which is involved in a single cancer type. The accuracy and specificity of some miRNAs are not so promising. In this paper, we analyze the miRNA expression patterns comparatively in cancer versus normal samples on early stage of eight prevalent cancer types. The datasets used in our paper are all RNA-Seq data from TCGA database to make a comprehensive identification of miRNA biomarkers for various cancers. We focus on identifying the specific miRNAs biomarkers including four aspects: (a) detecting differentially expressed miRNAs for each cancer type; (b) detecting specific differentially expressed miRNAs; (c) detecting specific miRNAs biomarkers; and (d) analyzing function and pathway of these miRNAs. Several identified miRNA biomarkers have been reported in existing researches and most of them could serve as potential candidate biomarkers for clinical early diagnosis.

miRNA Expression Datasets.
The original miRNA expression data based on miRNA-Seq for eight prevalent cancer types are downloaded from the TCGA database [21]. The eight cancer types include prostate, thyroid, breast, head and neck, kidney, stomach, lung, and liver cancer. For each cancer type, we select the cancer samples which have corresponding samples from their normal tissues as the paired samples to identify cancer related miRNA biomarkers of different cancer types. Biomarkers can ensure that the prediction results could be well generalized to clinical research and utility by using these paired samples. Detailed information of the datasets used in this paper is listed in Table 1. In this paper, we only use the samples of cancer from pathologic stage I to detect miRNA biomarkers for cancers of early stage. The pathologic stage is collected from clinical patient information of TCGA database. The value of "reads per million miRNA mapped" is used as expression value of each miRNA.

Identification of Differentially Expressed miRNAs.
For each cancer type, firstly, Wilcoxon signed-rank test is used to identify differentially expressed miRNAs between cancer samples and normal samples. Then, FDR controlling method is used to eliminate false discovery rate of the result by Wilcoxon signed-rank test. After the above processes, an improved fold change method is applied to identify the differential expressed miRNAs between cancer samples and normal samples.
In this paper, we use quantiles to calculate the fold change of each miRNA. A quantile is a cut point which divides a set of observational data into equal sized groups. So, the -quantiles indicate the value that partitions a set of finite data into groups of equal sizes. The number of -quantiles of value is − 1. For a variable , the th q-quantile (0 < < ) can be estimated using the following formula: where ℎ = ( − 1)( / ) + 1 and is the sample size. The set of -quantiles is defined as follows: Let ∈ R × be the transcript measurements in a miRNA expression matrix with miRNAs and samples. Here, BioMed Research International 3 variable is [ , :] (0 < ≤ ) and the set of -quantiles of miRNA can be calculated using formulae (1) and (2). For miRNA on normal samples and cancer samples, we can obtain two sets of q-quantiles for normal and cancer samples as and , respectively. The fold change value of the miRNA of th q-quantile can be calculated using the following formula: where and are the expression values of miRNA of th q-quantile in cancer and normal samples, respectively. And then, the original fold change OFC of miRNA , which is calculated by the average of FC across all the samples, is defined as follows: where is the selected number of quantiles.
In this paper, the impact of standard deviation is also introduced to the original fold change. The improved fold change denoted by IFC of miRNA is shown as follows: where is the standard deviation of sets of -quantiles. In formula (5), not only the mean value of fold change but also the influence of variance across different samples is considered, which is a more effective and robust statistical analysis.
In this paper, is set to 100 to calculate the quantiles. And then, we select differentially expressed miRNAs for each individual cancer type using the following rules: (1) the improved fold change (IFC) is less than −0.5 or greater than 0.5; and (2) -value by FDR controlling for Wilcoxon signedrank test is less than 0.05.

Identification of Specific Differentially Expressed miR-
NAs. The specific miRNA biomarkers for each cancer are selected, which can be used as discriminators for other cancers. We select specific differentially expressed miRNAs based on the differential expressed miRNA results from last section for each cancer. Let the improved fold change of miRNA in cancer 1 be IFC 1 and in other cancers . . , IFC . The miRNA is regarded as a specific miRNA biomarker for cancer 1 , if and only if miRNA complied with the following formula: where is the number of cancer types and = 8 in this paper.
1 means the threshold of the ratio of different cancer types (the number of cancer types being considered); here we set 1 to be 0.75. 2 means the threshold of the improved fold change (IFC value); here we set 2 to be 0.5. is a factor to filter the miRNAs with low expression value and is defined as follows: where is a vector which means the expression value of miRNA on cancer and is the average of . For example, if miRNA is upexpressed in cancer 1 (the first inequality in (6) is satisfied, which is |IFC 1 | > 2 ) and downexpressed or changed very little in other cancer types (the second inequality in (6) 1)), then this miRNA is a specific differentially expressed miRNA for cancer 1 . It is a good discriminator for distinguishing cancer types between 1 and other ( = 2, 3, . . . , 8).

Identification of Specific miRNA Biomarkers.
After obtaining the specific differentially expressed miRNAs for each single cancer type, we further select the circulating and upregulated miRNAs as biomarkers. The information of extracellular circulating miRNAs is downloaded from miRandola database [22]. Based on the source of extracellular miRNAs, the miRNAs in miRandola database are divided into four categories: Ago2, exosome, HDL, and circulating.
In this paper, we only select the circulating miRNAs which are source of plasma and serum in miRandola database as candidate miRNA biomarkers. For improving the sensitivity and specificity, we identify the combined miRNA biomarkers based on the specific single miRNA biomarkers. The rules of selecting combined biomarkers are considered here in order to identify -miRNA discriminators. These -miRNA discriminators are used as combined biomarkers for multiple cancer types, specific biomarkers for cancer types with similar survival rates (high survival rate, medium survival rate, and low survival rate), and a specific cancer type.
A computational process of finding the -miRNA ( = 1, 2, 3, 4, 5) combination biomarkers is proposed in this paper to give a best distinction among the different cancer groups using a linear classifier [23]. Linear Discriminant Analysis (LDA) [23] is employed to evaluate the performance of -miRNA combination biomarkers for multiple cancer types, cancers of similar survival rates, and a single cancer type. The overall accuracy is defined as the fraction of the total number of true positives and true negatives and the number of all the samples in the following: where TP is the number of true positives, TN is the number of true negatives, and is the total number of samples. The performance of all the -miRNA combinations is evaluated using Leave-One-Out Cross Validation (LOOCV) method [24]. In each LOOCV step, one single sample is randomly chosen as the validation data, and the remaining samples are chosen as the training data to build the classifier using LDA. The OA value is calculated as the accuracy in this LOOCV step. This process is iterated until all of samples are selected as the validation data. Due to the computational complexity, the number of combinations is set to be 1, 2, 3, 4, and 5 in this paper.

Function and Pathway
Analysis. The functional annotation and pathway analysis are conducted using miEAA, a miRNA Enrichment Analysis and Annotation tool [25]. miEAA is a web-based system, which offers miRNA set enrichment analysis similar to Gene Set Enrichment Analysis (GSEA) [26]. The tool also provides rich functionality in terms of miRNA categories and contains over 14,000 miRNA sets, including pathways, diseases, organs, and target genes. In this paper, we perform the functional annotation and pathway analysis on specific upexpressed miRNAs for each cancer type among multiple cancer types, respectively.

Results and Discussion
The differentially expressed miRNAs in each cancer type are identified in order to study the specific roles of miRNAs involved in different cancer types. Due to the various alterations accumulated in the development of the oncogenesis, different cancers may have their specific miRNAs having differential expression. These miRNAs may be involved in some biological processes in the formation and progression of cancers. Similar to genes, several specific miRNAs are identified and used as targets of the prevention and diagnosis of cancers.

Differentially Expressed miRNAs.
In this part, we use the strategy in the mentioned methods to identify the differentially expressed miRNAs in eight cancer types. We detect the upregulated and downregulated miRNAs, respectively, most of which may be involved in some important biology processes or pathways. The differentially expressed miRNAs of eight cancer types are summarized in Table 2 Another interesting observation is that the number of upregulated and downregulated miRNAs in each cancer differs too much; that is, the ratio of upregulated and downregulated miRNAs differs a lot. In Table 2, we can see that there are more upregulated miRNAs than downregulated miRNAs in prostate and stomach cancers. Differently, the downregulated miRNAs in thyroid and liver cancers account for a major portion among the differently expressed miRNAs. It may possibly indicate the unique and similar characteristics of different cancer types.

Specific Differentially Expressed miRNAs.
A detailed statistical analysis on specific differentially expressed miR-NAs for single cancer type is conducted. The results are summarized in Table 3. There are 51 specific differentially expressed miRNAs in breast cancer, among which 21 miRNAs are upregulated and 30 miRNAs are downregulated. There are 180 specific differentially expressed miRNAs in colorectal cancer, among which 62 miRNAs are upregulated and 118 miRNAs are downregulated. In lung cancer, 46 miRNAs are found as specific differentially expressed miRNAs, among which 33 miRNAs are upregulated and 13 miRNAs are downregulated. Similarly, while the majority of differentially expressed circulating miRNAs in prostate and stomach cancers are upregulated, in kidney and liver cancers, the majority of such miRNAs are downregulated. The details of the specific differentially expressed miRNAs in eight cancers are illustrated in Table 3.

Specific miRNAs Biomarkers.
The identification of specific miRNA biomarkers for each cancer is also performed in this part. For each cancer type, we use the strategy mentioned in Materials and Methods to find the miRNAs that have larger upregulation in each cancer type and have high expression value, and meanwhile the miRNAs have different changes or expression values in other cancer types. Such miRNAs    can make a better distinction between one cancer and other cancer types. The detailed information is illustrated in Figure 1. MiRNA MIMAT0000753 is taken as an example (the first one in Figure 1). It shows the expression distribution of miRNA MIMAT0000753 in breast cancer, normal breast, other cancers, and other normal tissues. MIMAT0000753 gets a fold change value of nearly 0.92 in breast cancer, which indicates its upregulation obviously. The approximate mean expression values are 470 and 160 in breast cancer and normal samples, respectively. The approximate mean expression values are 125 and 107 in other cancer and normal samples, respectively. So MIMAT0000753 gets a better performance in prostate cancer compared to other cancers. It could be used to distinguish prostate cancer from other cancer types.
In some cases, using one single gene or miRNA is not enough to distinguish the specific cancer. The identification of the combination of some miRNAs is an important step 6 BioMed Research International of differentially expressed miRNA analysis. In order to find the better combinations of miRNAs as biomarkers which can be used to distinguish the specific cancer, -miRNA ( = 1, 2, 3, 4, 5) combinations of differentially expressed miRNAs are selected in this section. By identifying different miRNA combinations which have similar expression patterns in single cancer type but have different expression patterns in other cancers, we explore useful information of different and various mechanisms about carcinogenesis with single cancer type. In each -miRNA combination, we calculate the classification capability in two aspects. One is the measurement of the -miRNA in a single cancer dataset, which calculates the classification capability between cancer samples and the paired normal samples of the same cancer type. The other is the measurement of the -miRNA in the datasets of multiple cancer types, which calculates the classification capability between one cancer type and the other cancer types using the fold change values of the cancer samples and the normal samples. Table 4 gives the results of distinguishing ability of the top five-miRNA specific biomarker combinations in eight cancers, respectively. In Table 4, accuracy 1 is the results obtained by classifier between cancer samples and their corresponding paired normal samples; and accuracy 2 is the results obtained by classifier between fold change values from each cancer type and fold change values from other cancer types. These miRNAs could be used as biomarkers for corresponding cancer, and simultaneously they are found to be effective discriminators for distinguishing each cancer from other cancer types. For example, in breast cancer the top five-miRNA of MIMAT(0000076 + 0000259 + 0000434 + 0000753 + 0003218) gets the accuracy of 77.02% in single dataset evaluation and accuracy of 79.54% in multiple cancer datasets evaluation process, which are the best results among other combinations. The mean value can reach 78.28%, which indicates that the five-miRNA combination could be good biomarkers for breast cancer.
As noted, several miRNAs in each combination have been reported to be related to different cancers. For BRCA, MIMAT0000076 (hsa-miR-21) is reported to be overexpressed in human breast cancer and associated with clinical stage, metastasis, and prognosis [27]. MIMAT0000259 (hsa-miR-182) and MIMAT0003218 (hsa-miR-92b) are found to be overexpressed in human breast tumor [28,29]. MIMAT0000434 (hsa-miR-142) is reported to relate to the regulation of tumorigenicity of human breast cancer through the canonical WNT signaling pathway [30], and MIMAT0000753 (hsa-miR-342) regulates BRCA1 expression through modulation of ID4 in breast cancer [31].

Function and Pathway Analysis.
In this section, the function annotation and pathway enrichment are performed using miEAA database. Firstly, we analyze the enriched pathways on specific upexpressed miRNAs for each cancer type among multiple cancer types, respectively. Then, we select the enriched pathways in most cancer types and sort them by the hint ratio. Table 5 gives top enriched pathway information with top ranks according to their hint ratio by miEAA [25]. The category means the source of pathway databases, the term means the enriched pathways, and the hint ratio means the enrichment score of miRNAs. From Table 5, we can see that the top enriched pathways are derived from Kyoto Encyclopedia of Genes and Genomes (KEGG) [60] and Wiki Pathways [61].
The first enriched pathway is apoptosis, which is described by its morphological characteristics and contributed to the high rate of cell loss in malignant tumors [62]. There are two enriched pathways, miRNAs involved in DDR and DNA damage response, which are related to DNA damage. There is an incontrovertible link between DNA damage and neoplastic phenotype in cancer [63]. Another interesting pathway is endocytosis, which entails selective packaging of cell-surface proteins and can be modified in cancer [64]. There are six signaling pathways in the enrichment results. Cellular signaling pathways are interconnected to form complex signaling networks and altered in cancer cells representing a major intellectual challenge [65].

Conclusions
Early detection of cancer is a very important and necessary way for cancer prevention. The biomarkers, as effective indicators to distinguish between cancer and normal samples or among different groups of cancer samples, are more and more used in cancer mechanism studies and clinical detection. The research of miRNA biomarkers is attracting more attention. But some of the biomarkers lack specificity and there are no effective common biomarkers for multiple cancer types. In this paper, we mainly focus on the identification of specific miRNAs and the common miRNA biomarkers. The miRNA expression data from RNA-Seq for eight cancer types are obtained from TCGA database, and the circulating miRNA information is collected from miRandola database.
For each cancer type, we apply Wilcoxon signed-rank test, which eliminates false discovery rate by using FDR controlling method, and improved fold change (FC) to identify miRNAs which have differential expressions between cancer samples and their corresponding normal samples. Then, the specific miRNA biomarkers for each cancer are further selected to act as discriminators for other cancers. After obtaining the specific differentially expressed miRNAs for each single cancer type, we further select the circulating and upregulated miRNAs as biomarkers. Several identified miRNA biomarkers in this paper are matched well with those reported in existing literatures and most of them could be taken as potential candidate indicators for clinical early diagnosis applications.