Detecting Differential Variable microRNAs via Model-Based Clustering

Identifying genomic probes (e.g., DNA methylation marks) is becoming a new approach to detect novel genomic risk factors for complex human diseases. The F test is the standard equal-variance test in Statistics. For high-throughput genomic data, the probe-wise F test has been successfully used to detect biologically relevant DNA methylation marks that have different variances between two groups of subjects (e.g., cases vs. controls). In addition to DNA methylation, microRNA is another mechanism of epigenetics. However, to the best of our knowledge, no studies have identified differentially variable (DV) microRNAs. In this article, we proposed a novel model-based clustering to improve the power of the probe-wise F test to detect DV microRNAs. We imposed special structures on covariance matrices for each cluster of microRNAs based on the prior information about the relationship between variance in cases and variance in controls and about the independence among cases and controls. To the best of our knowledge, the proposed method is the first clustering algorithm that aims to detect DV genomic probes. Simulation studies showed that the proposed method outperformed the probe-wise F test and had certain robustness to the violation of the normality assumption. Based on two real datasets about human hepatocellular carcinoma (HCC), we identified 7 DV-only microRNAs (hsa-miR-1826, hsa-miR-191, hsa-miR-194-star, hsa-miR-222, hsa-miR-502-3p, hsa-miR-93, and hsa-miR-99b) using the proposed method, one (hsa-miR-1826) of which has not yet been reported to relate to HCC in the literature.


INTRODUCTION
Investigating the relationship between genomics and complex human diseases has greatly improved our understanding of the molecular mechanisms of, and the interplay of environmental factors and genomic factors to, the complex human diseases. High-throughput data from cuttingedge technologies have substantially facilitated the unbiased discovery of the genetic risk factors for many diseases. The standard approach to identify disease-associated genomic probes is to test if the mean expression (e.g., DNA methylation) between cases and controls is significantly different. In Statistics, variance is another important measurement, in addition to mean. The larger the variance is, the more information the data could provide. However, the information about variance has not been directly used to detect disease-associated genomic probes until recent years.
Several groups of researchers have recently identified DNA methylation marks that have different variances between cases and controls [1][2][3]. They observed that (1) for differentially variable DNA methylation marks the variability in cases is usually higher than that in controls; and (2) differentially variable DNA methylation marks are biologically relevant. DNA methylation is a type of epigenetics, which studies the heritable changes in organisms caused by regulating gene expression without changing genetic code. DNA methylation inhibits gene expression by adding a methyl group to the cytosine or adenine DNA nucleotides. Another type of epigenetics is microRNAs that are short noncoding 18-25 nucleotide long RNA and negatively regulate mRNA translation [4,5]. However, to the best of our knowledge, no studies have investigated differential variability for microRNAs. The main objective of this article is to develop statistical methods to detect microRNAs differentially variable between cases and controls.
The F test is the classical method to test for equal variance between two groups of subjects, which evaluates whether the ratio of sample variances between two groups is significantly different from one. For high-throughput genomic data, such as DNA methylation data, the probewise F test could be used. That is, we first perform the F test for each probe to test for equal variance between cases and controls. We then calculate FDR-adjusted p-value to control for multiple testing, where FDR stands for false discovery rate. If the FDR-adjusted p-value < 0.05 for a DNA methylation mark, we then claim that this DNA methylation mark is differentially variable between cases and controls. The advantages of this probe-wise approach include flexibility (one model per probe) and easy-implementation. However, differentially variable probes might be governed by the same underlying mechanism. Statistically speaking, the variances of differentially variable probes might follow the same distribution. Similarly, the variances of non-differentially variable probes might also follow the same distribution. We hypothesize that these underlying distributions of variances could help us improve the power of the F test to detect differentially variable probes.
In this article, we propose a mixture of 3-component multivariate normal distributions to fit the expression levels of microRNAs to identify microRNAs differentially variable (i.e., having different variances) between cases and controls.

METHOD
Model. We assume that microRNAs belong to one and only one of the following three clusters: (1) microRNAs having higher variances in cases than in controls (denoted the cluster as OV), (2) microRNAs having equal variances between cases and controls (denoted the cluster as EV), and (3) microRNAs having smaller variances in cases (denoted the cluster as UV). We follow Qiu et al. (2008) [6] to directly model the marginal distributions of microRNAs in the 3 clusters. In this article, we modified Qiu et al.'s (2008) marginal model [6] to allow the detection of DV probes.
We assume that (1) data have been normalized to remove the effects of confounding factors, such as chip effect, and batch effect, etc., and (2) data have been transformed so that the distributions of microRNA expressions are close to normal distributions.
For a given microRNA, we denote X i as the pre-processed expression for the i-th subject, i=1, …, m, where m=m c +m n , m c is the number of cases and m n is the number of controls. For the k-th cluster (k=1, 2, or 3), we assume that the expressions of the m c cases are identically distributed with mean !" and variance !" ! . We assume that the expressions of the m n controls are identically distributed with mean !" and variance !" ! . According to Qiu et al. (2008) [6], X i 's are marginally correlated with correlation !" for cases and !" for controls. We also assume that (1) cases and controls are independent, and (2) the ×1 random vector (X 1 ,…, X m ) T follows a multivariate normal distribution. For the OV cluster, we require that !! ! > !! ! . For UV cluster, we require that !! ! < !! ! . For the EV cluster, we require that !! ! = !! ! . We allow the means and correlations are different between cases and controls in the EV cluster.
We used the EM algorithm [7] to estimate the model parameters !" , !" ! , !" , and !" ! . The is used to assign the -th microRNA to one of the 3 clusters, where ! is the density function of the multivariate normal distribution for the -th cluster. If !! is the largest posterior probability among !! , !! , and !! , then the -th microRNA will be assigned to the 1 st cluster (i.e., OV cluster We checked the data quality by visualizing the plot (Fig. A1) of percentiles across arrays and the plot (Fig. A2) of the first and second principal components. Both plots indicate the two datasets have been cleaned and have good quality (i.e., no apparent outlying microRNAs, outlying arrays, or technical batch effects). Hence, we directly use the two datasets in the further analyses. We regarded GSE67138 as the discovery set and GSE67139 as the validation set.

Simulation.
We conducted 4 sets of simulation studies. In the first set (denoted it as SimI), we generated microRNA data from the proposed marginal mixture model, where estimated model parameters for GSE67138 (i.e., the discovery set) are used as the true values of the model . We generated 100 datasets, each of which has 1,000 microRNAs for 50 cases and 50 controls. Ten percent (100) of the 1,000 microRNAs are in the OV cluster. Six percent (60) microRNAs are in the UV cluster. The remaining 84% (840) microRNAs are in the EV cluster.
In the second set (denoted it as SimII), we generated microRNA data from a mixture of 3 multivariate t distribution with the same mean vectors and covariance matrices as those in SimI and with degrees of freedom 3. SimII is used to evaluate the performance of the proposed method when the normality assumption is violated.
In the third set (denoted it as SimIII) of the simulation studies, we generated microRNA data from the same model as that in SimI, except that the marginal correlations within subject-groups were set to zero ( !" = 0 and !" = 0). SimIII is used to evaluate the performance of the proposed method when there are no marginal correlations.
In the fourth set (denoted it as SimIV) of the simulation studies, we generated microRNA data from the same model as that in SimII, except that the marginal correlations within subject-groups were set to zero ( !" = 0 and !" = 0). SimIV is used to evaluate the performance of the proposed method when there are no marginal correlations and when the normality assumption is violated.
We first applied the 11 methods (the gs method and the 10 existing methods) to the discovery set (GSE67138) to detect microRNAs differentially variable between invasive tumors and noninvasive tumors. For the 10 existing methods, we obtained FDR-adjusted p-values. If a microRNA has FDR-adjusted p-value < 0.05, we claim that this microRNA has significantly different variances between invasive tumors and non-invasive tumors. We then applied the same procedure to the validation set (GSE67139). We claim that a microRNA is a validated DV microRNA (1) if the microRNA is DV in both discovery and validation sets, and (2)  We applied miRSystem [15] to predict the target genes of microRNAs in each of the 3 sets: !"#$%& , !"#$%& , and !"#! . miRSystem also provides the enriched KEGG pathways for the predicted target genes.
For simulated datasets, we calculated the magnitude of agreement between the true cluster memberships of microRNAs and the detected cluster memberships by each of the 11 methods by using the Jaccard index [6,16]. The maximum value of the Jaccard index is one, indicating perfect agreement. The minimum value of the Jaccard index is zero, indicating that the agreement is by chance.

Results.
For the real data analyses, the numbers of the DV microRNAs in the discovery set, and the numbers and proportions of the validated DV microRNAs are shown in Table 1 (Table A1), the parallel boxplots of which are shown in Fig. A4. !"#! contains 60 microRNAs (Table A2), the parallel boxplots of which are shown in Fig. A5.
For the first simulation study (SimI), almost all of the 100 values of the Jaccard index by the gs method are equal to one (the perfect agreement). The gs method still had much higher values of the Jaccard index than the 10 existing equalvariance tests even for data generated from multivariate t distributions, although the average values of Jaccard index were much smaller (<0.5) in SimII ( Figure 2) and SimIV (Figure 4).

Discussion.
In this article, we proposed a novel method to detect microRNAs having different variances between cases and controls. The proposed method is different from probe-wised equal-variance tests in that it does not involve hypothesis testing since it is a model-based clustering. To the best of our knowledge, the proposed method is the first clustering algorithm to detect differential variable genomic probes.
In the 4 simulation studies, the proposed method outperformed 10 probe-wised tests, including the classic F test that has been reported to outperform other equal-variance tests when the normality assumption is held [17,18]. The reason why the gs method performed better than the F test in SimI and SimIII, where the normality assumption is held, is that the gs method could borrow information across microRNAs (i.e., the estimation of the model parameters uses the information provided by al microRNAs).
The gs method and the F test outperformed the other 9 tests in SimII and SimIV, where the data were generated from multivariate t distribution (i.e., the normality assumption is violated). This indicates that the gs method and the F test could work well relative to other equal-variance tests for data with uni-modal and symmetric distributions. Fig. A6 shows the histograms and qqplots for a simulated dataset in each of the 4 simulation scenarios. In future research, we will evaluate the performance of the gs method in scenarios where the violation of the normality assumption is caused by skewed distributions. The robustness of the gs method warrants further investigation.
In the real data analysis, the gs method detected 67 validated DV microRNAs (66 OV and 1 UV), seven of which are DV only. The 7 DV-only microRNAs (hsa-miR-1826, hsa-miR-191, hsa-miR-194-star, hsa-miR-222, hsa-miR-502-3p, hsa-miR-93, and hsa-miR-99b) targeted to 1,639 genes based on the miRSystem analysis. Except for hsa-miR-1826, all DV-only microRNAs have been associated with HCC. Elyakim et al. (2010) [19] showed that miR-191 is a candidate oncogene target for hepatocellular carcinoma therapy. Law and Wong (2011) [20] reported the association of miR-194 with metastatic behavior of HCC. Murakami et al (2006) [21] reported that miR-222 is increased in poorly versus moderately versus well-differentiated hepatomas. Jin et al. (2016) [22] reported that miR-502-3p suppressed cell proliferation, migration, and invasion in HCC by targeting SET. Li et al. (2009) [23] confirmed that the miR-106b-25 cluster, which miR-93 belongs to, is over-expressed in HCC. Morishita et al. (2016) [24] found that miR-99b is up-regulated in HBV-infected HCC cells.  [26] reported that in a mice study, DNA methylation marks that are differentially methylated between livers with HCC and livers without HCC are enriched in the SALIVARY SECRETION pathway.  [6] data preprocessing. That is, we first performed the Box-Cox transformation, and then for each microRNA, we performed mean-centering and scaling operations so that the mean expression level is 0 and the variance is 1. Fig A7 showed that even after the Box-Cox transformation and scaling, the distributions of the data are still quite different from normal distributions. In the real data analysis, we applied the original data downloaded from GEO for all the 10 existing equal-variance tests. We also tried to apply the F test to the Box-Cox transformed data. The F test detected 159 DV microRNAs from the discovery set (GSE67318), but 472 DV microRNAs from the validation set (GSE67319), which is more than half of the 847 microRNAs. Further investigation is warranted.
In summary, the proposed gs method outperformed existing equal-variance tests in the simulation studies and detected biologically relevant microRNAs in a real data analysis of HCC data. The gs method is based on a mixture of multivariate normal distributions. Although the gs method showed some robustness to the violation of the normality assumption, in future we will study how to improve the gs method to make it robust to the violation of the normality assumption.

Acknowledgement:
The research is partially supported by the NSERC Discovery Grants and by the National Institute of Health of the United States (NIH/NHLBI 1 P01 HL 132825-01, NIH/NHLBI 5 R01 HL 125734)