Cancer Outlier Analysis Based on Mixture Modeling of Gene Expression Data

Molecular heterogeneity of cancer, partially caused by various chromosomal aberrations or gene mutations, can yield substantial heterogeneity in gene expression profile in cancer samples. To detect cancer-related genes which are active only in a subset of cancer samples or cancer outliers, several methods have been proposed in the context of multiple testing. Such cancer outlier analyses will generally suffer from a serious lack of power, compared with the standard multiple testing setting where common activation of genes across all cancer samples is supposed. In this paper, we consider information sharing across genes and cancer samples, via a parametric normal mixture modeling of gene expression levels of cancer samples across genes after a standardization using the reference, normal sample data. A gene-based statistic for gene selection is developed on the basis of a posterior probability of cancer outlier for each cancer sample. Some efficiency improvement by using our method was demonstrated, even under settings with misspecified, heavy-tailed t-distributions. An application to a real dataset from hematologic malignancies is provided.


Introduction
Heterogeneity of the expression of oncogenes within the same histological cancers is considered to have significant implications for understanding disease biology, identifying risk groups, and optimizing patient treatment [1,2]. Recently, Tomlins et al. [3] argued that traditional analytical methods, for example, a two-sample -statistic, which search for common activation of genes across a class of cancer samples, will fail to detect cancer genes which show differential expression in a subset of cancer samples or cancer outliers. They developed the "cancer outlier profile analysis" (COPA) method to detect cancer genes with such heterogeneous expression profiles within cancer samples and revealed subtypes of prostate cancer patients defined by recurrent chromosomal aberration.
Inspired by the COPA statistic, some authors have proposed other methods for detecting cancer-related genes with cancer outlier profiles in the framework of multiple testing [4][5][6]. However, such cancer outlier analyses will generally suffer from a serious lack of power because the analysis attempts to detect relatively small fractions of cancer outliers; the signal contained in the data is relatively limited, compared with that in the standard multiple testing setting where common activation of cancer-related genes for all cancer samples is supposed. As information sharing across units in the data generally improves efficiency of the analysis, we propose a simple efficient method via information sharing across both genes and cancer samples. Specifically, we propose a parametric normal mixture modeling of gene expression levels of cancer samples across genes after a standardization using the reference, normal sample data. Then, a gene-based statistic for gene selection is proposed on the basis of a posterior probability of cancer outlier for each cancer sample. This posterior probability itself is to provide a useful index to aid identifying cancer outliers for a selected gene. This paper is organized as follows. After providing a brief summary of the existing multiple testing methods for the cancer outlier analysis in Section 2, we provide the proposed method in Section 3. We assess performance of our methods via simulations in Section 4. An application to a real dataset from hematologic malignancies is given in Section 5. Finally, concluding remarks appear in Section 6.

Existing Multiple Testing Methods for Cancer Outlier Analysis
We suppose a microarray study to detect cancer-related genes from a large pool of genes based on their gene expression levels measured for samples, comprised of 0 samples from a normal class and 1 samples from a cancer class. The gene expression data considered here comprise normalized log ratios from two-color cDNA arrays or normalized log signals from oligonucleotide arrays (e.g., Affymetrix GeneChip). For gene ( = 1, . . . , ), let be the expression value for sample ( = 1, . . . , 0 ) in the normal class and let be that for sample ( = 1, . . . , 1 ) in the cancer class. The most multiple testing methods developed for analyzing cancer outliers intend to a one-sided testing. Without loss of generality, we are interested in detection of activated genes that are overexpressed or upregulated in a subset of cancer samples, that is, cancer outliers. For detecting cancer-related genes with over-or underexpressions, one may perform two one-sided tests separately, one for detecting cancer-related genes with overexpressions and the other for detecting those with underexpressions.
The traditional two-sample -statistic for gene is defined as where is the mean expression value in the cancer samples, is the mean expression value in the normal samples, and is the usual pooled standard error estimate for gene ( = 1, . . . , ). The -statistic is efficient in detecting cancer-related genes on which most cancer samples are activated, but may not be efficient for those with cancer outlier profiles.
Tomlins et al. [3] defines the COPA statistic as where (⋅) is the th percentile of the expression level, med is the median of expression values, and mad is the median absolute deviation of expression values in all of the samples: med = median ( , ; = 1, . . . , 0 , = 1, . . . , 1 ) , The value of in (⋅), which represents a threshold in determining cancer outlier, is specified by the user, such as = 75, 90, or 95. Instead of using a fixed percentile value, approximately equivalent to using the information from only one sample, the use of additional outlier samples can be more efficient. Specifically, the OS statistic [4] is defined as Here the set of cancer outliers, , is heuristically identified by Wu [5] proposed the ORT statistic through identifying cancer outliers relative to the normal sample, rather than the pooled sample. Specifically, the ORT statistic is defined as As the COPA, OS, and ORT statistics are criticized because the outliers are arbitrarily defined, Lian [6] considers all possible values of the outlier threshold. Specifically, for the ordered gene expressions for the cancer samples,̃1 ≥̃2 ≥ ⋅ ⋅ ⋅ ≥̃1, the MOST statistic is defined as where = ⌊∑ 1≤ ≤ ⌋ and 2 = Var⌊∑ 1≤ ≤ ⌋ for 1 > 2 > ⋅⋅⋅ > 1 , the order statistics of 1 samples from the standard normal distribution. The standardization in the parenthesis is to make different values of the statistic comparable for different values of the outlier threshold, ( = 1, . . . , 1 ).

Mixture Modeling of Gene Expression Data.
In order for information sharing across both genes and cancer samples, we propose a simple parametric normal mixture modeling of gene expression data of cancer samples. As the existing multiple testing methods, for each gene, we consider standardized gene expressions of the cancer samples based on the reference, normal sample data, where , is the usual standard error estimate within the normal samples for gene ( = 1, . . . , ; = 1, . . . , 1 ). Again, the standardization intends to make all gene expression data from the cancer samples comparable across genes. We then assume the finite normal mixture model with the three components, The density function 0 corresponds to the null component with no differential expressions for the reference, normal sample data. The densities 1 and 2 correspond to the nonnull components (i.e., cancer outliers) of underexpression and overexpression, respectively, for the normal sample data. We specify normal distributions, (0, 1 2 ), ( 1 , 1 2 ), and ( 2 , 1 2 ), for 0 , 1 , and 2 , respectively. represents the mixing proportion ( = 0, 1, 2), and 0 + 1 + 2 = 1. We denote , as unobservable indicator random variables, such that , = 1 if the (standardized) expression level, , of cancer sample on gene belongs to the th component, and , = 0 otherwise ( = 1, . . . , ; = 1, . . . , 1 ). We estimate the parameters, 1 , 2 , and 's, via applying the EM algorithm to cope with the unobservable indicator variable , in the mixture model (e.g., [7]).

Statistics for Gene Selection.
The posterior probability, , , that , = 1, that is, the expression level belongs to the th component, provides a basis for gene selection, For detecting overexpressed genes, possibly with a cancer outlier profile (as a one-sided testing), we propose to use the following gene-based statistic for gene selection: This statistic may correspond to one minus the posterior probability that none of samples are cancer outliers with overexpressions. We will select genes with greatest values of . Gene-based statistics for detecting underexpressed cancer-related genes can be similarly developed.
In our framework, we can also derive a similar gene-based statistic for detecting under-or overexpressed genes (as a two-sided testing). One has It is important to note that the posterior probabilities, , , themselves can serve as a helpful index to aid identifying cancer outlier samples for a particular (selected) gene. In contrast, the existing cancer outlier methods do not provide such an expression-level statistic for identifying cancer outlier samples.
Unlike the existing statistics for cancer outlier analysis, the statistic, , does not involve any particular cancer outlier threshold, so that cancer-related genes with various proportions of cancer outliers ( in Section 4), even those with common activation across all cancer samples, could be detected. However, as is a composite of the posterior probabilities from all of the cancer samples, cancer-related genes with smaller proportions of cancer outlier will be more difficult to be detected because the statistic will be more dominated by the posterior probabilities from the cancer samples other than cancer outliers. The impact of the proportion of cancer outlier will be investigated in Section 4.

Simulation Study
We conducted a simulation study to assess the performance of our method in detecting cancer-related genes with cancer outlier profiles. We considered a microarray study with = 10000 genes for = 40, 80, or 200 samples, where the first half of samples were from the normal class and the latter half from the cancer class, that is, 0 = 1 = /2. Of note, for a given 1 , the power of the analysis will improve as 0 increases because more precise estimates of the mean and variance of the normal sample data become available in the standardization before fitting the mixture model (9) to detect cancer-related genes. We generated the gene expression levels for each gene from the standard normal distribution (0, 1 2 ) or the central -distribution with 20 degrees of freedom to assess the impact of deviation from the normality assumption. No interaction across genes was assumed. We supposed that genes were divided into the three gene components according to the mixture model (9), that is, the null, underexpression, and overexpression component. The mixing proportions were set to as 0 = 0.6,  We assessed the false discovery rate (FDR) and true positive rate (TPR), defined as the proportion of false positives in the set of significant genes and the proportion of selected true positives in the set of all of the overexpressed genes (= 2 ), respectively. Note that the TPR corresponds to average power in multiple testing (e.g., [8,9]). We conducted 200 simulations to obtain average TPR for a given value of FDR for each method, as the estimates of TPR were highly stable for = 10000 values of each statistic obtained in a single simulation run. Figures 1 and 2 show ROC curves that plot the TPR and FDR for various numbers of significant genes in multiple testing for normally distributed and -distributed gene expressions, respectively.
For normally distributed gene expressions, the gene selection based on the proposed statistic, , generally provided the greatest values of TPR (for a given value of FDR). As is expected, the proposed gene selection based on provided greater TPR as increased. The gene selection based on the -statistic provided the smallest values of TPR, especially when the proportion of cancer outliers is small, such as = 0.1, but the TPR improved for greater values of the proportion, such as = 0.5, as is expected. The COPA and OS methods performed worst among the methods except thetest, especially for greater values of , such as = 0.5. In particular, the performance of the OS method was largely deteriorated for = 0.5. The ORT and MOST methods generally provided comparable TPR values, but less than those of the proposed method based on .
For -distributed gene expressions, similar trends were observed. Again, the proposed method based on provided greatest TPR in general, except for the scenario with = 40

Application
We illustrate how the proposed method can capture the heterogeneity of cancer samples through its application to a microarray gene expression data from the myelodysplastic syndromes (MDSs) [10]. The MDSs are complex hematologic malignancies with heterogeneous clinicopathological features with various chromosomal aberrations. In order to discover the heterogeneous clinicopathological features of MDSs, possibly including those related to prognosis, we adopted the proposal using mixture distributions method for 139 MDSs and 69 nonleukemias (samples from bone marrow mononuclear cells from nonleukemic conditions), which were regarded as cancer and normal samples, respectively. Here, following Mills et al. [10], we removed samples of the chronic myelomonocytic leukemia disease type from MDS samples. We first adopted the RMA normalization [11] to the raw data (Raw data (CEL files) downloaded from Gene Expression Omnibus database (GEO, http://www.ncbi .nlm.nih.gov/geo/, accession number GSE15061)). We make Table 1: The number of overlaps in selected genes between the gene selection methods in the example of hematologic malignancies. Top 200 genes were selected by each method.
- statistic  COPA  OS  ORT  MOST  Proposed  -statistic  -13  14  50  56  56  COPA  13  -150  0  99  51  OS  14  150  -139  108  86  ORT  50  0  139  -151  89  MOST  56  99  108  151  -75  Proposed  56  51  86  89 75 statistics using the log scales expression intensities of each gene. As an initial screening of genes related to cancer outliers from a pool of = 54, 675 candidate genes, we adopted the existing and proposal methods. For each method, we selected 200 top genes with the greatest values of the statistic. The estimates of the parameters in the mixture model (9) obtained under an EM algorithm with a convergence criterion that are relative changes of the parameters <10 −4 were as follows:̂1 = 0.018,̂2 = 0.018,̂1 = −1.22, and̂2 = 3.54. Table 1 summarizes the overlap in a number of selected genes between the gene selection methods. Generally, the OS, ORT, and MOST methods had substantial overlaps in the selected genes. The degree of overlap can be explained by the affinity among the methods in terms of the used standardization and outlier thresholds (see Section 2). On the other hand, it is interesting that the proposed method based on the gene-based statistic, , had intermediate overlaps with all of the other methods. This would indicate that the proposed method could detect cancer-related genes with various cancer outlier profiles. Figure 3 shows histograms of the standardized expression levels within each class (normal and cancer) for three genes that were selected by our method, but not by the other methods. The proportion of cancer outliers was relatively small for the first two genes (Figures 3(a) and 3(b)), but large for the third gene (Figure 3(c)), which again indicates that our methods can detect cancer-related genes with various proportions of cancer outliers.

Discussion
In this paper, we have attempted to improve the efficiency of the cancer outlier analysis through information sharing across genes and cancer samples. In our simulations, the proposed gene selection based on a parametric normal mixture modeling of gene expression data demonstrated some improvement in efficiency for detecting cancer-related genes with moderate to large proportions of cancer outlier ( ≥ 0.3), even under settings with heavy-tailed -distributions. The proposed statistic would therefore be effective for selecting cancer-related genes that are involved in relatively major activation among cancer samples. Modification of the statistic for selecting cancer-related genes with more minor activation (i.e., small ) is a subject for future research. Another important subject would be the addition of a gene-level mixture structure, that is, nonnull and null genes in terms of the association with cancer, to provide a more formal basis for evaluating false positives and true positives in gene selection.
We have assumed the mixture structure (9) with the three components, 0 , 1 , and 2 , that is common across genes. In some cases, the use of only one nonnull component for a particular direction of differential gene expression may be rather restrictive for plausible, large heterogeneity among cancer samples. Our method can be extended to involve multiple nonnull components, possibly with selection of the number of nonnull components based on model-selection criteria, such as AIC and BIC [7]. Another restriction of our model is that no interaction or correlation is assumed among genes. According to an investigation in the context of mixture modeling of a gene-level statistic (e.g., [12]), the impact of correlation is generally small for moderate correlation. In our modeling of the standardized gene expression levels across both genes and samples, the proportion of correlated 's is expected to be relatively small because of independence across samples, but further investigation is needed.
As to the existing methods of cancer outlier analysis, our simulations suggested superiority of the standardization based on the reference, normal sample data, not the pooled data from both cancer and normal samples. The poor performance of the OS method for greater proportions of cancer outliers, such as = 0.5, can be explained by the use of the IQR based on the pooled data. In such situations with relatively large numbers of cancer outliers, the IQR may cover some of cancer outliers, resulting in a very large outlier threshold, so that a substantial fraction of cancer outliers might be missed by using the statistic. In contrast, the performance of the ORT method, which is based on the IQR based only on the normal sample data, was not deteriorated as increased in our simulations.
After screening cancer-related genes with cancer outlier profiles, researchers will need clustering of genes to identify coregulated genes that belong to the same molecular pathway related to disease biology and aggressiveness. At the same time, clustering of cancer samples based on the identified gene clusters can help discovering new taxonomy of cancer based on gene expression profiles of cancer outliers, possibly related to patients' clinical courses such as prognosis and response to therapeutics. A two-way model-based clustering of genes and samples in the context of cancer outlier analysis, as an extension of the proposed model-based method in this paper, would be an important topic, and one of such clustering methods will be reported elsewhere.