As modern biotechnologies advance, it has become increasingly frequent that different modalities of highdimensional molecular data (termed “omics” data in this paper), such as gene expression, methylation, and copy number, are collected from the same patient cohort to predict the clinical outcome. While prediction based on omics data has been widely studied in the last fifteen years, little has been done in the statistical literature on the integration of multiple omics modalities to select a subset of variables for prediction, which is a critical task in personalized medicine. In this paper, we propose a simple penalized regression method to address this problem by assigning different penalty factors to different data modalities for feature selection and prediction. The penalty factors can be chosen in a fully datadriven fashion by crossvalidation or by taking practical considerations into account. In simulation studies, we compare the prediction performance of our approach, called IPFLASSO (Integrative LASSO with Penalty Factors) and implemented in the R package
Most drugs cannot treat all patients with a given disease. It is thus crucial to identify biomarkers (genetic, genomic, proteomic, or any measurable biological entities) that can predict the patient’s response to a given therapy. Ultimately, the biomarkers are to be built into companion diagnostic kits. Ideally, the number of biomarkers should be small to reduce the labor and cost.
Highthroughput molecular data, termed “omics data” in this paper, have been used for developing prediction models for more than fifteen years. As a wellknown example, gene expression data have often been found to be useful for predicting survival response to therapy of cancer patients; the overwhelming enthusiasm in the initial years has meanwhile been tempered by more critical studies [
For example, methylation data, copy number data, and mRNA expression may be available for the same patient cohort. Other data types include microRNA expression, proteomic data, metabolomic data, and single nucleotide polymorphisms (SNPs). In this paper, we denote each group of variables of the same type as a “modality” and the whole dataset as a “multiomics” dataset. For example, in this paper, we consider as illustration a breast cancer dataset with a clinical modality and a gene expression modality [
As multiple modalities of biomarker measurements become available for the same patients, the research interest starts to focus on the integration of data modalities to identify biomarkers and build prediction models with good accuracy [
The case of variables from one lowdimensional modality (typically, a few clinical variables relevant to the outcome to be predicted) and one highdimensional modality (e.g., a microarray gene expression dataset) has been extensively investigated by De Bin et al. [
There has been a large amount of statistical and bioinformatic literature on the integration of multiple omics datasets investigating their correlation structure [
In simulation studies, we show that IPFLASSO performs better than the standard LASSO when the proportions of relevant variables are different in different modalities and generates parsimonious prediction rules compared with sparse group LASSO. An R package called
This paper is structured as follows. After a short introduction into
We denote the standardized predictor variable
This framework can be generalized to logistic regression (in the case of a binary outcome) and to Cox proportional hazards regression (in the case of a censored time to event). The term
We propose the use of a weighted sum of the
Similar to the standard LASSO, our proposed framework can be applied to
The Bayesian interpretation of the LASSO is useful to outline the motivation of the different penalty parameters. Park and Casella [
Note that our approach may also be seen as connected with the adaptive LASSO [
From a computational point of view, IPFLASSO with fixed penalty factors is not more complex than the respective form of LASSO (linear, logistic, or Cox) with the same penalty for all variables, in that estimates can be simply obtained with any standard LASSO algorithm by preliminarily scaling the variables using their respective penalty. More precisely, the standard estimation algorithm is run with the same penalty parameter
There have been LASSO variations for single and multiple data modalities proposed by several groups. In this section, we discuss the connections of IPFLASSO to these methods. In the scenario investigated by De Bin et al. [
Another twostep approach for prediction is proposed by Zhao et al. [
Group LASSO [
Sparse group LASSO [
Another recently proposed approach handling two modalities in the framework of penalized regression is collaborative regression [
For the sake of exhaustiveness, let us also mention an applied paper on plant breeding [
In summary, the IPFLASSO proposed here is aimed at using multiple highdimensional data modalities in a flexible way by weighing them differently in feature selection and prediction modelling, which is a critical yet unsolved problem in biomedical research.
In this section, we discuss the choice of the parameters
IPFLASSO is implemented in our new R package
As an example, the following simple code performs 5fold crossvalidation repeated 10 times to choose the best penalty factors out of
The criteria used for crossvalidation currently implemented in
The goal of simulation studies is to investigate the performance of IPFLASSO and compare it with other methods. We consider a binary dependent variable and two highdimensional data modalities. The two modalities of variables vary in (i) their total numbers of variables
In the main design, we consider the settings (i.e., combinations of
Combinations of





 

Setting A  1000  1000  10  10  0.5  0.5 
Setting B  100  1000  3  30  0.5  0.5 
Setting C  100  1000  10  10  0.5  0.5 
Setting D  100  1000  20  0  0.3  
Setting E  20  1000  3  10  1  0.3 
Setting F  20  1000  15  3  0.5  0.5 
For all
In each simulation setting, prediction performance of all fitted models is evaluated through an independently drawn test dataset of size
Note that simulation results are strongly dependent on the parameters and many other parameter settings are conceivable. To gain a better idea of our method’s behavior, we additionally consider a total of 33 other simulation scenarios, results from which are presented in a more compact form. These additional parameter settings are displayed in Supplementary Table 1 (in Supplementary Material available online at
In real life, variables may be correlated both within and across modalities due to biological relationship. To investigate whether correlation structure affects the method’s behavior, we additionally consider settings, denoted as A′ to F′, based on settings A to F where a nondiagonal covariance matrix
More specifically, we assume that each modality contains
Figure
Results for settings A to F: misclassification rate on test set (a), AUC on test set (b), number of selected variables (c), and penalty factors selected by IPF (d).
Sparse group LASSO (SGL) performs better in terms of misclassification rate and AUC than IPFLASSO in setting A where the two modalities are identical, in setting B where the proportions of truly relevant variables are the same, and in setting C where the number of truly relevant variables are the same. This observation indicates that when the two modalities are very similar, SGL tends to produce models with higher prediction performance.
Importantly, we notice that the improved prediction performance of SGL over IPFLASSO in this case comes at a price of selecting substantially more variables into the final model, as shown in Figure
In settings A, B, and C, the performance of the standard LASSO is slightly superior to IPFLASSO. It makes sense in that when two data modalities are equally informative, giving them the same penalty is expected to yield better results than penalizing them differently. Due to the variability of crossvalidation, however, IPFLASSO does not always recognize that the best penalty factors are
In settings D, E, and F where two modalities are very different in the proportions of truly relevant variables, IPFLASSO yields a better performance than the standard LASSO and SGL. When there is a belief that one modality is more relevant to the outcome than the other, IPFLASSO might thus be considered for prediction model building. This is a common scenario in clinical biomarker development: for example, we may have a small panel of protein markers identified based on strong prior biological knowledge and a profiling panel of wholegenome mRNA expression. Figure
To further understand the method performance with respect to the two modalities in the simulations, we perform a large number of simulations using further parameter settings as summarized in Figure
Panels (a), (b), and (c): difference
Panel (a) in Figure
The results of settings A′ to F′ (with correlation) are very similar to the results of settings A to F, as can be seen from Figure
Results for settings A′ to F′ (with correlation): misclassification rate on test set (a), AUC on test set (b), number of selected variables (c), and penalty factors selected by IPF (d).
We use publicly available data on acute myeloid leukemia (AML) from The Cancer Genome Atlas [
Clinical variables are the age, the percentage of blast cells in bone marrow, the white blood cell count per mm^{3} (continuous variables), and the sex. Preliminary analyses (not shown) show that, for these variables, the proportional hazards assumption is acceptable. One of the two molecular modalities consists of 19,798 microarray gene expression measurements from AffymetrixU133 Plus 2. In the TCGA repository, they are available at different processing stages. Here we use the preprocessed data (level 3). As a second modality, we consider the copy number alterations obtained using Affymetrix SNP array 6.0. We download the data from the repository following the procedure of Zhao et al. [
The clinical, gene expression, and copy number modalities have 200, 173, and 191 patients, respectively, which results in a total of 163 subjects with data for all three modalities. Since in the original study the data are not separated into training and validation sets, we generate this split randomly. More precisely, we use around 2/3 of the observations (109) for training our models (training set) and the rest (64) to compute their prediction ability (validation set). In our analysis, we consider 100 such random splits and present the average results.
We compare the prediction abilities of Cox proportional hazards models obtained with the four different approaches (IPF, standard, SGL, and S) for the AML data. We also include the results from the nonparametric KaplanMeier method (the null model). Figure
AML data. Prediction error curves computed up to 5 years for the models obtained by standard LASSO (red line), S (green line), SGL (blue line), and IPFLASSO (purple line). The black line represents the prediction error obtained with the null model (no variables).
In this example, we note that IPFLASSO (purple line) performs better than the standard LASSO and SGL (red and blue lines, resp.). Interestingly, if we apply LASSO separately to the different modalities (green line), the results are comparable to IPFLASSO. The comparison in terms of prediction ability can be also performed numerically by evaluating the integrated Brier score (IBS), which summarizes the aforementioned curves into a single index. In this example, the standard LASSO has the worst performance (average IBS = 0.211), not much better than that of the null model (average IBS = 0.217). SGL performs a bit better (average IBS = 0.203) but worse than IPFLASSO and S, which have both an average IBS equal to 0.196. In terms of sparsity, although IPFLASSO and S have similar performance in terms of Brier score, IPFLASSO produces much sparser models than S. On average, the numbers of variables in IPFLASSO models and in S models are 7.3 and 13.7, respectively, with the standard LASSO between these two values (10.2). Not surprisingly, SGL (using the default value for the tuning parameter,
Hatzis et al. [
Among the available clinical variables, we select age (continuous), nodal status (4 categories), tumor size (4 categories), grade (3 categories), estrogen receptor (binary), and progesterone receptor (binary) as described in De Bin et al. [
The dataset consists of a training set used for training the genomic signature with 310 patients and a validation set with 198 patients. They include 66 and 45 patients who died (events), respectively. After removing subjects with missing data, there are 283 (58 events) and 182 (41 events) subjects in the training and validation datasets, respectively.
Similar to the previous real dataset analysis, here we compare the Brier scores generated from the Cox proportional hazards models obtained with the four methods, that is, IPFLASSO, SGL, S, and the standard LASSO, together with the null model from the nonparametric KaplanMeier method. Figure
Breast cancer data. Prediction error curves computed up to 6 years for the models obtained by LASSO (red line), LASSO applied separately to the three modalities (green line), sparse group LASSO (blue line), and IPFLASSO (purple line). The black line represents the results obtained with the null model (no variables).
One advantage of IPFLASSO is the possibility of flexibly choosing different weights for the different modalities. In this example, we observe that the crossvalidation procedure selects the penalty factors
The best model from IPFLASSO (with penalty factor
Breast cancer data. (a) Integrated Brier score obtained with IPFLASSO for different choices of penalty factors. The numbers associated with the points are the numbers of selected clinical and molecular variables, respectively. For example, “(318)” indicates that for the penalty factors
Figure
In addition to modelling the distant relapsefree survival time, a secondary goal of this study is to distinguish the patients with a pathological complete response (RCBI) from those with a significant residual disease (RCBII/RCBIII). Here, the pathological response is a binary outcome. We now apply the four approaches considered previously with logistic regression and use the area under the ROC curve (AUC) as a performance metric for the methods. In contrast to the Brier score, a larger value of AUC corresponds to better prediction performance. The AUC values for IPFLASSO, S, SGL, and the standard LASSO are 0.663, 0.712, 0.722, and 0.653, respectively. Regarding the model sparsity, IPFLASSO and S select a comparable number of variables (50 and 46, resp.), while the standard LASSO leads to the sparsest model (38 variables). Again SGL provides a much larger model with 1128 variables. Please note that this unfavorable result of our method is not contradictory per se with the simulation results, since a real dataset is but a point in the space of all possible datasets, and the performance of methods is highly variable across datasets [
In this paper, we addressed an important question in biomedical research, namely, how to integrate multiple (possibly correlated) data modalities with different sizes and different relevancies to the outcome, with the aim of generating a sparse prediction model. We proposed an
Simulation studies have demonstrated that IPFLASSO has better prediction performance compared to competitors (standard LASSO, separate LASSO models, and sparse group LASSO), when the two data modalities are different in terms of relevance for prediction, and performs slightly worse if the modalities are similar. More importantly, in both simulations and real case studies, IPFLASSO is shown to generate much more parsimonious models than sparse group LASSO, which is a much desired property from a practical perspective.
In principle, IPFLASSO is designed for any number
One common issue for all variations of LASSO, including IPFLASSO, is instability. Small changes of the dataset may lead to big changes of the selected model. Stability can be investigated using resampling methods, as suggested under the name “stability selection” [
The authors declare no conflicts of interest regarding the publication of this paper.
The authors thank Sarah Tegenfeldt for her helpful comments. Mathias Fuchs was financed by a grant from Novartis Biomarker to AnneLaure Boulesteix. Riccardo De Bin and AnneLaure Boulesteix were financed by the German Research Foundation (DFG), Grant nos. BO3139/41 and BO3139/42.