Numerous studies have demonstrated sex differences in drug reactions to the same drug treatment, steering away from the traditional view of one-size-fits-all medicine. A premise of this study is that the sex of a patient influences difference in disease characteristics and risk factors. In this study, we intend to exploit and to obtain better sex-specific biomarkers from gene-expression data. We propose a procedure to isolate a set of important genes as sex-specific genomic biomarkers, which may enable more effective patient treatment. A set of sex-specific genes is obtained by a variable importance ranking using a combination of cross-validation methods. The proposed procedure is applied to three gene-expression datasets.
Personalized medicine is defined by the use of genomic signatures of patients to assign effective therapies in order to achieve the best medical outcomes for individual patients, thus improving public health. Despite the variety of clinical, morphological, and molecular parameters used to classify human malignancies, patients receiving the same diagnosis can have markedly different clinical courses and treatment responses. Since there is no simple way to determine who will have an adverse reaction, the current system of “one-size-fits-all-” diagnoses is simply not good enough.
An increasing number of studies have demonstrated sex differences in drug reactions to the same drug treatment. Migeon [
Recent advancements in biotechnology have accelerated the search for molecular biomarkers useful in the diagnosis and treatment of disease. Molecular biomarkers of disease risk and status are critical to an accurate treatment by identifying patients most likely to benefit from particular drugs or experience adverse reactions. Because medicine is always practiced on individuals rather than populations, the goal is to change the assignment of therapies from a population-based approach to an individualized approach.
Gene-expression data can be used to identify patients with a good disease prognosis, thereby preventing some patients from unnecessary therapies and toxicity. For example, gene-expression profiling was used to predict clinical outcomes in pediatric patients with acute myeloid leukemia and to find genes whose aberrant expression leads to a poor prognosis [
Classification algorithms are needed for biomedical decision making in clinical assignment of patients to treatment therapies based on individual risk factors and disease characteristics. Since many of those genes are not relevant, feature selection is a commonly addressed problem in classification [
Development of a biomarker classifier involves two distinct components:
For model building, highly accurate classification algorithms can be used to find sex-specific biomarkers. Given that differences in the biology of lung cancer and other diseases exist between men and women [
In the development of biomarker classifier, the predictive accuracy of the classifier must be evaluated on a separate set of data. To derive an unbiased accuracy estimate, a nested cross-validation procedure, 20 trials of 10-fold CV within LOOCV, is used in this paper. In other words, in each LOOCV, 90% of the
Three publicly available data sets of interest in this paper are pediatric acute myeloid leukemia (AML) [
Sex differences in disease rates or in rates of adverse reactions to treatment are common, which we intend to exploit to obtain sex-specific biomarkers from gene-expression data. We hypothesize that genomic biomarkers developed from the sex-specific application of classification algorithms will further improve class prediction accuracy.
The summary of our algorithm to find sex-specific predictive/prognostic genomic biomarkers for efficacy or toxicity in individualized treatment of patients for serious diseases is as follows. In each LOOCV trial, firstly, the data is partitioned into a test data set with one observation and the remaining data as the learning data set. The learning data set is further separated into male and female patients’ learning data sets. Thus, this process will be applied
Within each trial of LOOCV, each set of top-ranked genes for male and female patients is obtained in the second and the third steps as follows. First, in order to build genomic biomarkers, 20 trials of 10-fold CV are applied to each learning data set for males and females, where 90% were randomly selected without replacement as a set for each trial of CV. Next, for each trial of 10-fold CV, the BW ratio was applied to this 90 percent learning set and the top 25 ranked genes were selected in each process with the target endpoint of a dataset. The BW ratio for a gene
Next, to avoid selection bias from a pattern of selection of learning samples, we repeat the entire process 20 times by shuffling samples at every 10-fold CV. In order to obtain the variable importance ranking, 200 sets of top 25 ranked genes were combined so that the maximum possible rank score of a gene would be 200. A set of genes that has been selected at least once (rank score > 0) was obtained separately for males and females. These most influential genes used in the classification process are identified in order to extract a feasible set of sex-specific genes. For both male and female learning data sets, a set of genes scored greater than 0 are selected for males and females. The final product of the 20 trials of 10-fold CV described in each LOOCV is the sets of genes selected for male and female patients in the learning data set
To verify sex-specific biomarkers, we consider the following four different cases: we classify the outcome of
In order to validly evaluate the performance of a gene set selected by the proposed method, twenty trials of 10-fold CV within LOOCV are used. CV utilizes resampling without replacement of the entire data set to repeatedly develop classifiers on a training set and to evaluate these classifiers on a separate test set and then averages the results over the resamples.
In this section, we apply the proposed algorithm to the following genomic data sets to find the most meaningful sex-specific predictive/prognostic genomic biomarkers for improving individualized treatment of patients and for evaluating the biomarkers from the proposed algorithm.
Current chemotherapy enables a high percentage of pediatric patients with AML to enter complete remission (CR), but a large number of them experience relapse (R) with resistant disease [
This gene-expression data set consists of 54 AML pediatric patients (<15 years old) with an oligonucleotide microarray containing 12,566 probe sets and it is also available at
With the prognostic endpoint (R/CR), the average accuracy of 66% (sd 3.0%) for pediatric patient classification was obtained when no gene selection was introduced. When a set of 200 genes was selected in a learning phase of each CV, the average accuracy of 68.0% (sd 3.0%) was achieved. When a set of 20 genes was selected in a learning phase of each CV, the average accuracy of 71.0% (sd 5.0%) was obtained. Since it appeared to be no substantial difference between accuracy and the number of genes selected, a feasible set of 25 genes in the learning phase of each CV was collectively ranked by the BW ratio to find sex-specific biomarkers.
At the end of LOOCV trials, a set of male-specific genes that were selected at least 75% of the time in the entire LOOCV were 1882_g_at (MECOM: MDS1 and EVI1 complex locus), 37902_at (CRYZ: crystallin, zeta (quinone reductase)), 38789_at (TKT: transketolase (Wernicke-Korsakoff syndrome)), 39105_at (VASP: vasodilator-stimulated phosphoprotein), 40844_at (CTR9: Ctr9, Paf1/RNA polymerase II complex component, homolog (
In order to select a feasible set of sex-specific biomarkers a cut-off criterion of 75 percent is used. Since every dataset may have a different sample size, the number of genes selected is given by the percentage, which is a rank score of the selected genes greater than 75% of the sample size in our case. For example, if a sample size is 100, then genes that have rank scores greater than 75 have been selected. The LOOCV depends on the sample size
Similarly, nine top-ranked genes for classifying female patients into R/CR were 40601_at (TM2D1: TM2 domain containing 1), 36330_at (CCBL1: cysteine conjugate beta- lyase; cytoplasmic (glutamine transaminase K, kynurenine aminotransferase)), 40586_at (EEF1E1: eukaryotic translation elongation factor 1 epsilon 1), 36648_at (CRSP9: cofactor required for Sp1 transcriptional activation, subunit 9, 33 kDa), 32351_at (GPR20: G protein-coupled receptor 20), 1718_at (ARPC2: actin-related protein 2/3 complex, subunit 2, 34 kDa), 38622_at (MTG1: mitochondrial GTPase 1 homolog (
For the verification of sex-specific genes we applied DLDA classification algorithm. The result showed the following outcomes as expected. Data with male-specific genes showed higher prediction accuracy (71.9%) to classify male patients than the accuracy (43.8%) to classify male patients from data with female-specific genes. Similarly, data with female-specific genes showed higher prediction accuracy (76.2%) to classify female patients than the accuracy (61.9%) to classify female patients from data with male-specific genes. As shown in Table
Performance (%) of the sex-specific genes in pediatric AML data set. (ACC: accuracy; SEN: sensitivity; SPC: specificity; PPV: positive predictive value; NPV: negative predictive value).
Patients | ACC | SEN | SPC | PPV | NPV | |
---|---|---|---|---|---|---|
Male data with male genes | 32 | 71.9 | 86.7 | 58.8 | 65.0 | 83.3 |
Male data with female genes | 32 | 43.8 | 40.0 | 47.1 | 40.0 | 47.1 |
Female data with male genes | 21 | 61.9 | 70.0 | 54.5 | 58.3 | 66.7 |
Female data with female genes | 21 | 76.2 | 70.0 | 81.8 | 77.8 | 75.0 |
Genomic aberrations and mutational status of the immunoglobulin variable heavy chain (VH) gene have been shown to be among the most important predictors for outcome in patients with B-CLL [
The gene-expression data consists of 100 B-CLL patients with an oligonucleotide microarray containing around 12,000 probe sets, and it is available at
In Step 1, for each data set with male only and female only patients using the target endpoint (i.e., VH-mutated (M) versus unmutated (NM)), we separately selected and ranked 25 potential prognostic genes for males and for females in every CV trial and separately combined ranks of these genes for males and females during the learning phase of 20 trials of 10-fold CV within each LOOCV trial. In every LOOCV trial, we prioritized and combined the final top-ranked 200 genes from the male patients result (
After deletion of overlapped genes in both males and females, eleven potential male-specific genes were obtained to classify male patients into M/NM. They were 41209_at (LPL: lipoprotein lipase), 41755_at (COBLL1: COBL-like 1), 39878_at (PCDH9: protocadherin 9), 38211_at (ZBTB20: zinc finger and BTB domain containing 20), 39488_at (PCDH9: protocadherin 9), 36886_f_at (KIR2DL3: killer cell immunoglobulin-like receptor, two domains, long cytoplasmic tail, 3), 32140_at (SORL1: sortilin-related receptor, L(DLR class) A repeats-containing), 33535_at (P2RX1: purinergic receptor P2X, ligand-gated ion channel, 1), 39967_at (LDOC1: leucine zipper, downregulated in cancer 1), 32842_at (BCL7A: B-cell CLL/lymphoma 7A), and 36899_at (SATB1: special AT-rich sequence binding protein 1 (binds to nuclear matrix/scaffold-associating DNA’s)). For female patients, only five genes were selected at least 75% of the time in the entire LOOCV trial as potential female-specific genomic biomarkers to classify female patients into M/NM. They were 33745_at (PHKG2: phosphorylase kinase, gamma 2 (testis)), 38152_at (LOH11CR2A: loss of heterozygosity, 11, chromosomal region 2, gene A), 34142_at (PDE8A: phosphodiesterase 8A), 39593_at (FGL2: fibrinogen-like 2), and 217_at (KLK2: kallikrein 2, prostatic).
To verify the sex-specific genomic biomarkers, we considered the performance of four different cases in the LOOCV trials as described in Section
Performance (%) of overlapped genes in the B-CLL data set. (ACC: accuracy; SEN: sensitivity; SPC: specificity; PPV: positive predictive value; NPV: negative predictive value).
Patients | ACC | SEN | SPC | PPV | NPV | |
---|---|---|---|---|---|---|
Male data with male genes | 62 | 67.7 | 72.7 | 62.1 | 68.6 | 66.7 |
Male data with female genes | 62 | 40.3 | 36.4 | 44.8 | 42.9 | 38.2 |
Female data with male genes | 38 | 50.0 | 50.0 | 50.0 | 47.4 | 52.6 |
Female data with female genes | 38 | 47.4 | 61.1 | 35.0 | 45.8 | 50.0 |
Skin cancer is one of the most common malignancies in the United States. Although cutaneous melanoma represents a small subset, it is the most life-threatening neoplasm of the skin, and its incidence and mortality have been increasing worldwide. The key underlying molecular events have not been clearly elucidated, which may explain why no target has been developed and why almost no clinical benefits from new therapies have been clearly demonstrated in patients with melanoma since the late 1970s [
This gene-expression data set was collected from 83 patients corresponding to the training data set and 17 patients corresponding to the validation data set. The data consists of approximately 37,000 probe sets with dual-channel oligonucleotide microarrays. The probes are from tumor tissue and from reference tissue that are differentially labeled by the incorporation of cyanine 3 (Cy3) and cyanine 5 (Cy5), respectively. The data is available at:
In this data set, the endpoint was patient prognosis and survival along with tumor stages, defined as follows. In Stage I, cure rates are excellent with surgical removal, since they are the least likely to spread. In Stage II, melanomas can be cured, but the success rate lags behind that of Stage I because a small number of cancer cells may have spread to distant sites. In Stage III, since the tumor has started to metastasize (the spreading of a disease from one organ or part to another nonadjacent organ or part), the survival rate for these stages is lower than the earlier ones. Stage IV is associated with metastasis beyond the regional lymph nodes to distant sites in the body, such as the lung, liver, or brain, or to distant areas of the skin. Based on the tumor size, descriptions, and number of lymph nodes the stages are categorized in two classes. A class of high survival and small tumor size (HS/ST) is defined and composed of Stages 1 and 2. The second is defined as low survival and non-small tumor size (LS/NST), which is composed of Stages 3 and 4.
Tables
Melanoma training dataset distribution.
Melanoma training set | ||||
---|---|---|---|---|
Gender (class) | Clinical endpoint (abbrev., class) | Total patients | Final gene | |
High survival and small tumor size | Low survival and nonsmall tumor size | |||
Male (“0”) | 12 | 15 | 27 | 4641 |
Female (“1”) | 30 | 26 | 56 | |
Total |
Melanoma validation dataset distribution.
Melanoma validation set | |||
---|---|---|---|
Gender | Clinical endpoint | Total | |
High survival and small tumor size | Low survival and nonsmall tumor size | ||
Male (“0”) | 1 | 7 | 8 |
Female (“1”) | 1 | 8 | 9 |
Total | 2 | 15 | 17 |
Since the validation set was separately provided, the sex-specific genes were selected via 20 trials of 10-fold CV in the learning set. The following ten male-specific genes were selected at least 75% of the time during 20 trials of 10-fold CV: A_23_P128263 (PRB1: proline-rich protein BstNI subfamily 1), A_23_P83838 (CA8: carbonic anhydrase VIII), A_24_P212990 (MGC70863: similar to RPL23AP7 protein), A_23_P333650 (RAD9B: RAD9 homolog B (
Using a given validation set, the sex-specific genes were verified. As shown in the confusion matrix in Table
Confusion matrix from female data with female genes and from female data with male genes using the validation set.
Predicted | ||
---|---|---|
(HS/ST, “0”) | (LS/NST, “1”) | |
True class | ||
(HS/ST, “0”) | 0 | 1 |
(LS/NST, “1”) | 0 | 8 |
Confusion matrix from male data with male genes using the validation set.
Predicted | ||
---|---|---|
(HS/ST, “0”) | (LS/NST, “1”) | |
True class | ||
(HS/ST, “0”) | 0 | 1 |
(LS/NST, “1”) | 0 | 7 |
Confusion matrix from male data with female genes using the validation set.
Predicted | ||
---|---|---|
(HS/ST, “0”) | (LS/NST, “1”) | |
True class | ||
(HS/ST, “0”) | 0 | 1 |
(LS/NST, “1”) | 3 | 4 |
Large inter individual differences in benefit from chemotherapy highlight the need to develop predictive genomic biomarkers for selecting the right treatment for the right patient. Inappropriate chemotherapy can result in the selection of more resistant and aggressive tumor cells. To date, no reliable genomic biomarkers have been developed to provide the physician with prechemotherapy information to accurately predict the efficacy of a specific therapy.
We proposed a procedure to find sex-specific prognostic and predictive genomic biomarkers in order to assign individualized treatments in a personalized paradigm using variable importance ranking via combination of 20 trials of 10-fold CV and LOOCV. The proposed procedure was applied to data sets obtained from the BRB ArrayTools Data Human Cancer Archive [
In one application, pediatric patients with AML were classified by the algorithms as having either a good or poor prognosis, in terms of the likelihood of induction failure or relapse within one year of the first complete remission, based on gene-expression profiles. If this were brought into clinical application, a patient with a confidently predicted good prognosis might want to elect out of adjuvant chemotherapy and its associated debilitating side effects. With current rule-based decisions, almost all patients are subjected to chemotherapy. The overall average accuracy of this data set with a variable selection from pooled patients (males and females) was about 71.0%. However, using male-specific genes found by the proposed procedure, the accuracy was improved to about 72% as we found in the model validation studies. Similarly, using female-specific genes found by the proposed procedure, the average accuracy was improved to about 76% (see Table
In the B-cell chronic lymphocytic leukemia (B-CLL) dataset we found male-specific prognostic genomic biomarkers associated with B-cell chronic lymphocytic leukemia and its average classification accuracy was improved to about 68%. There was no substantial evidence to find female-specific prognostic genomic biomarkers in this data set. Similarly, male-specific predictive genomic biomarkers associated with a classification of primary cutaneous melanoma were found with the classification accuracy of about 88%.
The scope of our paper was to find sex-specific genomic biomarkers, if any, imbedded in the data instead of finding genomic biomarkers from the data. If commonly identified genes were kept in the proposed procedure, our procedure was not sex-specific genomic biomarker classifier involving two populations (males and females) any more but rather it became genomic biomarker classifier involving one combined population. In fact, even though commonly identified genes were kept, it did not necessarily improve the classification accuracy. For a counterexample, for the pediatric AML data of Yagi et al. [
It is not an easy task to find sex-specific genes, let alone verifying and proving that they are indeed sexspecific. We have presented a procedure for finding sex-specific prognostic and predictive genomic biomarkers in order to assign individualized treatments in a personalized paradigm. The procedure is shown to have good “sensitivity” and “specificity” in the sense that the sex-specific genes obtained can improve prediction accuracy in classification of individual patient’s prognosis. The proposed procedure to discover predictive and prognostic sex-specific genomic biomarkers for individualized treatment of diseases can play a critical role in developing safer and more effective therapies that replace one-size-fits-all drugs with treatments that focus on specific patient needs.
Hojin Moon’s research was partially supported by the Research, Scholarship, and Creative Activity (RSCA) Award from California State University, Long Beach, and was partially supported by the Faculty Research Participation Program at the NCTR administered by the Oak Ridge Institute for Science and Education through an interagency agreement between USDOE and USFDA. The views presented in this paper are those of the authors and do not necessarily represent those of the U.S. Food and Drug Administration.