Deciphering the Correlation between Breast Tumor Samples and Cell Lines by Integrating Copy Number Changes and Gene Expression Profiles

Breast cancer is one of the most common cancers with high incident rate and high mortality rate worldwide. Although different breast cancer cell lines were widely used in laboratory investigations, accumulated evidences have indicated that genomic differences exist between cancer cell lines and tissue samples in the past decades. The abundant molecular profiles of cancer cell lines and tumor samples deposited in the Cancer Cell Line Encyclopedia and The Cancer Genome Atlas now allow a systematical comparison of the breast cancer cell lines with breast tumors. We depicted the genomic characteristics of breast primary tumors based on the copy number variation and gene expression profiles and the breast cancer cell lines were compared to different subgroups of breast tumors. We identified that some of the breast cancer cell lines show high correlation with the tumor group that agrees with previous knowledge, while a big part of them do not, including the most used MCF7, MDA-MB-231, and T-47D. We presented a computational framework to identify cell lines that mostly resemble a certain tumor group for the breast tumor study. Our investigation presents a useful guide to bridge the gap between cell lines and tumors and helps to select the most suitable cell line models for personalized cancer studies.


Introduction
Breast cancer is one of the most frequently diagnosed lifethreatening cancers in women with about 235,000 new cases expected in the United States in 2014. Breast cancer is a complex and heterogeneous disease such that they may have different prognoses. It responds to therapy differently despite similarities in histological types, grade, and stage. In the laboratory, the breast cancer is often modelled using established breast cancer cell lines due to their ease of being acquired and used [1].
However, accumulated evidences have pointed out the genomic differences between cancer cell lines and tissue samples in the past decades [2][3][4]. In the review of Holliday and Speirs [1], they demonstrated that cell lines are prone to genotypic and phenotypic drift during their continual culture. This is particularly common in the more frequently used cell lines, especially those that have been deposited in cell banks for many years [5]. Subpopulations may arise and cause phenotypic changes over time by the selection of specific, more rapidly growing clones within a population. Considering these findings, it is essential for researchers to choose the decent cell lines models when designing experiments and interpreting results, especially if such cell lines are regarded as valid models in evaluating the pathobiology of breast cancer and/or the likely response to novel drug therapies [1].
With the quick development of the whole genome sequencing and other "-omics" techniques, now it becomes possible to systematically explore the relationship between tumor tissues and cancer cell lines and identify the cell lines that most closely resemble particular tumor subtypes. In The Cancer Genome Atlas (TCGA), the genome and expression profiles of at least 500 tissue samples per tumor type are being comprehensively characterized [6]. The Broad-Novartis Cancer Cell Line Encyclopedia (CCLE) contains a compilation of gene expression, chromosomal copy number, and massively parallel sequencing data from 947 human cancer cell lines that are used as models for various tumor types [7]. These huge data accumulated regarding tumor samples and cell lines have provided a great potential to mine their associations and characterize the cancer mechanisms.
Traditionally, breast cancer was diagnosed into luminal A, luminal B, HER2+/ER−, basal-like, and normal-like subtypes based on gene expression profiling or immunohistochemical (IHC) characteristic [6]. However, classification criteria defined by using only this information may be not sufficient and likely overly general. In this study, we focus on the primary tumors of breast and try to depict the genomic characteristics of these tumors based on their gene expression profiles. Besides, previous studies have suggested that DNA copy number variations (CNVs) are important influential factors for altered gene expression levels in cancer [8][9][10]. In a lung cancer study, approximately 78% genes showed a positive correlation between CNV and gene expression level [11]. Considering the potential key constitution of CNVs associated with the gene expression variations in breast tumors, copy number profiles were also incorporated in this study.
Using the genomic information, the relationship between these primary breast tumors and the breast cancer cell lines was explored. Furthermore, as intrinsic differences exist among the breast tumor, we also attempt to figure out the correlation between the cell lines and different breast tumor groups and design an efficient computational framework which helps to select the most suitable cell line models for a specified tumor type.

Data Collection and Tumor Sample Classification.
In our study, we only reserved breast tumor samples or cancer cell lines with both genome-wide DNA copy number information and mRNA expression profiles available. As a curation result, 543 primary breast tumor samples (including 52 normal samples) profiled by TCGA [6] and 59 breast cancer cell lines from the CCLE [7] were obtained.
Generally, breast cancer may be categorized into luminal A, luminal B, HER2+/ER−, basal-like, and normal-like subtypes based on gene expression profiling or immunohistochemical (IHC) characteristics [12,13]. However, large-scale genomics projects have revealed heterogeneities exist within the same class of breast cancer patients defined by the classic grouping [6]. Here, in order to make a relatively consistent molecular background for the tumor samples in the same group, we subdivided the 491 breast tumors into 8 groups according to the presence or absence of expression of the estrogen receptor (ER), the human epidermal growth factor receptor 2 (ERBB2/HER2), and progesterone receptor (PR) in combination, and there are ER group (ER+, PR−, and HER−;

Copy Number Data Analysis.
Level 3 copy number data was downloaded for breast tumor samples from TCGA (platform: Affymetrix SNP6) [6]. As the CNV sizes are quite different across the tumor samples, the CNV profiles were further broken into gene basis. To enable the gene based analysis, the Bioconductor package CNTools was used to map the segmented copy number data of TCGA samples to genes [14], and each gene corresponds to only one CNV segment. The mean copy number profile of each group of the TCGA samples was obtained by calculating the mean signal of each gene across all tumor samples in this group. Copy number data (gene level) for cancer cell lines was obtained from CCLE (platform: Affymetrix SNP6) [7]. As reported by TCGA and CCLE, the significant focal copy number alterations in individual tumor samples/cancer cell lines were identified from segmented data using GISTIC [15].

Gene Expression Data Analysis.
We used data from the Agilent G4502A 07 platform for TCGA, with measurements of 17,814 genes. Differentially expressed genes were selected based on the fold change of gene expression between each groups of tumor samples and the control (normal group) under the cutoff of | log 2foldchange| > 1 [17]. The overexpression/underexpression frequency was calculated for each gene in each tumor group. For example, gene A was overexpressed in ER group as compared to the normal group, and then the proportion of tumor samples in ER group with expression value of gene A higher than the mean expression value of gene A in normal group was defined as the overexpression frequency of gene A in ER group.
CCLE expression data was obtained using Affymetrix U133 Plus 2.0 Arrays, with measurements of 18,926 genes. Differentially expressed genes were selected based on the fold change of gene expression between each cell line and the average of expression value of all the cell lines [17].
For the comparison between gene expression data from TCGA and CCLE, robust -scores (median-centered expression values divided by the median absolute deviation) were derived separately for the two data sets from CCLE and TCGA, and only common genes were remained.

Gene Set Functional Enrichment Analysis.
Gene set enrichment analyses were performed for the functional annotation of the differential expressed genes. Functional Annotation Tools in DAVID Bioinformatics Resources [18] were used to carry out these analyses. Those gene ontology biological process terms with value less than 0.05 and genes more than two were considered as significant enriched functions for further analysis.  Figure 1: (a) DNA copy number change profiles in each group of breast tumor samples. The CNVs frequency of the whole genome was calculated, the gains of copy number were marked in red, and the losses were marked in green. The -axis in each subgraph represents the frequency of the copy number gain/loss of the corresponding gene. (b) Clustering of the CNV data. The CNVs on each chromosome in each sample group were clustered separately. The gains of copy number were marked in red and the losses were marked in blue.

The Construction of "Pathway of the 384 Genes in Breast
Tumors". First, pathways closely related to breast cancer were collected via NCI website (http://www.cancer.gov/) and literature review, and they are Estrogen Signaling pathway, ERBB pathway, PI3K/Akt/mTOR Signaling pathway, p53 Signaling pathway, Ras Signaling pathway [19], Notch Signaling pathway [20], Wnt Signaling pathway [21], and NFkB pathway [22]. These pathways were retrieved from KEGG pathway database [23] and compiled into a big pathway via the overlapping elements.
2.6. Rank Aggregation. Two ranking lists derived from copy number profiles and gene expression profiles were fused into one ranking list using R package RankAggreg [24]. Cross Entropy Monte Carlo (CE) algorithm together with Spearman distance was used to perform the rank fusion. The maximum number of iterations was set as 1000.

Copy Number Variations in Breast
Tumors. The TCGA and other groups have made great effort to explore the genomic landscape of breast cancer [6,25]. After classifying the tumor samples from TCGA, we found that, as compared to the normal samples, the tumor samples in other groups show similar copy number variation (CNV) pattern (supplementary Figure 2). Then, we obtained 2,426 genes with CNVs for all groups (supplementary Table 1). It is noteworthy that, for all the groups, the majority of the genes are undergoing frequent copy number gain (Figure 1). Chromosomes 1, 8, 17, and 20 contained most of the genes with CNVs. According to previous studies [6,26], many genes on chromosomes 8 and 17 show copy number gain, such as MYC on chromosome 8q24, and HER2 as well TOP2A on chromosome 17q21.1. The gains of copy number were marked in red and the losses were marked in green. The genes were arranged in chromosomal order (chr1 → chr ). The circular rings denote different tumor groups (from outside to inside: ER, PR, HER, ERPR, ERHER, TP, and TN). The 7 circles outside represent the expression value of the 384 genes in the 7 tumor groups. The overexpressed genes were in red and the underexpressed ones were in blue. The genes were arranged in chromosomal order (chr1 → chr ). The circular rings denote different tumor groups (from outside to inside: ER, PR, HER, ERPR, ERHER, TP, and TN).

Differentially Expressed Genes in Breast Tumors.
Totally, there were 4,843 differentially expressed genes (DEGs) for all groups of tumor samples from TCGA (supplementary Table 2). 399 of the DEGs were overexpressed in all the tumor groups, while 588 of them were underexpressed in all groups (supplementary Figures 3 and 4). There were only 5 overexpressed genes and 14 underexpressed genes unique for ER group, while there were 254 overexpressed genes and 219 underexpressed genes unique for TN group (supplementary Figures 3 and 4). Then, the overexpression/underexpression frequency was calculated for each of the 4,843 genes in each group. Notably, 413 of the genes differ greatly in these tumor sample groups (the deviation between the highest and the lowest frequency of the gene across the groups is bigger than 1), and they were significantly enriched in regulation of hormone levels and cell adhesion.

Genes with Correlations between Copy Number and
Expression. We found that totally 384 individual genes show copy number change associated with the alteration in their expression for all tumor sample groups (Figure 2 and supplementary Table 3). The majority of these genes were distributed in chromosomes 1, 8, and 17, which is not surprising, as most of the genes with CNVs were concentrated in these  chromosomes. The genes with high copy number change also show high gene expression change, such as ERBB2, PSMD3, and TCAP. Altogether these genes are significantly enriched in biological processes related to cell cycle. Amplified (and overexpressed) genes are prime therapeutic targets. For example, the use of the drug trastuzumab against ERBB2 has been shown to improve breast cancer survival rates alone or in combination with other treatments [27][28][29]. The amplified genes with overexpression in each tumor sample groups might be the potential therapeutic targets for the specific tumor type, such as CCND1, CCNE2 for the ER group and E2F5, EIF2C2 for the PR group. 23 of these genes are distributed in the pathways which are closely related to breast cancer: ERBB pathway, PI3K/Akt Signaling pathway, NFkB pathway, and so forth ( Figure 3) whereas whether these genes are druggable needs further exploration.  Figure 4). The presence or absence of expression of ER, HER2, and PR in these cell lines was shown in Figure 4, and accordingly, the cell lines were clustered into three parts. These cell lines were also classified into 7 groups as for the breast tumors. The cell lines within the same group show quite different copy number pattern (supplementary Figure 5). The number of overexpressed/underexpressed genes and the count of genes with copy number changes in each cell line were also shown in Figure 4. In general, most of the cell lines have more genes with CNVs rather than DEGs, while CAL51, HS343T, HS606T, HS281T, HMEL, HS274T, HS739T, and HS742T have more DEGs rather than genes with CNVs.

BioMed Research
Additionally, we examined the overlap ratio of genes that showed copy number change associated with alteration in their expression between each breast cancer cell line and each tumor sample group (Figure 7). This ratio could also indicate the correlation between cancer cell lines and different tumor samples, as it shows high consistency with that only by copy number profiles or gene expression profiles.

Ranking of the Breast Cancer Cell Lines as Candidate Models for Certain Tumor Group Study.
Breast cancer is a complex disease that manifests as a result of coordinated alterations on genomic, epigenomic, and proteomic levels. Therefore, it is important to take into account the multiple datasets together to optimize strength of biological information across multiple assays relevant to breast cancer. With the accumulated copy number profiles and gene expression profiles for different cancer cell lines and tumor samples, we could evaluate whether a certain breast cancer cell line is a good model for a specific tumor group by integration of these two aspects of information. We designed a ranking aggregation model of the cell lines according to their correlation with each tumor group based on the integration of copy number profiles and gene expression profiles. First, the breast cancer cell lines were ranked in descending order of their similarity with each tumor group using copy number profiles and gene expression profiles, respectively. Then, for each tumor group, the two derived ranking lists of the breast cancer cell lines were fused into one ranking list using R package RankAggreg [24]. In this way, the good cell line models for each tumor groups were picked out from the 59 breast cancer cell lines.

Comparing All the Cancer Cell Lines with Breast Tumor
Samples. Similarly, we also evaluated the correlation between all the cancer cell lines in CCLE and breast tumor sample groups, using the copy number information and gene expression profiles (supplementary Tables 4 and 5). From the perspective of either copy number or gene expression profiles, respectively, some breast cancer cell lines were ranked with high correlation with any of the breast tumor groups while, interestingly, we also identified that some lung cancer cell lines and ovary cancer cell lines also present high correlation with at least one of the breast tumor groups.

Breast Tumor Sample Groups Differ Greatly in the Regulation of Hormone Levels and Cell
Adhesion. 413 of the DEGs differ greatly in the frequency of overexpression/underexpression across the different tumor sample groups. After conducting the gene set functional enrichment analysis, we found these genes were significantly enriched in biological processes including the regulation of hormone levels and cell adhesion. The enrichment in the regulation of hormone levels is expected. As cell adhesion is related to cancer metastasis, we checked the literatures and found that different breast cancer subtypes show disparity in metastasizing to different sites [31,32]. However, the classification of breast cancer into subtypes does not typically inform about metastatic behavior. These genes (COL9A1, ITGB8, ITGB6, TTYH1, RET, etc.) enriched in the cell adhesion may serve as important indicators of different types of breast cancer. Due to limited information in this field, the roles of these genes in manipulating the tendency of breast cancer metastasis to different sites need to be further studied.  BT20  BT474  BT483  BT549  CAL120  CAL148  CAL51  CAL851  CAMA1  DU4475  EFM19  EFM192A  EVSAT  HCC1143  HCC1187  HCC1395  HCC1419  HCC1428  HCC1500  HCC1569  HCC1599  HCC1806  HCC1937  HCC1954  HCC202  HCC2157  HCC2218  HCC38  HCC70  HDQP1  HMC18  HMEL  HS274T  HS281T  HS343T  HS578T  HS606T  HS739T  HS742T  JIMT1  KPL1  MCF7  MDAMB134VI  MDAMB157  MDAMB175VII  MDAMB231  MDAMB361  MDAMB415  MDAMB436  MDAMB453  MDAMB468  SKBR3  T47D  UACC812  UACC893  YMB1  ZR751  investigation of the two perspectives also provides a guide to reveal the relationship between breast cancer cell lines and the tumor samples, as well as selecting the suitable cell lines for the corresponding breast tumor group. In general, the correlation between the cancer cell lines and the tumor sample groups indicated by the two aspects was consistent with each other. The association between copy number variations and gene expression has been investigated by several research groups [33,34]. As DNA copy number variations (CNVs) are important influential factors for altered gene expression levels in cancer, the observed high consistency was expected. Some of the cancer cell lines have high correlation with the preclassified tumor group based on the presence or absence of expression of ER, HER2, and PR in the cell line, while a big part of them does not show this tendency, including the most used MCF7, MDA-MB-231, and T-47D. According to ATCC (http://www.atcc.org/) which is one of the largest biosources in the world and offers investigators a complex array of human, animal, insect, fish, and stem cell lines, these three cell lines are not from primary breast cancer but are metastatic breast cancer cell lines derived from pleural effusion. Some of the cell lines (HS343T, HS606T, HS739T, and HS742T) have low correlation to any one of the primary tumor groups either calculated using copy number profiles or gene expression profiles. The low correlation probably lies in that they are not originated from primary tumors, or maybe these cell lines were contaminated during their continual culture.

Correlation between Breast Cancer Cell
Indicated by the fused rank based on the similarity of copy number profiles and gene expression profiles, the most resemble breast cancer cell lines were picked out as the good models for different tumor groups. Further evidences might be identified by investigating mutation profiles, proteomics data, and so forth.

Lung Cancer Cell Lines and Ovary Cancer Cell Lines
Show High Correlation with the Breast Tumor Samples. By evaluating the correlation between other cancer cell lines in CCLE with the breast tumor groups, we found some of the lung cancer cell lines and the ovary cancer cell lines show high relevance with the breast tumors. In the systematic analysis of the genomic characteristics of breast tumors, the similarity between ovary tumors and lung tumors was observed [6]. The high correlation between some of the ovary/lung cancer cell lines and the breast tumors was understandable. In addition to the similar CNV profile (e.g., common gains in chromosomes 1, 8, 17, and 20) and gene expression profile (e.g., overexpression of AKT3, MYC) between the breast tumor samples and the ovary/lung cancer cell lines, there are some other commonalities between them. For example, breast tumors and ovary tumors have common risk factors including hormone therapy, obesity, and inherited genetic risk such as BRCA1 and BRCA2 [35,36]. For breast tumors and lung tumors, they have high frequency of TP53 mutations, EGFR mutation, and so forth [37].

Conclusion
In this paper, we investigated the correlation between different groups of primary breast tumors and breast cancer cell lines using copy number profiles and gene expression profiles. Although the relevance between tumors and cancer cell lines seems not very high, while considering their ease of use, there is no doubt that established cell lines will continue to be used as models for breast cancer. Our study is expected to provide a useful guide for researchers to understand the limitations of the cells and select the suitable cell lines as the tumor model for better investigation of cancer mechanism.