A Canonical Correlation Analysis of AIDS Restriction Genes and Metabolic Pathways Identifies Purine Metabolism as a Key Cooperator

Human immunodeficiency virus causes a severe disease in humans, referred to as immune deficiency syndrome. Studies on the interaction between host genetic factors and the virus have revealed dozens of genes that impact diverse processes in the AIDS disease. To resolve more genetic factors related to AIDS, a canonical correlation analysis was used to determine the correlation between AIDS restriction and metabolic pathway gene expression. The results show that HIV-1 postentry cellular viral cofactors from AIDS restriction genes are coexpressed in human transcriptome microarray datasets. Further, the purine metabolism pathway comprises novel host factors that are coexpressed with AIDS restriction genes. Using a canonical correlation analysis for expression is a reliable approach to exploring the mechanism underlying AIDS.


Introduction
Human immunodeficiency virus (HIV) is the basis for acquired immune deficiency syndrome (AIDS) pathogenesis and destroys the lymphoid system with prodigious replicates, which reduces a patient's ability to survive. Since HIV was identified in the 1980s, this pathogen has taken more than 10 million people's lives throughout the world. Researchers have developed considerable information on HIV involving immunology, virology, host genetics, and treatment over the past few decades.
Human genetics research involving the infectious disease HIV has progressed considerably after initiation of the human genome project (HGP), which is sequencing the entire human genome, both physically and functionally [1]. Many host genetic factors that influence AIDS epidemiological heterogeneity have been characterized [2][3][4]. From the HIV entry receptor on lymphoid cells to oncogenes in human glioblastomas, AIDS restriction genes (ARGs) are widely involved in biological pathways, and nearly 40 ARGs have been studied in depth through functional analyses [5][6][7][8][9][10][11][12]. Host genomic analysis is a key approach to studying AIDS epidemiology [13].
Further, genome, transcriptome, proteome, and metabolome biodatasets related to HIV have grown exponentially due to advanced sequencing technology. However, an integrative study on these datasets is limited in terms of understanding the complicated biological network.
Recent studies have revealed that metabolic pathways exert certain effects on the control of AIDS disease progression [14]. For example, the oxygen concentration can modulate T-cell differentiation through controlling metabolic status [15]. Metabolizing ATP to adenosine inhibits HIVspecific effector cells. Further, HIV infection is affected by dNTP hydrolysis. Efficient HIV-1 infection of CD4(+) lymphocytes requires sufficient glucose uptake via the Glut1 2 Computational and Mathematical Methods in Medicine glucose transporter [16]. Tryptophan and phenylalanine metabolism also play an important role in HIV because HIV pathophysiology is associated with inflammatory stress due to dysregulated amino acid metabolism [17]. The HIV protein NEF impacts lipid-related metabolism through impairing cholesterol metabolism in both infected and bystander cells [18,19]. This evidence suggests that cross talk between AIDS and the host metabolism is an important research topic that is necessary to resolve the disease mechanism and aid in therapy. Integrating biodatasets with an in-depth analysis of host AIDS restriction genes and metabolic pathways is imperative.
In the transcriptome, gene coexpression is a model for understanding how individual genes are correlated in certain conditions [20,21]. Based on advances in this field, researchers hypothesize that the coexpression of genes in certain pathways indicates an integrative correlation between the two molecular pathways. Full genes in metabolic pathway are available for the human genome. Identifying correlations between a group of metabolic pathway genes and ARGs is a more comprehensive means for understanding integrative biodatasets. However, traditional methods using a Pearson or partial correlation are only suitable for a single gene. A canonical correlation analysis (CCA) is an efficient and powerful approach for measuring coexpression between two sets of genes. A Childhood Asthma Management Program (CAMP) study using a CCA successfully detected genetic regulatory variants [22]. Using the CCA, the glioblastoma transcriptomes of 45 patients were thoroughly analyzed to identify the glioma pathway genes [23].
In this paper, we used a CCA to analyze coexpression between ARGs and metabolic pathways from KEGG. We discuss the most important metabolic pathways coexpressed with the ARGs, which may imply strategies for AIDS diagnosis and therapy.

Datasets.
Human genome expression datasets were downloaded from the website COPRESDB (http://coxpresdb .jp/), which contains approximately 4000 experiments and expression data on 20,000 human genes. Metabolic pathway genes were downloaded from KEGG (http://www.kegg.jp/); this dataset includes 129 typical metabolic pathways with predicted genes. The ARGs were collected from published literature. Two expression datasets were generated to include metabolic pathway gene and ARG expression data, respectively (Tables 1 and 2).

Canonical Correlation Analysis.
To analyze the correlations between ARG and metabolic pathway gene expression, we used a CCA, which integrates multiple correlations into a few significant correlations. This statistical method calculates the correlation between two sets of variables and generates statistically independent pairs of new variables, which are referred to as canonical variables. The linear combination of the variables creates a component of the canonical variable pair in each group of the original variables.
In this study, these variables were defined at each flag as follows: ARG expression described by genes in the vector = ( 1 , 2 , . . . , ) and metabolic pathway gene expression described by genes in the vector = ( 1 , 2 , . . . , ). The respective sets of canonical variables = ( 1 , 2 , . . . , ) and = ( 1 , 2 , . . . , ) are results from the linear combination of ARG and metabolic pathway gene expression. The ARG expression canonical variables are included in the vector , which is the result of the linear combination comprising the vector (original ARGs expression) and the canonical coefficients vector as = . The vector contains the canonical variables for metabolic pathway gene expression, which result from the linear combination of the vector (original metabolic  The magnitude of the correlation between each pair of canonical variables is described by the vector eigenvalues. The canonical coefficients exist in the eigenvectors and can be used to estimate the canonical variables. The variancecovariance matrices contain the variances and covariances within the groups for the ARGs and metabolic pathway genes, respectively. The covariances between variables were calculated from the variance-covariance matrices.

The Study
Design and Software Tools. The canonical correlation analysis was performed using the R platform (http://www.r-project.org/). After the canonical variables were generated from the expression datasets composed of ARGs and metabolic pathway genes, we set the absolute value 0.15 as the threshold for selecting ARGs correlated with canonical variables. To select metabolic pathway genes correlated with canonical variables, we sorted the genes using the absolute value, and the top 50 were selected for further enrichment analyses. Functional annotations were generated and enrichment analyses were performed for the metabolic pathway genes using the web-based DAVID tool (http://david.abcc.ncifcrf.gov/). For the pathway enrichment analyses, the "KEGG PATHWAY" was selected. The pathways with a value < 0.01 were considered significant.

The ARGs and Metabolic Pathway Genes
3.1.1. The General CCA Results. Eight significant ( < 0.01, Wilk's Lambda, > 0.95) canonical correlations were discerned between the ARG and metabolic pathway gene transcriptomes using the CCA. 60% of the total ARG expression variance was explained by the ARGs canonical variables. Significant metabolic pathway canonical variables explained 38% of the metabolic gene transcriptome variation. Thus, ARG-metabolic pathway associations were involved in a substantial proportion of the total variance. The first pair of canonical variables had a correlation of 0.99, while the second pair of canonical variables had a correlation of 0.98.

Relationships between the Canonical Variables and
Original Genes 3.2.1. Pair 1 (C1, P1). As shown in Table 3 Table 4, the canonical variable 1 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that correlated with variable 1 were enriched for purine metabolism; these genes include phosphodiesterase 4C (5143), polymerase (RNA) III (DNA directed) polypeptide K (51728), and primase (5558). Table 3 Among the ARGs with negative correlations, CXCR1 and IL4 are related to cytokines. DC-SIGN is involved in chemokines, which play important role in HIV entry through chemokine receptors.

Pair 2 (C2, P2). As shown in
As shown in Table 4, the canonical variable 2 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that highly correlate with the variable 2 are not enriched in a certain pathway. Table 3 As shown in Table 4, the canonical variable 3 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that highly correlated with the variable 3 are enriched in glycolysis and pyrimidine metabolism. The glycolysis genes include phosphoglycerate mutase 1 (5223), glyceraldehyde-3-phosphate dehydrogenase (2597), and glucose-6-phosphatase (57818). The pyrimidine metabolism genes include polymerase (DNA directed), delta 2 (5425), cytidine monophosphate (UMP-CMP) kinase 1 (51727), and uridine monophosphate synthetase (7372). Table 3 As shown in Table 4, the canonical variable 4 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that highly correlated with the variable 4 are enriched in purine metabolism. These genes include deoxyguanosine kinase (1716), polymerase (RNA) III (DNA directed) polypeptide K (51728), polymerase (RNA) III (DNA directed) polypeptide B (55703), pyruvate kinase (5313), adenylate cyclase 10 (55811), phosphodiesterase 6D (5147), polymerase (DNA directed), delta 2 (5425), polymerase (RNA) II (DNA directed) polypeptide C (5432), and phosphodiesterase 5A (8654).  Table 3 correlations were observed between 5 and IDH1. However, the greatest negative correlations were observed between 5 and NCOR2. Among the ARGs that highly correlated with 5, PPIA, TSG101, APOBEC3B, TRIM5a, and CUL5 are postentry cellular viral cofactors. HLA-C and HLA-B are members of the HLA system. DC-SIGN and SDF1 are related to chemokines. CXCR1 is related to the cytokines pathway.

Pair 5 (C5, P5). As shown in
As shown in Table 4, the canonical variable 5 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that highly correlated with the variable 5 are enriched in inositol phosphate metabolism; these genes include synaptojanin 2 (8871), phospholipase C beta 2 (5330), and inositol-trisphosphate 3kinase B (3707). Table 3 The greatest positive correlation was observed between 6 and PPIA. However, the greatest negative correlation was observed between 6 and ZNRD1. Among the ARGs that highly correlated with 6, PPIA, TSG101, TRIM5a, and CUL5 are postentry cellular viral cofactors. HLA-A, HLA-C, and HLA-B are members of the HLA system. CXCR6 is related to chemokine receptors. IRF1 and CXCR1 are related to cytokines. As shown in Table 4, the canonical variable 6 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that highly correlated with variable 6 are enriched in pyrimidine metabolism and terpenoid backbone biosynthesis. These genes include polymerase (RNA) II (DNA directed) polypeptide F (5435), cytidine monophosphate (UMP-CMP) kinase 1 (51727), polymerase (RNA) I polypeptide B (84172), farnesyl diphosphate synthase (2224), and acetyl-CoA acetyltransferase 1 (38). Table 3 (−0.54), and CUL5 (−0.80). The greatest positive correlation was observed between 7 and IDH1. However, the greatest negative correlation was observed between 7 and CUL5. Among the ARGs that highly correlated with 7, TSG101, APOBEC3G, TRIM5a, and CUL5 are postentry cellular viral cofactors. KIR and HLA-C are in the HLA system. DC-SIGN and CCL11 are related to chemokine receptors. IL4 is related to cytokines.

Pair 7 (C7, P7). As shown in
As shown in Table 4, the canonical variable 7 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that highly correlated with variable 7 are enriched in pyrimidine metabolism and methane metabolism. These genes include uridine-cytidine kinase 1-like 1 (54963), polymerase (RNA) II (DNA directed) polypeptide F (5435), polymerase (RNA) II (DNA directed) polypeptide A (5430), alcohol dehydrogenase 5 (class III) (128), and methylenetetrahydrofolate reductase (4524). Table 3 03). The greatest positive correlation was observed between 8 and PPIA. However, the greatest negative correlation was observed between 8 and TSG101. Among the ARGs that highly correlated with 8, TSG101, APOBEC3G, TRIM5a, and PPIA are postentry cellular viral cofactors. KIR and HLA-C are in the HLA system. DC-SIGN and SDF1 are related to chemokine receptors. IL4 and IRF1 are related to cytokines. HLA-C and HLA-B are in the HLA system.

Pair 8 (C8, P8). As shown in
As shown in Table 4, the canonical variable 8 accounts for the variability in the original metabolic pathway gene expression data. The metabolic pathway genes that highly correlated with variable 8 are not enriched in a metabolic pathway.

Discussion
Researchers have used numerous approaches to identify host genes related to AIDS [5][6][7][8][9][10][11][12][13]. Most studies use genomic information but not integration of the genome and transcriptome. However, most SNPs at ARGs impact AIDS through changing host gene transcription [7][8][9][10]. This study features novel experiments that focus on ARG cooperation at the transcription level and extends the correlation between ARGs and metabolic pathway genes to discover novel host genes related to AIDS.
For each variable in the canonical correlation analysis, HIV-1 postentry cellular viral cofactors highly cooperated at the transcription level. PPIA, TSG101, TRIM5a, APOBEC3G, and CUL5 frequently appeared together to correlate with the canonical variables. PPIA functions in cyclosporin Amediated immunosuppression by encoding a member of the peptidyl-prolyl cis-trans isomerase (PPIase) family [24]. Formation of HIV virions requires an interaction between PPIA and HIV viral proteins. TSG101 negatively regulates cell growth and differentiation by producing a protein that interacts with stathmin [25]. TRIM5a is an E3 ubiquitinligase, and its ubiquitination function is involved in retroviral restriction [26]. These genes encode HIV-1 postentry cellular viral cofactors involved in different biological processes. Thus, the high correlation between these genes and canonical variables demonstrates that these genes are coordinated at the transcriptional level. These data suggest that a potential transcriptional regulator for these genes may be a key host factor related to AIDS.
The high-frequency ARGs that correlated with canonical variables include PPIA, TSG101, CUL5, NCOR2, IDH1, and MYH9. PPIA, TSG101, and CUL5 are discussed above. NCOR2 with histone deacetylases is a nuclear receptor corepressor [27]. IDH1 encodes isocitrate dehydrogenases involved in cytoplasmic NADPH production and pyruvate metabolism [28]. MYH9 aids in maintaining cell shape, cell motility, and cytokinesis as a conventional nonmuscle myosin [29]. These ARGs are not enriched in a certain biological process. However, many host genetic factors have not been studied.
The low-frequency ARGs that correlated with canonical variables include DEFB1 with 4, KIR with 7, HLA-A with 5, CCL11 with 7, LY6D with 4, APOBEC3B with 5, and CXCR6 with 6. DEFB1 is a defensin and is implicated in cystic fibrosis pathogenesis [30]. HLA-A is a major histocompatibility complex class I heavy chain paralogue; these paralogues are expressed in nearly all cells [31]. CCL11 is chemokine (C-C motif) ligand 11 and is implicated in immunoregulatory and inflammatory processes [32]. CXCR6 is chemokine (C-X-C motif) receptor [33]. LY6D is a member of the lymphocyte antigen 6 complex [34]. APOBEC3B is a member of the cytidine deaminase gene family. Recent studies have revealed that these ARGs may be RNA-editing enzymes that control the cell cycle [35]. Further, these genes only correlated with one canonical variable, which suggests that the specificity of the correlation may determine the canonical variable correlated with a certain metabolic pathway.
The most significant metabolic pathway in our analysis is purine metabolism, which featured correlations with two canonical variables and the lowest values. Recent studies analyzed purine codon patterns in variable and constant regions of HIV-1 and showed that HIV-1 RNA exhibits extreme enrichment in the purine A compared with most organisms [36]. These data suggest that a potential therapeutic agent against HIV-1 may involve novel purine derivatives [37]. Studies have elucidated twenty-four purine derivatives that act as HIV-1 Tat TAR interaction inhibitors [38]. More recently, research revealed that host cells with a modified purine biosynthesis pathway exhibit increased activity by tenofovir against sensitive and drug resistant HIV-1 [39]. In this study, we show a high correlation between ARG and purine metabolism gene expression. These data imply that purine metabolism genes are significant candidates for studying the host genomic or transcriptome influence on AIDS.

Conclusions
In this study, we used a CCA to analyze the correlations between ARG and metabolic pathway gene expression. The results show that HIV-1 postentry cellular viral cofactors are highly coexpressed, which suggests that regulating this group of host genes may be a key factor in studies to understand the AIDS-host interaction mechanism. Furthermore, we show that purine metabolism pathway genes coordinate with ARGs; this novel discovery supports future studies on AIDS therapy using purine derivatives. Both coexpressed ARGs and metabolic pathway genes also provide a new marker for AIDS diagnosis.