Gene Ontology and KEGG Enrichment Analyses of Genes Related to Age-Related Macular Degeneration

Identifying disease genes is one of the most important topics in biomedicine and may facilitate studies on the mechanisms underlying disease. Age-related macular degeneration (AMD) is a serious eye disease; it typically affects older adults and results in a loss of vision due to retina damage. In this study, we attempt to develop an effective method for distinguishing AMD-related genes. Gene ontology and KEGG enrichment analyses of known AMD-related genes were performed, and a classification system was established. In detail, each gene was encoded into a vector by extracting enrichment scores of the gene set, including it and its direct neighbors in STRING, and gene ontology terms or KEGG pathways. Then certain feature-selection methods, including minimum redundancy maximum relevance and incremental feature selection, were adopted to extract key features for the classification system. As a result, 720 GO terms and 11 KEGG pathways were deemed the most important factors for predicting AMD-related genes.


Introduction
Age-related macular degeneration (AMD or ARMD) is a chronic, progressive eye disorder that primarily occurs in elders (>50 years) and has become a major cause of blindness and visual impairment in developed countries as well as the third major cause globally [1,2]. In an Asian population aged 40-79 years, the morbidities of early and late AMD were 6.8% and 0.56%, respectively [3]. Further, AMD is likely to increase with a longer life expectancy. Due to retina damage, AMD typically results in vision loss, which can render daily activities difficult, such as reading, watching TV, and recognizing faces [4]. There are two typical types of AMD: dry AMD and wet AMD. Dry AMD is the major type of AMD and accounts for approximately 80% of cases; no efficient surgical or medical treatments are available. It typically causes mild vision loss, which develops slowly. However, it can cause vision loss through retinal pigment epithelial layer atrophy, which results in photoreceptor loss (rods and cones) in the central portion of the eye. Wet AMD is caused by choroidal neovascularization (CNV), wherein new blood vessels grow in choriocapillaries through the Bruch's membrane. Leaking and bleeding of these vessels can damage the rods and cones, which lead to rapidly deteriorating vision. Thus, wet AMD accounts for 90% of AMD cases with severe visual impairment.
The AMD etiology is complex. AMD results from both genetic and environmental factors; however, the underlying

Materials and Methods
2.1. Dataset. The known AMD-related genes were retrieved from the Retina International website (http://www.retinainternational.org/files/sci-news/remacdy.htm, recent update from March 24, 2010) and the literature. Specifically, 16 genes are from Retina International; three genes for the complement system proteins factor H (CFH), factor 3 (C3), and factor B (CFB), which are strongly related with a person's risk for developing AMD, are employed; HTRA1 is from [26,27]; ABCR is from [28]; 2 genes are from [29,30]; and 23 genes are from [18]. Finally, 39 known AMDrelated genes were collected; these genes are referred to as "positive genes" and compose the gene set . To analyze the differences between the positive genes and other genes, we randomly selected 1,950 genes (50 times the number of positive genes) from Ensemble that were not in ; these 1,950 genes are referred to as "negative genes" and compose the set . The Ensemble IDs for the positive and negative genes are in Supplementary Material I available online at http://dx.doi.org/10.1155/2014/450386.
The negative genes outnumbered the positive genes; thus, we confronted an imbalanced dataset. Encouraged by certain studies that have managed this type of data [31,32], the following strategy was adopted. The negative genes were equally and randomly split into 10 portions 1 , 2 , . . . , 10 (i.e., = 1 ∪ 2 ∪ ⋅ ⋅ ⋅ ∪ 10 and ∩ = for ̸ = ). For each , we combined the genes in and to comprise the th datasets (i.e., = ∪ ).

Feature Construction.
To analyze the differences between the positive and negative genes, each gene must be represented by certain features that can then be processed by certain computer programs. Here, we adopted gene ontology (GO) and KEGG enrichment to compute numerical values that represent each gene. GO enrichment indicates the relationship between genes and GO terms. For each gene and each GO term GO , a score is generated, which is typically referred to as the gene ontology enrichment score and defined as the −log 10 of the hypergeometric test value [33][34][35] for a gene set consisting of 's direct neighbors in STRING and the GO term GO j that can be computed as follows: where denotes the overall number of proteins in humans, denotes the number of proteins annotated in the gene ontology term GO , denotes the number of proteins in , and denotes the number of proteins in that are annotated in the gene ontology term GO . If the score is large for one gene and one GO term, the gene and GO term likely have a strong relationship; there were 12,877 gene ontology enrichment scores.
Similarly, for each gene and each KEGG pathway , the KEGG enrichment score is defined as the −log 10 of the hypergeometric test value [35,36] for a gene set that consists of 's direct neighbors in STRING and the KEGG pathway , which can be calculated as follows: where denotes the overall number of proteins in humans, denotes the number of proteins annotated in the KEGG pathway , denotes the number of proteins in , and denotes the number of proteins in that are annotated in the KEGG pathway . Additionally, a higher KEGG enrichment score between and indicates a stronger relationship; 239 features were KEGG enrichment scores.
Accordingly, each gene can be represented by 12,877 gene ontology enrichment scores and 239 KEGG enrichment scores, which can be formulated as follows:

Prediction Method and Accuracy Measurement.
Weka [37] is a collection of many state-of-the-art machine-learning algorithms and has been used to solve various biological problems [38][39][40][41][42]. One classifier, which is referred to as SMO, was adopted herein as the classification method; it implements John Platt's sequential minimal optimization algorithm to solve the optimization problem that should be settled during training of a support vector classifier. The kernel function can be polynomial or Gaussian [43,44]. The predicted results for a two-class classification problem can be represented by a confusion matrix consisting of four entries: a true positive (TP), a true negative (TN), false positives (FP), and a false negative (FN) [45,46]. Accordingly, the prediction accuracy (ACC), specificity (SP), and sensitivity (SN) can be computed as follows: However, in each dataset , the number of negative genes was 5 times as many as the number of positive genes, which is still imbalanced. Thus, an additional measurement, Matthews's correlation coefficient (MCC) [47], was employed to solve the problem; the coefficient can be computed as follows: 2.4. 10-Fold Cross Validation. Ten-fold cross validation is often used to examine the performance of various classification models [48]. In 10-fold cross validation, the dataset is equally and randomly divided into ten portions. Each portion is used as testing data, and the samples in the remaining nine portions compose the training dataset. Each sample is tested once because each portion is tested once. Compared with the Jackknife test [49,50], a 10-fold cross-validation test is more efficient and provides similar results for a given dataset. Thus, it was adopted herein to examine the classification model.

Feature Selection.
As described in Section 2.2, each gene is represented by 12,877 + 239 = 13,116 enrichment scores. To analyze these features and extract key features that contribute the most to the positive and negative gene classification, certain feature-selection methods were employed. This procedure included two stages: (1) using Cramer's coefficient [51,52] to exclude nonsignificant features and (2) using the minimum redundancy maximum relevance (mRMR) method as well as incremental feature selection (IFS) [53] for additional selection.
Cramer's coefficient [51,52] is a statistical measure of two variables that was derived from the Pearson Chi-square test [54]; it ranges from 0 to 1. A high Cramer's coefficient for two variables indicates a strong association. Here, for each feature and samples' class labels, Cramer's coefficient was calculated, and features with a Cramer's coefficient lower than 0.1 were excluded.
The remaining features were further refined using the minimum redundancy maximum relevance (mRMR) method and incremental feature selection (IFS), which are feature selection methods that have been widely used in recent years [34,[55][56][57][58]. By evaluating a classification model, key features can be extracted from a complicated biological system. The mRMR method has two criteria: max-relevance and min-redundancy. Accordingly, two feature lists can be generated using this method: (1) the MaxRel feature list and (2) the mRMR feature list. Specifically, the former list sorts features according to their contributions to the classification (i.e., only considering the criterion of maxrelevance), while the latter list sorts features by considering both the max-relevance and min-redundancy criteria. The MaxRel and mRMR features lists were formulated as follows: MaxRel features list : where denotes the total number of features. A detailed description of the mRMR method can be found in Peng et al.'s paper [53].
Only the mRMR features list was used to extract key features. The extraction procedure is described as follows.
(2) The classifier SMO was evaluated through 10-fold cross validation using features in . As described in Section 2.3, ACC, SP, SN and MCC can be obtained.
(3) The feature set with the maximum MCC is deemed the optimal feature set. For ease in observation, an IFS-curve can be plotted with MCC values as theaxis and the superscript of as the -axis.  Material III. For clarity, we plotted an IFS-curve for each dataset , which is referred to as IFS-curve-. The five IFScurves for 1 , 2 , 3 , 4 , and 5 are shown in Figure 1(a), while the other five IFS-curves for 6 , 7 , 8 , 9 , and 10 are shown in Figure 1(b); the ten IFS-curves that are plotted in separate coordinates are available in Supplementary Material IV. Generating the maximum MCC for each dataset from Supplementary Material III and IV (listed in column 3 of Table 2) was a straightforward process. Clearly, most MCCs are in the range 0.7 to 0.8, and the mean value was 0.76139. As mentioned in Section 2.5, the features used to obtain the maximum MCC compose the optimal feature set. The number of features in the optimal feature set for each dataset is listed in column 2 of Table 2. The results for dataset 1 are described as follows. The maximum MCC for the dataset 1 is 0.712699 (listed in row 2 and column 3 of Table 2) using the first 344 (listed in row 2 and column 2 of Table 2) features in the mRMR features list of dataset 1 (see Supplementary Material II).

Analysis of the Optimal Feature Set.
As mentioned in Section 3.2, we generated an optimal feature set for each dataset, thereby obtaining 10 optimal feature sets. We combined these optimal feature sets to compose the final optimal feature set, which includes 720 GO terms and 11 KEGG pathways that are available in Supplementary Material V. To discern the distribution of these 731 optimal features, we counted the number of optimal feature sets containing each of 731 features. Figure 2 shows the number of features against the number of optimal feature sets, from which we can see that 400 features were exactly contained in one optimal feature set, 131 features were exactly contained in two optimal feature sets, while others were contained in at least three optimal feature sets. Accordingly, 45.28% (331/731) features were contained in at least two optimal feature sets, indicating that different datasets may induce some common features. It also suggested that some important features for distinguishing AMD-related genes were contained in the final optimal feature set. In   the following sections, features in the final optimal feature set were discussed.

GO Number and Percentage.
It is known that GO terms can be divided into the following three types: (1) biological process (BP) GO term, (2) cellular component (CC) GO term, and (3) molecular function (MF) GO term. To efficiently discern the biological meanings and characterize the functional essentiality of the GO terms in the final optimal feature set, we considered the children terms of the aforementioned three types. For clarity, let be the 720 GO terms in the final optimal feature set and be the children terms of any children term of BP GO term, CC GO term, or MF GO term. To display the distribution of the GO terms in , we calculated the frequency and percentage for each children term of BP GO term, CC GO term, or MF GO term which were defined as | ∩ | and | ∩ |/| |, respectively.     and may lead to AMD due to aberrant behavior in relevant cells. "Response to stimulus" refers to any process that results from a stimulus, which leads to a change in a state or activity, such as movement and secretion.   For the BP term percentages, as shown in Figure 4, the top five biological process terms are (I) GO: 0001906: cell killing (7.25%); (II) GO: 0040011: locomotion (4.00%); (III) GO: 0002376: immune system process (3.99%); (IV) GO: 0022610: biological adhesion (3.88%), and (V) GO: 0048518: positive regulation of a biological process (2.72%).
Biological adhesion between substrate and cells modulates several critical cellular processes, such as cell locomotion and gene expression [59]. Biological adhesionand locomotion-related gene dysfunction may result in AMD. Previous research has shown that the immune system, particularly the complement system, is relevant to AMD. Genetic studies also indicate that several complement-related genes, including CFH, complement component 2, complement component 3, CFHR1, and CFHR3, are highly associated with AMD [60]. Further, complement can enhance the generation of VEGF (vascular endothelial growth factor), which may strongly facilitate AMD development [61]. Histological studies show the presence of macrophages, lymphocytes, mast cells, and fibroblasts in both atrophic lesions and with retinal neovascularization [61].
(2) CC GO Terms. In Figure 5 From the distribution of CC terms, except for the cell term (GO: 0005623), the top four CC terms are associated with the extracellular matrix. Moreover, the extracellular region is relevant to cell adhesion and locomotion, which were mentioned in the biological process GO terms.
The results are also consistent with a recent GWAS study, which identified several new loci with enrichment for genes involved in the extracellular matrix and other activities [18]. Structural damage of extracellular matrix in retinal cells may lead to break point of AMD [62]. Matrix metalloproteinases result in extracellular matrix degradation and are highly related to AMD pathogenesis [63]. Therefore, taken together, these facts suggest that the extracellular matrix plays an important role in AMD.
MF terms related to catalytic activity and binding were highlighted partly due to the large base numbers of these terms. However, this finding may suggest that genes assigned to these two terms are essential to maintain normal function. For example, matrix metalloproteinases, which can degrade extracellular matrix proteins, play an important role in AMD [63]. In addition, highlighting receptor activity and molecular transducer activity indicates that abnormal cellular signal pathway behaviors are involved in AMD patients. For example, the Aryl hydrocarbon receptor, which is responsible for clearing cellular debris and for toxin metabolism, is essential to maintaining normal function in RPE cells, and deficiency of this receptor causes AMD in mice [64].
To our surprise, receptor activity was highlighted in both the frequency and percentage of molecular function terms, which is further evidence of the important role that receptor activity plays in AMD. Antioxidant activity is also highlighted, and oxidative stress [6] is a risk factor correlated with AMD. Channel regulator activity and structural molecule activity may also be involved in AMD.
Valine, leucine, and isoleucine biosynthesis (hsa00290) and selenocompound metabolism (hsa00450) are related to amino acid metabolism. Mucin-type O-glycan biosynthesis is associated with modifications of serine or threonine residues of certain proteins. RNA transport from nucleus to cytoplasm is also essential for gene expression. These terms may not be the key factors in AMD, but they may give us suggestions about the AMD development. Phagosome (hsa04145) is also associated with AMD. There are various forms of cell death and phagocytosis in the retina [65]. But failure of retinal pigment epithelial cells and macrophages to phagocytize dying retinal pigment epithelial cells may result in drusen formation and development of AMD [66]. The underlying mechanism of AMD is still unclear, but many studies have highlighted the essential role of the immune system in the development and progression of AMD [67]. Previous studies have revealed a strong association between complement pathway and AMD [20]. Several complement genes including complement 2 (C2) and complement 3 (C3) have been strongly associated with AMD [12,68]. Except vasopressin-regulated water reabsorption, viral myocarditis (hsa05146) and Staphylococcus aureus infection (hsa05150) are all correlated with immunity, which further emphasizes the effect of immunity in AMD.

Conclusions
In this study, we performed GO and KEGG enrichment analyses of AMD-related genes. The results suggest that 720 GO terms and 11 KEGG pathways are important factors that contribute to identifying AMD-related genes.