Combination of Genetic Markers and Age Effectively Facilitates the Identification of People with High Risk of Preeclampsia in the Han Chinese Population

Objective This study aimed to analyze the possible association between known genetic risks and preeclampsia in a Han Chinese population. Methods A total of 156 patients with preeclampsia and 286 healthy Han Chinese women were enrolled and genotyped for 27 genetic alleles associated with preeclampsia in different populations. The association between the genotypes of the individual alleles and preeclampsia and the possible interaction among the alleles were analyzed. Finally logistic models were trained with the genotypes of possible alleles contributing to preeclampsia. Results Seven alleles were significantly or marginally significantly associated with preeclampsia, which involved six genes (rs4762 in AGT, rs1800896 in IL-10, rs1800629 and rs1799724 in TNFα, rs2070744 in NOS3, rs7412 in APOE, and rs2549782 in ERAP2). A multilocus interaction analysis further disclosed an interaction among seven alleles. A logistic model showing individual or synergetic contribution to preeclampsia could reach ~0.67 preeclampsia prediction accuracy in the Han Chinese population, while integration of age information could improve the performance to ~0.75 accuracy using a fivefold training-testing evaluation strategy. Conclusions The genetic factors were closely associated with preeclampsia in the Han Chinese population despite large ethnicity heterogeneity. The genotypes of different alleles also had synergetic interactions.

The early risk prediction could improve the morbidity, mortality, and clinical outcomes of patients with PE [15][16][17]. The environmental factors, disease history, and concurrent clinical manifestations have been identified as risk factors and used for the clinical guidance of earlier detection of PE [12,13] despite the critical arguments on the weak specificity and sensitivity [18]. Recently, a couple of genetic factors have been demonstrated to have an association with PE onsets independently or interactively with each other or with environmental factors [9-11, 19, 20]. Inclusion of the high-risk or low-risk genetic factors could improve the prediction sensitivity and specificity of PE, facilitating earlier antenatal screening and intervention of population with higher PE susceptibility [21]. However, studies also disclosed the large variance in genetic components and corresponding PE risk odds among ethnic populations [20,22,23]. Therefore, population-specific identification of genetic-risk factors appears necessary and significant for more effective prediction of population with a high or low PE susceptibility.
A previous study explored the genotypes of a Chinese population and showed the different PE risk profile of the putative risk alleles between Chinese and other ethnic groups [23]. Both the sample size and the putative risk allele species were enlarged in this study to generate a more robust and broader PE genetic-risk profile in Chinese women. The confounding factors, such as age, were also investigated. Furthermore, logistic regression models were built to effectively distinguish Chinese people with high or low risks of PE based on the genetic and clinical features.

Participants.
A total of 442 unrelated women with first pregnancies (156 with PE and 286 controls) were enrolled at the High-Risk Pregnancy Outpatient Service in Maternal and Children's Hospital of Shenzhen City between June 2014 and May 2015. Diagnosis of PE was defined as new-onset hypertension, combined with proteinuria. Hypertension was defined as systolic blood pressure (BP) ≥140 mm Hg or diastolic BP ≥90 mm Hg. Proteinuria was defined as urinary protein excretion ≥300 mg/24 h or a positive urine dipstick result of at least 1+ without urinary infection [24]. Women in the control group had neither PE in the current pregnancy nor history of previous pregnancies with PE. Participants with multiple gestation, any malformation, any form of hypertension, diabetes mellitus, chronic infectious diseases, autoimmune diseases, thyroid disease, chronic renal disease, rheumatoid arthritis, or systemic lupus erythematosus were excluded from the study. This study was approved by the institutional review board of the Maternal and Children's Hospital of Shenzhen City. Written informed consent was obtained from all participants. All included participants were of Han Chinese origin and lived in the same region at the time of the study.

Single-Nucleotide Polymorphism Selection and Genotyping.
In this study, 27 polymorphisms in 19 genes were selected [9,11], which showed a significant association with PE in populations with different ethnic background and function in coagulation and fibrinolysis, renin-angiotensin system, oxidative stress, inflammation, or lipid metabolism (Table 1).
Genomic DNA was extracted from peripheral blood samples. Except VNTR (Variable Number of Tandem Repeats) in eNos Intron 4 and ACE (angiotensin converting enzyme, ACE) rs4646994 for which genotyping was performed according to the protocols described in previous studies [25,26] and in Supplementary Materials (available here), the remaining 25 polymorphisms were determined using a multiplex polymerase chain reaction (PCR) reaction and SNaPshot method. Briefly, single-nucleotide polymorphisms (SNPs) were amplified using the KAPA HotStartTaqDNA polymerase (KAPA Biosystems Inc., MA, USA) and further analyzed with single-base extension (SBE) reactions using the SNaPshot Multiplex kit (Applied Biosystems, CA, USA). PCR and SBE primers are listed in Supplementary Materials. The products were purified and analyzed using an ABI3730XL (Applied Biosystems). Twenty individuals were selected randomly for bidirectional sequencing to confirm the accuracy of the genotyping. No genotyping error was observed. The general success rate of genotyping reached 99.7%.

Statistics and Interaction Analysis of Multiple Genetic
Features. The chi-square and EBT (exact binomial test) tests were performed in R (https://www.r-project.org/) for rate comparisons as indicated in the context [27]. Multifactor dimensionality reduction (MDR) was used to analyze the interaction among genetic factors [28]. Significance levels were defined as follows: P < 0.05, significant; 0.05 ≤ P < 0.1, marginally significant.

Logistic Regression Modeling of PE Risk with Genetic
Markers and Other Features. Genetic markers were reencoded with "1" and "0" for the genotypes with recessive models of major alleles. Each participant in positive (PE) or negative (control) group was represented by a vector of binary digits indicating the genotype composition of the series of SNP markers. The genetic feature matrix of all the participants was imported into R, and logistic regressions were performed with the glm function. For the models with both genetic and age features, feature representation for genotype composition was the same as described earlier, while (pregnancy) age was encoded by a 1-bit binary digit: "1" if ">32 years" or "<21 years" and "0" if "≥21 years and ≤32 years." Each participant in positive (PE) or negative (control) group was represented by a vector of binary digits indicating the genotype composition of both the series of SNP markers and age stratification. When indicated, age could be encoded with the scheme of "1" representing ">40 years" or "<18 years" and "0" representing "≤40 years and ≥18 years." Software tools were developed with GO programming language to implement the logistic regression models predicting PE risks (http://www.szu-bioinf.org/PERPer Go). As described in the document, the genotypes of targeted SNP loci and/or age were used as input, and the output was the probability of PE. Presently, the software is implemented in Linux or Mac operating system.

Performance Assessment of Computational Models on PE Risk Prediction.
A fivefold training-testing evaluation strategy was adopted, which was proposed in a previous study to make a fair evaluation on the performance of the computational models with limited samples and genetic data [27]. Briefly, both the PE and control groups were randomly divided into five subgroups. Four of the subgroups were combined for each group and together comprised the training dataset, and the remaining subgroups for patients with PE and controls were combined as the testing dataset. In this way, the original dataset was split into two independent parts, The minor allele at the locus without polymorphism is represented with '-'; not significant allele or genotype composition difference between patients with PE and controls is indicated with "not sig." (chi-square test, P ≥ 0.1). For significant (P < 0.05) or marginally significant (P ≥ 0.05 and < 0.1) ones, the PEcontributing minor alleles are shown.
with the training dataset to optimize model parameters and the testing dataset to assess the performance of the model. The splitting, model training, and performance evaluation were repeated five times, and the average performance was calculated over the repeats. Sensitivity (Sn), specificity (Sp), accuracy (Acc), receiver operating characteristic (ROC) curve, and the area under the ROC curve (AUC) were used to assess the performance of models. In the following formulas, Sn (true positive rate) and Sp (true negative rate), respectively, represented the percentage of positive instances (PE) and the percentage of negative instances (control) correctly predicted. Acc denoted the percentage of both PE and control instances correctly

Different Age Distribution between PE and Control Population.
The clinical factors that could possibly contribute to PE risk (e.g., history of PE, smoking, multiple gestation, and concurrent diseases) were controlled strictly during participant recruitment. The only two exceptions included age and BMI. For age, a previous stratification criterion (≤ or >40 years) was adopted, with no difference between groups in the age composition [23,29]. For all the PE cases recruited in the study, the disease happened not before 34 weeks after pregnancy.
The percentages of patients with PE and controls were plotted versus age to further observe the possible relationship between age and PE in Chinese women ( Figure 1).
Interestingly, the age of controls showed a clock-shape distribution with a mean of 28-29 years; however, the PE group showed a strikingly different distribution, with an apparent percentage increase on both ends, >32 years and <21 years, and a decrease between 21 and 32 years compared with CT PE ≤20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 (Figure 1; chi-square test, P = 2.34E-12; EBT test, P = 4.54E-05). This suggested that age also influenced PE in Chinese population, and the risk stratification levels for age should be adjusted [29].  (Figure 2). The difference between groups was significant, with the higher BMI in PE patients (P = 3.56e-9, Mann-Whiney U test).

Ethnicity-Specific PE-Associated Genetic Risks in the Han
Chinese Population. The 27 genetic loci with known polymorphisms were genotyped in patients with PE and healthy pregnant women of Han Chinese ethnicity (Table 1). Different from the previous reports on other populations, seven loci showed homozygous allele composition without any polymorphism detected in the Han Chinese patients with PE or controls (rs5742620, rs1799963, rs1799889, rs268, rs4986790, rs4986791, and rs1800590; Table 1). All the remaining 20 loci showed polymorphisms, and the genotype composition followed the Hardy-Weinberg equilibrium (chi-square test, P > 0.05). However, 65% (13/20) did not show different allele or genotype composition between PE and control groups (chisquare test, P > 0.1, for both allele and genotype comparison; Table 1), further reflecting the ethnic specificity of genetic risks on PE.
Only five loci (rs2070744, rs1800896, rs1800629, rs1799724, and rs4762) showed a significant difference in genotype composition between patients with PE and controls (chi-square test, P < 0.05) and additional two (rs2549782 and rs7412) showed marginal significance (P < 0.1 and P ≥ 0.05) (Tables 1 and 2). For these loci, the minor alleles were all consistent with previous reports on different populations ( Table 1). The detailed genotype composition and the PE risk odds ratio were calculated for each significant or marginally significant locus based on a recessive model of major alleles ( Table 2). In the Han Chinese population, the genotypes composed of heterozygous or homozygous minor alleles contributed a significantly higher risk to PE in rs2070744 (TC/CC), rs1800896 (AG/GG), and rs1799724 (CT/TT); a marginally higher risk to PE in rs2549782 (TG/GG) and rs7412 (CT/TT); and a protective effect to rs1800629 (GA/AA) and rs4762 (CT/TT) ( Table 2).
The genetic polymorphisms appeared to contribute to PE independent of the age factor because the association analysis after the stratification of PE and control groups according to the age distribution shown in Figure 1 yielded similar results (data not shown).

Complex Interactions among Genetic Polymorphisms
Associated with PE. The MDR interaction analysis was performed to observe possible interactions among the 20 polymorphic genotypes showing the contribution to PE. Among different combinations, the best model showed a significant and stable interaction among seven polymorphic loci from six genes ( Table 3).
Among these loci, rs1800896, rs1799724, rs2070744, rs4762, and rs7412 showed significant contributions in the model to PE risks (Table 4). Consequently, another logistic model was also trained only with the genotype features of the five significant contributing SNPs (Supplementary Table S1).
A training-testing strategy was adopted to make a fair assessment on the predictive performance of the logistic models based on genetic features in the Han Chinese  1 Only the best interaction models with no larger than eight features are shown. The significant and best model is shown in bold.
The performance was apparently better than that of the neutral random model, which showed an AUC of 0.500 (Figure 4, neutral). The models based on five significant contributing SNPs showed slightly lower performance, with average AUC, accuracy, specificity, and sensitivity of 0.603, 0.658, 0.852, and 0.303, respectively (Table 5; Figure 4, SNP5).

Combination of Genotype and Age Information Could Improve the Prediction Power of PE Risks in the Han Chinese
Population. At present, age is one of the most important factors for a clinical doctor to evaluate the general risk of PE. Therefore, the genetics-based models were further compared with the simplest age model for predictive power in identifying PEs.
The levels of risk stratification by age are different among different countries or areas, for example, ">32 years" being considered as a high-risk factor in some countries but ">40 years" in others such as China [12,29]. This study showed the high PE risks of ">32 years" and "<21 years" in the Han Chinese population (Figure 1). Therefore, the age model followed the new age stratification schemes (refer to Materials and Methods). As a result, the pure age model could reach an AUC, accuracy, specificity, and sensitivity of 0.598, 0.633, 0.677, and 0.591, respectively, which were quite close to or slightly worse than the performance of the models based on pure genetic features (SNP5 or SNP8) ( Table 5). The age model was also evaluated based on previous stratification levels, that is, "<18 years," "18-40 years," and ">40 years"; however, it worked far worse.
The genetic features were further combined with age stratification, and a new logistic model was trained. In the new model, the "age" contributed most significantly while the contribution of different genetic markers varied largely from pure genetic models, indicating the significance of age and the complex interactions between age and genetics in PE (Supplementary Table S2). The training-testing evaluation further demonstrated the strikingly improved prediction power of the model based on both genetic markers and age information (Table 5; Figure 4, SNP age). The average AUC,    optimized accuracy, specificity, and sensitivity reached 0.687, 0.749, 0.856, and 0.504, respectively (Table 5).
Other machine learning techniques were also adopted to build prediction models based on the eight or five SNPs, for example, support vector machine with different kernel types. However, based on the training-testing evaluation results, none of them outperformed the logistic models (data not shown).
Besides the age and genetic features, the other clinical feature, BMI, was also evaluated for the performance as a PE predictor. It could predict PE with an average accuracy of 0.69, not as good as the SNP8Age model. Due to the missing of height information for a substantial portion of the PE cases (∼15%), BMI was not integrated into the combined model as an individual feature in this research.

Discussion
Accumulating evidence supported the association between the occurrence and progression of PE and genetics [9,11]. On the contrary, the ethnicity heterogeneity for the association was observed repeatedly [20,22,23]. This study further observed the ethnicity heterogeneity in the Han Chinese population for the genetic risks of PE. All the investigated 27 alleles were reported with polymorphisms previously, which were associated with PE in different populations. However, seven of them showed homogeneity without polymorphism among all the 442 Han Chinese participants (Table 1). For the other 20 alleles, 13 did not show any association between PE and allele composition or genotype (Table 1). Even for the remaining alleles with significant or marginally significant risk for PE in the Han Chinese population, the ethnicity difference was still observed. For example, rs4762 was previously shown with a polymorphism in a Korean cohort, but different genotypes did not show a significant bias between patients with PE and controls [30]. For rs1800896, an active debate continues on the association and the risk PE allele composition (A or G) in different populations [31][32][33][34][35]. No association was frequently observed in multiple populations for rs1800629 [36,37], while the association and risk PE genotype (AA or TT) remain contradictory for rs1799724 [38,39]. Despite the ethnicity difference disclosed in this study, the project is still ongoing with an enlarged sample size. Moreover, the possibility could not be excluded that some of the alleles shown without polymorphism or no association with PE could show association with PE with the increase in size.
Seven alleles were found in the Han Chinese population significantly or marginally significantly associated with PE (Tables 1 and 2). The genes involved included AGT, IL-10, TNF-alpha, NOS3, APOE (marginally), and ERAP2 (marginally) ( Table 2). The AGT gene product is an important component of the rein-angiotensin system (RAS), a key regulatory system of blood pressure, which could be closely related to PE occurrence [40]. The PE risk allele in rs4762 of AGT gene in the Han Chinese population could cause amino acid change (T174 M). IL-10 and TNFalpha are inflammation-related genes that participate in anti-inflammatory responses observed in PE [41]. In this research, one allele in IL-10 promoter (rs1800896) and two alleles in TNF-alpha promoter (rs1800629 and rs1799724) were found to be associated with PE. The allele "G" and genotypes "GG"/"AG" were associated with a high risk for PE in rs1800896. The composition of allele "A" and genotypes "AA"/"AG" significantly decreased while "T" and genotypes "TT"/"CT" significantly increased in PE for rs1800629. The other genes were also reported to be associated with the occurrence or progression of PE. However, the underlying molecular mechanisms remain unclear and require further investigation. Besides the individual contribution to PE, the alleles (or genes) also showed a complex interaction ( Figure 2). More samples are needed to definitively confirm the synergetic action among the alleles before the experimental exploration of the molecular mechanism. It should also be noted that all the PEs in the study occurred in the later stage of pregnancy (>34 weeks), and therefore the maternal factors likely had more significant roles. However, it would still be interesting to further examine the genotypes of the paternal side and their possible contribution.
This study developed regression models based on the association between genotype features and PE in the Chinese population, which could predict the risk of PE. Notably, the models simply based on genetic features identified in the present study only showed limited prediction power. However, the performance was still close to or slightly better than the age-based prediction (Table 5). Therefore, genetic features could be considered as important factors as age or other clinical factors when women were screened for PE risk because strategies were urgently desired but still lacked PE screening at an earlier time before or during pregnancy [12,29]. The genetic features were also combined with age in a new model, which exhibited much better PE prediction power. Therefore, both the genetic features and age should contribute to PE in different ways. The contributions of individual genetic loci in the combined model appeared different from those in the sole genetic models, also suggesting complicated interactions between the genetic features and age in PE (Table 4; Table  S2). Also, a software tool, PERPer Go, was developed, which provided an easy way to implement the genetic (SNP8 or SNP5) and combined (SNP8age) models and predict the PE risk of participants (http://www.szu-bioinf.org/PERPer Go). This probably is the first application of PE prediction with combined genetic and age features in specific populations. In practice, a high specificity (e.g., 95%) should be controlled, leading to a relatively low PE recalling rate (38.2% for SNP8 age at 95% specificity). Efforts continue to improve the prediction performance of the models in different ways. An enlarged cohort of patients with PE and controls has been recruited and sampled for de novo risk genotype detection with whole-genome SNP arrays. Factors other than genetics or age have also been combined for consideration. For example, the association of multiple fetuses and fetus number with PE was observed, revealing that the model with integrated features of genetics, age, and fetus could reach ∼44% sensitivity when the specificity was controlled at 95%. Clinical factors, such as PE history and concurrent disorders during pregnancy, are also important and can improve the PE prediction power strikingly. In the current study, we also noticed that PE patients showed significant higher BMI than control. Maternal overweight and obesity have been considered as risk factors for preeclampsia. Despite the single BMI predictor could only reached a ∼ 60% accuracy in our dataset, it would still be interesting to integrate the feature in new prediction models in the future.

Conclusion
PE has a strong genetic factor in its causes and shows a complex process of the interactions of various factors. The MDR model may be an effective method for estimating risks of PE.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Ethical Approval
This study was approved by the institutional review board of the Maternal and Children's Hospital of Shenzhen City.

Consent
Written informed consent was obtained from all participants.

Conflicts of Interest
The authors declared that they have no conflicts of interest.