Predictive Biomarkers for Postmyocardial Infarction Heart Failure Using Machine Learning: A Secondary Analysis of a Cohort Study

Background There are few biomarkers with an excellent predictive value for postacute myocardial infarction (MI) patients who developed heart failure (HF). This study aimed to screen candidate biomarkers to predict post-MI HF. Methods This is a secondary analysis of a single-center cohort study including nine post-MI HF patients and eight post-MI patients who remained HF-free over a 6-month follow-up. Transcriptional profiling was analyzed using the whole blood samples collected at admission, discharge, and 1-month follow-up. We screened differentially expressed genes and identified key modules using weighted gene coexpression network analysis. We confirmed the candidate biomarkers using the developed external datasets on post-MI HF. The receiver operating characteristic curves were created to evaluate the predictive value of these candidate biomarkers. Results A total of 6,778, 1,136, and 1,974 genes (dataset 1) were differently expressed at admission, discharge, and 1-month follow-up, respectively. The white and royal blue modules were most significantly correlated with post-MI HF (dataset 2). After overlapping dataset 1, dataset 2, and external datasets (dataset 3), we identified five candidate biomarkers, including FCGR2A, GSDMB, MIR330, MED1, and SQSTM1. When GSDMB and SQSTM1 were combined, the area under the curve achieved 1.00, 0.85, and 0.89 in admission, discharge, and 1-month follow-up, respectively. Conclusions This study demonstrates that FCGR2A, GSDMB, MIR330, MED1, and SQSTM1 are the candidate predictive biomarker genes for post-MI HF, and the combination of GSDMB and SQSTM1 has a high predictive value.


Introduction
Heart failure (HF) is one of the primary long-term complications of acute myocardial infarction (MI). Meanwhile, post-MI HF has been identified as a timedependent variable significantly related to mortality with a hazard ratio of 3.31 [1][2][3]. Screening the post-MI HF genes served as novel candidate biomarkers facilitates exactly diagnosis and timely intervention. However, despite many proposed biomarkers involving post-MI HF, few of them have gained widespread acceptance and application in clinical practice [4].
We analyzed the gene expression profile of post-MI HF patients and those who remained HF-free over a 6-month follow-up using plasma samples collected at admission, discharge, and 1-month follow-up. Differential expression analysis and weighted gene coexpression network analysis (WGCNA) were combined to screen the top-ranked circulating candidates. In addition, we performed enrichment analysis to illustrate the potential influence on progression from MI to HF using functional annotation algorithms. Moreover, we confirmed the differentially expressed genes (DEGs) and key modules using external datasets from 2 different acute MI patient cohorts, 4 single-cardiac cell transcriptomic studies [5], and 12 ischemic cardiomyopathy patients' expression profiles [6]. is study aimed to identify circulating biomarkers to predict post-MI HF using machine learning methods.

Data Acquisition.
A total of 64 samples from patients with ST-elevation MI were enrolled from the First Chair and Department of Cardiology of the Medical University of Warsaw, with the approval of the Ethics Committee of the Ain Shams the Faculty of Medicine [7]. All 17 patients were indicated for direct percutaneous coronary intervention. Coronary angiography, angioplasty of the infarct-related artery, and pharmacological therapy were performed following the 2008 European Society of Cardiology guidelines for acute myocardial infarction [8]. Whole blood samples were collected at the time point of admission (first day of MI), discharge (4 to 6 days after MI), and 1-month followup, respectively. According to the manufacturer's instructions, the transcriptional profiling was analyzed using Human Gene 1.0 ST Array (Affymetrix, Santa Clara, CA, USA; Platform GPL6244). e involvement of this study did not influence treatment. All participants were provided written informed consent following the Declaration of Helsinki. is study was a secondary data analysis on publicly available data, and the raw data were acquired from the Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/ geo/) [7]. Figure 1 shows a flow diagram summarizing the entire study design.

Data Processing and Probe Reannotation.
Log 2 -transformation, background correction, and quantile normalization were performed on the raw gene expression profiles using the linear models for the microarray data (limma) algorithm. en, the probe serial numbers were converted into gene symbols according to the annotation file provided by the manufacturer. When a single gene was mapped by more than one probe, the average expression level of this gene was calculated. Finally, the expression profile containing 23,307 genes was further processed.

Clustering Analysis and Visualization.
Clustering analysis is a powerful tool to perform molecular classification among samples and identify subtype characterization [9][10][11]. Among the many clustering algorithms, hierarchical cluster analysis and k-means clustering are the two prominent representatives, whereas t-distributed stochastic neighbor embedding analysis and principal component analysis are widely used unsupervised methods to reduce dimensions of expression data. e processed expression data were first analyzed by the k-means cluster method and visualized using a heatmap. en, we performed an unsupervised hierarchical cluster analysis with a scale-free network and topological overlaps. Meanwhile, hierarchical cluster analysis is a cluster analysis method to create a hierarchy of clusters and thus group patients with similar gene expressions into the same clusters [12,13]. Additionally, we ran the discriminant analysis using t-distributed stochastic neighbor embedding analysis, a nonlinear dimensionality reduction algorithm well-suited for visualizing high-dimensional data [14,15]. In this study, hierarchical cluster and t-distributed stochastic neighbor embedding analysis were performed on the full set (all the four time points) of detected genes, which aimed to illustrate the general difference in expression patterner between the post-MI HF and non-HF groups.
Moreover, we performed principal component analysis on the expression data of admission, discharge, and 1-month follow-up, respectively. e principal component analysis is a widely used distance-based statistical algorithm that reduces the dimensionality of complex datasets, increases interpretability, and minimizes information loss [14][15][16][17][18][19]. An appropriate time point with good distinguishing ability will be selected based on the expression parameter revealed by principal component analysis.

Screening Differentially Expressed Genes (DEGs).
Fold change is a univariate filter method to compare the absolute expression value change between two groups, and it has been widely used as a threshold for screening possible biomarkers. We analyzed the gene expression profile acquired at three time points (admission, discharge, and 1-month follow-up) and screened DEGs between the post-MI HF and non-HF groups based on log 2 fold change expression using the limma method [20]. We assumed that the difference in blood samples might be smaller compared with tissue samples (like heart tissue). erefore, to avoid eliminating excessive candidate biomarkers, we set a lower threshold of fold change >1.1 and P value < 0.05. e DEGs were visualized as a volcano plot and heatmap using the "ggplot2" and "pheatmap" package in R.

Construction of WGCNA.
WGCNA is a bioinformatics algorithm to explore the transcriptome expression patterns across genes, identify gene modules associated with complex disease features, and reveal the biologically functional interpretations of network modules [21][22][23][24]. Based on the time-series gene expression profiles, we used the one-step network construction function of the "WGCNA" package (version 1.60) for constructing the coexpression network and identifying key modules. Scale independence and mean connectivity were calculated using a gradient method with a range of 1 to 20, and the power value was selected with a threshold of independence degree >0.8. e minimal module size and the merge cut height were set as 30 and 0.3, respectively. After module construction, we summarized the module eigengene according to the first module principal component to evaluate the significance of each module, and the module-trait relationships were assessed based on the correlation between module eigengenes and clinical traits. Furthermore, we calculated all genes' average absolute gene significance within one module and evaluated the correlation strength accordingly. In addition, the gene significance value was defined by log10-transformed P value in the linear regression between expression and clinical traits. e 2 Evidence-Based Complementary and Alternative Medicine modules with the highest MS values were considered as the key modules [21]. After constructing coexpression networks, we further evaluated the preservation levels of key modules using module preservation analysis, which summarizes different preservation statistics into one single overall measure of preservation (i.e., Zsummary value). Zsummary is a statistic value composed of multiple statistics related to density connectivity [25]. Generally, a higher Zsummary value suggested the more substantial evidence that a module should be preserved: Zsummary value less than 2 indicated "no evidence," Zsummary value between 2 and 10 indicated "weak evidence," and Zsummary value higher than 10 indicated strong evidence. However, the Zsummary value tends to increase with the rise of module size, and therefore, it is unappropriated to use the Zsummary value to perform preservation analysis on modules with distinct sizes. In that case, medianRank, which is calculated based on the observed preservation statistics and not affected by module size, should also be applied [26]. A module with a lower medianRank value is more preserved than those with a higher medianRank.

Enrichment Analysis and Protein-Protein Interaction (PPI) Network.
To reveal the roles of key modules in the progression of post-MI HF, we ran gene ontology (GO) enrichment analysis using the "clusterProfiler" package. Moreover, we performed association and enrichment analysis based on DisGeNET [27] database and visualized using Metascape, which was a tool to systematically analyze and interpret OMICs-based research [28]. DisGeNET, a gene-disease associations database, contains publicly available collections of genes and human disease-associated variants [27]. In addition, we ran a PPI enrichment analysis on genes from key modules and created an interaction network. e molecular complex detection algorithm was also applied to detect densely connected network components [29].

Identification of Potential Biomarkers and Expression
Analysis. To identify potential biomarkers for post-MI HF, DEGs (dataset 1) were cross-referenced with genes from key modules (dataset 2). We considered the biomarkers would be reproducible if they were identified by both expression analysis and coexpression network analysis at the same time.
In parallel, we included other candidate biomarkers from external datasets (dataset 3): (1) a recently published research combining aptamer-based proteomics from 2 different acute MI patient cohorts and 4 single-cardiac cell transcriptomic studies, which identified 36 potential circulating biomarkers [5]; (2) an open-access gene expression profile of human heart evaluating the influence of heart failure on human nucleocytoplasmic transport-related genes: 12 samples from ischemic cardiomyopathy and 5 samples from control hearts [6]. en, we ranked biomarkers according to 3 priorities: lower priority (observed in 1 of 3 datasets), intermediate priority (observed in 2 of 3 datasets), and high priority (observed in all three datasets).

Statistical Analysis.
Continuous variables were represented as mean ± standard deviation (normal distribution) or median + interquartile range (skewed distribution). Categorical variables were presented as percentages. e one-way ANOVA test, Kruskal-Wallis test, and chi-square test were used to determine statistical differences, as appropriate. e receiver operating characteristic (ROC) curves were created, and the area under the curve (AUC) was calculated to assess the predictive value of these possible biomarkers. All statistical analysis was performed by R software version 3.6.1 (R Foundation for Statistical Computing, Vienna). P < 0.05 was considered as statistical significance.

Clinical Characteristics of the Study Population.
is study included 17 patients with myocardial infarction who volunteered for a six-month visit. All these patients Evidence-Based Complementary and Alternative Medicine were diagnosed with STEMI and received coronary angiography, angioplasty, and pharmacological treatment following current guidelines [8]. After six months, 9 patients were diagnosed with HF (post-MI HF group), and the other 8 individuals were grouped into the post-MI non-HF group. No significant difference was observed in age, sex, body mass index, hypertension, diabetes, smoking, hypercholesterolemia, anterior myocardial infarction, and medications (beta-blockers, aspirin, clopidogrel, statins, and angiotensin-converting enzyme inhibitors) at baseline (all P > 0.05). However, the post-MI HF group showed higher NT-proBNP (918.3 ± 848.5 vs. 62 ± 14.1 pg/mL, P < 0.001), lower LVEF (39.3 ± 8.4 vs. 66.8 ± 1.9%, P � 0.001), and more administration of diuretics (7 vs. 1, P � 0.015) compared with the non-HF group. Baseline demographic and clinical characteristics have been summarized in the parent study [7].

Clustering Analysis and Visualization.
K-means cluster analysis indicated a distinct expression patterner between the HF and non-HF groups, although only limited expression similarity was observed in samples from the same time point (Figure 2(a)). Consistently, hierarchical cluster and t-distributed stochastic neighbor embedding analysis suggested that post-MI HF patients showed a different expression patterner compared with non-HF patients (Figures 2(b) and 2(c)). Principal component analysis on the expression data of three time points showed that the expression at admission and discharge might be appropriate time points with a good distinguishing ability (Figures 2(d)-2(f )).

Weighted Coexpression Network Construction and Key Modules Identification.
e soft-thresholding power of 8 was selected according to the scale-free topology criterion (scale-free R 2 � 0.81, Figures 4(a) and 4(b)), and 28 modules were created (Figure 4(c)). All the genes that could not be put into any other modules were included in the grey module, and the grey module was excluded from the following research. Next, we analyzed the association between modules and clinical traits, including the diagnosis of HF and follow-up time (Figures 4(d) and 4(e)). e white and royal blue modules were most significantly positively or negatively correlated with post-MI HF, respectively. Accordingly, white and royal blue modules were identified as the key modules. A total of 40 and 105 genes were included in the white and royal blue modules, respectively. In Figures 4(f ) and 4(g), we illustrated the correlation between module membership and gene significance in white (correlation coefficient � 0.91, P < 1e-200) and royal blue module (correlation coefficient � 0.74, P � 1.8e-111). Figure 4(h) shows the module preservation statistics, and the Z summary values of both white and royal blue modules were more than 10. Additionally, Figure 4(i) illustrates the medianRank score analysis of different modules.

Enrichment Analysis of Key Modules and Interaction
Network. We ran enrichment analysis on the key modules using the Gene Ontology database. As shown in Figure 5(a), enriched biological processes were mainly involved in autophagy, a process utilizing autophagic mechanism, negative regulation of ubiquitin-dependent protein catabolic process, negative regulation of proteolysis involved in cellular protein catabolic process, and positive regulation of RNA splicing. e cellular components were mainly enriched in nuclear chromatin, inclusion body, mediator complex, clathrin-coated endocytic vesicle membrane, and nuclear pore nuclear basket. Enriched molecular functions mainly involved nuclear hormone receptor binding, hormone receptor binding, histone binding, vitamin D receptor binding, and thyroid hormone receptor binding. Additionally, Figure 5(b) shows the gene network of GO analysis, and the network of enriched terms is shown in Figure 5(c). Moreover, the enrichment analysis in DisGeNET revealed that genes in the key modules were associated with sleep disturbances, multiple congenital anomalies, delayed speech and language development, bulbous nose, and neurodevelopmental disorders ( Figure 5(d)). Furthermore, the PPI network was illustrated in Figure 5(e), and 2 cluster subnetworks (including SMARCC1, NR3C1, RNF2, NCOR1, MED1, MED14, TNRC6A, APP, PTBP3, and BCL7C) were created using the molecular complex detection algorithm.

Identification of Potential Biomarkers and Expression
Analysis. Dataset 1 included 200 DEGs from 3 time points, and dataset 2 included 145 genes from key modules. After overlapping dataset 1 and dataset 2, a total of 5 genes were acquired, including OR7E14P, GSDMB, TAX1BP3, SQSTM1, and KAT6B. e detailed cross-reference information was provided in Figure 6(a). Moreover, 9 genes were found in all three datasets and considered high-priority candidates, including FCGR2A, RQCD1, IRF8, RELL1, GPR21, PTBP3, CYB5R1, ICA1, and CPNE8 (Figure 6(b)). Based on a literature search, we identified 5 genes that might most effectively differentiate the post-MI HF patients from those without HF: FCGR2A, GSDMB, MIR330, MED1, and SQSTM1. Figures 6(c)-6(e) shows the expression level of these genes at admission, discharge, and 1 month after discharge, respectively.

Discussions
In this study, we performed a secondary analysis of a cohort study using machine learning including nine post-MI HF patients and eight post-MI patients who remained HF-free over a 6-month follow-up. e main findings are as follows.
(1) Five candidate biomarkers (including FCGR2A, GSDMB, MIR330, MED1, and SQSTM1) were identified, which might most effectively differentiate the post-MI HF patients from those without HF. (2) When combining GSDMB and SQSTM1, the AUC achieved as high as 1.00, 0.85, and 0.89 in admission, discharge, and 1-month follow-up, respectively, indicating a high predictive value for post-MI HF. FCGR2A, also named FccRIIa, is a low-affinity receptor for the constant fragment of immunoglobulin G, mainly expressed on platelets' surface. Calverley et al. [30] reported an increased level of FCGR2A in patients with myocardial infarction, unstable angina, and ischemic stroke. Schneider et al. [31] analyzed the expression level of FCGR2A in post-MI patients and found a 4-fold greater risk of subsequent MI, stroke, and death in those with higher platelet FCGR2A expression. In our study, we revealed that FCGR2A was significantly upregulated in post-MI HF patients. Engagement of FCGR2A on platelets by immune complexes will trigger intracellular signaling events and lead to platelet activation and aggregation. Multiple studies have revealed that HF was significantly associated with abnormal platelet morphology and function [32,33]. In addition, HF patients have higher mean platelet volume [34], increased whole blood aggregation [35], and elevated platelet-derived adhesion molecules [36]. Potential mechanisms include hemodynamic and vascular factors, secretion of cytokines like C-C chemokines, and renin-angiotensin system activation [32].
Gasdermins (GSDMs) are a family of functionally diverse proteins expressed in various cell types and tissues, and

Evidence-Based Complementary and Alternative Medicine
GSDMs have been well demonstrated to be involved in pyroptosis, a proinflammatory type of regular cell death [37]. It has been reported that GSDMB promotes noncanonical pyroptosis by enhancing caspase-4 activity and GSDMD cleavage [38]. With the deepening understanding of HF and chronic inflammation, pyroptosis has been revealed as having an important role in HF [39]. e pyroptosis of myocardial cells leads to the irreversible loss of cardiomyocytes, whereas pyroptosis of cardiac fibroblasts results in myocardial fibrosis and cardiac hypertrophy, which leads to the adverse change in cardiac structure and function and will eventually result in HF. Moreover, accumulating studies revealed that sleep disturbances significantly increased cellular stress, inflammation, and myoblast pyroptosis, leading to the development of HF [40][41][42]. Interestingly, our results revealed that GSDMB was differently expressed in all three time points and included in key modules showing high similarity with sleep disturbancesassociated genes, suggesting its important role in the development of post-MI HF.
SQSTM1 (also known as p62), a multifunctional protein consisting of a series of domains, acts in concert with binding partners to regulate the cellular process, especially autophagy [43]. As an autophagy receptor, SQSTM1 has been recognized as an autophagy marker [44]. Autophagy is a self-degradative process for delivering aggregating proteins and damaged organelles to lysosomes for degradation, protecting cells from intracellular stress, and providing essential energy for starving cells [45]. However, the exact mechanisms between autophagy and HF remain largely vague despite the many studies. Current evidence indicates the key role of autophagy in protecting myocardial cells against HF, while overactivation of autophagy will contribute to the progress of HF [46,47]. In the early stage of HF, activated autophagy increases protein degradation, reduces myocardial hypertrophy, and antagonizes ventricular hypertrophy. On the contrary, autophagy promotes cardiomyocyte death and accelerates the deteriorating progression of HF. In our study, the expression of SQSTM1 was significantly increased in post-MI HF, which suggested that excessive autophagy with MI might contribute to the development of HF. In addition, our results showed that the combination of GSDMB and SQSTM1 had a high predictive value for post-MI HF, indicating that pyroptosis and autophagy played a jointly promoting role in the development of post-MI HF.
Mediator, a multisubunit nuclear complex, is a major component of eukaryotic transcription machinery that   [49,50]. Hall et al. [51] revealed that deletion of Med1 in cardiomyocytes deregulated more than 5000 genes and promoted the development of acute HF. Underlying mechanisms may be involved in the deregulated expression of genes in calcium signaling, cardiac muscle contraction, and mitochondrial metabolic functions, accompanied by the downregulated expression of Med1 [52]. Interestingly, Bai et al. [53] had reported that Med1 in macrophages has an antiatherosclerotic role by suppressing the expression of proinflammatory genes via PPARc-regulated transactivation [54,55], suggesting its protective role in the cardiovascular system. Similarly, our study showed that Med1 was significantly downregulated in patients with post-MI HF, indicating it may be served as an effective biomarker for HF. MiRNAs are a class of small noncoding RNAs, which function as regulators of gene expression at the posttranscriptional level [56]. Ren et al. [57] demonstrated that overexpression of MIR330 in acute coronary syndrome alleviated acute coronary syndrome by suppressing atherosclerotic plaque formation and enhancing vascular endothelial cell proliferation through the WNT signaling pathway. Moreover, Wei et al. [58] reported that upregulated MIR330 might lead to stable carotid plaques by targeting Talin-1 in symptomatic carotid stenosis patients. However, in another research [59], overexpression of MIR-330 was reported to promote left ventricular remodeling, increase myocardial infarction sizes, and aggravate myocardial ischemia-reperfusion injury during coronary recanalization. Different downstream pathways exert distinct biological effects, and the role of MIR330 in post-MI HF remains to be further studied.

Limitation
Several limitations should be highlighted in our study. First, our study belongs to a secondary analysis of a cohort study. In the parent study [7], the study group and validation group were created, and microarrays were used to identify a set of genes associated with post-MI development HF in the early phase of MI, especially on admission. Differently, in our study, we focused on the study group of the parent study and performed a more indepth analysis of the whole blood samples collected at admission, discharge, and 1-month follow-up to screen candidate biomarkers to predict post-MI HF in the early stage using differential expression analysis and WGCNA. More importantly, the developed external datasets on post-MI HF were introduced to confirm the candidate biomarkers, which facilitated increasing sample sizes and improving the reliability of results. Second, the datasets contained different cardiac models and pathologies and failed to include a strictly post-MI HF disease [5,6], which posed a risk of introducing false positives and false negatives. However, these pathologies shared many underlying features and were likely to exhibit similar biomarker profiles [5]. Moreover, despite included external  datasets, our study still had relatively smaller sample sizes, which may have an effect on the stability of results, reduce the test efficiency, and cause possible bias to the research results; accordingly, external validation with a larger cohort is still required to demonstrate their reliability, and meanwhile, further studies should be required to elucidate the underlying mechanisms. In addition, our screening tools have their limitations; therefore, candidate biomarkers need further validation in clinical and experimental studies.

Conclusions
is study demonstrates that FCGR2A, GSDMB, MIR330, MED1, and SQSTM1 are the candidate biomarkers for the progression of HF after MI, and the combination of GSDMB and SQSTM1 has the highest predictive value. Following studies are required to further validate the predictive accuracy and clarify the underlying mechanisms.

Data Availability
All the data generated or analyzed during this study include the available Gene Expression Omnibus (GEO) database (Accession nos. GSE59867 and GSE42955) and a published article (DOI: 10.1161/CIRCULATIONAHA.119.045158).

Conflicts of Interest
e authors declare no conflicts of interest.