Combination of Enrichment Using Gene Ontology and Transcriptomic Analysis Revealed Contribution of Interferon Signaling to Severity of COVID-19

Introduction The severity of coronavirus disease 2019 (COVID-19) was known to be affected by hyperinflammation. Identification of important proteins associated with hyperinflammation is critical. These proteins can be a potential target either as biomarkers or targets in drug discovery. Therefore, we combined enrichment analysis of these proteins to identify biological knowledge related to hyperinflammation. Moreover, we conducted transcriptomic data analysis to reveal genes contributing to disease severity. Methods We performed large-scale gene function analyses using gene ontology to identify significantly enriched biological processes, molecular functions, and cellular components associated with our proteins. One of the appropriate methods to functionally group large-scale protein-protein interaction (PPI) data into small-scale clusters is fuzzy K-partite clustering. We collected the transcriptomics data from GEO Database (GSE 164805 and GPL26963 platform). Moreover, we created a data set and analyzed gene expression using Orange Data-mining version 3.30. PPI analysis was performed using the STRING database with a confidence score >0.9. Results This study indicated that four proteins were associated with 25 molecular functions, three were associated with 22 cellular components, and one was associated with ten biological processes. All GOs of molecular function, cellular components, and 9 of 14 biological processes were associated with important cytokines related to the COVID-19 cytokine storm present in the resulting cluster. The expression analysis showed the interferon-related genes IFNAR1, IFI6, IFIT1, and IFIT3 were significant genes, whereas PPIs showed their interactions were closely related. Conclusion A combination of enrichment using GOs and transcriptomic analysis showed that hyperinflammation and severity of COVID-19 may be caused by interferon signaling.


Introduction
A coronavirus is a group of viruses from the subfamily Orthocoronavirinae in the Coronaviridae family and the order Nidovirales. is group of viruses can cause disease in birds and mammals, including humans [1]. In humans, coronaviruses cause respiratory infections that are generally light, such as colds, to some severe infections of the respiratory, digestive, and systemic systems [2]. Several significant health problems such as fever, dry cough, difficulty breathing, pneumonia, multi-organ failure, and even death [3]. In addition, the SARS-CoV-2 virus becomes a parasite on host cells and will cause an excessive inflammatory reaction or hyperinflammation [4]. e meaning of hyperinflammation is an excessive immune response that can cause high levels of inflammation. e excessive inflammation caused by this virus occurs due to a storm of cytokines that can damage the human lungs [5], even forcing immune cells to destroy healthy cells. erefore, COVID-19 patients must receive intensive care [6]. In inflammation, protein can be a biomarker of organ damage [4].
In recent decades, the development of high-throughput experimental research and the availability of a wide variety of databases have led to the development of methods to explain systems biology [7]. Systems biology is an integrated field that links molecular components within the same biological and across different scales (e.g. cells, tissues, and organ systems) to physiological function and phenotype of organisms through quantitative reasoning, computational modelling, and high-performance experimental technology [7]. Systems biology methodology can be applied either as a bottom-up approach that gathers smaller functional units to create a system or as a top-down approach that starts from the overall view of the system and then tries to study smaller subsystems [7]. Systems biology uses both experimental and computational frameworks to answer biological questions. e computational task includes knowledge extraction using bioinformatics, statistical methods, and network analysis [7]. Network nodes are cellular components in systems biology, and edges are the reactions or interactions between these nodes. Considering cellular systems as networks is a valuable and practical way to understand the functional organization of cells by analyzing network topology [8,9].
In a previous study, we implemented a bottom-up strategy using network analysis to investigate the important proteins from protein-protein interaction (PPI). We employed a clustering technique and topological measures, such as degree centrality, betweenness centrality, and closeness centrality [8], to identify several proteins that exacerbate COVID-19 from the effects of hyperinflammation [10].
As in the bottom-up approach, in this study, we tried to know the role of those proteins in hyperinflammation by conducting enrichment analysis using gene ontology (GO). Enrichment analysis is a method for identifying the class of genes or proteins that are overexpressed in a large set of genes or proteins and may be associated with the phenotype of the disease. e screening of important proteins requires analysis using gene ontology (GO) due to the impact of hyperinflammation caused by SARS-CoV-2. GO is one of the data sources for functional genomic research [8], which consists of three distinct aspects describing the function of proteins. Ontology is a tool for seeking biological knowledge by associating data (genes or gene products) with biological processes, molecular functions, and cellular components [11]. We also could get the advantage of reducing the false positives value by using GO data [12]. e use of biological information could get better results in identifying important proteins [13] because it can improve the interpretation of results [8]. However, research on this topic is still limited. Most studies ignore the biological meaning of proteins in the context of PPI [14].
Finding the association of proteins and GO can be conducted using clustering. One of the clustering technique, the fuzzy k-partite clustering method, can cluster large-scale PPI data into functionally small-scale clusters. In a previous study, fuzzy k-partite clustering was developed and used to perform tripartite clustering on disease-gene-protein [15].
us, in this case, we could use fuzzy k-partite clustering to group protein/gene and three GO components. As known, each biological network was multifunctional so that proteins could fit into more than one cluster [15]. In this study, we employed the fuzzy k-partite clustering approach to perform enrichment analysis of protein and GO. We also continued this enrichment results by conducting transcriptomic analysis from the GEO database. We expected to reveal particular genes contributing to the disease severity.

Data.
is study used protein data that affect hyperinflammation in COVID-19. All proteins in this study are proteins of humans. ese proteins were obtained by using computational biology methods from important protein candidate data and PPI data [10]. e important protein candidates' data were obtained from OMIM (https://www. omim.org/), UniProt (https://www.uniprot.org/), and previous research about the COVID-19's protein [16][17][18]. From these various sources, there are 57 proteins obtained. Moreover, the PPI data were obtained from STRING (https://stringdb.org/). ese 57 proteins interact with other proteins so the PPI data contain 357 proteins and 1686 interactions.
is PPI data must go through the preprocessing stage, such as cleaning edge duplication and eliminating PPIs in small subgraphs that are not connected to the main protein interaction network. is step shows that the PPI data contain 222 proteins and 1239 interactions. e screening of important proteins was carried out in two stages by calculating the overall centrality value using PCA and clustering with ClusterONE. e overall centrality value was calculated from the seven centrality measures (degree centrality, betweenness centrality, closeness centrality, subgraph centrality, eigenvector centrality, information centrality, and network centrality); the weights of each centrality were obtained from the eigenvalue resulting from PCA. Next, we reduced the graph to obtain the subgraph using the induced graph method. e amount of protein taken for subgraph formation is 10% of the total protein in the main graph. ese proteins are linked to most of the proteins in the graph compared to the remaining 90%. Moreover, the top 10% proteins with the highest score of overall centrality contain 124 interactions formed by 22 proteins. By using ClusterONE, there were two clusters of this subgraph. e first cluster is the best because it had a higher density, quality, and average overall value and a lower p-value. ere were 20 important proteins in the first cluster, namely STAT3 , TYK2, IL6, STAT1, JAK1, STAT2, TBK1,  RSAD2, OAS2, OAS1, MX2, MX1, ISG15, IRF7, IRF3,  IFNAR1, IFIT3, IFIT1, IFI6, and DDX58. Moreover, we used GO data such as molecular function, cellular component, and biological process obtained from the UniProt database site. ree GOs represent GO terms describing a gene that encodes a gene product. ese gene products carry out molecular-level activities (molecular functions) at specific locations relative to the cell (cellular components). ese molecular-level processes contribute to a larger biological goal (biological processes) [11]. Figure 1 illustrates DNA replication in yeast modelled using three aspects of GOs. All GOs in this study are reviewed GOs in humans.

K-Partite Graph.
In graph theory, a k-partite graph is a graph in which nodes can be divided into k independent sets. An independent set means that the nodes in a set are not connected to other nodes by an edge. In other words, no node is adjacent to another node in an independent set. In a k-partite graph, a node is adjacent to another node of a different set.
is study used four data, including protein and three GOs data (molecular function, cellular component, and biological process). Because the three GOs data are not related to each other, we could build three bipartite graphs, namely protein-molecular function, protein-cellular component, and protein-biological process. We used Cytoscape to visualize three bipartite graphs. Figure 2 shows the illustration of bipartite graphs.

Fuzzy K-Partite
Clustering. Fuzzy K-partite clustering applies the graph-based fuzzy clustering algorithm [15]. Fuzzy clustering is one of the methods to determine the optimal cluster in a vector space based on the Euclidean distance. It is a soft clustering method that can group an object into more than one cluster. In other words, fuzzy clustering can perform overlap clustering. e term fuzzy in the context of clustering is that each object has a degree of membership value to determine the cluster position of the object [19]. e purpose of fuzzy clustering is to minimize the objective function with the main parameter is the degree of membership [20]. So, the initial stage in this method is to determine the number of initial clusters. e determination of the maximum number of clusters for each GO data and protein can be seen in equations (1) and (2), respectively [15].
where C go is the maximum number of clusters for each GO and N go is the number of nodes in each GO data.
where C p is the maximum number of protein clusters and N p is the number of nodes in the protein data.
e fuzzy K-partite clustering algorithm inputs are matrix A, matrix B, and matrix C. Matrix A is the adjacency matrix between protein and GO. Matrix A was obtained from transforming three bipartite graphs into three adjacency matrices for each protein-molecular function graph, protein-cellular component graph, and protein-biological process graph. Element of matrix A has value one if an edge connects a protein node and a GO node and zero otherwise.
Matrix B is a matrix of interconnection value between protein clusters and GO clusters, while matrix C is a matrix of protein membership degree value in the protein cluster and GO in the GO cluster. In the fuzzy K-partite clustering algorithm, matrix B and matrix C are non-negative matrices whose initial value for each element is a random value. e dimensions of matrix B and matrix C depend on the number of proteins, the number of GO, the maximum number of protein clusters, and the maximum number of GO clusters. Figure 3 illustrates matrix A, matrix B, and matrix C. Figure 3 shows two sets, including the proteins and GO sets. e clustering process is carried out in three stages, as many as GO types. We conducted three clustering processes: clustering of important proteins with molecular function, cellular component, and biological process because there is no information about the relationship between each type of GO.
e output of this algorithm is the value of protein and GO membership degree in each cluster and the interconnection value between protein clusters and GO clusters. e interconnection value between clusters was high if the percentage of cluster members was low and vice versa. e fuzzy K-partite clustering algorithm will stop if the cost function value has converged. e cost function equation is calculated by equation (3) [15].
where ‖ · ‖ 2 F is the Frobenius norm of squares, i.e., the sum squares of the matrix elements. e cost function value shows how easily data are grouped into several clusters, the easier the data are grouped into a cluster, the lower the cost function value will be. U sing the fuzzy K-partite algorithm, the value of the cost function will find the lowest value because the algorithm's structure is similar to non-negative matrix factorization (NMF), with the difference that it can handle the factorization problem of three matrices. In addition, each iteration will not increase the cost function value. Another advantage is that fuzzy K-partite clustering produces a lower cost function value of 10% than usual and can predict the cluster structure better than the previous method since it is a soft clustering. e fuzzy K-partite clustering algorithm can be seen in Figure 4.

MicroArray Dataset Analysis.
Microarray data were collected from data that referred to [21]. e data showed whole peripheral blood mononuclear cell (PBMC) genomic transcriptomes from severe (severe) and mild (mild) COVID-19 patients, as well as healthy controls (HC) Interdisciplinary Perspectives on Infectious Diseases retrieved from the GEO database (GSE 164805 and GPL26963 platform) [21]. Data set creation and gene expression analysis were performed using Orange Datamining version 3.30.

PPI Analysis Using STRING.
Protein-protein interaction (PPI) analysis of the altered expression of IRF7, IFNAR1, IFIT3, IFIT1, IFI6 in severe COVID-19 patients was performed using the STRING database. PPI analysis between the expression of these genes and the genes resulting from enrichment was carried out with a confidence score >0.9.
e type of interaction, the confidence score, and the type of change in expression (upregulation or downregulation) were recorded and arranged in tabular form [22].

Gene Ontology.
We searched three GO data (cellular component, biological processes, and molecular function) from 20 proteins obtained from PPI analysis. ere were 65 types of molecular function, 55 types of cellular component, and 274 types of biological processes associated with important proteins that have been obtained in the previous study.
Each GO data is associated with one important protein and has a relationship with more than one important protein.
e GO data of molecular functions, cellular components, and biological processes that have the highest relationship or degree with the important proteins can be seen in Figures 5-7, respectively.

Bipartite Graph.
Each GO is formed into a bipartite graph associated with important proteins using Cytoscape.
ere were 113 interactions between important proteins-molecular functions, 145 interactions between important proteins-cellular components, and 459 interactions between important proteins-biological processes. e illustrations of three bipartite graphs can be seen in Figures 8-10. e blue nodes are GO nodes, while the green nodes are important protein nodes.

Fuzzy K-Partite Clustering.
After being formed into a bipartite graph, convert each graph into an adjacency matrix, or matrix A. en calculate the maximum clusters formed for protein clusters in cellular components, molecular functions, and biological processes and each GO clusters of cellular components, molecular functions, and biological processes. With equations (1) and (2), the calculation results of the maximum clusters of each protein in cellular components, molecular functions, and biological processes and GO clusters of cellular components, molecular functions, and biological processes can be seen in Table 1.
After obtaining information on the maximum number of clusters in each protein and GO, matrix B and matrix C can be built with dimensions according to the maximum number of clusters in Table 1. is study did not search for the optimum cluster because the maximum number of clusters was relatively small. A protein or GO is assigned to a cluster if its membership degree exceeds the threshold of 0.2 [23].

Clustering Results
Analysis. From the three results of bipartite graph clustering between important proteins with molecular function, cellular components, and biological processes, this study shows that four important proteins are associated with 25 molecular functions, three important proteins are associated with 22 cellular components, and one

Significant Changes of Interferon-Associated Genes and Network Analysis.
e interferon was found to be upregulated, including IFNAR1 (https://en.wikipedia.org/wiki/ Interferon-alpha/beta_receptor), IFI6 (Interferon Alpha Inducible Protein 6), IFIT1 (Interferon Induced Protein With Tetratricopeptide Repeats 1), IFIT 3 (Interferon Induced Protein With Tetratricopeptide Repeats 3, IFNA6 (Interferon Alpha 6), and IFNB1 (Interferon beta precursor). On the other hand, the IRF7 (Interferon regulatory factor 7) and IFNG (Interferon Gamma) were found to be downregulated in severe COVID-19 patients. In Figure 11, we show the PPI interactions of these genes, IFI6, IFIT1, IFIT3, and IRF7, were strongly connected (confidence level � 0.9), while IFNAR was not connected. is shows that severity may be affected by changes of transcripts of interferon-related signaling.  From these data, we also observed the IFNG value in severe and mild-HC groups. e data are shown in Figure 11. e results showed that IFNG expression was downregulated significantly (p ≤ 0.01). erefore, the severity of COVID-19 may be caused by the down expression of the IFNG. e results of gene expression analysis showed that there was a decrease (downregulation) of the expression of IFNG and IRF7, which was thought to affect the antiviral response by interferon in SEVERE patient's expressions.
In addition, to understand the IFNG downregulation, we observed the TBK1 (TANK-binding kinase 1-interferon) value, which is an inducer of the IFNG in Figure 12. e TBK expression was not changed in HC, mild, and severe COVID-19 patients, respectively. We also investigated the PPI from interferons-related proteins to understand connections. In Figure 12, we show that all of the genes were connected. From this, we assumed that down or upregulated genes will affect the network. PPI analysis can be seen in Figure 13

Discussion
Previous studies reported two molecular functions, two cellular components, and 14 biological processes associated with significant cytokines in the COVID-19 cytokine storm [24]. ese GOs were retained from GeneAnalytics (https:// geneanalytics.genecards.org/), which can identify gene ontology terms associated with such cytokines. e names of each type of GO can be seen in Table 2. e results of each GO cluster member that are not associated with significant cytokines in the COVID-19 cytokine storm are GOs suspected to be associated with other hyperinflammation in human organs cause of disease complications when exposed to SARS-CoV-2.
Clustering important proteins can develop this research with three types of GOs using a quadripartite graph. Information about the relationship between the three types of GO is needed so that it is hoped that there will be a more indepth analysis of the relationship between important proteins and three types of GO. As comparison besides GO, we also added enriched data from Wikipathway, which is shown in Table 3. e severity of COVID-19 was triggered by hyperinflammation of the host. Several factors may drive this, including oxidative stress driven by xanthine oxidase, cytokine productions such as interleukin family, and neutrophil recruitment, triggering microthrombus formations [25]. We followed up by combining gene expression analysis from microarray data collected referred to [21]. e data suggest that IFNAR1 (Interferon-alpha/beta receptor alpha chain), https://www.uniprot.org/uniprot/O14879)IFIT3 (Interferon-induced protein with tetratricopeptide 3, IFIT1 (https://www.uniprot.org/uniprot/O14879)Interferon-induced protein with tetratricopeptide 1, and IFI6 (Interferon alpha inducible protein 6) were upregulated inpatient with severity, while the IRF7 was downregulated (interferon regulatory factor 7).
IFNAR was necessary to activate interferon stimulatory gene (ISG) to suppress the virus to enter the cells. An increase in IFNAR may be associated with the host response to viral infections. Bastard et al. showed that one cause that affected severity was auto antibody IFN type 1 [26]. Moreover, it was found that increasing IFIT1 and IFIT3 has been reported previously in CD16+ monocytes of mild and severe COVID-19 patients, while the IRF7 was not differentially expressed. Interestingly, our analysis showed that the IRF7 was downregulated, while the IFNAR1 was increased. is may reflect to that autoantibody of IFN type 1 may occur and cause IRF7 down expression. On the other hand, besides type 1 IFN, COVID-19 infections were modulated by interferon inflammation that triggered the SARS-CoV-2 entering the cells. It was also reported that interferon response genes were also found to increase as a response to viral infections while the epithelial was infected, including IFI6 similar to our finding. [26][27][28].  Contribution of interferon type I and type II was previously reported by [29] and related to our results. e analysis of gene expression increased significantly in IFNA6, IFNB1 (p < 0.05), and IFNG. An increase of IFNA6 and IFNB1 may reflect activation of hyperinflammation through the type 1 interferon pathway [28]. On the other hand, our data suggested that IFNG was significantly down between mild and healthy control to severe patients. erefore, downregulation of IFNG may show low antiviral response in the patients and may relate to the severity of COVID-19 patients as shown in Table 4 and Figure 11, respectively (p ≤ 0.01). e increase in IFNA6 and IFNB1 is a natural thing because as due to the decrease in IFNG antiviral response, there is no antiviral response that harms the host. Decreased IFNG expression could be decreased IRF7 via TBK1 [29].  Figure 11: e mean IFNG expression in the SEVERE and MILD-HC groups was significantly different (p ≤ 0.01). It can be seen in the image that IFNG is downregulated that inhibits IFNG synthesis.

Conclusion
e network analysis, as one of the system biology approach, can help us to reveal the contributing genes to the diseases severity. A combination of enrichment using GOs and transcriptomic analysis showed that hyperinflammation and severity of COVID-19 may be caused by interferon signaling.

Data Availability
All the data were collected from database such as GO and GEO Browser.

Conflicts of Interest
e authors declare that they have no conflicts of interest.