Data Mining Mycobacterium tuberculosis Pathogenic Gene Transcription Factors and Their Regulatory Network Nodes

Tuberculosis (TB) is one of the deadliest infectious diseases worldwide. In Mycobacterium tuberculosis, changes in gene expression are highly variable and involve many genes, so traditional single-gene screening of M. tuberculosis targets has been unable to meet the needs of clinical diagnosis. In this study, using the National Center for Biotechnology Information (NCBI) GEO Datasets, whole blood gene expression profile data were obtained in patients with active pulmonary tuberculosis. Linear model-experience Bayesian statistics using the Limma package in R combined with t-tests were applied for nonspecific filtration of the expression profile data, and the differentially expressed human genes were determined. Using DAVID and KEGG, the functional analysis of differentially expressed genes (GO analysis) and the analysis of signaling pathways were performed. Based on the differentially expressed gene, the transcriptional regulatory element databases (TRED) were integrated to construct the M. tuberculosis pathogenic gene regulatory network, and the correlation of the network genes with disease was analyzed with the DAVID online annotation tool. It was predicted that IL-6, JUN, and TP53, along with transcription factors SRC, TNF, and MAPK14, could regulate the immune response, with their function being extracellular region activity and protein binding during infection with M. tuberculosis.


Introduction
Tuberculosis (TB) is one of the deadliest infectious diseases in the world [1,2]. One-third of the global population is estimated to be infected with Mycobacterium tuberculosis, and 5-10% of these patients have active disease [3]. Since the World Health Organization (WHO) announced TB as a global emergency, incidence and mortality of tuberculosis remain high, despite the large cost invested during this period [4]. A dominant reason behind this is the long-term latency of M. tuberculosis and lack of development of new, effective antituberculosis drugs. M. tuberculosis is an intracellular pathogen, and it is difficult for human antibodies to enter the cell from serum, and the cell-mediated immune response plays a very important role in either development of infection to active disease or containment of the pathogen. Therefore, the study of M. tuberculosis genes related to immunity and their regulatory network has become a focus of the current research.
Traditional single-gene screening of M. tuberculosis targets has been insufficient towards the development of new therapies [5]. Tuberculosis infectious disease is a complex disease involving the body's immune system, and traditional single-gene screening patterns do not fully and accurately reflect the occurrence, development, and pathogenesis of infectious diseases [6]. Therefore, integrating multidisciplinary approaches to construct an active pulmonary TB immune response gene regulatory network and then utilizing highthroughput screening to look for diagnostic targets in body fluids would add significant understanding of the regulatory network involved in the immune response to M. tuberculosis [7]. This strategy may lay a foundation for the early clinical diagnosis of TB and the development of new effective drugs for TB treatment [8].
In this study, using the National Center for Biotechnology Information (NCBI) GEO Datasets [9], the fluid (whole blood) gene expression profile data were obtained in patients with active pulmonary tuberculosis. Linear model-experience Bayesian statistics using the Limma package in R combined with t-tests were applied for nonspecific filtration of the expression profile data [10], and the differentially expressed genes were screened out. Using DAVID (Database for Annotation Visualization and Integrated Discovery) [11] and KEGG (Kyoto Encyclopedia of Genes and Genomes) [12], the functional analysis of differentially expressed genes (GO analysis) [13] and the analysis of signaling pathways (pathway analysis) [14] were performed. Based on the differentially expressed gene set, the transcriptional regulatory element databases (TRED) [15] were integrated to construct the M. tuberculosis pathogenic gene regulatory network, and the correlation of the network genes with disease was analyzed with the DAVID online annotation tool.

Method
2.1. Gene Chip Array Data. The gene chip array data were retrieved from GEO Datasets of the National Center for Biotechnology Information (NCBI, USA) database using keywords "tuberculosis" and "Homo sapiens" [porgn:_t-xid9606] [9]. Data from Affymetrix gene chips from human subjects infected with M. tuberculosis and with the original CEL documents available were selected.
We chose gene expression profile of GSE16250 from GEO database, which was a public and freely available database. GSE20050 was based on agilent GPL1352 platform. The GSE52819 dataset included 12 samples, containing 6 TB samples and 6 healthy samples. We also downloaded the Series Matrix File of GSE54992 from GEO database (Table 1).

Chip Data
Processing. The Affymetrix Expression Console software tool was applied for the background correction of the collected chip data and the conversion of probe fluorescence values into gene expression values [16]. The Affymetrix Transcriptome Analysis Console was used for the logarithmic calculation and standardization of the chip data. The difference in mRNA expression between the chips in healthy people and those infected with M. tuberculosis was compared by SAM, and the genes with fold change > 2.0 or <−2.0 and p value < 0.05 were selected as differentially expressed genes. The differentially expressed genes were further screened by a Venn diagram; considering the differences between the chip platforms, the genes that overlapped with three or more than three platforms were selected.

Transcription
Factor and Target Gene Screening. Transcription factor and target gene screening analysis was carried out at the TRED website http://rulai.cshl.edu/TRED [15].

Regulatory Network Diagram.
Eighteen transcription factors (TF) and 887 target genes were predicted to a combined total of 887 TF-to-target pairs. The relation obtained from the analysis on the differential coexpression was mapped to the human transcription factors and target gene pairs to obtain transcription regulation pairs. Finally, Cytoscape software was used for plotting [17].

Gene Function Annotation
Analysis. This analysis was carried out using DAVID database at website. 1075 genes were included in the analysis, and the human genome was used as the source of background genes [11].

Analysis of Differentially Expressed Gene Functions.
Based on the NCBI Gene Ontology database, the GO annotation of these genes was performed to obtain all GO terms which genes were involved [13]. Fisher's exact test and χ 2 test were used to calculate the significance level and misdiagnosis rate of each GO, and the p value was calibrated with the misdiagnosis rate, thereby screening out the GO significance reflected by the differentially expressed genes (p < 0 05). Using the European Bioinformatics Institute (EBI) database, the experimental results were manually analyzed.
2.7. Analysis of KEGG Gene Pathways. All pathways used in this study were downloaded from KEGG databases. In total, 78 relevant KEGG pathways were downloaded from http://www.genome.jp/kegg/ in September 2016 [12].

Gene Chip Data.
All the chip data in this study were taken from GEO Datasets of the National Center for Biotechnology Information (NCBI) database. The gene expression data of humans infected with M. tuberculosis based on Affymetrix gene chips with the original CEL document were selected, and 4 groups of gene chips were determined, as shown in Table 1.
These 4 groups of chips included those whose chip number of chips was 6, 7, 12, and 21, and each chip group was divided into two subgroups, a healthy subgroup and TB patient subgroup.

Chip Data
Processing. The logarithm and standardization of chip data were conducted using Affymetrix's Transcriptome Analysis Console, and the differentially expressed mRNA between the chips of healthy people (healthy group) and those infected with M. tuberculosis (TB patient group) was compared by the SAM method to determine differentially expressed genes. The number of differentially expressed genes of each platforms was 5569, 2442, 616, and 578, respectively. These genes were cross-screened again, and the results showed that there were 130 differentially expressed genes on the 4 platforms, 224, 63, 504, and 154 on three platforms, respectively, with a total of 1075 genes ( Figure 1).

TB Transcription Factor Regulatory Network Diagram.
Using TRED (Transcriptional Regulatory Element Database) database (http://rulai.cshl.edu/TRED), transcription factors interacting among the 1075 expressed genes were predicted and analyzed, and 887 genes were found to correspond to 18 transcription factors ( Table 2).
The regulatory network diagram of M. tuberculosis pathogenic gene transcription factors of the 18 transcription factors and their corresponding target genes was mapped with Cytoscape software (Figure 2).
Sixteen target genes were regulated by more than 1 transcription factor (Table 3), of which the target genes regulated 3.4. Network Nodes. As shown in Figure 2, statistical analysis of network nodes in the transcription factor regulatory network diagram found more than 10 nodes that had 44 genes, of which there were 20 nodes that had more than 4 genes, including IL-6, ETS2, TNF, and OPN1MW (Table 4). Furthermore, IL-6 was regulated by the most transcription factors, and genes such as JUN and IL-5 were also closely related to the regulation of M. tuberculosis pathogenic gene transcription factors.
3.5. GO Function Annotation Analysis. GO function annotation for 1075 different genes was carried out and sorted based on the p values, and 10 pathways were determined (Table 5). These 10 pathways were closely related to the disease, and the TB pathway was ranked in the top five of the pathways found.

Discussion
Tuberculosis caused by the infection of M. tuberculosis remains a threat to human health throughout the world. Currently, about 2 billion (1/3) people worldwide are infected with M. tuberculosis, but only 10% will develop active tuberculosis and the remaining 90% will make the disease to curb due to the body's own well-developed immune system [18]. Our previous study found that the immune response mechanism of host cells plays a crucial role in the process of microbial infection [19]. The disease process of active tuberculosis is transformed into the active stage and depends not only on the adaptive changes of mycelium caused by genes regulation but also on the transcription factor regulation network of the host's immune system [20]. To understand the defense mechanism of the host immune system against M. tuberculosis infection will provide a theoretical basis for the progress and prevention of active tuberculosis. In this study, utilizing GEO Datasets from the NCBI database and using the Affymetrix Transcriptome Analysis Console, the logarithm transformation, and standardization of chip data were carried out, and 1075 differentially expressed M. tuberculosis pathogenic genes were found. The TRED database was used to predict and analyze these 1075 genes, and as a result, 18 transcription factors and 887 genes were found, and an M. tuberculosis pathogenic gene transcription regulation network diagram was constructed (Figure 2). Through a series of statistical analysis based on Figure 2, we found that many of these key factors are closely related to the human immune system.
As shown in Figure 2, there were 2 transcription factors that corresponded highly to the target genes, namely, JUN (236 target genes) and NFKB1 (221 target genes), whereas the other transcription factors corresponded to less than 50 target genes. JUN is closely linked with SLE (systemic lupus erythematosus), and SLE is a typical autoimmune disease involving multiple organs and systems. Doniz-Padilla et al. [21] found that JUN expression level in peripheral blood mononuclear cells (PBMCs) of patients with SLE was significantly higher than in individuals in the control group. JUN might play an important role in the transcriptional regulation of FCGR2B promoter activity, while FCGR2B has been shown to be closely associated with the pathogenesis of SLE [22]. These results suggest that c-JUN may be involved in the pathogenesis of SLE. NFKB1 gene is one of members in  Figure 1: Screening results of genes overlapping ≥3 platforms. The blue part is GSE16250, the yellow part is GSE20050, the green part is GSE52819, and the pink part is GSE54992.  NF-κB family of widespread transcription factors. It plays an important role in the immune response, inflammation, and cell growth and development.
The statistics of genes regulated by the transcription factors in Figure 2 also showed that the target genes regulated by more than 3 transcription factors were IL-6, MAPK14, RELA, and FOS; JUN was also regulated by the 2 transcription factors. Studies have shown that 4 genes, including IL-6, MAPK14, RELA, and FOS, are closely related to the immune system. IL-6 is a B cell stimulating factor and can also activate macrophages. It is produced not only by T cells but also by macrophages, fibroblasts, epithelial cells, and other cells outside the lymphatic system. IL-6 can provide a non-T cell-dependent response to acute and chronic infection with M. tuberculosis. IL-6 can act on multiple cells, such as B cells, liver cells, hybridoma cells, and plasma cells, and enhance immune resistance through its proinflammatory activity and the effect on the production of the other cytokines. Different from other type-2 cytokines, IL-6 not only is essential for the activation of T cells to secret IFN-γ but also presents as the main molecule to induce and protect T cells during M. tuberculosis infection and strengthen activities of IFN-γ [23]. Ladel et al. [24] found that after infection with M. tuberculosis, the production of IFN-γ in IL-6deficient mice was reduced and their lifetime was shortened compared with wild-type mice. Although many studies have shown that IL-6 could induce a protective immune response against M. tuberculosis infection, some reports have demonstrated that IL-6 also has an adverse effect to inhibit the responsiveness of macrophages to IFN-γ [25]. MAPK14 gene is a member of mitogen-activated protein kinase (MAPK) family, and the MAPK cascade is a ubiquitous intracellular serine/threonine protein kinase super family. Thus, an important substance that can transmit the cytoplasmic signaling into the nucleus and cause the changes in the nucleus closely linked with the cell apoptosis and proliferation, tumor genesis, and oxidative stress injury of intestinal epithelial cells [26].  M. tuberculosis can exist in a nonproliferative state in the body of its host, thus resulting in a delayed course of the disease and a persistent recurrence of infection. The persistence of M. tuberculosis in macrophages may exert a huge survival pressure from bactericidal components in the cells. During its long evolutionary process and coevolution with humans, M. tuberculosis has developed a unique signal transmission system to ensure that it can start different response elements in different infection cycles to regulate the expression of genes for the adaptation to the environment. Among the numerous response (stringent response) regulatory elements, RELA plays a key role [27][28][29]. Studies have showed that the resistance and tolerance of bacteria to environmental pressure are dependent on RELA and the resulting ppGpp (P) signal can mediate bacterial resistance to antibiotics, UVresistant survival, and DNA damage repair. M. tuberculosis with inactivated RELA cannot survive for a long time in an environment with the abovementioned types of pressure [30]. There are some differences in the morphology and chromaticity between long-term persisters and M. tuberculosis active propagules; persisters isolated from the lungs show extremely close manifestations to those under conditions of pressure, especially lack of nutrition, which is completely consistent with the starting condition of RELA. Another study showed that RELA is closely related to the bacterial morphology and colony formation caused by changes in the cell wall of mycobacterium, which is important for the formation of M. tuberculosis L-forms, namely, cell walldeficient M. tuberculosis. RELA is thus a good target against persistent tuberculosis infection [31,32].
FOS gene is a member of FOS family, along with JUN family members and other activated transcription factor protein family members including activator protein 1 (AP-1). AP-1 is an important transcription regulation factor in the  Figure 3: Network diagram of the relationship between network nodes. The blue part is the drug, the two purple parts are Gene Ontology, the yellow parts are the transcription factors, and the brown parts are the network nodes.
nucleus, playing an important role in the transduction process of many signals in the body, and is the intersection of a series of nuclear cellular signal transduction pathways [8]. FOS can transmit extracellular stimulatory signals through the upstream signaling pathway to AP-1 and then activate the AP-1 to regulate the expression of a series of downstream target genes with AP-1 DNA recognition sites, that is, binding the specific AP-1 DNA recognition site 5′-TGA (G/C TCA-3 ′ ), also known as the TPA response element (TRE), to allow the body to make an adaptive response to external stimuli. In addition, AP-1 plays an important regulation role in SLE. The statistics of network nodes is shown in Figure 2. The results showed that there were 4 genes with more than 20 network nodes, respectively, IL-6, ETS2, TNF, and OPN1MW ( Table 4). As discussed previously, IL-6 is a gene regulated by transcription factors and JUN and IL-5 are also the genes regulated by transcription factors and the network node. Based on analysis of previous publications and analysis of all the network nodes, we suggest a certain portion between most of the networks (Figure 3), also closely related to the transcription factor SRC. Many of these nodes were targets of intravenous immunoglobulin drugs (Figure 3), and it is well known that intravenous immunoglobulin can rapidly raise the level of the IgG in the blood of recipients to enhance the body's ability to fight infections and regulate immune function. We also found that the major GO terms of these network nodes were in the extracellular region and were protein binding. Therefore, we presume that the network nodes IL-6, JUN, and TP53, along with transcription factors SRC, TNF, and MAPK14, can regulate the body's immune function, and their gene function should be mainly acting in the extracellular region and be protein binding.

Conclusions
In summary, 1075 genes were screened out through GEO database and a transcription factor regulation network diagram was constructed. Through the statistical analysis of the transcription factors and network nodes of their regulatory network diagram, we speculate that IL-6, JUN, and TP53, along with transcription factors SRC, TNF, and MAPK14, could regulate the human immune function, evading the attack on the host immune system and ultimately leading to the existence of M. tuberculosis in a nonproliferative state in the body of the host, thereby resulting in a delayed course of the disease and a recurrent persistent infection.