Identifying the Immunological Gene Signatures of Immune Cell Subtypes

The immune system is a complicated defensive system that comprises multiple functional cells and molecules acting against endogenous and exogenous pathogenic factors. Identifying immune cell subtypes and recognizing their unique immunological functions are di ﬃ cult because of the complicated cellular components and immunological functions of the immune system. With the development of transcriptomics and high-throughput sequencing, the gene expression pro ﬁ ling of immune cells can provide a new strategy to explore the immune cell subtyping. On the basis of the new pro ﬁ ling data of mouse immune cell gene expression from the Immunological Genome Project (ImmGen), a novel computational pipeline was applied to identify di ﬀ erent immune cell subtypes, including αβ T cells, B cells, γδ T cells, and innate lymphocytes. First, the pro ﬁ ling data was analyzed by a powerful feature selection method, Monte-Carlo Feature Selection, resulting in a feature list and some informative features. For the list, the two-stage incremental feature selection method, incorporating random forest as the classi ﬁ cation algorithm, was applied to extract essential gene signatures and build an e ﬃ cient classi ﬁ er. On the other hand, a rule learning scheme was applied on the informative features to construct quantitative expression rules. A group of gene signatures was found as qualitatively related to the biological processes of four immune cell subtypes. The quantitative expression rules can e ﬃ ciently cluster immune cells. This work provides a novel computational tool for immune cell quantitative subtyping and biomarker recognition.


Introduction
The immune system is a complicated defensive system that comprises multiple functional cells and molecules acting against endogenous and exogenous pathogenic factors [1][2][3]. With organism evolution, the immune system gradually becomes complicated and finally forms layered defensive mechanisms, including innate and adaptive immune systems, in advanced creatures such as mammals [4,5]. Both these immune systems are complicated and constitute cellular (immune cells) and noncellular components (immuno-regulatory molecules) [6]. The cellular component is diverse on structural and functional levels [6]. In particular, each single immune response could involve multiple subtypes of immune cells, and each immune cell subtype may play various important roles in multiple immune responses, thereby constituting a complicated regulatory network on the cellular level [6].
Identifying the subtypes of immune cells and recognizing their unique immunological functions are difficult due to the complicated cellular components and immunological functions of the immune system. The only standards are the typical molecular markers recognized by cytobiology [7][8][9]. However, such biomarkers cannot accurately reflect the components of immune cells and reveal their immunoregulatory mechanisms in vivo. With the development of transcriptomics and high-throughput sequencing [10,11], the gene expression profiling of immune cells can provide a new strategy to explore the complicated immunoregulatory mechanisms, e.g., detailed immune cell subtyping. Gene expression profiling with transcriptomic analysis can reflect the typical gene expression pattern of each cell subgroup. In accordance with the central dogma of molecular biology, cells with different gene expression patterns may have varying proteomic features and biological functions and therefore must be clustered into different cell subtypes [12,13]. Differentially expressed genes or transcripts may be potential biomarkers for the identification of a given cell subgroup/subtype. Therefore, transcriptomic sequencing is a novel technique for immune cell subtyping through the recognition of cell subgroups and their respective biomarkers and functions.
The Immunological Genome Project (ImmGen) [14,15] is aimed at establishing a systematic panorama of gene expression and regulatory networks of all immune cells by using a mouse model. Initiated in 2008, this collaborative study [15] has analyzed the differentiation, maturation, active responses, effector stages, tissue localizations, and genetic variations of more than 250 subtypes of immune cells in mouse models. A systematic immunoregulatory network in mice, which encompasses the innate and adaptive immune systems, has been established by this project through systematic quality check control and standardized analyzed conditions. The analyzed data and constructed networks can be accessed by using a dedicated data browser, and the raw sequencing and microarray data can be accessed in a public database [16]. This project provides reliable resources for immune cell gene expression profiling for further exploration and research.
As reported by various previous publications, the immune system is composed by multiple immune cell subtypes which are impossible for us to study one by one. For further analyses on the mouse immune cell gene expression profiling, we focused on four basic subtypes of immune cells: αβ T cells [17], B cells [18], γδ T cells [19], and innate lymphocytes [20]. Among them, αβ T cells are the majority of T cells with αβ TCR, contributing to immune-mediated cell death as adaptive immune responders [21]. As for γδ T cells, as quite a functioning minority of T cells, which are different from αβ T cells mainly functioning for adaptive immune responses, there are three major functions for γδ T cells: (1) regulating other immune cells for central and peripheral immune responses [22,23], (2) contributing to thermogenesis to maintain body temperature [23], and (3) regulating autoimmune responses [24]. B cells are the main participator for antibody medicated adaptive humoral immune responses with complicated activation processes relying either on T cells or not [25]. Innate lymphocytes are a group of immune cells that participate in the innate immune responses with cytotoxic natural killer (NK) cells in the circulating system and innate lymphoid cells (ILCs) in the tissue-resident microenvironment [26]. Therefore, as introduced above, the four major subgroups of immune cells have quite different biological functions, controlling the basic biological functions of the immune system: innate and adaptive immune responses. Therefore, considering the significance of such four subgroups of cells in the immune system, we selected such immune cell subtypes for detailed classification studies on the murine transcriptomic level in this study.
A new batch of profiling data for mouse immune cell gene expression has been released by ImmGen on the Gene Expression Omnibus (GEO) database provided by NCBI [16]. On the basis of these data, a novel computational pipeline was applied to distinguish different immune cell subtypes including αβ T cells, B cells, γδ T cells, and innate lymphocytes. These four subtypes of immune cells are the major effective immune cells in mouse immune systems, contribute to different immune responses, and constitute a complicated immunoregulatory system by playing unique roles and interacting with each other. The powerful feature selection method, the Monte-Carlo Feature Selection (MCFS) [27], was first applied to the profiling data. We obtained a feature list and some informative features. Of the feature list, the two-stage incremental feature selection (IFS) [28] method with random forest (RF) [29] as the classification algorithm was executed to extract essential gene signatures and build an efficient RF classifier. Furthermore, some quantitative expression rules were constructed on the informative features via a rule learning scheme. Extensive analysis on gene signatures and rules were performed by literature review. All in all, we recognized the typical gene signatures and rules of each key immune cell subtype, which were helpful to explore the complicated regulatory mechanisms of immune systems.

Materials and Methods
2.1. Dataset. The system-wide mouse RNA-Seq data released by the Immunological Genome Project (ImmGen) were downloaded from GEO (Gene Expression Omnibus) under accession number of GSE109125 [30]. The reads were mapped to the mouse reference genome (mm10), and then the uniquely mapped reads were assigned to genes according to GENCODE annotation (vM12). The genes were quantified as counts per million (CPM) using edgeR [31]. A total of 49,480 genes and 112 samples were obtained from the four types of cells: 46 αβ T cells, 33 B cells, 13 γδ T cells, and 20 innate lymphocytes. The original dataset included more samples and cell types (46 αβT cells, 33 B cells, 20 innate lymphocytes, 13 γδ T cells, 12 stromal cells, 11 stem cells, 8 dendritic cells, 7 macrophages, 6 granulocytes, and 1 mast cell). Since the sample sizes of many cell types was extremely small, we only kept the top four cell types with enough samples (46 αβ T cells, 33 B cells, 20 innate lymphocytes, and 13 γδ T cells). The gene expression signatures were identified for such four major types of immunological cells.

Feature
Selection. The MCFS [27] was first used to identify interpretable information about gene discrimination among the different groups of immune cells. Then, the twostage IFS [28] method was applied to obtain genes with 2 BioMed Research International strong classification ability to improve the component recognition for the immune system.

Monte-Carlo Feature Selection.
MCFS is a classic feature selection method for distinguishable features and ranks the features through guided sampling. For specific steps, multiple feature subsets with m features were arbitrarily chosen from the original M features (m < <M), the bootstrap dataset was trained for each specific feature subset, and the generated p decision trees were evaluated. The p × t decision trees were obtained by repeating the above steps t times. The accuracy weights of generated multiple decision trees provide a relative importance (RI) score for each feature, which was calculated as below.
where wAcc is the weighted accuracy and n f ðτÞ is a node of feature f in decision tree τ. The information gain of n f ðτÞ is expressed as IGðn f ðτÞÞ, no:in n f ðτÞ is the number of training samples in n f ðτÞ, and u and v are the two weighting factors. Here, we adopted the MCFS program obtained from http://www.ipipan.eu/staff/m.draminski/mcfs.html. For convenience, default parameters were used. After obtaining the RI scores of all features, we ranked all features in a list with the decreasing order of their RI scores. In addition to the feature list, the MCFS method also yields some most important features, called informative features in this study, which are some top features in the list. These features are accessed by a permutation test on class labels and one-sided Student's t-test.

Two-Stage Incremental Feature
Selection. IFS is a feature selection method that accurately distinguishes samples from different classes by screening a set of optimal features. The features in the ranked list from MCFS can be sorted in a descending order according to their RI scores as mentioned above. Such feature list can help the classification algorithm in producing optimal performance. The original IFS must test all possible feature subsets, which are constructed from the feature list, to filter out the optimal feature subset that can identify samples' classes with best performance. Here, due to the large number of features (~50000), inducing lots of time to test all feature subsets, we designed a two-stage IFS method. In the first stage, we constructed candidate feature subsets with a large step size. Taking 10-step size as an example, in N feature subsets Classifiers on samples with each feature subset were learned, which were further tested with 10-fold cross-validation [32][33][34][35][36][37]. Then, the feature interval containing the feature subset with the highest performance was determined and denoted as [min, max]. In the second stage, a series of feature subsets which contained top min, min + 1,…, max-1, max features were constructed. Likewise, a classifier for each feature subset was built and evaluated by 10-fold cross-validation. Accordingly, the classifier with the best performance can be found and termed as the optimal classifier. The corresponding feature subset is defined as the optimal feature subset. In this study, we selected RF to construct classifiers. [29] is a classic machine learning algorithm, which contains a large number of decision tree classifiers, that is RF is an assemble classification algorithm. It is widely used in computational biology as one of the most common machine learning methods [38][39][40][41][42][43]. The output sample class/category of RF is determined by these tree classifiers (i.e., decision trees) in an aggregating vote manner. A RF consists of multiple decision trees with subtle differences. Thus, the mean of the predictions of all decision trees is usually taken as the final consensus results. Although this approach can lead to interpretability loss and slight increase in the model bias, it can avoid overfitting and improve the performance robustness. To quickly implement RF, the tool "RandomForest" in Weka [44,45] was employed. Such a tool was executed with its default parameters.

Random Forest. RF
2.4. Rule Learning Scheme. The IFS method with RF is helpful to construct a powerful classifier. However, such a classifier is absolutely a black-box classifier. It is very hard to capture the classification principle from such a classifier. Thus, we further employed a rule learning scheme to extract classification rules from the cell expression data.
To save time, we directly used the informative features yielded by the MCFS method. These features were first processed by the Johnson reducer algorithm [46,47]. Some nonessential features were discarded, and the remaining features had the similar classification ability to the original informative features. Then, the repeated incremental pruning to produce the error reduction (RIPPER) algorithm [48] was applied on the remaining features to construct rules. The RIPPER algorithm is a specific method for constructing rule-based classifiers. The main frame of the RIPPER algorithm is based on IF-ELSE rules and consists of two parts: rule generation and rule optimization. The rule generation is a two-layer loop: the outer loop generates a rule each time after pruning and adds it to the rule pool, and the inner loop adds a predecessor to the rule each time. The rule optimization constructs alternatives based on the rules in the pool and finally selects the optimal rule to update the rule pool.
The above procedures are implemented and integrated in the MCFS program downloaded from http://www.ipipan.eu/ staff/m.draminski/mcfs.html. We directly used it to produce rules.

Performance Measurement. The Matthew Correlation
Coefficient (MCC) [49][50][51][52][53][54][55][56][57] is a common method used to evaluate the performance of dissimilar classifiers. This variable correlation coefficient calculates the correlation between the target and prediction classes with return value between -1 and +1. MCC considers true and false positives and negatives and is generally considered as a balanced measurement, even when the sample categories have different sizes. In this study, MCC within 10-fold cross-validation was used to evaluate classification performance.
Besides, we also employed accuracy on each cell type and overall accuracy (ACC) to fully evaluate the performance of different classifiers.

Results
In this study, we adopted several computational methods to analyze the RNA-Seq data of mouse immunological cells. The entire procedures are illustrated in Figure 1. The purpose was to extract essential gene signatures and rules for different immunological cell types. This section gives detailed results of each step of the procedures.
3.1. Results of the MCFS Method. The RNA-Seq data was first analyzed by the MCFS method. Accordingly, each feature was assigned a RI score. Then, a feature list was constructed with the decreasing order of their RI scores, which are provided in Table S1. Moreover, some informative features were also yielded by the MCFS method, which were the top 84 features in the list provided in Table S1.

Results of the IFS with Random Forest.
A two-stage IFS method, incorporating RF as the classification algorithm, was applied to the feature list. In the first stage, we ran IFS with a step size of 10 on the feature list from MCFS. A RF classifier was built based on each constructed feature subset. Then, all classifiers were assessed by 10-fold cross-validation. The predicted results were counted as MCCs, accuracies on four types, and ACCs, which are available in Table S2. For an easy observation, we plotted the obtained MCCs on a coordinate system with the number of used features as the x-axis, as shown in Figure 2(a). It can be seen that when the top ten features were adopted, the RF classifier gave a perfection prediction with MCC = 1. Thus, we determined the min = 1 and max = 50 to do the second IFS stage. Feature subsets containing the top 1-50 features were built, on each of which a RF classifier was set up. Each classifier was evaluated by 10-fold cross-validation. The predicted results are provided in Table S3. Figure 2(b) listed the performance of the RF classifier based on the top ten feature subsets. The RF with the top six features yielded the perfect classification. Thus, such RF classifier was called the optimal classifier, and the corresponding feature subset was termed as the optimal feature subset.

Results of the Rule Learning.
Besides the RF black-box classifier, we also used a rule learning scheme to give a clearer description on the classification procedure, thereby evidently elaborating the differences on four immunological cell types.
According to the MCFS results, 84 informative features were obtained (see the first 84 features in Table S1). Then, the Johnson reducer algorithm was applied on these features to further select the most essential features. The RIPPER algorithm followed to extract rules with the  remaining features, resulting in four rules, which are listed in Table 1. To indicate the utility of these rules, two measurements, support and accuracy, were calculated for each rule, which are also listed in Table 1. It can be seen that each rule can cover several immunological cells, and the efficiency of each rule was quite high. Furthermore, to elaborate the utility of the procedures for constructing the rules, we did the 10-fold cross-validation three times. The accuracies on four cell types are shown in Figure 3. Except the accuracy on the γδ T cell (82.05%), other accuracies were all no less than 90%. The ACC was 93.15% and MCC was 0.903. All these indicated that such a rule learning scheme was quite effective to extract efficient rules, also indicating the reliability of the rules in Table 1.

Discussion
We analyzed the following four typical cell subtypes in the mouse immune system: αβ T cells, B cells, γδ T cells, and innate lymphocytes, to screen detailed immune genes and establish standards for cell subgrouping. Basing on the gene expression profiling of individual cells, we performed

BioMed Research
International qualitative prediction on cell subtypes, identification of candidate immune cell-associated genes (noted as ImmGenassociated genes), and quantitative screening for the detailed recognition criteria of each cell subtype in a rule manner. According to recent publications, all identified ImmGenassociated genes and quantitative rules can be supported and confirmed by existing experiments and analysis, thus validating the efficacy and accuracy of our prediction. The detailed analysis of high-ranked ImmGen-associated genes and corresponding quantitative rules can be seen below.

Cell Type-Specific Function of ImmGen-Associated Genes.
With the MCFS method, we ranked features (genes) in a list (Table S1). Here, we selected the top ten genes, which are listed in Table 2, for detailed analysis.
The top gene in the ranked feature list is Ighv1-72, which encodes the variable region in the heavy chain of immunoglobulin [58]. According to recent publications, this gene participates in antigen-responding antibody synthesis [58]. All biological processes involving antibody synthesis mostly occur in one of our candidate cell subtypes, i.e., B cells, but not in the other cell subtypes [59][60][61]. Therefore, the expression pattern of Ighv1-72 in B cells may be different from those in the other three cell subtypes. This finding validates the potential distinguishing role of Ighv1-72.
The next high-ranked gene is Cd5, which encodes a famous cluster of differentiation. In the mouse immune system, Cd5 is a T-cell surface glycoprotein that regulates T cell inhibition [76,77]. According to recent publications, Cd5 is expressed in αβ T cells and γδ T cells and is a potential biomarker for T cell subgroups [62,63]. Although Cd5 has a spe-cific role of encoding protein T cell surface glycoprotein, its protein products are found on the surface of a specific subgroup of B cells [64,65]. This finding implies that this biomarker may distinguish innate lymphocytes from the other three immune cell subtypes.
The next predicted gene is Klrb1b, which encodes a specific lectin-like receptor on the surface of natural killer cells. Our predicted gene Nlrb1b (Klrb1b) encodes a functional subunit of a receptor-ligand system in NK cells and may regulate an MHC-independent immune surveillance mechanism [66]. For its cell subtype specific expression pattern, Nlrb1b plays an irreplaceable role in natural killer cells and T cells [66], i.e., a key subtype of innate lymphocyte. Hence, Nlrb1b may be a potential marker for innate lymphocytes.
Phka1 exhibits a differential expression pattern among the different cell subtypes. Encoding an alpha chain of the phosphorylase kinase, this gene has a differential expression pattern in different T cells, B cells, and innate lymphocyte subtypes and even under different cell activation status [67][68][69]. Therefore, Phka1 is definitely a potential biomarker for the distinction of four immune cell subtypes due to its substantially biological functions and alternative expression patterns in different immune cell subtypes under various immune conditions.
Trdv5, Trbj1-2, Trbj1-3, Tcrg-V4, and Trbj1-7 are the feature genes encoding different regions of the T cell receptor (TCR) [78,79]. The differential expression pattern of these genes in four cell subtypes indicates or reflects that of T cell receptors in different cell subtypes. αβ T and γδ T cells have a high expression of T cell receptors [80,81]. These two cell subtypes can be further distinguished according to feature genes due to the differential expression pattern of Tcrg-V4, which encodes a unique region of the gamma chain [82]. A high expression pattern of Tcrg-V4 can be found in γδ T cells but not in αβ T cells, thus confirming the distinguishing capacity of our predicted gene signatures [72,73]. For the remaining cell subtypes, B cells and innate lymphocytes, the former does not have the expression of all TCR-associated genes. A specific subtype of innate lymphocyte, namely, natural killer T cells, has a unique expression pattern of T cell receptors [70,71]. All identified natural killer T cells with T cell receptor expression would also have the specific αβ T cell receptors but not the γδ T cell receptors [71]. Therefore, some subgroups of innate lymphocytes may also have alternative expression patterns of Trdv5, Trbj1-2, Trbj1-3, and Trbj1-7 but not Tcrg-V4. This finding reflects the distinguishing effects of our predicted ImmGen-associated genes involved in T cell receptors.  Various T cell receptor-associated genes can still be found in the top-ranked genes of our feature gene list, thereby implying the unique differential capacity of the T cell receptor expression pattern and validating the efficacy and accuracy of our prediction approach. In addition to T cell receptor coding genes, we also identified a unique B cell recognizing gene named EBF1. Acting as a transcription factor, this gene contributes to the maintenance of B cell identity and prevention of alternative fates in committed cells, such as transferring to the T cell lineage [74,75]. Therefore, the high expression pattern of EBF1 can only be identified in B cells, implying its potential as a biomarker for this cell type.
Owing to the limitation of the article length, we cannot individually analyze the discriminative genes. However, the above-mentioned high-ranked genes have cell type-specific expression patterns in immune cell subtypes, thus validating the efficacy and accuracy of our prediction and analysis.

Cell Type-Specific Expression
Pattern of ImmGen-Associated Rules. Besides the gene signatures, we also obtained some classification rules via a rule learning scheme (Table 1). They were analyzed as follows. The analysis was based on the expression level measured by Fragments Per Kilobase of transcript per Million mapped reads (FPKM).
The first identified quantitative parameter is Tcrg-V4. As a specific encoding gene for the γδ T cell receptors, this gene may have high expression in γδ T cells. In accordance with our predicted expression rules, the expression abundance of Tcrg-V4 is higher than 62.755978 (FPKM) [72,73]. According to the Mouse Genome Informatics database [83], the expression level and relative expression quantity of Tcrg-V4 in the T cell subgroup, γδ T cells, basically conforms to our predicted rules [84,85], thereby validating the efficacy and accuracy of our prediction.
The next classification rule of quantitative parameter involves a specific gene named Aifm2, which contributes to the identification of innate lymphocytes. According to the Mouse Genome Informatics database [83], this gene may have a unique higher expression pattern in mucosal tissues, which are full of innate immune cells, than in the blood system and lymphoid node, which are full of T cells and B cells [86,87]. Therefore, Aifm2 may be highly expressed in innate lymphocytes with a threshold of approximately 15 FPKM.
Another quantitative parameter is Abcb9, whose low expression (lower than 10 FPKM) is indicative of B cells rather than T cells or innate lymphocytes. Mucosal tissues are full of innate immune cells, and thymus tissues are full of T cells. By contrast, the anatomic area, the spleen, is full of mature B cells and thus has low expression level of Abcb9 (<5 FPKM) [83].
The cells that do not follow the three rules mentioned above may be αβ T cells. Thus, all typical expression patterns can be set up for corresponding immune cell subtypes and have been confirmed by recent studies, thereby validating the efficacy and accuracy of our analysis.

Conclusions
By using our newly presented computational approach, we identified a group of signature genes that are qualitatively related to the biological processes of four immune cell subtypes. We also set up a set of quantitative expression rules for the detailed clustering of immune cells based on the absolute expression levels measured by FPKM. This work provides a novel computational tool for the quantitative subtyping of immune cells and biomarker recognition.

Data Availability
The data used to support the findings of this study have been deposited in the Gene Expression Omnibus repository (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE109125).

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.

Authors' Contributions
Yu-Hang Zhang and Zhandong Li contributed equally to this work.