Rheumatoid arthritis (RA) is a severe chronic pathogenic inflammatory abnormality that damages small joints. Comprehensive diagnosis and treatment procedures for RA have been established because of its severe symptoms and relatively high morbidity. Medication and surgery are the two major therapeutic approaches. Infliximab (IFX) is a novel biological agent applied for the treatment of RA. IFX improves physical functions and benefits the achievement of clinical remission even under discontinuous medication. However, not all patients react to IFX, and distinguishing IFX-sensitive and IFX-resistant patients is quite difficult. Thus, how to predict the therapeutic effects of IFX on patients with RA is one of the urgent translational medicine problems in the clinical treatment of RA. In this study, we present a novel computational method for the identification of the applicable and substantial blood gene signatures of IFX sensitivity by liquid biopsy, which may assist in the establishment of a clinical drug sensitivity test standard for RA and contribute to the revelation of unique IFX-associated pharmacological mechanisms.
Rheumatoid arthritis (RA) is a severe chronic pathogenic inflammatory abnormality that damages small joints [
Comprehensive diagnosis and treatment procedures for RA have been established because of its severe symptoms and relatively high morbidity [
Medication and surgery are the two major therapeutic approaches for RA [
With the development of liquid biopsy and high-throughput sequencing technologies, a recent study [
The blood gene expression profiles of 140 patients with RA before IFX treatment were downloaded from the Gene Expression Omnibus under accession number GSE78068 (
In this work, the MCFS method, which is a decision tree- (DT-) based feature selection method [
The contribution of each feature in these DTs can be evaluated by the relative importance (RI) score, which is calculated as follows:
After each feature was assigned a RI value, we ranked all features in a list with the decreasing order of their RI values. In addition to the feature list, the MCFS method also outputs some most important features, called informative features, which are some top features in the list. These features are accessed by determining a threshold of RI values via a permutation test on class labels and one-sided Student’s
This study used the MCFS program retrieved from
IFS is widely used to determine the optimal number of features for constructing a classification model with an integrated supervised classifier [
SVM is a classification algorithm suitable for linear and nonlinear data [
RF [
In addition to “black-box” classification algorithms, the interpretable rules for a classification model can also be extracted to explain the feature differences between groups of patients with particular response to drug treatment. To accelerate this procedure, we directly picked up the informative features extracted by the MCFS method. These features were further filtered by the Johnson Reducer algorithm [
The MCC [
Besides, we also employed other five measurements to give a full evaluation on different classification models. They were sensitivity (SN), specificity (SP), accuracy (ACC), precision, and
In this study, we gave a computational investigation on the blood gene expression profiles of patients with RA before IFX treatment. The entire procedures are illustrated in Figure
Entire procedures to investigate the blood gene expression profiles of rheumatoid arthritis patients. Profiles are retrieved from Gene Expression Omnibus, which are analyzed by the Monte Carlo feature selection method. A feature list is obtained, which is fed into the incremental feature selection method to construct efficient classifiers and extract essential genes. On the other hand, informative features, which are some top features in the list, are used to construct classification rules via Johnson Reducer and RIPPER algorithms.
The blood gene expression profiles were first analyzed by the MCFS method. As a result, each feature was assigned a RI value, which indicated its importance. The RI values of all features are listed in Table
Based on the feature list obtained in the above section, the IFS method is followed. It first constructed several feature subsets. Then, on each feature subset, a classifier was built using SVM or RF as the classification algorithm. Each classifier was evaluated by 10-fold cross-validation. Six measurements (see equations (
IFS curves with different classification algorithms on different numbers of features (genes). The support vector machine yields the highest MCC of 0.760 when top 1260 features are used, whereas the random forest generates the highest MCC of 0.611 when top 10 features are adopted.
Performance of some key support vector machine (SVM) and random forest (RF) classifiers.
Classification algorithm | Number of features | Sensitivity | Specificity | Accuracy | Precision | |
---|---|---|---|---|---|---|
SVM | 1260 | 0.690 | 0.990 | 0.900 | 0.967 | 0.806 |
SVM | 60 | 0.619 | 0.980 | 0.871 | 0.929 | 0.743 |
RF | 10 | 0.643 | 0.929 | 0.843 | 0.794 | 0.711 |
As mentioned above, the optimum SVM classifier needed much more features than the optimum RF classifier. In fact, the SVM classifier can yield good performance when much less features were used. As shown in Figure
IFS curves with different classification algorithms on top 10-200 features. The support vector machine (SVM) yields the MCC of 0.686 when only top 60 features are used. It is still higher than the highest MCC yielded by the optimum random forest (RF) classifier.
The optimum SVM and RF classifiers gave good performance. However, they were black-box algorithms. Few medical insights can be captured from these classifiers. In view of this, we further employed a rule learning procedure. The informative features yielded by MCFS were processed by the Johnson Reducer and RIPPER algorithms one by one. As a result, three rules were constructed, as shown in Table
Classification rules yielded by RIPPER.
Index | Condition | Result | Support# | Accuracy$ |
---|---|---|---|---|
1 | IFX-sensitive patient | 7.86% | 90.91% | |
2 | IFX-sensitive patient | 7.14% | 90.00% | |
3 | Others | IFX-resistant patient | 85.00% | 80.67% |
#Support is defined as the proportion of samples satisfying the rule to all samples. $Accuracy is defined as the proportion of correctly predicted samples to the samples satisfying the rule.
Furthermore, to test the effectiveness of the above rule learning procedures, we did the 10-fold cross-validation three times. Six measurements calculated by equations (
IFX is one of the major clinically applied drugs for RA. However, the sensitivity and effectiveness of this drug vary among patients. Recent publications confirmed that the sensitivity of this drug against RA can be predicted by obtaining the expression profiling pattern of patients’ pretherapeutic blood. However, the core signatures/biomarkers for the prediction and understanding of IFX sensitivity are difficult to identify. We identified gene signatures for drug therapeutic effect evaluation and established a series of quantitative rules that explain the detailed accurate recognition of patients with different IFX sensitivity using a novel computational approach on the expression profiling of pretherapeutic blood. All the identified signatures have been confirmed by recent publications, and the detailed analysis of the representative genes and rules is discussed below.
In this study, with some computational methods, several genes associated with IFX response were identified. Here, we selected some of them for detailed analysis, which are listed in Table
Some top genes associated with IFX response.
Gene symbol | Description | RI score |
---|---|---|
DISC1 | DISC1 scaffold protein | 0.0506 |
SAMD11 | Sterile alpha motif domain containing 11 | 0.0410 |
EID2B | EP300 interacting inhibitor of differentiation 2B | 0.0345 |
NTS | Neurotensin | 0.0257 |
STAT2 | Signal transducer and activator of transcription 2 | 0.0233 |
HELZ | Helicase with zinc finger | 0.0198 |
SUMO2 | Small ubiquitin-like modifier 2 | 0.0190 |
The
Apart from unannotated RNA transcripts with no validated protein products, all the predicted genes have been confirmed to be functionally related to TNF-
Apart from the qualitative analysis of each top-ranked gene signatures in our prediction list, we also set up a series of quantitative recognition rules for the detailed and accurate recognition of IFX-sensitive and IFX-resistant patients. The first rule involves only the
The second rule involves the
The identified blood gene signatures participate in IFX-sensitive pharmacological processes in patients with RA. Thus, these genes may be potential biomarkers for the distinction of IFX-sensitive and IFX-resistant patients at the transcriptomic level. Several quantitative signature rules for the distinction of patients have also been verified by other recent publications. Therefore, our newly presented method provides comprehensive qualitative and quantitative prediction standards for prognosis guidance on the clinical application of IFX on patients with RA.
The data used to support the findings of this study have been deposited in the Gene Expression Omnibus repository (
The authors declare that there is no conflict of interest regarding the publication of this paper.
ShiJian Ding and ZhanDong Li contributed equally to this work.
This research was funded by the Strategic Priority Research Program of Chinese Academy of Sciences (XDB38050200), the National Key R&D Program of China (2017YFC1201200, 2018YFC0910403), the Shanghai Municipal Science and Technology Major Project (2017SHZDZX01), the National Natural Science Foundation of China (31701151), the Shanghai Sailing Program (16YF1413800), the Youth Innovation Promotion Association of Chinese Academy of Sciences (CAS) (2016245), and the Fund of the Key Laboratory of Tissue Microenvironment and Tumor of Chinese Academy of Sciences (202002).
Table S1: list of genes ranked based on the RI score from MCFS. Table S2: performance of IFS with SVM. Table S3: performance of IFS with RF.