GEPSI: A Gene Expression Profile Similarity-Based Identification Method of Bioactive Components in Traditional Chinese Medicine Formula

The identification of bioactive components in traditional Chinese medicine (TCM) is an important part of the TCM material foundation research. Recently, molecular docking technology has been extensively used for the identification of TCM bioactive components. However, target proteins that are used in molecular docking may not be the actual TCM target. For this reason, the bioactive components would likely be omitted or incorrect. To address this problem, this study proposed the GEPSI method that identified the target proteins of TCM based on the similarity of gene expression profiles. The similarity of the gene expression profiles affected by TCM and small molecular drugs was calculated. The pharmacological action of TCM may be similar to that of small molecule drugs that have a high similarity score. Indeed, the target proteins of the small molecule drugs could be considered TCM targets. Thus, we identified the bioactive components of a TCM by molecular docking and verified the reliability of this method by a literature investigation. Using the target proteins that TCM actually affected as targets, the identification of the bioactive components was more accurate. This study provides a fast and effective method for the identification of TCM bioactive components.


Introduction
A method to identify the bioactive components in traditional Chinese medicine (TCM) from their complex mixtures is a critical challenge of TCM research. Because of its intuitive and efficient characteristics, molecular docking has become an important means for the identification of TCM bioactive components. The basis of identification via molecular docking involves one or multiple target proteins and the components being screened; ultimately, the components that specifically act on target protein can be identified, such as TCM bioactive components. In the screening process, a single target or multiple targets are chosen, usually targets associated with a specific disease. Methods for choosing targets are generally based on a database of disease-associated targets, a key target in a signaling transduction network or from the literature [1][2][3]. Because of the complexity of a disease, multiple targets may be associated with it. Therefore, the target proteins selected may not be the actual targets affected by TCM, or it may not be possible to screen against all of the associated targets. Therefore, the bioactive components obtained by molecular docking may not be the components that actually cured the corresponding disease or have been left out.
The development of chemical informatics and bioinformatics has led to the accumulation of data on TCM components, target proteins, and gene expression profiles.
To determine a method for the selection of target proteins for molecular docking guided by the ideas of system pharmacology, this study proposed a method for determining the target proteins of TCM and then identified the bioactive components of TCM by molecular docking. This method has been designated the gene expression profile similaritybased identification (GEPSI) method. The basic concept is to choose the gene expression profiles that are targeted by small molecule drugs in Cmap based on the principle that they have higher comparability with the gene expression profiles of a TCM, and calculate the gene expression profiles similarity between the TCM and the small molecule drugs. The target proteins of the small molecule drugs that have higher similarity scores could be considered TCM targets. Aiming at these target proteins, virtual screening is carried out to screen the TCM components, ultimately identifying the bioactive components. Because it considers the entirety of the TCM components and all of the genes affected as the object, this method could embody the holistic thinking of TCM research more concretely. This method provides an effective means for the identification of TCM bioactive components and could serve as a basis for drug repositioning, quality control, and TCM drug design.

Principle of GEPSI.
Both TCMs and small molecule drugs all can affect gene expression. By comparing the gene expression profiles before and after treatment with TCM or a small molecule drug, up-and downregulated differentially expressed genes can be identified. Then, these up and downregulated differentially expressed genes that are affected by TCMs and small molecule drugs can be compared, and a similarity score can be obtained. If the similarity score is high, the TCM and the small molecule drug may have similar pharmacological action, and the target proteins for the small molecule drugs that have higher similarity score can be considered targets for the TCM. Using these proteins as the targets, we can finally identify the bioactive components of a TCM by molecular docking. We also discuss each step of the ITPI method ( Figure 1) in detail in this paper.

The Determination of Up and Downregulated Genes.
The differentially expressed genes were determined using the bioinformatics toolbox of Matlab [35]. A t-test and false discovery rate (FDR) of multiple hypothesis testing were performed on each gene. Significant differentially expressed genes were detected by random sample replacement ( < 0.05, FDR ≤ 0.1). Up and downregulated genes were distinguished by the magnitude of fold change (FC). If FC ≥ 2, then the significant differentially expressed genes were up-regulated genes, and if FC ≤ 0.5, then the significant differentially expressed genes were downregulated genes.

The Similarity Computation of the Gene Expression
Profile. The up-and downregulated genes were used to calculate the gene expression profile similarity. Using up and downregulated genes as the base data, the gene expression profile similarity was automatically calculated in Cmap by the K-S algorithm [34,36]. A similarity comparison yielded the similarity scores of the gene expression profiles of each small molecule drugs and TCM. Similarity scores fell between −1 and 1. If 0 ≤ similarity scores ≤ 1, the pharmacological action of a small molecule drug and TCM were similar, and a higher absolute value of the similarity score indicated a greater similarity; if −1 ≤ similarity scores ≤ 0, the pharmacological action of a small molecule drug and TCM were adverse, and a higher absolute value indicated less similarity.

Determination of the TCM Target Proteins.
If the similarity score for the gene expression profiles of a small molecule drug and TCM was high, then their pharmacological action was similar. The target proteins of a small molecule drugs were considered TCM targets. This study only considered the top 10 small molecule drugs that had a definite pharmacological action and their target proteins were recorded in DrugBank version 4.3.

Data for the TCM Components.
The components of a TCM formula were collected from TCMD [37] and TCMSP [38]. The components were supplemented and perfected by the literature in CNKI and PubMed (1979∼2017). The name, structure, and SMILES string of a component was recorded. For components with synonyms, the repetitive components were deleted by the "full structure" algorithm in "ChemBioFinder for Office 12.0".

Determination of Bioactive Components of a TCM.
The three-dimensional structure was downloaded from the PDB (https://www.rcsb.org/pdb/home/home.do), and the structure that had active ligands and higher resolution was preferentially selected. The preprocessing of the target protein included the deletion of ligands, water, and redundant protein conformations; the completion of missing or incomplete residues; the addition of hydrogens; and the distribution of related charges. The amino acids in the target protein that interact with the ligand were selected and were defined as the active pocket. The structure of components was transformed into a three-dimensional structure, endowed with a CHARMM force field and protonated in accordance with the corresponding pH. Molecular docking was carried out by LibDock [39], and the parameter settings were as follows: the "Conformation Method" was "BEST," the "Docking Preferences" was "High Quality," and the other parameters were set to the default. With the "LibDock Score" as the reference, the components that had a score higher than the ligand and the ranked in the top 10 were considered the bioactive components. This information allowed us to identify the bioactive components of the TCM.

Up-and Downregulated Genes.
At SWT concentrations of 0.0256 mg/mL and 0.256 mg/mL, the expression of each gene did not obviously change, but when the SWT concentration was 2.56 mg/mL, the expression of each gene obviously changed. Therefore, the gene expression profile that was elicited by SWT (2.56 mg/mL) was chosen for follow-up research.
A t-test and false discovery rate (FDR) multiple hypothesis test were applied to each gene. A large number of genes were found to have biological differences; 442 genes were up-regulated and 189 were downregulated (Supplemental Information 2).

The Small Molecule Drugs with High Similarity Scores.
After the similarity was computed, the similarity scores of the gene expression profiles for 1294 small molecular drugs and SWT were obtained. The top ten small molecule drugs that had explicit pharmacological action and their target proteins contained in DrugBank were retained. The results are shown in Table 1.

The Primary Pharmacological Actions of the Top Ten Small
Molecule Drugs. The primary pharmacological actions of the top ten small molecule drugs in Table 1 were investigated in the literature. The results are shown in Table 2.   Table 2 shows that the pharmacological actions of the ten small molecule drugs all involve disease caused by an unbalanced estrogen level. Except for phenoxybenzamine and equilin, the primary pharmacological action of the remaining eight small molecule drugs was closely related to the treatment of breast cancer. Most of the eight drugs have an estrogenic effect. For example, resveratrol and genistein are phytoestrogens; estradiol is a natural estrogen that is secreted by mature ovarian follicles; diethylstilbestrol is a kind of estrogen that is a common endocrine medication for breast cancer. We often think that the occurrence of breast cancer is related to an excessive or imbalanced level of estrogen in the female body [46], and the regulation of immunity is an important method for the treatment of cancer. To summarize, SWT may have an anti-breast cancer effect because it has a high similarity score with the top ten small molecule drugs.

The Target Proteins That Were Used in Molecular Docking.
Of the top ten small molecule drugs, only the target proteins of four drugs have a three-dimensional structure in the PDB with a high resolution and corresponding bioactive ligands. Therefore, the target proteins of these four drugs were used for molecular docking studies (Table 3).

Bioactive Components of SWT.
After molecular docking, the components whose LibDock score were higher than that of the ligand were identified as bioactive components (the LibDock score of the ligand is shown in Supplemental  Information 3). This study identified 46 bioactive components, including 12 components in Paeoniae Radix Alba, 4 components in Chuanxiong Rhizoma, 6 components in Angelicae Sinensis Radix, and 24 components in Rehmanniae Radix Praeparata (the results are shown in Table 4). The 46 bioactive components act on 9 target proteins. Table 4 shows that SWT has anti-breast cancer activity through 46 bioactive components and these components acted on 9 target proteins. Most of the bioactive components, such as catalpol, verbascoside, and paeoniflorin, acted on multiple targets. The types and numbers of targets that the bioactive components acted on were diverse. If we only use one or a few proteins as targets, the bioactive components retrieved Evidence-Based Complementary and Alternative Medicine 5  Table 4 were the targets of resveratrol, diethylstilbestrol, estradiol, and genistein. According to the literature, these four components were estrogen or had an estrogen-like effect, and resveratrol and genistein had potent anti-breast cancer activity. Hence, the 46 bioactive components of SWT may also have anti-breast cancer and estrogenlike effects. Studies showed that catalpol, a DNA polymerase inhibitor, inhibited the proliferation of six human solid tumor cell lines by acting during the G0-G1 period. The naturally occurring iridoid catalpol is a Taq DNA polymerase inhibitor. However, the formation of analogs bearing one to three silyl ether groups led to antiproliferative compounds against a panel of six human solid tumor cell lines, with GI50 values in the range 1.8-4.8 M. Cell cycle studies revealed an arrest of the G0/G1 phase that was consistent with DNA polymerase inhibition [47]. Orientin could suppress the proliferation of MCF-7 and present specific dose-response relationships [48,49]. Paeoniflorin could suppress the proliferation and spread of breast cancer cells through the Notch-1 pathway [50]. The effect of trigalacturonic acid on the proliferation inhibition of Bcap-3 in breast cancer cells was better and it may have an anti-breast cancer potential [51]. To summarize, we found some bioactive components did have the same effect as small molecule drugs via a literature research which indicated the reliability of the bioactive components identification method based on the similarity of gene expression profiles.

Conclusion
This method was more accurate with the protein that TCM actually acted on as the target, and the result was more comprehensive than a determination of the target protein according to disease-related target databases and signal transduction networks. For example, there are 74 breast cancer-related targets in the Therapeutic Target Database (TTD), including the estrogen receptor (ER), the vascular endothelial growth factor receptor 1 (VEGFR1), and the epidermal growth factor receptor (EGFR). However, no evidence was available to support the selection of these proteins as targets. This study identified the target protein that SWT actually acted on by a gene expression similarity comparison, identified all the bioactive components of SWT by molecular docking, and then verified the reliability of this method through a literature investigation. GEPSI could serve as a rapid and effective method for the identification of TCM bioactive components. Although some time is necessary to perfect related databases, such as components of TCM and the target protein structure of small drugs, we believe the data that used in GEPSI will be more complete, and the results will be more accurate with the development of chemical informatics and bioinformatics.
Meanwhile, this study has also revealed that SWT had anti-breast cancer efficacy. However, there have been no studies of these effects. The Tao-hong Si-wu Decoction, a derivative formula, was proved to influence the upper limb swelling after breast cancer surgery and the quality of a chemotherapy patient's life [52,53]. Research has shown that Paeoniae Radix Alba, Chuanxiong Rhizoma, Rehmanniae Radix Praeparata, and SWT have plant estrogen-like effects, but the bioactive components have not been identified [54]. The above studies indirectly illustrate the rationality that SWT has an anti-breast cancer effect. That is to say, GEPSI also can be used for drug repositioning. Now that the bioactive components have been identified, we can control the quality of the individual herbs. We can also design an anti-breast cancer drug combination based on the bioactive components in SWT.

Conflicts of Interest
The authors declare that there are no conflicts of interest.