Application of Data Mining in WITMED: Identification of Prognostic Genes in Oral Cancer

In recent years, the booming development of big data, cloud computing, Internet of ,ings, and other technologies provides conditions for the popularization and application of smart city. ,e combination of big data and medical information produces the emerging field of WITMED (Wise Information Technology of Med). WITMED is essential for the prospering growth of smart cities, which assumed a high quality of medical service is the most challenging goal for the city government. In this paper, the main attention is paid to the method of targeted gene therapy, which provides a new method for the treatment of oral cancer in inhibiting the growth, differentiation, invasion, and metastasis of oral cancer cells; therefore, the physical and psychological adverse effects of surgery and chemotherapy on patients are reduced and the survival and prognosis of patients are improved. Targeted gene therapy methods need to select the appropriate gene; that is, data mining methods are used to analyze a large number of complex genetic data from smart cities to obtain appropriate genetic markers, which makes the effect of targeted gene therapy better, and also provide some reference for the research of oral cancer gene direction and provide some basis for clinical treatment.


Introduction
Smart city is the use of advanced information technology to realize the intelligent management and operation of the city and then create a better life for the people in the city and promote the harmonious and sustainable growth of the city. WITMED is an important part of smart city. rough information technology, medical infrastructure is integrated with IT infrastructure. WITMED takes medical cloud data center as the core, crosses the spatial and temporal limitations of the original medical systems, and makes intelligent decisions on this basis to realize the medical system with optimized medical services. For example，through machine learning and other technologies, precise treatment is realized to help improve the efficiency of diagnosis and treatment of doctors and improve the quality of medical services [1].
Oral cancer is one of the most common cancers in the world, and a major health problem all over the world, with higher morbidity and mortality. Oral squamous cell carcinoma is a common malignant tumor in the head and neck, accounting for more than 90% of oral cancer cases. ere are more than 300000 new cases in the world every year. In recent years, statistics of oral squamous cell carcinoma show that the incidence of oral squamous cell carcinoma is increasing [2].
In 2012, there were more than 440,000 new cases of oral and oropharyngeal cancer in the world, and more than 240,000 oral cancer and oropharyngeal cancer deaths, respectively, accounting for 3.1% and 3.0% of the new and dead cancer cases around the world [3]. According to GLOBOCAN estimates, the incidence of oral cancer and oropharyngeal cancer is the highest in Melanesia, followed by Central and South Asia and Western Europe. e incidence rate was more than 10/100,000 every year in these two regions. Annual incidence rates are the lowest in East Asia and West Africa, about 2/100,000. Asia is one of the most serious regions of oral cancer and oropharyngeal cancer. In South Asia, the incidence and mortality rates of oral cancer and oropharyngeal cancer is the highest in Bangladesh. It can be seen that nearly two-thirds of oral cancer and oropharyngeal cancer cases live in underdeveloped countries. According to statistics, from 2005 to 2013, there were more than 280,000 new cases and more than 130,000 deaths in China, which were related to oral cancer and oropharyngeal cancer [2]. In the next 20 years, the incidence rate of oral cancer will increase from 2.26/100,000 to 3.21/100,000 people in the world [4,5].
With the development of digital technology and smart cities, a great amount of data are produced in our real world, including daily life data, academic data produced in schools, and scientific experiment data produced in experiments [6]. For so many data, how to pass data to find useful information in it, and benefit to construction of smart cities, is a hot topic in today's technology research; it also promotes the rapid development of machine learning. For traditional machine learning, data dimensionality reduction is mainly used to learn low-dimensional feature representations from high-dimensional data [7]. For deep learning, images are mainly applied, including target detection and anomaly detection [8,9].
Big data storage and processing platform is used to extensively collect and deeply utilize data in WITMED, with patient data as the core, and the medical historical data is modeled and analyzed by using data mining, to achieve the purpose of detecting early diseases and predicting health risks, at the same time, for medical staff to provide reference for diagnosis and treatment.
Predictive analytics rely on historical data and utilize advanced statistical or machine learning techniques to simulate the behavior or pattern so that it is possible to predict the likelihood of possible future trends or patterns in data. To sum up, it predicts what will happen in the future by learning the relevance of historical patterns and available data. Predictive analytics have been widely used for different applications including predictive maintenance, prediction of price, supply-demand trend, or prediction of likelihood of any outcome. State-of-the-art predictive modeling techniques include model based on statistical regression, Decision Trees, and Neural Network or Deep Neural Networkbased models [10].
With the development of molecular biology, people have a more in-depth understanding of genetic sequencing and genetic markers; if scientists can analyze and research oral cancer from the perspective of molecular biology through massive genetic data, the corresponding research results will be helpful for the treatment of early diagnosis and prognosis of oral cancer to facilitate patients' medical decision-making process [11][12][13]. Since human genetic data is high-dimensional, the first problem to be solved for data mining of genes is dimension reduction. In this paper, Cox univariate regression analysis and Least Absolute Shrinkage and Selection Operator (LASSO) regression analysis were used to conduct data mining analysis on the genetic expression of oral cancer patients, and risk genes that have an impact on the prognosis of oral cancer are screened out. Second, the obtained risk genes were used to build the prognostic model and verify whether the risk genes screened by the two methods have reference value for the prognosis of oral cancer through survival analysis and ROC curve (AUC value). ird, the 23 genes closely related to the prognosis of oral cancer were obtained by the LASSO method, and the validation set was used to verify the reference value of the obtained genes for the prognosis of oral cancer and combined with the independent external data set to do double-blind verification for the screened genes. Finally, STRING genetic function analysis and literature review were used to further verify that the genes screened by us are closely related to the prognosis of oral cancer, which can provide a certain reference and theoretical basis for the future clinical research, treatment, diagnosis, and prognosis of oral cancer based on molecular biology. e rest of this paper is organized as follows. In Section 2, we describe the system model and system architecture of prognostic analysis of oral cancer based on LASSO algorithm. In Section 3, we evaluate our model through validation set. We perform a functional analysis of the genes selected by the model in Section 4 and conclude in Section 5.

Data Analysis Process
In this section, we describe the establishment and data analysis process of prognostic model, including data collection and processing, differential expression of genes, Cox regression analysis, and LASSO regression analysis. e framework of the model is shown in Figure 1.

Data Collection and Processing.
In this paper, the samples of head and neck cancer were downloaded respectively in three open source websites of the Xena Functional Genomics Explorer (xenabrowser, xenabrowser.net/datapages/), cBioportal (http://www.cbioportal.org/), and National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/), a total of 566 standardized TCGA head and neck cancer samples are gotten from xenabrowser, and 485 oral cancer samples were sorted according to requirements. 514 head and neck cancer samples were downloaded from cBioportal and 385 oral cancer samples were sorted. 103 oral cancer samples were downloaded from NCBI and 74 samples containing tumor tissues were sorted out. e data obtained on xenabrowser is used for the establishment of data mining model and the data obtained on cBioportal and NCBI are used as independent data sets for result verification. e raw data obtained from the open-source databases on the xenabrowser, cBioportal, and NCBI websites are mainly structured data.
e original data were processed as follows. (1) Data cleaning: in order to preserve the authenticity of the data, we chose to delete the data without many features and fill other data with few missing features by the revised average method.
rough data cleaning, 141-dimensional clinical data were organized into 80-dimensional data. (2) Data transformation: compared with the original 141-dimensional clinical feature data, 32-dimensional clinical features obtained by data transformation and discretization had improved the degree of analysis.

Model
Building. First, we need to transform the clinical information data and gene data of these samples into data frames for data mining analysis. Second, data preprocessing was performed; that is, the patient samples with missing survival attribute values were removed and 453 patient samples were retained, and then the tumor-containing tissue and the normal solid tissue were separated; meanwhile, 412 patient samples with the tumor tissue and 41 normal solid tissue samples were obtained. Finally, the 412 patient samples with tumor tissues were randomly divided into training samples and validation samples at a ratio of 1 : 1. After data preprocessing, the data was divided into 3 groups: normal samples, training samples, and validation samples.

Gene Expression
Differences. Data analysis of gene differential expression is to screen genes with differences in genetic expression, and those with "significant differences in expression" are screened out. In order to analyze the differences in genetic expression between the 200 training samples and 41 normal samples, the "Limma" package version 3.42.2 (39) in R was used. According to the adjusted P value (adj. P. val is less than 0.001), 2146 genes were considered to be differentially expressed genes. Some of the selected genes are shown in Table 1.

Cox Regression Analysis.
After the differential expression analysis of genes, 2146 differentially expressed genes were obtained. en, these 2146 genes were combined with survival data from the clinical characteristics of 200 training samples, and Cox regression was used to analyze the degree of correlation between each differentially expressed gene and the survival of oral cancer patients [14]. e Wald-Test P value and hazard ratio (HR) were calculated by each gene to filter genes which are significantly associated with the survival of oral cancer patients. According to statistical principles, threshold is set to P < 0.05 and 314 genes highly related to the survival of oral cancer patients were filtered. Some of the selected genes and related parameters are shown in Table 2.

LASSO Regression Algorithm.
In order to achieve more accurate genetic screening, LASSO regression method was used to reduce dimension and regression analysis for gene data of training samples [15].
Genes screening based on LASSO regression: LASSO regression is a linear model with penalty term where L1 norm is the absolute value, and K-fold cross-validation was used to select the penalty parameter λ, and the value of K was 10. rough 10-fold cross-validation, the appropriate penalty parameter λ was determined [14,16]. e process of parameter λ selection is shown in Figure 2. After the 10-fold Scientific Programming cross-validation, λ is equal to 0.04098355; then λ was substituted into the LASSO regression equation, and 23 genes with the highest survival rate of oral cancer patients were finally obtained. e results are shown in Table 3.
Establishment of the prognostic index based on LASSO regression: as an important indicator of the integration of risk genes, a PI value can be determined for each patient with oral cancer. e PI was obtained by linearly fitting the product of the expression and the coefficient corrected by LASSO of each gene [14]. e formula of the prognosis index is shown as follows: (1) In the formula, Xi is the expression of the i-th gene, and βi is the regression coefficient of the i-th gene. rough linear fitting of the product of expression and regression coefficient of the 23 genes in each sample, the PI of each patient was calculated, and the patients were sorted from lowest to highest according to their PI value. Based on the median PI value, the patients were divided into high-risk and low-risk groups.
Based on LASSO regression analysis, 23 genes were selected finally. ese 23 genes and the corresponding LASSO regression coefficients were used to construct a multivariate linear model. Because the expression of each gene in different samples was different, a prognostic index can be generated for each sample. e distribution of PI value is shown in Figure 3.
Next, the prognostic model constructed by the LASSO regression method was tested on the training samples to observe whether the samples of high-risk patients could be distinguished from the samples of low-risk patients. e Kaplan Meier method, which was combined with the division of high and low risk of the patient samples, the survival status, and survival time in the clinical characteristics of the samples, was used to draw the survival curve of the training samples [14]. ROC curve was used to further verify the scientific and feasibility of the prognosis model constructed by LASSO regression. e survival time of 4 years was selected for ROC curve analysis. If the AUC value is more than 0.5, it indicated that the prognostic model obtained under the LASSO regression method performs well for the mining and analysis of prognostic risk genes of oral    Scientific Programming cancer, and the analysis results are shown in Figure 4. It can be seen that the high-risk group was clearly distinguished from the low-risk group, log rank P is less than 0.001, and the AUC value of LASSO regression is 0.963, which indicates that the model constructed by the LASSO method performs well. e prognostic values of all samples were sorted from the lowest to the highest, and the median of the prognostic value was taken as a reference. e samples of patients larger than the selected median were considered high-risk patients, and those less than the median were considered low-risk patients [14]. e genetic expression profile of patient samples is shown in Figure 5.

Verification on Validation
Set. Kaplan Meier method was used to verify whether the 23 genes screened by the LASSO regression model could distinguish high-risk patients from low-risk patients in the 212 validation samples. It was also necessary to use the ROC curve to further verify the scientific and feasibility of the LASSO model. e analysis results are shown in Figure 6. It shows that these genetic biomarkers could still classify patients with oral cancer in the validation samples into high-risk and low-risk categories.

Validation Based on Clinical Information.
rough the screening and analysis of clinical data, the patient's drinking history, gender, tumor status, age, smoking history, and cancer status were closely associated with this research [17]. en, the above 6 clinical factors in the clinical information of 485 samples were taken as univariate, and Cox univariate regression analysis was sequentially used on the selected 6 clinical features. e Log-rank P value and HR value of each clinical feature were calculated sequentially and the final results are shown in Table 4.    Scientific Programming e results of Cox univariate regression analysis showed that the Log-rank test P of six clinical factors, such as drinking history, sex, tumor status, age, smoking history, and cancer stage, was less than 0.05. erefore, we can see that these six clinical factors are significantly related to the survival of oral cancer patients. e 6 clinical factors of drinking history, gender, tumor status, and age were used as variables, and the patient samples were divided into two groups; then, the 23 genetic markers screened by the LASSO method in the training samples were used to analyze the survival curve of each clinical factor based on the Kaplan Meier method. e analysis results are shown in Figure 7. It can be seen from the above figures that the six clinical information features we selected are significantly associated with the survival of the oral cancer patients of the research in this paper.

Comparative Verification Based on Other Data Sets.
Using single data set to analyze the test results was often not convincing enough, so other data sets needed to be used to verify the results. e first validation set was from cBioportal, and the data of 385 oral cancer samples were sorted. e survival curve, ROC curve, and AUC value were used to verify the results of the LASSO regression algorithm, and the results are shown in Figure 8. e second validation set was from NCBI, and the data of 74 oral cancer samples were sorted. e survival curve, ROC curve, and AUC value were also used to verify the results of the LASSO regression algorithm, and the results are shown in Figure 9. erefore, it could be seen that these genes screened by LASSO regression analysis still have good results on other independent data sets and can also better distinguish high-risk and low-risk patients with oral cancer.

Genetic Function Analysis Based on String.
In order to further analyze and research the relationship between the 23 genes obtained by LASSO regression and oral cancer, we explore the biological activity relationship between these genes and how they affect the survival prognosis of oral cancer patients. We used STRING to analyze genetic function and obtained genetic function network pathway diagram which is shown in Figure 10. It can be seen from the above genetic function network pathway diagram that most of the 23 genes are involved in cell metabolism, the synthesis of biological enzymes, and some life activities associated with cell apoptosis. e life activities of these cells are closely associated with the generation, proliferation, and metastasis of cancer cells. Some of the results are shown in Table 5.
Further analysis results of cell components showed that some genes are involved in cell metabolism, cell apoptosis, and other processes, some genes are involved in the synthesis, metabolism of biological enzymes, and the synthesis and metabolism of nucleotide which affect some life activities of cells, and another part of genes are involved in some activities of mitochondria, and those mitochondrial activities are the energy source of cell life activities. Details are shown in Table 6.
According to some data obtained by STRING genetic function analysis, we can see that there are two pathways in these genes from the genetic function network pathway diagram. e first pathway is composed of 7 genes, namely, PDHA2, DLAT, PDHB, HS3ST1, PDHA1, PDHX, and PDHAX. e second pathway is composed of 3 genes, namely, TNFRSF25, CASP8, and BID. PDHA1 in the first pathway has an extremely important effect on the proliferation of tumor cells. In addition, HS3ST1 is related to the onset of inflammation. e BID in the second pathway is related to cell apoptosis and DNA damage response. ese genes play a vital role in cell mutation, proliferation, and DNA response. erefore, it is very likely that they have an important impact on the formation and metastasis of cancer cells. In particular, PDHA1 directly and independently affects the prognosis and survival of oral cancer patients. Log rank p=1e-06 HR=4.011 (2.19,7.  Among the available literature, we found that the prolyl 4-hydroxylase subunit alpha1 (P4HA1) in these 23 genes has a great correlation with the poor prognosis of oral cancer, and P4HA1 is a protein encoded by genes, which is involved in the hydroxylation of proline residues in posttranslational collagen synthesis and takes some responsibility in the prognostic information of polygenic hypoxia signals. e high level of P4HA1-mRNA, as a single gene substitute index of hypoxia, is an independent prognostic indicator of overall survival and local recurrence in oral cancer patients [18]. LRG1 (leucine rich alpha-2-glycoprotein1) is a pleiotropic protein that plays a pathogenic role in a variety of human diseases. e results showed that TGF-β is expressed in oral squamous cell carcinoma. It is of great significance that Lrg1 can control TGF-β pathway in oral squamous cell carcinoma [19].

Risk Genes Related to Other
Cancers. FMNL3 belongs to the vertebrate-specific actin polymerization factor superfamily and has a wide range of biological functions in cell and tissue development. In the research of human cancer, FMNL3 is identified as overexpressed in lymphoid malignancies and melanoma and is associated with oncogenic signaling pathways that regulate cancer cell invasion and migration.
ere are literatures suggesting that increased expression of FMNL3 is associated with the development, metastasis, and poor prognosis of colorectal cancer (CRC) patients [20].  Scientific Programming B 1 , 4-N-acetyl-galactosaminyltransferase1 (B4GALNT1) is one of the family members of glycosyltransferase, which is a key enzyme in the synthesis of ganglioside GM2, GD2, and glycolipid GA2. e researches have shown that B4GALNT1 is a key gene of clear cell renal cell carcinoma (ccRCC) metastasis and may become a new diagnostic marker and therapeutic target for ccRCC [21]. LRG1 is activated by HIF-1α to regulate angiogenesis and epithelial-mesenchymal transition (EMT) in colon cancer. According to reports, LRG1 is a potential noninvasive diagnostic and prognostic biomarker in colon cancer [22]. Some scholars wrote in the literatures that MIAT is partly involved in the development of AML (acute myeloid leukemia) through the negative regulation of miR-495; thus, it provided a promising target for the treatment of AML [23].

Risk Genes Associated with Other Diseases.
SLC25A4 (A1, member of solute carrier family 4) is an important type of transmembrane glycoprotein, which plays an important role in maintaining the stability of Erythrocyte membrane structure and regulating energy metabolism [24].
GRAP is a low-abundance signaling protein that is enhanced in the samples of diabetic renal tubules and is predicted to be a new component of the TGF-β signaling pathway from biological information analysis [25].
According to the literature, the HS3ST1 gene regulates the inflammation of antithrombin and is related to atherosclerosis. HS3ST1 is a heparan sulfate with a specific Penta saccharide motif and can bind to the anticoagulant protein antithrombin (AT) [26].
TNFRSF25 (tumor necrosis factor receptor superfamily member 25) is the receptor of TNFSF12, APO3L, and TWEAK. It is pointed out in the literature that the methylation level of the TNFRSF25 promoter can be used as an epigenetic biomarker for patients with rheumatoid arthritis (RA) [27].
CELSR3 is an atypical receptor of 7-pass Cadherin and also is an epithelial marker that is downregulated in noncystic fibrosis primary human bronchial epithelial cells. e   [28][29][30]. NAA38 is a component of NatCN terminal acetylation complex. It is reported in the literature that the destruction of NAA38 will affect the stability of NRF2 and the expression of glutathione biosynthesis genes, thereby changing the sensitivity of hypertrophy [31].
Autosomal dominant mutation of ANT1 gene (SLC25A4) leads to autosomal dominant inheritance progressive external ophthalmoplegia. It is viewed in the literature that this recessive mutation was described in patients with rare hypertrophic cardiomyopathy, lactic acidosis, and exercise intolerance [24].
Apolipoprotein L1 (APOL1) is a substructure of the cell. It is pointed out in the literature that APOL1 may have an impact on human kidney diseases by participating in the fusion or fission of mitochondrial.
e literature also pointed out that the fusion/fission pathway of mitochondria may be a therapeutic target for APOL1-nephropathy [32][33][34].

Risk Genes Related to Other Life Activities.
OSR2 controls the production of tooth organs through the antagonism of secreted Wnt antagonists. e absence of Osr2 can prevent the growth and development of molar organs, including normal continuous bud-shaped to cap-shaped and then to bell-shaped teeth [35].
It can be seen that the genes obtained by mining analysis are basically involved in the synthesis of proteins and enzymes related to cell metabolism and some genes are related to mitochondrial activity. Among them, P4HA1 has been determined to be related to the poor prognosis of oral squamous cell carcinoma. While OSR2 is related to the synthesis of dental organs, and MIAT is overexpressed in the cell lines of patients with acute myeloid leukemia. e other genes mentioned in the literature have a certain connection with the poor prognosis of other cancers, and the remaining part of the genes may be related to human life activities or other diseases.

Conclusions
In the whole process of data mining, data analysis is the key.
rough the means and methods of data analysis, the effectiveness of data mining can be improved and the accuracy of conclusions can also be ensured. WITMED based on big data has played a huge advantage in medical testing, medical image analysis, clinical diagnosis, and other fields. Big data analysis provides new methods for medical testing and clinical diagnosis and promotes the development of the medical industry. In the future, data analysis will continue to be integrated into the overall context of the smart cities, and the development of various technologies will be improved through various means.
In a word, the 23 genes obtained by data mining in this paper are closely associated with the prognosis of oral cancer, which can provide certain reference and theoretical basis for clinical research, treatment, diagnosis, and prognosis of oral cancer based on molecular biology.

Conflicts of Interest
e authors declare that they have no known conflicts of interest or personal relationships that could have appeared to influence the work reported in this paper.