Application of Data Mining Technology on Surveillance Report Data of HIV / AIDS High-Risk Group in Urumqi from 2009 to 2015

College of Public Health, Xinjiang Medical University, Urumqi 830011, China Department of Information Engineering, Xinjiang Institute of Engineering, Urumqi, 830000, China College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi 830011, China Department of Medical Engineering, The Affiliated Tumor Hospital, Xinjiang Medical University, Urumqi 830011, China Department of AIDS/STD Control and Prevention, Urumqi Center for Disease Control and Prevention, Urumqi, Xinjiang 830026, China


Introduction
Acquired immunodeficiency syndrome (AIDS) is a malignant infectious disease with a very high fatality rate caused by human immunodeficiency virus (HIV) [1].It alters the immune system making people much more vulnerable to infections and diseases [2].Up to now, the HIV/AIDS epidemic has been one of the most important and crucial public health problems facing both developed and developing nations.Since the first case of HIV infection of China discovered in 1985, the number of the infected patients has been increasing year by year.The spread trend of AIDS in China has not been fundamentally controlled; AIDS prevention and control situation in Xinjiang is even severer.Xinjiang Uygur Autonomous Region is one of the provinces hardest hit by AIDS in China.The first HIV/AIDS case in Xinjiang was reported in 1995.At the end of 2011, the cumulative total of HIV/AIDS cases reported in Xinjiang accounted for 7.7% of all cumulative total of HIV/AIDS cases in the country, ranking the fifth position in China [3].The total number of HIV/AIDS reported cases from 2004 to 2015 had been accumulated to 14,696, and it accounted for 5.56% of the total number of AIDS patients reported in China.There were also 3830 people died of HIV, which took up 4.56% of the total death cases induced by AIDS.The reported AIDS cases increased from 20 to 1868 with the average annual growth rate of 28.74, and the reported deaths increased from 5 to 680 with the average annual growth rate of 28.74 in the past decades, which were higher than that of the national average annual growth level [4].Urumqi, the capital of Xinjiang Uygur Autonomous Region, is one of the main districts of AIDS infection in Xinjiang, and its AIDS epidemic has been consistently high.The largest group of HIV infection is injecting drug users in Urumqi.But in the late 2011, the proportion of the sexual route of transmission of infection is more than the intravenous drug users sharing syringes; the infection became the first way.More and more sexual partners, men and men crowd into the spread of AIDS high-risk groups [5,6].The situation of stemming the spread of HIV in persons at high risk of exposure and blocking the AIDS epidemic moving from high-risk groups to the general population proliferation is still very flinty.Therefore, HIV infection continues to be a major global public health issue.
Data mining is a newly developing technology based on machine study in artificial intelligence and database, and it can be classified into two categories: unsupervised learning and supervised learning [7].Data mining is the process of selecting, exploring, and modeling large amounts of data, which aims at discovering unknown patterns or relationships and infer prediction rules from the data [8].In the recent years, great advancement has been achieved in the medical research of data mining.Studies have applied data mining to analyze volumes of data, explore unknown factors of disease, develop predictive models, and produce meaningful reports in different medical research fields [9][10][11].In the new period, the study of prevention, diagnosis, and treatment of HIV disease entered a new phase.A lot of domestic and foreign researchers have done on using the data mining technology to discover the relationship of the AIDS patient's potential factors and the result of treatment based on HIV surveillance data or comprehensive clinical data [12].Oliveira et al. built multilayer artificial neural networks (MLP), naive Bayesian classifiers (NB), support vector machines (SVM), and the k-nearest neighbor algorithm (KNN) in order to identify the main factors influencing reporting delays of HIV-AIDS cases within the Portuguese surveillance system.The results of this study strongly suggested that MLP provided the best results, with a higher classification accuracy (approximately 63%), precision (approximately 76%), and recall (approximately 60%) [13].Wang et al. had developed three computational modeling methods to predict virological response to therapy from HIV genotype and other clinical information.The comparison results showed that an artificial neural network (ANN) models were significantly inferior to random forests (RF) and support vector machines (SVM) [14].Hai-Lei, et al. constructed a 133 HIV carriers forecasting model based on support vector machines (SVM), and the HIV carriers were found in the port of a province in China during the period of 2004-2009.The overall accuracy rate of forecasting model was 90.60%, and its sensibility and specificity were 90.29% and 90.90%, respectively [15].Hailu compared the prediction of the different data mining technologies, which were used to develop the HIV testing prediction model.Four popular data mining algorithms (decision tree, naive Bayes, neural network, and logistic regression) were used to build the model that predicted whether an individual was being tested for HIV.The final experimentation results indicated that the decision tree (random tree algorithm) performed the best with an accuracy of 96% [16].
However, in previous studies, few researches considered the use of data mining methods to construct predictive mathematical models of AIDS high-risk group based on several potential risk factors for surveillance report data.This paper aims at using data mining technology to identify the main factors influencing on the status of AIDS high-risk group infection (including injecting drug user (IDU), female sex worker (FSW), and men who have sex with men (MSM)) on surveillance report data in Urumqi and compare the prediction power of the different forecast models based on data mining technology.In order to accomplish this objective, several data mining classification models were considered, namely, random forests (RF), support vector machine (SVM), k-nearest neighbors (KNN), and decision tree (DT), using a 10-fold cross-validation technique.The classification performance was evaluated in terms of a confusion matrix, accuracy, sensitivity, specificity, precision, recall, and AUC values of the receiver operating characteristic (ROC) curves.

Study Population.
The target populations that met the inclusion criteria in this paper were selected from the data between 2009 and 2015 that the sentinel surveillance of CDC at all levels in Urumqi was reported to China CDC Information System.There are three populations at higher risk of HIV exposure that were considered, including FSW which was defined as women who engaged in commercial sex trade during the investigation; IDU was defined as who takes oral, inhaling, or injecting heroin, cocaine, opium, morphine, marijuana, k-powder, methamphetamine, ecstasy, leprosy, etc.; and MSM was defined as people who have had intercourse or oral sex in the past years.The results of the attributes description are presented in Tables 1, 2, and 3. Table 1 shows a total of 5304 MSM respondents tested for HIV.Among them, 377 (7.11%) were detected as HIV positive and 4927 (92.9%) were detected as HIV negative.Table 2 shows a total of 9090 FSW respondents who had received a HIV test; 9041 (99.5%) were HIV-positive, while only 49 (0.5%) were HIV negative.Table 3 shows 7337 IDU respondents who had accepted a HIV test; the HIV negative and positive were 6087 (83%) and 1250 (17%), respectively.These results indicate that there is a need of balancing these two classes of the three datasets.In this article, we employed the Synthetic Minority Over-sampling Technique (SMOTE) [18] to dispose unbalanced samples.In SMOTE algorithm, majority class samples use the undersampling method and minority class samples use the oversampling technique.It potentially performs better than simple oversampling and it is widely used [19,20].

Attribute Selection.
In a data mining task, the selection of the input attributes is usually a highly important step to improve the classification ability of the models, to reduce the classier complexity, to save the computational time, and to simplify the obtained results.Filtering and wrapper are two main different approaches to select a subset of attributes from all of the attributes used in machine learning.Filtering is to make an independent assessment based on the data general characteristics.Wrapper is to select a feature subset using the evaluation function based on a machine learning algorithm [21].In this paper, the wrapper methods based on random forests (RF) was used to select the attributes as the inputs of the classification model.RF algorithm is an ensemble learning method based on the aggregation of a large number of decision trees and has proved to be very powerful in many different applications [22][23][24].A feature selection based on the random forest classifier has been found to provide multivariate feature importance scores, which are relatively easy to obtain and have been successfully applied to high dimensional data [25,26].The quantification procedures of the variable importance scores can be described as follows: computing the variable importance score and permuting score, then selecting the features that have more contribution to classification model, and building models through the feature evaluation criteria of random forest algorithm.The Gini importance considers conditional higher-order interactions among the variables and might be a preferable ranking criterion than a univariate measure [27,28] and is the feature importance evaluation criteria of random forest algorithm which was used in this study.

Random Forests (RF).
The first algorithm for random decision forests was created by Ho (1995) [29], and its extension version was developed by Breiman [30].The RF is an ensemble learning method based on decision tree and has been successfully used in several types of classification and regression, especially for accurate identification of disease diagnosis problems [31][32][33].RF builds a large number of decision trees using a bootstrap sample with replacement from the training set and predicts the class of each tree according to the test set, and the final RF prediction class is presented based on the majority of the votes [34].It has been shown to give excellent performance on numerical and categorical data.

Support Vector Machine (SVM)
. Support vector machine, a novel type of learning machine derived from statistical learning theory, constructs a hyperplane or set of hyperplanes in high-or infinite-dimensional space, which can be used for classification, regression, or other tasks like outliers detection, function estimation, and highdimensional pattern recognition problems [35][36][37][38].The SVM mainly deals with the problems of binary classification.In addition to performing linear classification, SVM can efficiently perform nonlinear classification through kernel techniques [39] implicitly mapping their inputs into highdimensional feature spaces.SVM categorization model can be constructed in two ways, as follows: (1) converting the input space into higher dimensional feature space by a nonlinear mapping function.(2) Building the separating hyperplane based on maximum distance from the closest points of the training set [40].[41].KNN classifier has been widely used in many fields, such as text classification, pattern recognition, and disease detection and diagnosis, based on the advantages such as simplicity, high efficiency, and easy to implement [42,43].KNN arithmetic idea mainly considers three points: the value of k, distance measurement, and decision rules of classification.The k, as a user-defined constant, will directly affect the KNN classification performance.And the distance metric measures commonly use Euclidean distance, Manhattan distance, and Minkowski distance.The decision rules of classification depend on the majority voting.

Decision Trees (DT).
A decision tree is a kind of commonly used data mining method with many advantages such as easy to understand, readable, and quick classification [44].A decision tree is the organization of the nodes that make decisions like a tree, which consists of decision nodes, branches, and leaf nodes.Each decision node represents a data category or attributes to be classified, and each leaf node represents a result [45].The whole decision-making process starts from the root decision node, and from top to bottom, it is determined until the classification results are given.There are three commonly used typical decision tree algorithms in data mining at present, such as ID3 algorithm, C4.5 algorithm, and CART algorithm [46].
2.6.Performance Evaluation.In this paper, a confusion matrix and some indicators including accuracy, sensitivity, specificity, precision, recall, and the receiver operating characteristic (ROC) curve were used to appraise the performance of the four classification models.A 10-fold cross-validation was applied to RF, SVM, KNN, and DT validation.A confusion matrix consists of the parts shown in Table 4.In Table 4, TP (true positive) is the positive records of the correct classification, TN (true negative) is the negative records of the correct classification, FP (false positive) is the positive records of the incorrect classification, and FN (false negative) is the negative records of the incorrect classification.Several important measures, such as accuracy, sensitivity, specificity, precision, and recall, can be calculated by using the confusion matrix.The accuracy is the number of samples correctly classified.The sensitivity is a description of measuring the proportion of correctly classified positive samples.The specificity is a description of measuring the proportion of correctly classified negative samples.The precision is a description of the number of positive samples to the proportion of all predicted positive samples.The recall is a description of the ratio of positive samples The ROC curve is originally derived from statistical decision theory, which can comprehensively describe the classification performance of the classifiers with different discriminant thresholds [47].The vertical axis of the ROC curve is TP rate, and the horizontal axis is FP rate.However, in a practical application, the AUC (the area under the ROC curve) is often used to evaluate the performance of the classifier.

Experimental Results
R is an open source programming language and software environment for statistical computing and graphics.Based on the R language environment, the implementation of each algorithm in this experiment is carried out.Here, we used SMOTE (DMwR), randomForest (randomForest), ksvm (kernlab), kknn (kknn), and rpart (rpart) packages.All experiments were validated with a 10-fold cross-validation technique in order to present a more stable accuracy rate after applying the four classification models.Some evaluation indexes were used to compare the classification performance of four data mining algorithms.
Table 5 shows the three original datasets and the three artificial datasets obtained using SMOTE algorithm.It is evident that the original datasets are biased; the imbalance rate of each original datasets is 13.0689, 184.5102, and 4.8696, respectively.In order to achieve the data balance to avoid the result bias, we used SMOTE algorithm combining the oversampling the minority class and undersampling the majority class techniques.We apply the function SMOTE in the DMwR package in R software.The three main parameters of function SMOTE are perc.over,perc.under, and k.The parameter perc.over and perc.undercontrol the amount of oversampling of the minority classes and undersampling of the majority classes, respectively.The parameter k controls the way of the new examples created.For the parameters in the SMOTE algorithm, the value of k was set to 5. For the initial dataset of MSM with 377 minority samples and 4927 majority samples, we set the parameters perc.over= 1200 and perc.under= 110, respectively.Firstly, the number of minority samples was increased; a total of 1200 × 377/100 new minority samples were generated.The original minority samples and the new minority samples consisted of the new dataset.Secondly, sampling the majority sample, we obtain a new sample of the majority, which is (110/100) × 1200 × 377/100.We put the new sample of the majority into the new dataset which was created  1, 2, and 3 describe the importance of the sorted variables of the three datasets (MSM dataset, FSW dataset, and IDU dataset) according to the Gini index criterion from RF. From Figure 1, for the MSM dataset, the most important variables are B01, B06, A01B, A06, and B05.The least important variables are I02, G01, H01, and D01.From Figure 2, for the FSW dataset, the most important variables are B01B, T05C, A06, B05, and B06.The least important variables are F01, G02, C08, E01, and D01.From Figure 3, for the IDU dataset, the most important variables are B02, A01, T05C, B06, and B05.The least important variables are C08, B04, T04C, F01, and D01.Finally, applying the rank + MeanDe-creaseGini method of attribute selection method, variables were ranked based on their importance in classifying the HIV patients.We also asked the CDC doctors about the importance of lower-ranking attributes, combining the two methods agree that B01, B06, A01B, A06, B05, B04, B02, D03, I03, J01, I01, B03, F01, T04C, and E01 as the main subset of attributes important in predicting the HIV patients from MSM population, B01B, T05C, A06, B05, B06, B04, B02, A01B, D02, H01, T04C, G03, B03, and G01 as the main subset of attributes important in predicting the HIV patients from female sex workers population, and B02, A01, T05C, B06, B05, B03, A06, D02, H01, G02, G03, E01, G01, and B01 as the main subset of attributes important in predicting the HIV patients from drug users population.The detailed descriptions of the selected attributes were shown in Tables 6, 7, and 8.  , 5, and 6 show the ROC curve obtained for the three datasets with the four classifiers.The AUC scores for RF, SVM, KNN, and DT on MSM dataset are 0.9802, 0.9401, 0.9747, and 0.7917; 0.9981, 0.9803, 0.9967, and 0.8702 on FSW dataset; and 0.9874, 0.9135, 0.9802, and 0.7438 on IDU dataset.It is obvious that RF performed significantly better than the other three classifiers.The AUC scores achieved for MSM dataset, FSW dataset, and IDU datasets are 0.9802, 0.9981, and 0.9874, respectively.The maximum value of the AUC (0.9981) was obtained for the FSW dataset with RF algorithm.Moreover, the value of AUC of DT algorithm with IDU dataset is 0.7438 which is the minimum of all AUC scores.
Figures 7, 8, and 9 depict the classification performance when the four classifiers are applied on MSM dataset, FSW dataset, and IDU dataset, respectively.The accuracy, precision, and recall for RF, SVM, KNN, and DT on the three datasets were compared.For the MSM dataset (Figure 7), the SVM model achieved a classification accuracy of 87.8404%, with a precision of 89.5130% and a recall of 85.5132%.The KNN model had a classification accuracy of 91.5258%, with a precision of 89.5130% and a recall of 85.5132%.For the   10 Complexity decision tree, the accuracy, precision, and recall were 76.7440%, 77.6199%, and 74.6582%, respectively.The random forest algorithm performed best among the four evaluated models with an accuracy of 94.4821%, a precision of 98.5511%, and a recall of 90.2061%.For the FSW dataset (Figure 8), the final experimental results demonstrated that the random forest algorithm showed the best with an accuracy of 97.5136%, and the precision and recall were 97.4638% and 91.6160%, respectively.The KNN model came out to be the second with a classification accuracy of 96.3083%, and the precision and recall were 97.4210% and 95.1163%, respectively, followed by SVM model with a classification accuracy of 93.3560%, the precision and recall equal to 94.1554% and 92.4155%, respectively.The decision tree has also performed the least classification accuracy of 85.0408%, and the precision and recall were 86.9467% and 82.3739%, respectively.
For the IDU dataset (Figure 9), the RF classifier showed the best predictive performances; the accuracy, precision, and recall gave 94.6375%, 97.4638%, and 91.6160%, respectively.In the SVM model, they were 83.4821%, 84.8141%, and 81.4080%, respectively.As shown in the confusion matrix in Table 10, the KNN learning algorithm scored an accuracy of 90.8287%; the precision and recall were 94.7831%, 86.3360%, respectively.Using the decision tree had a lower overall performance, with an accuracy of  The other performance metrics confusion matrixes, such as sensitivity and specificity, were also employed to measure the performance of different classifiers for the three datasets.As a whole, the RF classifier has the best performance as compared to the other three methods and has obtained higher accuracies 94.4821%, 97.5136%, and 94.6375% on MSM dataset, FSW dataset, and IDU dataset, respectively.The decision tree has also achieved the least classification accuracy 76.7440%, 85.0408%, and 71.2271% on MSM dataset, FSW dataset, and IDU dataset, respectively.The detailed classification outcomes of each model for the three datasets are shown in Tables 9, 10, and 11.

Discussion
The AIDS epidemic in Urumqi is still very serious.The increasing number of high-risk groups, such as prostitutes, male sex workers, and floating population, has exacerbated the difficulty of AIDS prevention and treatment.Data mining has been widely used in the field of diagnosis, evaluation, and other medical fields [48].This study aimed at using four mature data mining algorithms (random forests, support vector machine, k-nearest neighbors, and decision tree) to build identification models for AIDS patients based on the sentinel monitoring data of HIV high-risk populations (MSM, FSWs, and IDUs) in Urumqi and compared the prediction power of the different models.However, considering    13 Complexity 91.3571%).The DT algorithm was the poorest of the four algorithms, with 79.1761% diagnostic accuracy on MSM dataset, 87.0283% diagnostic accuracy on FSW dataset, and 74.3879% accuracy on IDU.These results suggested that the four established data mining models can predict whether a person is infected with HIV.But compared with SVM, decision tree, and KNN, random forest model through a large number of random sample method balance the sampling error; the effect of classifying the results produces a large number of different test data.A comprehensive assessment is just a single test sample for fitting the results of the other three models more reliably [50].
This study based on the importance score of independent variables for random forest model identified the most important influencing factor for the HIV infection in the three high dangerous populations in Urumqi.For the MSM dataset, these variables are age, educational level, monitoring sites, sample source, inhabit time, nation, marital status, etc. Variables such as age show that the MSM population in Urumqi is mainly the young and middle-aged active population aged from 18 to 40 years old, accounting for 91.3%, which is similar to the monitoring results in Chengdu [51] and show that sexually active people are still the focus of AIDS prevention and treatment.The majority (82.5%) of the participants had never been married.More than half (56.2%) came from the Sayibak District, 68% of the participants were recruited through the network, and 72.1% had some college or higher education.Therefore, based on the epidemic characteristics of MSM population in Urumqi, personal characteristics and social factors should be taken into account comprehensively when education intervention measures are carried out for this population.For the FSW dataset, the results showed that 14 Complexity most of the female sex workers (FSWs) in Urumqi were young women under 30 years old, 58.2% were unmarried, 65% of female sex workers (FSWs) worked in a local workspace for less than a year, and more than half were primary school and junior middle school and had come mainly from nightclub, karaoke, ballroom, and bar.Therefore, we should focus on the actual epidemic characteristics of FSWs to take corresponding measures to publicize education and intervene.For the IDU dataset, the age of the 7337 participants ranged from l1 to 71 years, with more than half (94.5%) of them aged 18-48 years.Among them, 2586 (35.2%) were single, with 2147 (32.9%) participants coming from Sayibak District, and 5169(66.4%)participants were junior high school and below.Among the participants, 89.3% were male and 69% were from the community.These results can provide evidence for the prevention of HIV infection among drug users through the promotion of education, especially for adolescents, low cultural level population, floating population, drug abuse, sexual disorder, etc.As we have shown above, data mining models can accurately identify diseases based on certain important attributes.These predictive models are valuable tools in the medical field.However, there are areas of concern in the development of predictive models: (1) the model should include all clinically relevant data, (2) the model should be tested on an independent sample, and (3) the model must make sense to the medical personnel who are supposed to make use of it.It has been shown that not all predictive models constructed using data mining techniques satisfy all of these requirements [52].
There are some limitations to this article.First, all individuals are recruited in Urumqi, which was limited by geographical and population characteristics.Therefore, the information bias may exist during the experiment process.If the study population could be expanded to more than one province or to the whole country, the model recognition effect would be better.Second, in the epidemiological investigation of HIV-infected persons, due to subjective, objective, and other reasons, respondents may provide unreal information, which leads to a certain influence on the analysis results.In the future, more feature selection methods, class imbalance processing methods, and data mining algorithms are expected to be tested.

Conclusion
In general, four prediction models were established and compared for predicting whether a person is infected with HIV.The results showed that the random forest model performed the best in classification accuracy.This study can provide some effective ways for medical staffs to quickly screen and diagnose AIDS from a large amount of information.

2 F01
Have you ever been diagnosed with an STD in the last year G02 Have you ever received a community medication to maintain or providing or exchanging cleaning needles to prevent HIV C08 Knowledge and awareness of HIV E01 Did you take drugs D01 Did you use condoms with your guests the last time G01 Have you ever received a condom promotion or HIV counselling and testing to prevent HIV B03 The location of household register G03 Have you ever received a companion education to prevent HIV T04C Syphilis test results H01 Has HIV been tested in the last year D02 How often did you use condoms when you have sex with a guest last month variables of FSW dataset

Figure 2 :
Figure 2: The importance of variables of FSW dataset.

Figure 4 :
Figure 4: ROC curve of different classifiers for MSM dataset.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: Performance of different classification models for MSM dataset.
2.2.Data Source.The data applied in this paper consisted of three datasets from the higher risk of HIV/AIDS exposure populations collected between 2009 and 2015 by the Urumqi CDC.The three datasets are FSW dataset that included 9090 FSWs and 53 attributes, MSM dataset that included 5304 MSM and 57 attributes, and IDU dataset that included 7337 IDUs and 56 attributes.The collected data had three core survey questionnaires: FSW questionnaire, MSM questionnaire, and IDU questionnaire.The survey items included demographic characteristics (age-at-birth, gender,

3 Complexity Table 1 :
Details of the attributes of the MSM dataset.
D01Have you ever had anal sex with a person of the same sex in the last six months 2.5.3.K-Nearest Neighbors (KNN).The k-nearest neighbors algorithm (KNN) is the simplest but more powerful nonparametric classification method of all data mining methods, since it is a type of instance-based or lazy learning algorithm

Table 2 :
Details of the attributes of the FSW dataset.

Table 3 :
Details of the attributes of the IDU dataset.

Table 4 :
Confusion matrix for the two-class problem.

Table 5 :
Description of original data and balanced data.

Table 6 :
Selection attributes used in models of MSM dataset.

Table 7 :
Selection attributes used in models of FSW dataset.

Table 8 :
Selection attributes used in models of IDU dataset.

Table 9 :
Performance measures of the classifiers for MSM dataset.

Table 10 :
Performance measures of the classifiers for IDU dataset.

Table 11 :
Performance measures of the classifiers for FSW dataset.