The Comparative Experimental Study of Multilabel Classification for Diagnosis Assistant Based on Chinese Obstetric EMRs

Obstetric electronic medical records (EMRs) contain massive amounts of medical data and health information. The information extraction and diagnosis assistants of obstetric EMRs are of great significance in improving the fertility level of the population. The admitting diagnosis in the first course record of the EMR is reasoned from various sources, such as chief complaints, auxiliary examinations, and physical examinations. This paper treats the diagnosis assistant as a multilabel classification task based on the analyses of obstetric EMRs. The latent Dirichlet allocation (LDA) topic and the word vector are used as features and the four multilabel classification methods, BP-MLL (backpropagation multilabel learning), RAkEL (RAndom k labELsets), MLkNN (multilabel k-nearest neighbor), and CC (chain classifier), are utilized to build the diagnosis assistant models. Experimental results conducted on real cases show that the BP-MLL achieves the best performance with an average precision up to 0.7413 ± 0.0100 when the number of label sets and the word dimensions are 71 and 100, respectively. The result of the diagnosis assistant can be introduced as a supplementary learning method for medical students. Additionally, the method can be used not only for obstetric EMRs but also for other medical records.


Introduction
Since family planning was issued as one of the fundamental state policies in China, late marriage and late childbirth have indeed benefited the country. However, it has also led to the increasing proportion of older pregnant women especially those who are over 35 years old. The problem is exacerbated with the implementation of the Universal Two-child Policy in 2016. Later pregnancies are associated with higher risks of fetal abnormality and other complications, which are challenges for obstetricians [1]. Since the National Health and Family Planning Medical Affairs Commission issued the Basic Norms of Electronic Medical Records (Trial) [2] in 2010, medical institutions have accumulated many obstetric EMRs (electronic medical records). EMR data are big data in the medical field. They contain medical data and a large amount of patients' health information. Currently, one urgent task is how to achieve clinical information decision support with these resources in order to improve clinical treatments.
EMRs are the detailed records of medical activities written by the medical staff, in which free text (semistructured or unstructured) is one of the most important forms [3]. Using natural language processing technology to structure EMRs and extract information is a crucial step to ensure that the best possible information is contained in the EMRs. As artificial intelligence develops, automatic medical diagnosis becomes possible. In EMRs, the first course record is stored in a textual format and includes the chief complaints, physical examinations, auxiliary examinations, and other information, which can provide the foundation for admitting diagnosis. Generally, admitting diagnosis in obstetric EMRs includes more than one single diagnosis but includes normal obstetric diagnosis, medical diagnosis, and complications.
The problem can be transformed into a multilabel classification task in machine learning, in which the different diagnoses can be regarded as the variable labels.
Based on the analysis of the structure and content of Chinese obstetric EMRs, the first course records are cleaned and structured in this paper. The collected Chinese obstetric EMRs are divided into complaints, physical examinations, obstetrical examinations, and auxiliary examinations. Then, the latent Dirichlet allocation (LDA) topic model is utilized to extract the features. The word vectors trained by the Skip-gram model are regarded as the features. Several multilabel classification methods are employed to diagnose the obstetric EMRs, which is an initial attempt for a diagnosis assistant based on Chinese obstetric EMRs.

Related Works
Each instance belongs to only one label in both the conventional binary class task and multiclass task, while each instance can belong to more labels in the multilabel classification. For example, the diagnosis from a doctor for one patient is usually a variety of mixed results rather than a single one. Multilabel classification has often been applied in the fields of text classification [4][5][6], emotional classification [7,8], image and video classification [9][10][11], bioinformatics [12][13][14][15], and medical classification [16][17][18][19][20]. Recently, there were three research works which focus on multilabel learning (MLL). The first one improves or proposes new classification or sorting models. Zhang et al. [21] changed the original error function and proposed the BP-MLL (backpropagation multilabel learning) method on the basis of the traditional multilayer feed-forward neural networks. Li et al. [22] improved the classifier chain (CC) method and named it the ordered classifier chain (OCC). It can effectively utilize the dependency relationship among different labels. The second focus improves or proposes new feature selection models. Duan et al. [23] defined the lower approximation and dependency and designed a neighborhood rough set based on a feature selection algorithm for multilabel classification. The third focus applies MLL to new areas. Liu et al. [24] applied an MLL to choose symptoms from a Chinese coronary heart disease dataset.
In the field of medical research [16][17][18][19][20], Shao et al. [16] proposed an algorithm called hybrid optimization-based multilabel (HOML) to select features. HOML combined the relatively strong global optimization ability of the simulated annealing algorithm, the genetic algorithm, and the strong local optimization capability of greedy algorithm. They adopted the multilabel classifier to model coronary heart disease in traditional Chinese medicine (TCM), which significantly improved the performance. Zhang et al. [18] used multilabel learning by exploiting label dependency (LEAD) subsequently to the tongue image classification in TCM. Xu et al. [19] combined the random forest algorithm and the MLL algorithm. They then used it to select symptoms of excess chronic gastritis and establish classification models. Goldstein et al. [25], using data from I2B2 of 2008, trained one specialist classifier per class and classified obesity and its comorbidities using the MLL method. The previous research was mainly conducted on normalized public dataset or real records that included a relatively small number of labels.
In the field of diagnosis assistants, Jiang et al. [26] presented a novel computational model for the aided diagnosis of subhealth. The dataset was divided into the training set and the test set. Based on the rough set and fuzzy mathematics, the training set was used to extract important features and generated fuzzy weight matrixes. Then, the features and fuzzy weight matrixes were used to assist the diagnosis of subhealth. Tiwari et al. [27] presented the LTEM-PCA-ANN (LAW texture energy measures (LTEM), principle component analyses (PCA), and artificial neural network (ANN)) approach which can improve results with an overall accuracy of 93.34%. Then, the computational model was used to design an adequate computer-aided diagnosis (CAD) system for the classification of brain tumors to assist inexperience radiologists in the diagnosis process. Jiang et al. [28] proposed a three-layer knowledge-based model (diseasesymptom-property) to diagnose a disease, which significantly reduces the dependencies between attributes and improves the accuracy of predictions.
However, very few studies have been conducted on the diagnosis assistant of the complicated Chinese obstetric EMRs up to now. Chinese is a logographic language and the Chinese EMRs are free narrative texts, which will bring challenges to a diagnosis assistant. Furthermore, the obstetrical diagnosis types are complicated, and some of their features are not easy to directly extract, which also makes it more difficult to conduct the research on a diagnosis assistant for the complicated obstetrics EMRs. In this paper, the LDA topic model and Skip-gram model are used to carry out feature selection. The methods of BP-MLL [21], RAkEL (RAndom k labELsets) [29], MLkNN (multilabel k-nearest neighbor) [30], and CC [31] multilabel classification are employed to study the automatic diagnosis of obstetric EMRs.

Materials and Data Preprocessing
3.1. Materials. This paper takes more than 10,000 copies of Chinese obstetric EMRs as a research dataset. These data were randomly selected from 15 hospitals. Under the guidance of the Basic Specification of Electronic Medical Records (trial) [2], the written forms of EMRs in different hospitals vary slightly according to the actual situations in China. Charts and free text are the major forms of EMRs, and the unstructured free text is one of the main information extraction research objects. The obstetric EMR mainly includes the two parts, the course records and the discharge summary. In addition, there will be preoperative summaries, operation records and postoperative course records if a surgery is performed, and there will be newborn case records if a baby was born. In general, one course record includes one first course record, one or more daily course records (also known as ward-round records), superior doctors' ward-round records, and one discharge summary. We focus on analyzing the content and characteristics of the first course records. The first course record usually includes the recorded time, chief complaints, admitting physical examinations, obstetric practice, auxiliary examinations, admitting diagnosis, diagnostic basis, differential diagnosis, and treatment plan. An example of the first course record is shown in Figure 1.
In the first course record, the admitting diagnosis is made by the obstetricians who comprehensively analyze the patient's conditions. As is shown in Figure 1, the admitting diagnosis "宫内孕 28+2 周 (intrauterine pregnancy 28 +2 weeks)" can be calculated from the date of the last menstrual period in chief complaints or obtained directly from the result of auxiliary examinations, and the diagnosis "孕 3 产 1 (pregnancy 3, production 1)" can be extracted from the chief complaints in the admitting records. The rest of the four diagnoses can be inferred from the features contained in the chief complaints or the previous examinations. Therefore, the admitting diagnosis in the first course record can be regarded as a multilabel classification according to the explicit or implicit features contained in the complaints or examinations.

Data Preprocessing.
Since the collected EMRs are real cases, it is necessary to protect patients' privacy and it is inevitable that they contain some noisy data. Deidentification and data cleansing are the necessary steps for the processing of EMRs. In the process of analyzing the extracted records, the private information, such as mentions of patients, hospitals, doctors, patient's ID, location, and phone number, have all been removed from the records. Then, the essential preprocessing of the EMR data is conducted, including data cleansing, data structuration, word segmentation, and data standardization, which are described below.

Data Cleansing.
There are problems such as redundancy, missing information, and disordering due to deficiencies in the existing HIS (hospital information system). For redundant records, the records are filtered through automatically string matching. In particular, when more than one first course record is detected in one EMR, the correct one will be chosen according to the integrity of information and record time, and the others will be removed. For a missing first course record, the EMR will be deleted from the dataset. For temporal disordering, an algorithm is designed to detect the temporal error records according to the temporal logic of the obstetric treatment, and the records that include temporal errors are also removed from the dataset. Finally, the dataset contains 11,303 copies of first course records.

Data Structuration.
All content in one original EMR text is mixed together. To facilitate data analysis, the first course records are formatted in accordance with the chief complaints, admitting physical examinations, obstetric practice, auxiliary examinations, admitting diagnosis, diagnostic basis, differential diagnosis, and treatment plan, which form the experimental dataset in this paper. The record in Figure 1 is arranged according to the section of content after structuring.

Word Segmentation.
In this paper, chief complaints, physical examinations, obstetric examinations, and auxiliary examinations are used to predict the admitting diagnosis. The admitting diagnosis and the other parts extracted from the EMRs have been cleaned and structured by using the aforementioned methods, from the experimental dataset. We regard the first four parts as features and regard the admitting diagnosis as labels. The word segmentation tool ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) (https://codeload.github. com/NLPIR-team/NLPIR-ICTCLAS/zip/master) is put to use to segment the word in the dataset. Medical terminology and drug names obtained from the Internet and literature [32] are added to the ICTCLAS dictionary in order to improve the segmentation accuracy.

Data Standardization.
The diagnoses such as pregnancy X + Y weeks and pregnancy Z production U are the results of a calculation or complaint, so they will not be accepted as class labels. The rest of the diagnoses are accepted as class labels in the multilabel classification and form label set L 1 that includes 737 labels. Through the analysis of the class label set, it is found that there is more than one written form for the same category since the EMRs are extracted from different medical institutes and the doctors have personalized writing habits. For example, in set L 1 , "胎盘前置 状态 (state of placenta previa)" and "前置胎盘 (placenta previa)" are different writing forms, but they are the same diagnosis. In this case, based on the naming rules of ICD10 (International Classification of Diseases 10) disease, after the segmentation of the diagnosis results, the similarity of labels is calculated based on the semantic method (https:// my.oschina.net/twosnail/blog/370744#comment-list). The similarity S s is defined as follows: where S 1 and S 2 are the semantic vector representations of the two diagnosis labels. Depending on the similarity calculation result, medical professionals standardize the class labels and merge the labels that have the same diagnostic results but different expressions. Finally, we get the label set L 2 that contains 233 class labels. The frequency statistics are shown in Figure 2.
The number of diagnosis labels that appear once is 80, which accounts for 34% of the total. The number of diagnosis labels that appear in 2-10 is 82, which accounts for 35% of the total. The total frequency of diagnosis labels is 26,772 in the dataset. The minimum number of diagnosis labels in one instance is 1, while the maximum is 8.The average number of labels in one instance is 2.67. Figure 3 is the workflow of the diagnosis assistant process. Data processing has been described in Section 3.2. Feature extraction and the multilabel classification are as follows.

Method
4.1. Feature Extracting. The most important stage in MML, and any classification problem, is the feature extraction in which the data are represented in a low dimensional space by the most descriptive features that maximize and characterize the interclass differences. From Figure 1, we see that there are many numerical data in EMRs, but the main written form is still free narrative text. In this paper, we utilize two methods, the LDA and Skip-gram models, to obtain features. The three-layer structure of the LDA can effectively extract the textual features of narrative texts, and Skip-gram is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships.
4.1.1. LDA. The LDA was proposed by Blei et al. [33]. It is a three-layer Bayesian model, which has been widely applied to feature extraction. The input of the LDA model is a segmented document set D, and the output is the probability distribution for each document d under each topic k.
Each document d can be seen as an N-word composition and a k-topic composition, and the word is the basic unit in the topic. For document d, we choose a topic k from the document topic distribution θ, and then select a word w from the corresponding subdistribution φ in the topic k. It can form a document containing N words by repeating the above steps that are shown as follows: p θ, z, w|α, β = p θ|α ∏ N n=1 z n |θ p w n |z n , β 2 The document topic distribution p k|d = and the word subject distribution can be obtained by LDA, where C wk is the number of times the word w is given the subject k, and C dk is the number of times the document d is given the subject k.

Word
Vector. Distributed representations of words in a vector space help learning algorithm achieve better performance in natural language processing tasks by grouping similar words.Word2vec is an implementation of the model proposed by Mikolov et al. [34] that can be used to quickly and effectively express words as word vectors. It contains two kinds of training models, which are the CBOW (continuous bag-of-words) model and the Skip-gram model [35]. There are three layers, including the input layer, projection layer, and output layer. In this paper, we use the Skip-gram model to obtain the features. The CBOW model generates word vectors by using the contextual information to predict the current word. Meanwhile, the Skip-gram model generates word vectors in the opposite way by generating word vectors that utilize the current word vector to predict the word vector of possible context. In this paper, we choose the Skip-gram model to train the word vector. For the skip model, the training goal of the Skip-gram model is to maximize the value: where c is the size of the training context, and T is the size of the training text. The basic Skip-gram model calculates the conditional probability: where v w and v w ′ are the input and the output vector representations of w, respectively, and W is the number of words in the vocabulary. After the word vector is obtained through the Skip-gram model, the document vector can be calculated by averaging the vectors of the words contained in the document.

Multilabel Classification.
In the training set f x 1 , Y 1 , x 2 , Y 2 , … , x m , Y m , each instance x i is a d-dimensional feature and Y i ⊆ y is the set of labels associated with this instance. The original error function of the traditional multilayer feed-forward neural networks is defined as follows: where E i is the error of the network on x i , c i j = c j x i is the actual output of the network on x i on the jth class, and d i j is the desired output of x i on the jth class. In (7), it is assumed that each class label is independent and the relationships between labels are not considered. Zhang et al. [21] changed the original error function and changed the traditional multilayer feed-forward neural networks to the BP-MLL. The new error function is shown as follows: where Y i is the complementary set of Y i in y and |·| measures the cardinality of a set. Specifically, c i k − c i l measures the difference between the outputs of the network on one label belonging to x i (k ∈ Y i ) and one label not belonging to it (l ∈ Y i ) [21]. Therefore, the minimization of (8) will lead the system to output larger values for labels belonging to the training instance and smaller values for those not belonging to it. First, as shown in Figure 2, the frequency of diagnostic labels has an uneven distribution, and the proportion of low-frequency labels is high. Therefore, the experiments are performed on different frequency label sets. Second, the LDA is used to extract features, and the number of different features has an impact on the experimental results. Therefore, LDAs with different topics are investigated. Third, the number of the word vector dimensions in the Skip-gram also influences the experimental results. Therefore, experiments with different word dimensions are also conducted.

Experiments
There are three groups of experiments in this section. In the first group, the topic number of the LDA is set to 120, and the word vector dimension is set to 100. The experiments are conducted to compare the classification performance of the different numbers of the label set. In the second and the third treatments, the size of the diagnostic label set remains 71. The second group of experiments compares the results of different topics in the LDA method, and the third compares the results of various numbers of vector dimensions in the Skip-gram model.
Hamming loss, one-error, coverage, ranking loss, and average precision are used as evaluation indicators. Hamming loss (HL) is defined as follows: It evaluates the error rate between the real mark of the instance and the resulting mark of the system. It is that the instance has the possibility of marking Y i but not being identified or not having the token Y i being misjudged. A smaller HL indicates a better classification effect.
One-error (OE) is defined as follows: It evaluates the likelihood that the highest ranked marker is not the true markup of the instance in the category sorting sequence of the sample. In single label learning, it evolves into a general classification error rate. A smaller OE indicates a better classification effect.
Coverage (C) is defined as follows: It evaluates the average number of search depths in the category sorting sequence of the instance to cover proper labels of the instance. A smaller C indicates a better classification effect.
Ranking loss (RL) is defined as follows: It evaluates the likelihood of a sorting error in the category sort sequence of the sample. It is likely that the sample has a mark on it that is lower than the ranking of the marker that it does not have. A smaller RL indicates a better classification effect. Average precision (AP) is defined as follows: It evaluates the case where the marker with a large membership value is still an associated mark in the category sort queue of the sample. It reflects the average accuracy of the predictor class. A higher AP indicates a better classification effect.

Experimental Results on the Different Sizes of the Label
Set. In this group of experiments, LDA topic number K is set as 120 and the word vector dimension T is set as 100. First, the size of label set L 2 is set as 233. It includes all class labels in the data set. The results are shown in Table 1. In the table, for each criterion, "↓" indicates "the smaller the better," while the "↑" indicates "the bigger the better." It can be seen that in all indicators, the experiments using word vector feature obtain the best results. MLkNN is the best result in HL, OE, and AP indicator. BP-MLL presents the best results for RL and C. Moreover, BP-MLL also ranks second in terms of the other three indicators.
As seen from Table 1, MLkNN using word vector feature obtains the best result, but its AP is only 0.7272 ± 0.0081. According to the results shown in Figure 2, there are 80 diagnostic labels whose frequency is only 1, and 82 diagnostic labels whose frequency is between 2 and 10. This adds up to a total of 162. The analysis of these labels reveals that there are three different situations. First, as the EMRs have not been classified, the labels are taken in all obstetric hospitalization of patients. Some obstetric diagnoses are atypical, such as obesity, allergic dermatitis, and others. Second, because of different writing habits, some doctors may write the diagnosis, such as "single pregnancy," which may rarely be written in the normal record by most doctors. Third, some of the results are relatively rare, such as "fetal nasal bone loss." These labels appear only once in a data set of 11,303 instances, which to a certain extent causes the data sparseness. Therefore, these labels are deleted, and the remaining labels form label set L3, which contains 153 class labels. The experimental results are shown in Table 2. It can be seen that MLkNN and BP-MLL still have the best performance in each of the evaluation indicators, and the AP of BP-MLL has increased by nearly 3 percent.
We try to further reduce the sparseness of data and the labels whose frequencies are less than 10 by deleting them from the label set. The remained labels form the label set L 4 , which contains 71 class labels. The experimental results are shown in Table 3. It can be seen that MLkNN and BP- MLL have still the best performance in each of the evaluation indicators, and AP of BP-MLL is as high as 0.7413 ± 0.0100 by using the word vector feature. In general, with the decrease of the label set size, the results keep increasing. MLkNN and BP-MLL have the best performance in each of the indicators. Whether the size of the label set is 233,153 or 71, the experimental results using the word vector as a feature are all better than those using LDA topics. We may get some reasons from the working process of the LDA model and Skip-gram model. The word representations computed using the Skip-gram model are very interesting since the learned vectors explicitly encode many linguistic regularities and patterns, while LDA topic model is a bag-of-words model that may ignore the relationships between words.

Experiment Results on Different Number of Topics.
As seen from the Section 4.1.1, the number of topics K must be given before the LDA model is trained. Since the number of topics selected in the above experiments is 120, K should be around 120 approximately. Thus, 100, 110, 130, and 140 are selected and they will be individually compared with K when it is 120. The purpose of this experiment is to study the effect of the topic number on the classification of the LDA. In the case of AP, the abscissa is the number of different topics, and the ordinate is the average precision of each method under different themes. It can be seen from Figure 4 that as the number of topics in the LDA continues to grow, the other three algorithms tend to be roughly the same. The exception is that the average precision of CC drops, reaching the highest point when the number of topics is approximately 120. The overall effects of MLkNN and BP-MLL are better than the other two algorithms. MLkNN is better than BP-MLL on both sides of the polyline, but in the middle part, BP-MLL is better than MLkNN.

Experiment Results on Different Number of Word Vector
Dimensions. If the vector dimensions are not the same, it will affect the result. The different vector dimensions T of 10, 100, 200, 300, 400, and 500 are selected. The results are shown in Figure 5. In the case of AP, the abscissa is the word vector dimension, and the ordinate is the average precision of each method under different dimensions.   It can be seen from Figure 5 that as the vector dimension continues to grow, the AP of RAkEL, MLkNN, and BP-MLL tend to increase and the AP of CC drops. When the dimension is more than 100, the curve becomes gentle, but the time consumption will greatly increase. The overall effects of MLkNN and BP-MLL are better than the other two algorithms. MLkNN is better than BP-MLL on both sides of the polyline, but in the middle part, BP-MLL is better than MLkNN. Taking both the effectiveness and the efficiency into consideration, they are better when the vector dimension is 100.

Conclusion
In this paper, on the basis of the analysis of obstetric EMRs, the diagnosis assistant is regarded as a multilabel classification task. The LDA topic and the word vector trained by the Skip-gram model are adopted as the features and four methods; BP-MLL, RAkEL, MLkNN, and CC are utilized for multilabel classification. It also discusses the influence of the size of the label set, LDA topics, word vector dimensions and different, classifications on the experimental results. In general, the results using word vectors as features are slightly better than using LDA topics. The best result is achieved by BP-MLL with the word vector feature method. Its AP is up to 0.7413 ± 0.0100, when the label set size is 71 and the dimension of word vector is 100. The result of the diagnosis assistant can be introduced as a supplementary learning method for medical students. In this paper, the experiments are conducted on real cases of Chinese obstetric EMRs. The methods can be used for all kinds of medical records. Furthermore, the method proposed in this paper can be applied to English EMRs by treating the diagnosis assistant as multilabel classification.
From the discussion in this paper, the different features and classification methods in varying extent impact the experimental results. In the future work, we will focus more on mixing the extracted indicators with the help of the clinician to improve model performance. As for the multilabel classification, we will carry on the theoretical analysis of the performance differences between classifications and then propose the pertinent methods to get better results. It is expected that the result of the diagnosis assistant can provide an efficient assistant for the clinicians.

Conflicts of Interest
The authors declare that they have no conflicts of interest.