Deep Learning Multi-label Tongue Image Analysis and Its Application in a Population Undergoing Routine Medical Checkup

Background Research on intelligent tongue diagnosis is a main direction in the modernization of tongue diagnosis technology. Identification of tongue shape and texture features is a difficult task for tongue diagnosis in traditional Chinese medicine (TCM). This study aimed to explore the application of deep learning techniques in tongue image analyses. Methods A total of 8676 tongue images were annotated by clinical experts, into seven categories, including the fissured tongue, tooth-marked tongue, stasis tongue, spotted tongue, greasy coating, peeled coating, and rotten coating. Based on the labeled tongue images, the deep learning model faster region-based convolutional neural networks (Faster R-CNN) was utilized to classify tongue images. Four performance indices, i.e., accuracy, recall, precision, and F1-score, were selected to evaluate the model. Also, we applied it to analyze tongue image features of 3601 medical checkup participants in order to explore gender and age factors and the correlations among tongue features in diseases through complex networks. Results The average accuracy, recall, precision, and F1-score of our model achieved 90.67%, 91.25%, 99.28%, and 95.00%, respectively. Over the tongue images from the medical checkup population, the model Faster R-CNN detected 41.49% fissured tongue images, 37.16% tooth-marked tongue images, 29.66% greasy coating images, 18.66% spotted tongue images, 9.97% stasis tongue images, 3.97% peeled coating images, and 1.22% rotten coating images. There were significant differences in the incidence of the fissured tongue, tooth-marked tongue, spotted tongue, and greasy coating among age and gender. Complex networks revealed that fissured tongue and tooth-marked were closely related to hypertension, dyslipidemia, overweight and nonalcoholic fatty liver disease (NAFLD), and a greasy coating tongue was associated with hypertension and overweight. Conclusion The model Faster R-CNN shows good performance in the tongue image classification. And we have preliminarily revealed the relationship between tongue features and gender, age, and metabolic diseases in a medical checkup population.


Introduction
Tongue inspection is the most common, intuitive, and effective diagnostic method of traditional Chinese medicine (TCM) [1]. Recent TCM researches have realized measurable and digitized color features of tongue images by means of color space parameters such as RGB, Lab, and HIS [2][3][4].
However, the quantification of the shape and texture of tongue images remains a difficult point in tongue diagnosis. Much attention has focused on the automatic recognition methods of the shape and texture of tongue images. Obafemi-Ajayi et al. [5] have proposed a feature extraction method for automated tongue image shape classification based on geometric features and polynomial equations. Yang et al. [6] extracted the cracks by applying the G component of the false-color image in RGB color space, and the accuracy of detection was 82.00%. Douglas-Peucker algorithm was implemented to extract the number of features for toothmarked tongue and achieved an accuracy of 80% [7]. Xu et al. [8] conducted an RGB color range and a gray mean value of acantha and ecchymosis in tongue patterns, and the overall accuracy of recognition was 77.10%. Wang et al. [9] realized the prickles extraction on the green tongue image, with an accuracy of 88.47%. Yet, due to the complex and diverse tongue features, classical image processing methods have some problems, such as space-time consumptive algorithm, difficulties in automated high-throughput processing, and weak migration ability in correlation research [10][11][12], which make the comprehensive analysis of tongue images unavailable.
Intelligent diagnosis based on images is a main direction of modernization of tongue diagnosis technology [13]. As the current mainstream technology, a convolutional neural network (CNN) has a powerful capability of feature extraction and representation [14,15], which greatly improves the accuracy and efficiency of tongue image segmentation, and classification [16][17][18][19][20]. For example, Chen's team has utilized the deep residual neural network (ResNet) to identify the tooth-marked tongue, with an accuracy of over 90% [21]. Xu et al. [22] have proposed a CNN model combining a u-shaped net (U-NET) and Discriminative Filter Learning (DFL) for classification and recognition of different types of tongue coating, achieving an F1-Score of 93%. e research on the recognition and classification of the tooth-marked tongue [23] and cracked tongue [24] has significantly improved the accuracy of tongue image identification.
However, tongue images have multi-label attributes (Figure 1(a)). Although the classical model CNN shows better recognition performance for single tongue features such as tooth marks or fissures (Figure 1(b)), the multi-CNN fusion model has no apparent superiority in the multi-label classification of tongue images with diverse features (Figure 1(c)). Under nonparallel conditions, multiple CNN models require huge space and time. e classical CNN model fails to accurately identify, locate and quantify complex and diverse fine-grained features of tongue images simultaneously, and it is difficult to achieve efficient detection and recognition of tongue images in parallel.
Object detection technology is considered as a method to find a specific object in an image and determine the specific position of the object. As one of the mainstream neural networks for object detection, faster region-based convolutional neural networks (Faster R-CNN) [25] can perform multi-label recognition with only one model, thus reducing the cost of training multiple models. Here, we utilized Faster R-CNN and fine-tune method to extract local features of tongue images, learning the high-level semantic features. Aiming at 7 categories of tongue shape and texture in TCM, ResNet [26] was used as the backbone network for feature extraction to construct a deep learning model.
In this research, we constructed a standard database for training, testing, and validation realized the efficient and accurate classification and recognition of local features of tongue images and applied it to a population undergoing medical checkups with Chinese medicine, in order to reveal the association of tongue image features with diseases.

Materials and Methods
We proposed a deep learning multi-label tongue image model based on Faster R-CNN. A total of 8676 tongue images were collected to train and test the proposed model. e collected tongue images annotated by experts were divided into seven categories. Furthermore, this approach was applied to a population undergoing medical checkups with Chinese medicine. e specific process of this study is shown in Figure 2. Figure 3, all the tongue images were acquired by using TFDA-1 and TDA-1 tongue diagnosis instruments designed by Xu's team at Shanghai University of TCM. e instruments were equipped with unified CCD equipment, a standard D50 light source, a color temperature of 5003K, and a color rendering index of 97 [27]. Tongue images were obtained from September 2017 to December 2018 at Shuguang Hospital. e raw tongue image size was 5568 × 3711 pixels in JPG format. To reduce the amount of deep learning calculation and eliminate the interference of other regions except the tongue body, all tongue images were automatically cut to the size of 400 × 400 pixels by mask R-CNN.

Tongue Image Labeling and Datasets Construction.
All tongue image labels were evaluated and screened by 10 TCM experts with normal vision and reported normal color vision [28]. To avoid the chromatic differences from the monitor, experts interpreted and screened under uniform conditions with 27 inches APPLE Cinema HD monitor. With reference to the diagnostic criteria of tongue image features [1,29], tongue images were divided into seven categories. At least 8 out of 10 experts confirmed that the dataset contained the same labels, and all 8676 tongue images were annotated by two experts as seven different folders. Example samples of each typical tongue image were shown in Figure 2(b), and the other eight experts respectively checked the labeled folders. e images with the inconsistent diagnosis were excluded from this research. e datasets for Faster R-CNN were with the MS COCO format, which was the most popular standard format in the field of object detection [30]. We used LabelImg (Version 1.8.1) to annotate the interest regions of shape and texture on the tongue image. e annotation was confirmed by experts and the process interface is shown in Figure 2(c). en, the generated "XML" annotation files were transferred into the "JSON" format file using Python (Version 3.6).

Dataset Partition.
e constructed datasets were randomly partitioned into 80% training set, 10% validation set, and 10% testing set. e number of the training datasets and labels in 7 categories used to build the Faster R-CNN model 3601   Evidence-Based Complementary and Alternative Medicine 3 were shown in Table 1. In addition, the number of tongue images in each category included in the testing set was equal to the validation set.

Faster R-CNN Model Development for Recognizing Tongue
Shape and Texture. Figure 4(a) shows the network architecture diagram of Faster R-CNN, mainly consisting of four parts: convolution layer, regional proposal network, the region of interest (ROI) pooling layer, and a layer of classifier and regressor [31]. e backbone convolution layer ResNet101, as shown in Figure 4(b), extracts feature maps from the input tongue images; the region proposal network (RPN) centers on each pixel of the feature maps, and generated anchor boxes with different scales in the tongue images by using nonmaximum suppression; the ROI pooling computes feature maps for region proposals; the final output feature maps of the ROI pooling layer are performed for classification. Finally, an average pooling is applied, and the features obtained from the pooling are used for classification and bounding box regression, respectively.

Model Training, Validation, and Testing.
e Faster R-CNN model based on the Caffe framework was deployed in the Ubuntu operating system by using open-source code and was trained in a computing environment with 4 NVIDIA GTX 1080Ti GPUs, 12 Intel Core I7-6850K CPUs, and 128 GB DDR4 RAM. e model training, validation, and testing were conducted according to the following steps: First, the Faster R-CNN network was fine-tuned on a tongue image dataset for 40000 iterations with an optimizer of stochastic gradient descent (SGD), the learning rate of 0.03, weight decay of 0.0001, the momentum of 0.9, gamma value of 0.1, and batch size of 128. Detailed initial parameters were shown in Table 2.
en, the tongue image and the marked position information were fed into an integrated Faster R-CNN model for training. In each training iteration, features were extracted, labels and frame position were predicted, and losses (i.e., errors) between predicted frame position, predicted labels, and object actual position and object actual label were calculated. e parameters were updated according to the backward error propagation. Complete the training and generate the object detection model of tongue image of TCM. At the end of the training, a well-trained object detection model for tongue images was achieved. Collect and observe results using a well-trained model with different hyper-parameters over the validation set. e operation of validation was performed during the training process. Based on the results over the validation set, the state of the model was checked, and the hyperparameters of the model were adjusted. When the results of accuracy in the validation do not increase, the training is stopped. e loss function for Faster R-CNN sums the classification loss and regression loss, as defined in the following equation [24]: where N cls and N reg are the number of anchors in minibatch and number of anchor locations, λ and i mean the selected anchor box index and the balancing parameter; p i and p * i represent the predicted probability and the ground truth of tongue feature; t i and t * i represent the predicted bounding box and actual tongue feature label box. e accuracy results and the loss changes in the training are depicted in Figure 4  Finally, by adjusting the initial learning rate and comprehensive comparison, the training model with a learning rate of 0.001 and iteration of 40000 was finally selected as the final object detection model. en the trained model was applied to the testing set.

Strategies for the Prevention of Overfitting.
In this study, the two means of regularization and dropout were deployed to prevent overfitting. In the process of the training model, the regularization of L2 was leveraged to constrain the weight estimates, so as to help in preventing overfitting [32]. In addition, the technique of dropout was applied for training the last several classification layers in the neural network of Faster R-CNN. By means of the dropout, convolution kernels were randomly deactivated in the training process [33]. Furthermore, the importance of convolution kernels in the classification layers was dynamically balanced. Also, the overfitting phenomenon could be alleviated.    (2)), recall ( (3)), precision ((4)), and F1-score ( (5)), were selected as metrics to evaluate the performance of Faster R-CNN in the multiclass classification of tongue images [34][35][36][37]. True positive (TP) means that the expert's conclusion and the result of object detection are the same, and false negative (FN) represents that the existing tongue feature category is not detected. False positive (FP) means if the tongue feature detection algorithm classifies those that are not in this category. True negative (TN) denotes that if the tongue image does not belong to a certain category, the tongue feature detection algorithm is the same as the expert conclusion. Macro averaged measures for the above indices are calculated for the model Faster R-CNN with respect to the 7-classes classification of tongue images. Accuracy

Application of Tongue Image Detection Model.
e Faster R-CNN model obtained from the above training was applied to the population undergoing routine medical checkups with Chinese medicine to explore the association between tongue features and diseases. All samples were collected from January 2019 to December 2019. A total of 3601 subjects were included in the physical examination center of Eastern Hospital of Shuguang Hospital affiliated with Shanghai University of TCM. We excluded women who were pregnant or nursing; those who cannot cooperate with researchers. All volunteers signed informed consent, all subjects completed routine medical checkups and simultaneously used the TFDA-1 tongue diagnosis instrument to capture tongue images.
All tongue images were analyzed by a trained Faster R-CNN model. All analysis and test results were verified by experts for the second time, and the analysis results were unanimously confirmed. If they were inconsistent, comprehensive analysis results should be adopted. e indicators of shape and texture features of tongue images were classified into two categories. Doctors at the physical examination center of Shuguang Hospital affiliated with Shanghai University of TCM made a diagnosis with reference to the corresponding clinical guidelines for diseases, aiming at the common and multiple diseases in the medical checkup population.

Statistical Analysis Methods.
Excel and Python3.6 were used for data matching, merging, and sorting. e tongue images were described by percentage (%) and were compared using the Pearson χ 2 test. Statistical analysis was performed using the IBM SPSS Statistics for Windows, version 25 (IBM Corp., Armonk, N.Y., USA). All results were compared using the two-tailedt test and differences were considered statistically significant when P < 0.05.
A complex network by the improved node contraction method [38][39][40] is a weighted network that contains the degree and weight of edges based on the obtained node importance. e weight of the weighted network was defined as e visualization tool Python NetworkX [41] was used to store the constructed network in the form of the adjacency matrix and triple, and the complex network diagram was built with disease and tongue image features as nodes.

Tongue Image Detection Over Testing Set.
In our testing set, the average accuracy of the model achieved 90.67%, with a precision of 99.28%, recall of 91.27%, and F1 score of 95.00%, indicating that the model had a good detection effect and can accomplish the multiobject recognition task well, as shown in Table 3.
Our method detected tongue shape and texture features with different scales and ratios. Figure 5 show three or more different tongue-shaped features, in which (i) shows a greasy coating, tooth marks and stasis were detected simultaneously, (j) shows a greasy coating, tooth marks and spots were detected simultaneously, (k) shows a peeled coating, fissures and stasis were detected simultaneously, and (l) shows a greasy coating, tooth marks, fissures, and stasis were detected simultaneously.

Distribution of Tongue Image Features in the Medical Checkup Population.
e tongue images were input into the established optimal Faster R-CNN intelligent tongue diagnosis analysis model, and 1494 cases (41.49%) of the fissured tongue, 1338 cases (37.16%) of the tooth-marked tongue, 1068 cases (29.66%) of greasy coating, 672 cases (18.66%) of the spotted tongue, 359 cases (9.97%) of stasis tongue, 143 cases (3.97%) of peeled coating, and 44 cases (1.22%) of rotten coating, as shown in Figure 6.

Statistics of Tongue Image Features on the Gender Factors and the Age Factors.
It showed that the proportion of fissured tongue, tooth-marked tongue, and greasy coating in the male group was higher than that in the female group (P < 0.001), whereas the proportion of spotted tongue and stasis tongue in females was significantly higher than that in males (P < 0.001). ere was no significant difference between the two groups in the proportion of peeled and greasy coating, as shown in Table 4, Figures 7(a) and 7(b). e results from the above table illustrated that there were significant differences in the incidence of the fissured tongue, tooth-marked tongue, spotted tongue, greasy coating, and rotten coating among the four age gradients, but there was no significant difference in the incidence of stasis tongue and peeled coating. Overall, with the increase of age, the incidence of fissured tongue and greasy coating increased gradually, while the incidence of spotted tongue and tooth-marked tongue decreased gradually, as shown in Table 5, Figures 7(c)-7(d).

Less than 1 label 2 Labels
More than 3 Labels   Evidence-Based Complementary and Alternative Medicine

Correlation Analysis among Tongue Features and Diseases
Based on Complex Networks. Overall, the tongue features of diseases in medical checkups were mainly characterized by increased fissures, tooth marks, and greasy coating. Table 6 showed the top ten weights relationships between tongue features and diseases. Fissured tongue, tooth-marked tongue, and greasy coating are most closely related to glucolipid metabolic diseases. Specifically, the fissured tongue had the highest weight in hypertension, reaching 0.974, and the weights for dyslipidemia, overweight, and NAFLD were 0.812, 0.799, and 0.775, respectively. For toothmarked, the weights of hypertension, dyslipidemia overweight and NAFLD were 0.786, 0.649, 0.639, and 0.623, respectively. For greasy coating, the weights of hypertension   Note: * denotes significant difference compared to < 3 0 years old group, # denotes significant difference compared to 30-39 years old group, and ▲ denotes significant difference compared to 40-49 years old group. 8 Evidence-Based Complementary and Alternative Medicine and overweight were 0.649 and 0.540, respectively. As shown in Figure 8, greasy coating, tooth-marked tongue, and fissured tongue were more closely related to hypertension, dyslipidemia, NAFLD, and overweight.

Discussion
Intelligent tongue diagnosis is one part of the important content in clinical TCM diagnosis. e researchers have applied the tongue image features extracted by deep learning to diabetes mellitus [4,42,43], NAFLD [44], lung cancerassisted diagnosis [45], and TCM constitution recognition [46][47][48] with good performance of disease classification [13,49]. Professor Yang Junlin's team [50] has applied the AI screening system for scoliosis developed and established by Faster R-CNN, and quantified the severity of scoliosis, with the accuracy reaching the average level of human experts. Tang et al. [51] have proposed a tongue image classification model based on multitask CNN, and the classification accuracy achieved 98.33%. However, due to the small sample size, the advantages of deep learning methods cannot be brought into full play, and the tongue features such as rotten, greasy, spotted, stasis, dryness, or thickness remain unexplored [52]. Liu et al. [53] applied Faster R-CNN to identify tooth-marked tongues and fissured tongues, and the accuracy of identifying fissured tongues and tooth-marked tongues was 0.960 and 0.860, respectively. e research only involved tooth marks and fissures due to the small sample size, so the advantages of the deep learning multi-label object detection model were not fully exerted.
Compared with the tongue classification model constructed by the classical CNN, Faster R-CNN as a highly integrated and end-to-end model is still the mainstream object detection neural network at present [54][55][56].
In our research, we focused on the categories of the tongue image features, rather than the precise feature position, so we applied the method of object detection to the multiclass recognition problem of tongue features. Our tongue feature detection model based on Faster R-CNN had a good generalization ability. With the unique advantages of deep learning and transfer learning in the identification of shape and texture features of tongue images, it can realize automatic high-throughput processing, better solve the problems of local tongue image recognition, integrate the identification and annotation of tongue images, and has a good visualization effect. Our model accomplished the    [57], and the correlations between them and the occurrence and progress of diseases are unknown. In this study, tongue feature diagnosis based on Faster R-CNN applied to a population undergoing routine medical checkup was a beneficial attempt to mine the implicit information of TCM tongue image and diseases through a complex network [40]. e intelligent diagnostic analysis was established to analyze 3601 physical examination population, and the results showed that the incidence of the fissured tongue was 41.49%, the tooth-marked tongue was 37.16%, the greasy coating was 29.66%, the spotted tongue was 18.66%, stasis tongue was 9.97%, the peeled coating was 3.97%, and the rotten coating was 1.22%, the incidence of fissures, tooth marks and greasy coating in men was higher than that in women, and the incidence of spotted tongue and stasis tongue in women was significantly higher than that in men, which may be related to deficiency of spleen qi, essence and blood in male subjects and excessive blood heat in female subjects. With age increasing, the incidence of fissured tongue and greasy coating increased, while the incidence of spotted tongue and tooth-marked tongue decreased, which may be related to the tendency of both qi and yin deficiency in the elderly and excess syndrome in the young. In the population with glucose and lipid metabolic diseases such as fatty liver and metabolic syndrome, fissures and greasy coating increased, which may be related to the pathogenesis of glucose and lipid metabolism, such as deficiency of qi and yin and dampness. ese results were consistent with the clinical practice of TCM [58].
Although the method has some advantages, our model also has limitations.
Firstly, we will conduct further research on the multiclass classification of tongue images in the future. e performance of other neural network models such as VGGnet, ResNet, and DenseNet, will be explored in the task of tongue image classification.
Secondly, the tongue image object detection model has still to be optimized. Annotation of large samples requires a lot of labor cost. Tongue image data acquired by standardized technology has high stability, but the scalability is not strong. Regardless of the fact that the user visualization effect is good, it is still difficult to explain the extracted feature [59]. A more efficient model algorithm, such as unsupervised deep learning based on the flow generation model [60] and a self-attention mechanism based on end-to-end object detection with transformers [61], would be used to further optimize and establish a robust intelligent diagnosis and analysis model of tongue image.
irdly, our approach for the detection of tongue images is a qualitative model. However, the identification of tongue images in TCM clinics is complicated, which is not only a binary problem but also a quantification of pathological change. e changes in tongue image features are also of great value in the diagnosis of disease symptoms, which will be the focus of our subsequent research.

Conclusions
is study was a cross-sectional study of healthy people with medical checkups. Furthermore, a case-control study will be carried out on patients with major chronic diseases in order to prove the value of tongue features in the diagnosis of disease. In addition, we will optimize the Faster R-CNN model with the respect to the precise location of objects in a tongue image.
is paper presents a supervised deep learning method based on a large amount of labeled data. In the future, we will explore a more robust self-supervised deep learning model for the multiclassification of tongue features. e model Faster R-CNN shows good performance in tongue image classification. And we have preliminarily revealed the relationship between tongue features and gender, age, and metabolic diseases in a medical checkup population.
Data Availability e datasets used and/or analyzed in this study are available upon reasonable request from the corresponding author.

Ethical Approval
is study was reviewed and approved by the Institutional Research Ethics Committee of Shuguang Hospital affiliated to Shanghai University of TCM (No. 2018-626-55-01). e clinical trial has been registered at the Chinese Clinical Trial Registry with the registration number: https://clinicaltrials. gov/ct2/show/ChiCTR1900026008. Consent e patients/participants provided their written informed consent to participate in this study..

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
T.J. and J.X. conceptualized the study; X.H. and X.Y. developed methodology; C.Z. designed software; X.H. and X.Y.validated the study; J.H. formally analyzed the study; L.T. investigated the study; J.C. collected resources; L.T. curated the data; T.J. and Z.L.wrote the original draft; T.J. and Z.L. reviewed and edited the manuscript; X.Y. visualized the study; C.Z. supervised the study; X.M. and L.Z. administered the project; J.X. and T.J. acquired funding. All authors have read and agreed to the published version of the manuscript. Tao Jiang and Zhou Lu contributed equally to this work. 10 Evidence-Based Complementary and Alternative Medicine