Deep Learning Based Tongue Prickles Detection in Traditional Chinese Medicine

Tongue diagnosis is a convenient and noninvasive clinical practice of traditional Chinese medicine (TCM), having existed for thousands of years. Prickle, as an essential indicator in TCM, appears as a large number of red thorns protruding from the tongue. The term “prickly tongue” has been used to describe the flow of qi and blood in TCM and assess the conditions of disease as well as the health status of subhealthy people. Different location and density of prickles indicate different symptoms. As proved by modern medical research, the prickles originate in the fungiform papillae, which are enlarged and protrude to form spikes like awn. Prickle recognition, however, is subjective, burdensome, and susceptible to external factors. To solve this issue, an end-to-end prickle detection workflow based on deep learning is proposed. First, raw tongue images are fed into the Swin Transformer to remove interference information. Then, segmented tongues are partitioned into four areas: root, center, tip, and margin. We manually labeled the prickles on 224 tongue images with the assistance of an OpenCV spot detector. After training on the labeled dataset, the super-resolutionfaster-RCNN extracts advanced tongue features and predicts the bounding box of each single prickle. We show the synergy of deep learning and TCM by achieving a 92.42% recall, which is 2.52% higher than the previous work. This work provides a quantitative perspective for symptoms and disease diagnosis according to tongue characteristics. Furthermore, it is convenient to transfer this portable model to detect petechiae or tooth-marks on tongue images.


Introduction
Based on the clinical practice of doctors, traditional Chinese medicine has been developed for thousands of years and has achieved brilliant results both in the past and in modern times. Tongue diagnosis, as a role of vital importance in TCM clinical diagnosis, is a convenient and noninvasive method based on the health status information carried by the appearance of the tongue [1]. e chromatic features and morphological characteristics of the tongue, the number of prickles and the form of tongue coating reveal the pathological changes of internal organs, as shown in Figure 1 [2,3]. ere have been reports that prickles are associated with tumors, kidney disease, gastric disease, and new crowns [4]. However, traditional tongue diagnosis is an empirical procedure that relies heavily on the personal experience and subjective judgment of TCM doctors. With the assistance of artificial intelligence (AI), tongue diagnosis will be objective and people without medical knowledge can give themselves a preliminary diagnosis of a health condition. In recent years, much effort has been spent on AI-based tongue diagnosis, especially in the field of tongue color recognition [5,6], tongue shape analysis [7], cracks segmentation [8], thickness, and moisture of tongue coating classification [9,10].
Prickle, also called red-pointe, appears as a large number of red thorns protruding from the tongue. It indicates blood heat or excess heat in the internal organs. e color and number of the prickles can help estimate the flow of qi and blood. As proved by the study, the prickles originate in the fungiform papillae, which are enlarged and protrude to form spikes like awn [11]. Shang et al. further analyzed the association of prickles with the gastric sinus [12]. On the one hand, prickles mean increased blood flow and thus congestion. On the other hand, prickles represent thermal burns to blood vessels, resulting in blood spillage and mucosal erythema. e automatic detection of prickles can not only release the burden of doctors but also enable patients without medical knowledge to give themselves a brief examination. ough AI-based tongue diagnosis has attracted a lot of attention, there is little literature on prickle detection due to its difficulty. In most tongue images, each prickle only occupies a few pixels and has little difference in tongue color under natural light. Moreover, the similarity between prickles and petechiae in both morphological and chromatic characteristics makes it a challenging task to distinguish them. Xu et al. were the first to introduce template feature matching to detect the prickles and petechiae, and then distinguished them based on RGB value range, gray average, and the position of the detected object [13]. Zhang employed the fuzzy C-means color cluster and noise reduction methods to detect prickles in the tongue edge image. Wang et al. used multistep threshold spot detection to detect prickles and petechiae. After extracting the features of spots (including prickles and petechiae), support vector machines and k-means were introduced to distinguish prickles and petechiae [11]. Wang et al. proposed a prickle detection method based on an auxiliary light source and a LOG operator edge detection method [14]. e last two works are most similar to our work, for they provided a quantitative description of prickles.
By eliminating background areas such as the face and lips, the tongue region segmentation can enhance the performance of downstream tasks, including prickle detection. Practice has proved that the neural network is very effective in the task of tongue segmentation. Zhang et al. introduced a DCNN-based tongue segmentation algorithm [15]. Wang et al. designed a coarse-to-fine segmentor based on RsNet and FsNet [16]. Jiangproposed an HSV enhanced CNN to segment the tongue region [17]. Zhang et al. combined superpixel with CNN to increase decoding performance [18].
ough previous researchers put much efforts into prickles and petechiae detection, the existing methods all rely heavily on manual parameter tuning. is not only adds to the burden of researchers but also causes the model to overfit to specific circumstances and equipment. Moreover, there is no end-to-end prickle detection method, which could provide a quantitative description of prickles without manually segmenting the tongue raw images. Finally, most methods only took gray values of the exact pixels into consideration and lose the color information and the context information around the prickles. e method proposed in this paper solved the question mentioned above from a completely new perspective: Deep Learning. We designed an end-to-end workflow to detect prickle automatically. e entire workflow and intermediate results are depicted in Figure 2.

Dataset Collection.
In this paper, the tongue images and segmentation annotations come from the bio-HIT tongue image dataset [19] (https://github.com/BioHit/TongeImageDataset). e tongue images dataset contains 300 RGB images with 576 × 768 pixels, and the images are obtained by the tongue image acquisition device shown in Figure 3. e device is designed as a semiclosed black box with a camera and illuminated on each side of the camera. e daylight illuminant D50, recommended by CIE (Commission Internationale de lEclairage) [20], was utilized as daylight illumination. According to the guidelines provided by CIE, the angle between the incident and outgoing rays is 45°. We elaborately screened out images with poor quality and got 224 images to train the model. e device has a closed image acquisition environment with an independent stable light source and a head restraint to ensure all the images are sampled to one standard. In addition, the image registration and calibration is not necessary either. Four volunteers in HIT elaborately annotated the image segmentation labels and the best one was chosen [19]. All the images were standardized to meet the standard normal distribution.

Tongue Segmentation.
e AI-assisted tongue diagnosis is based on the information obtained from the tongue image. When concentrated on the tongue, irrelevant elements, including the lips, cheek, and chin, distract I neural network. With interference eliminated and the tongue matted, the contour line of the tongue becomes apparent and the performance of feature extraction is guaranteed. erefore, it is necessary to segment the tongue before the next step, and we introduced the Swin Transformer [21] as the segmentor. e core concept of the Swin Transformer is self-attention, as shown in equations:  where the Q is the query vector, the K is the key vector, and the V is the value vector. W Q , W K , W V are all weight matrices. With the self-attention mechanism, the network can perceive global semantic information. e entire architecture of Swin Transformer is shown in Figure 4, where Z 0 , Z 2 , Z 4 , Z 6 , Z 8 , and Z 10 used multihead self-attention with regular windowing (W-MSA) and the others used multihead selfattention with shifted windowing (SW-MSA). Considering the fact that the dataset only contains 224 images, which is insufficient for training a network from scratch, we adopted a paradigmatic strategy in computer vision: pretraining and fine-tuning with data augmentation. Microsoft has trained the model with over 20,000 images on the ADE20K dataset [22], and we fine-tuned the model on the tongue segmentation dataset. e data augmentation pipeline includes flipping, cropping, and photometric distortion. Photometric distortion applies the following transformations with a probability of 0.5: random brightness, random contrast, color space converting, random saturation, random hue, and randomly swapping channels. With data augmentation, the model will be more robust when the illumination or sampling device varies. ough the neural network is able to classify each pixel as tongue or background, there is no guarantee that the segmented tongue region has structural integrity. To address this issue, we analyzed the connected components of each mask, filled the blank areas in the tongue and eliminated the outliers using two-pass connected component analysis [23]. e algorithm is shown in Algorithm 1 and the result is shown in Figure 5.

Prickles Annotation.
Usually, hundreds of prickles appear on the tongue in groups. erefore, it will be a challenging task if we manually annotate the whole dataset. In this paper, we employed partition spots detection [11] with LAB chromatic aberration filtering to give a primitive annotation of the prickles. e LAB color space is based on the human eye's perception of color and can represent all colors the human eye can perceive. "L" represents lightness, "A" represents red-  Figure 2: Prickle detection workflow. Blue rectangles represent images, and yellow rectangles represent image processing. First, in tongue segmentor, we introduced Swin Transformer, a state-of-the-art computer vision segmentation neural network, to mat the tongue region out of the raw picture. Second, in tongue partition, the tongue is partitioned into four areas: root, margin, tip, and center. ird, in data prelabeling, a spot detector is applied based on tongue areas. Fourth, elaborate manual annotation is conducted with the help of TCM doctors. Fifth, to fully embody the advantages of the neural network, a super-resolutionfaster-RCNN based detector is deployed to detect the prickles from a matted tongue image. Evidence-Based Complementary and Alternative Medicine green difference, and "B" represents blue-yellow difference.
e spot detection algorithm works on gray images and the color information is lost. erefore, we filtered the spots by the chromatic aberration between the spots and the manually picked prickles. Given an RGB value, an approximate LAB chromatic aberration can be calculated as follows: Swin-Transformer Encoder Z 10,11 Swin-Transformer Encoder Z [4][5][6][7][8][9] Swin-Transformer Encoder Z 2,3 Pre-trained Position Embedding Figure 4: Architecture of tongue segmentor. e tongue image is divided into several patches and then added with position embeddings to retain spatial information. e encoder consists of cascaded Swin Transformers, while UperNet decoder is applied to aggregate multilevel features from encoder. Figure 5: Mask morphology processing. e connected component with the largest area in black or white will be marked as "1" and other connected components will be eliminated.
e tongue is partitioned before detection so that we can elaborately set different detection parameters for different areas.
e tongue coating is distributed on the tongue surface, which is usually slightly thicker in the center or root of the tongue, and the prickles covered by the coating have different characteristics from the prickles on the margin and tip. In addition, the cracks on the root and center of the tongue tend to be detected mistakenly by the spot detection algorithm. To solve this problem, we divided the tongue into four areas: root, margin, tip, and center before preliminary annotation. en we set the threshold of chromatic aberration, area, circularity, and convexity tighter in the root and center than in other areas.
is setting avoids the misdetection of cracks while maintaining the detection rate. e algorithm of annotation is shown in Figure 6. First, the parallel-line method is introduced to build a reference line for tongue partition [14]. Second, the tongue is divided into four areas by the relative thickness compared to the overall scale of the tongue. e result is depicted in Figure 7.
ird, a simple blob detector based on OpenCV (Open Source Computer Vision Library) is deployed with different parameters in different tongue regions. Fourth, the detected spots are filtered by LAB chromatic aberration. Finally, a professional TCM doctor revised the roughly annotated bounding boxes with the MIT Labelme annotation tool (https://labelme.csail.mit.edu) and two other TCM doctors checked the result under the same diagnostic criteria on the same monitor [24,25].

Prickle Detection.
In this paper, we take the Faster-RCNN as the prickle detector. In 2016, Ross B. Girshick proposed a new object detection neural network called Faster-RCNN, which is depicted in Figure 8. e Faster-RCNN first uses a set of basic convolution layers, ReLU function, and pooling layers to extract an image feature map, which is subsequently shared by region proposal networks (RPN) and fully connected layers. e RPN network is designed to generate region proposals with the softmax layer to determine whether the anchors are positive or negative, and then it employs bounding box regression to correct the anchors to obtain accurate proposals. e roi-pooling layer collects the input feature maps and proposals, extracts proposal feature maps after synthesizing the information, and sends them to the subsequent fully connected layer to determine the target category [26]. ere are two obstacles that need to be overcome before training the Faster-RCNN. First, insufficient data makes the model hard to train. erefore, we introduced data augmentation, including flip, crop, scale, translation, and rotation, as shown in Figure 9. In addition, we followed the paradigm of computer vision workflow: we pretrained the model on a general detection dataset and used transfer learning to fine-tune the model on the prickle detection dataset. Second, the Faster-RCNN is designed for the detection of the target on a normal scale. When the target is smaller than 32 × 32 pixels, the performance of the model will decrease sharply. To address this issue, we used 4× bilinear interpolation for upsampling to improve the model performance. Since the neural network aims to build an endto-end prickle detection model and it is proven that the color calibration and irrelevant noise filtering may cause a degradation in model performance [27], image registration and filtering are not employed.

Evaluation Metrics.
e segmentation tasks in computer vision field could be considered as a pixel-wise classification, and there are four types of the results: true positive (TP), false positive (FP), true negative (TN), and false negative (FN), as shown in Figure 10. e standard evaluation metrics of segmentation task is intersection over union (IoU), defined as follows: where TP, FP, and FN are about pixel-wise classification results. We also employed precision and accuracy as metrics to fully demonstrate the performance of the proposed method, which are defined as follows: Accuracy � (TP + TN) (TP + TN + FP + FN) ,

Evidence-Based Complementary and Alternative Medicine
A prickle is classified as detected correctly when the bounding box has IoU over the threshold. e metrics for prickles detection are precision defined above and recall defined as follows: Where FP is the number of misdetected prickles, TP and FN is the number of manually annotated prickles that were detected and undetected, respectively.

Experiment Setting.
In our work, the models were trained and tested based on the Python deep-learning     Table 1.

Result of Tongue Segmentation.
e Swin Transformer is a flexible neural network, which means it requires more training compared to the conventional CNN. To address this issue, we introduced the pretraining model provided by Microsoft (https://github.com/SwinTransformer/Swin-Transformer-Semantic-Segmentation).
Microsoft has trained the model with over 20,000 images on ADE20K dataset [16] and we fine-tuned the model on our tongue segmentation dataset. e training set and test set contained 178 and 46 annotated images, respectively. e splitting of the dataset took a cross-validation strategy. After we got the predicted segmentation, we filled the blank areas in the tongue and eliminated the outliers. e result is shown in Table 2 and Figure11(b). e excellent performance of the Swin Transformer makes the segmentation results basically the same as manual segmentation. Compared to the conventional machine vision segmentation methods, neural network is an endto-end progress without the requirement of manual tuning parameters. It is also more robust when the illumination or sampling device varies and the result proves the superiority of the transformer architecture in tongue segmentation.

Prickle Labeling.
We employed the simple blob detector with LAB chromatic aberration filtering to detect the prickles coarsely. e simple lob detector is a multistep threshold spot detection method for processing gray images. We took the parameters of the simple blob detector in [11] as the initial value and introduced the grid search to find the optimal parameters for each area. e searching step length is set to 10% of the initial value and the searching range is set from 50% to 150% of the initial value. e parameters of the partitioned simple blob detector are shown in Table 3.
Petechiae are usually found on the center and roots of the tongue. To reduce the probability of misdetection, the constrains of spots in those areas are more stringent. In addition, to take full advantage of the color information, we sampled the RGB value of prickles in different tongues and areas and filtered the detected spots by the LAB chromatic aberration.
e automatic annotation result with and without partitioning is shown in Figure 12 to give a clear demonstration of how the partitioning helps the prickle labeling. After automatic annotation, three well-trained TCM doctors manually revised the labels. e annotation is shown in the Figure 11(d).

Result of Prickle Detection.
Object detection is a challenging task requiring a lot of training. We downloaded the pretrained model provided by CUHK and SenseTime (https://github.com/open-mmlab) and fine-tuned it on the prickle detection dataset.
e training set and test set contained 178 and 46 annotated images, respectively. e splitting of the dataset took a cross-validation strategy. Compared to the original Faster-RCNN, we employed a 4× bilinear interpolation for upsampling and modified the anchor size to match the prickle detection. To demonstrate the superiority of our modified Faster-RCNN, we took the vanilla Faster-RCNN and other detection algorithms for comparison. e predicted result of the entire dataset is shown in Table 4. e recall of our method outperformed the existing methods without fine-tuning the parameters manually. We did not choose YOLO because it is a one-stage detection algorithm and the resolution is fixed, which limits the performance of detecting tiny targets. It is worth mentioning that we tried other learning-based algorithms, including DCNv2 [29] and SSD [30], but they were unable to predict       any bounding box. As shown in Figure 11(e), the detected prickle patterns are similar to the annotation while it tends to detect prickles on the tip. To illustrate the principle of our neural network, we provided an attention heat map of the detector, as shown in Figure 11(f ). e neural network had more attention on the edge of the tongue, where the prickles are most likely to appear. is attention is learned by the neural network automatically and has the potential to provide TCM doctors an intuition to distinguish a tongue with prickles. In addition, we fed the raw images into the detector as an ablation experiment to validate the necessity of segmentation, and the result is shown in the 3rd line of Table 4.

Discussion
Prickle, as an essential syndrome feature of TCM, can be used to assist in the diagnosis and treatment of subhealthy people. But the ambiguity of prickle recognition and the subjective preferences of doctors limit its further application. Combined with modern AI technology, we proposed an end-to-end computer vision-based prickle detection workflow, which makes prickle detection more precise and objective. is workflow is divided into three main steps. Firstly, Swin Transformer, a state-of-the-art semantic segmentation neural network, was employed to segment the tongue region out of a raw image. e segmented tongues were partitioned into four areas: root, center, tip, and margin to help the parameters setting of follow-up spot detection and prickle detection. Secondly, we manually labeled the prickles on 224 tongue images with the help of a spot detector and fed the result into the Faster-RCNN. Finally, the neural network extracted the features of images at both a texture level and morphological level to detect the prickles on the tongues. e precision of our segmentation is 99.47%, and the recall of our detection is 92.42%, which outperformed the existing methods. e result illustrates that the utilization of transfer learning made it possible to train neural networks on a limited number of images. Compared to previously published studies, it gets rid of the trivial parameters tuning procedure and releases the burden on researchers. Meanwhile, the workflow proposed is more portable, which means you can transfer the model to arbitrary tongue characteristics or image acquisition equipment.
In the context of artificial intelligence and big data, the informatization of TCM diagnosis is an area that urgently needs in-depth research. Our work took full advantage of the deep learning algorithm to implement an intelligent recognition of prickles and provided the possibility of establishing the quantitative association between the tongue image and clinical symptoms. In addition, incorporating a prickle detection model into the smartphone will allow people without medical knowledge to give themselves a simple health status assessment, and a quantitative and objective tongue diagnosis will also benefit the integration of TCM and modern Western medicine. e precise segmentation and feature recognition of the tonguecan also be used for throat swab robot perception. ough our method is state-of-the-art, there are some aspects for further research. Firstly, the model was trained on the images sampled by a standard acquisition device, which limits its generalization and robustness. A larger dataset consisting of images from different devices would help the model to establish a greater degree of accuracy in this matter. Secondly, our model provides a paradigm for tongue feature detection. With petechiae, cracks, toothmarks, and other TCM features of the tongues labeled, the model has the potential to achieve an acceptable result.
irdly, most existing learning-based tongue feature detection methods aim to find bounding boxes. It is possible to use a segmentor to classify every single pixel in the image into a kind of TCM tongue feature.
Data Availability e experimental data and code used to support the findings of this study will be available on https://github.com/zz7379/ PrickleDetection or on contacting the authors with reasonable request (wangxinzhou@tongji.edu.cn).

Conflicts of Interest
e authors declare that there are no conflicts of interest.

Authors' Contributions
Xinzhou Wang and Siyan Luo contributed equally to this work.