An Improved Multitask Learning Model with Matching Network and Its Application in Traditional Chinese Medicine Syndrome Recommendation

Multitask learning (MTL) is an open and challenging problem in various real-world applications, such as recommendation systems, natural language processing, and computer vision. The typical way of conducting multitask learning is establishing some global parameter sharing mechanism among all tasks or assigning each task an individual set of parameters with cross-connections between tasks. However, for most existing approaches, the raw features are abstracted step by step, semantic information is mined from input space, and matching relation features are not introduced into the model. To solve the above problems, we propose a novel MMOE-match network to model the matches between medical cases and syndrome elements and introduce the recommendation algorithm into traditional Chinese medicine (TCM) study. Accurate medical record recommendation is significant for intelligent medical treatment. Ranking algorithms can be introduced in multi-TCM scenarios, such as syndrome element recommendation, symptom recommendation, and drug prescription recommendation. The recommendation system includes two main stages: recalling and ranking. The core of recalling and ranking is a two-tower matching network and multitask learning. MMOE-match combines the advantages of recalling and ranking model to design a new network. Furtherly, we try to take the matching network output as the input of multitask learning and compare the matching features designed by the manual. The data show that our model can bring significant positive benefits.


Introduction
Mixed negative sampling (MNS) [1] applies two-tower neural networks to improve retrieval quality in large-scale recommendation systems. e selection bias of implicit user feedback is solved via using the mixture of batch and uniformly sampled negatives. A neighbor similarity loss with a multichannel matching called Graph-DR framework [2] is proposed to improve both accuracy and diversity. YouTube general matching neural network [3] is applied to recall and sort in different stages, and joint training is proposed to solve the age exposure bias problem. A dual augmented two-tower model [4] is proposed, which employs adaptive-mimic mechanism (AMM) and category alignment loss (CAL) to model the information interaction between two towers and get better embedding for the imbalanced category. A deep structured semantic model (DSSM) [5] employs the clickthrough data to optimize the parameters that directly target the goal of ranking. Furthermore, DSSM extends the linear semantic structure to nonlinear using multiple layer perceptron, which can capture more sophisticated semantics. e adversarial two-tower neural network (ATNN) [6] model is applied to CTR prediction, which introduces an adversarial network to a two-tower network and extends it to multitask learning. Sampling bias-corrected neural (SBCN) [7] is proposed for estimating the item frequency of streaming data, which can be adaptive to data distribution and can reduce the sampling bias of in-batch items. A deep learning recommendation model (DLRM) [8] is proposed, which can mitigate memory constraints via utilizing model parallelism on the embedding tables. An internal and contextual attention network (ICAN) [9] is proposed, which combines the feature field interactions among multichannel with the channel-specific contextual attention in the matching module. An adversarial mixture of experts (ADMOE) [10] is proposed, which introduces adversarial regularization among the expert outputs and uses soft gating constraints of category hierarchy. Multitask adversarial active learning (MTAAL) [11] is proposed for medical named entity recognition and normalization, which can keep the performance of multitask learning and active learning via task discriminator and diversity discriminator. A hierarchical model with micro-and macrobehaviour (HM 3 ) [12] is proposed, which utilizes multitask learning and applies the abundant supervisory labels from micro-and macro-behaviours to predict conversion rate (CVR) in a unified framework. Gating-enhanced multitask neural network (Gem-NN) [13] is proposed to predict CTR in a coarse-to-fine manner, which allows parameter sharing from upper-level tasks to lower-level tasks and introduces a gating mechanism between embedding layers and MLP. e multiple-level sparse sharing model (MSSM) [14] is proposed to represent feature flexibly and share information among tasks efficiently, which include a field-level sparse connection module (FSCM) and a cell-level sparse sharing module (CSSM). Affect dimensions in detecting emotions employing multitask learning [15] are proposed, which jointly trains multilabel emotion classifier and multidimensional emotion regressor via using the interrelatedness of tasks. An adaptive information transfer multitask (AITM) framework [16] is proposed, which uses the adaptive information transfer (AIT) module to construct the sequential dependence among audience multistep conversions. e sequential deep matching (SDM) model considering the users' dynamic evolution preference [17] is proposed, which can capture users' interests via combining long-term behaviours and short-term sessions.
In particular, our main contributions to this study can be summarized by the following four aspects: (1) We propose a novel two-tower match network for multitask learning, which can better learn ID embedding via auxiliary loss. (2) On the basis of a two-tower network, we integrate the output of matching network as features into the multitask learning embedding layer, which increases feature expression ability and helps different tasks to extract information more accurately. (3) To our best knowledge, this is the first work that introduces mimic interaction into the MMOEmatch network. It adds the communication of recalling and ranking, which makes combined training better.
(4) Combined with TCM [18] knowledge, we design the statistical features such as confidence degree feature, promotion degree feature, and TF-IDF feature. e experiments show that these features bring a positive effect. (5) We show offline results of our experimental evaluation on TCM syndrome recommended data in laboratory, which demonstrates the scalability of our methods.

Related Work
Our work is based on the algorithm of artificial intelligence recommendation system, which is applied in TCM syndrome element recommendation. e first is multitask learning, and the second is a two-tower matching network, which can improve the convergence and prediction accuracy.

Recommendation Systems.
Recommendation systems select and rank items from millions of candidates. It is a common practice that uses models with two stages: the recalling stage and the ranking stage. e recalling stage reduces the corpus size from millions to thousands, and the ranking stage estimates the CTR of item candidates and delivers top-ranked items to users. During the ranking stage, machine learning or deep learning has been widely used. Logistic regression [19] is a classic method that recommends items based on the linear structure quickly. XGBoost [20] is a gradient boosting tree learning algorithm, which is widely used in many winning solutions of machine learning competitions. With the development of deep learning, the neural network models have been successfully applied for recommendation CTR prediction. Wide-deep [21] jointly utilizes both linear memory and nonlinear generated abilities via its wide and deep architecture. Deep-FM [22] combines DNN with neural FM to capture high-order interaction automatically.

Multitask
Learning. Multitask learning (MTL) has achieved success in many applications of deep learning, from natural language processing and speech recognition to computer vision and recommender systems. Compared with single-task learning, it can significantly improve learning efficiency and prediction accuracy via using a flexible share mechanism for different tasks. e mainstream of multitask learning is Expert-Gate pattern, Expert-NAS pattern, and Probability-Transfer pattern. e main idea of Expert-Gate pattern is to control how experts are shared or independent among tasks, including MMOE [23], DADR [24], MOSE [25], and PLE [26]. e Expert-NAS pattern can learn expert or feature information selectively and choose knowledge across all tasks in a flexible and efficient way, including SNR [27] and MSSM [14].
e Probability-Transfer pattern considers relationships in the output layer of different tasks, which can transfer information via scalar product and model sequential dependence well among rich useful representations, including ESMM [28], ESM2 [29], and AITM [16].

Matching Tower in Recommendation.
In the recalling stage of recommendation systems, the mainstream model is two-tower match networks. YouTube-Net [3] highlights the two-step architecture in a wide range of real-world applications, which brings in deep neural networks to build user embedding for matching. Recently, graph embedding [30] models are rising to learn the representation of items and users. e recalling process is made equivalent to retrieve the nearest neighbors of users' vectors among all the items. Inspired by the matching idea, our work introduces twotower models in building multitask learning in large-scale recommenders.

Application of Recommendation Algorithm in TCM Syndrome
Elements. TCM diagnosis needs to predict syndrome elements, symptoms, disease name, and prescription name according to the patient's medical record. e traditional approach views it as multilabel problems, which employ natural language processing techniques such as TextCNN model. When the number of predict objects is small, the multilabel algorithm is feasible, but when the number of predict objects is large, the multilabel algorithm is not good. ere are more than 60 kinds of TCM syndrome elements, more than 2000 kinds of symptoms, and more than 10,000 kinds of disease names and prescription names, so it is extremely important to explore prediction schemes beyond multilabel mode. e e-commerce recommendation system mines the user interested commodities from massive data, and the candidate objects are much larger than TCM predict objects. We refer recommendation ranking algorithm and introduce deep learning in TCM projects. Syndrome element predict, symptom predict, name of disease predict, and name of prescription predict, each of them can be trained independently using the recommendation ranking algorithm. It also can be trained jointly and viewed as multiple tasks, which use one model to obtain multiple prediction results at the same time.

The Proposed Methods
We design a general ranking paradigm called MMOE-match, which combined multitask learning with matching network. We apply it in the laboratory TCM medical case data and achieve good performance, which is significant for the construction of TCM recommendation system. e core structure of neural network proposed by us is shown in Figure 1.
On the left side of Figure 1 is the multitask learning model, which is the main network. From bottom to top are input data layer, embedding layer, expert network, gate network, tower layer, and prediction layer.
On the right of Figure 1 is the double-tower matching network. From bottom to top are input data, embedding layer, MLP layer, match tower layer, and match prediction layer.
On the left side of Figure 1 (main network) is as follows: Input Data: it includes symptoms and syndrome elements. Symptoms are used to construct features, and syndrome elements are used to construct labels.
Embedding Layer: the symptoms are text features, which are processed as ID features via word segmentation. Embedding transforms high-dimensional sparse symptom ID feature into the low-dimensional dense vector by nonlinear mapping. Expert Layer: it is two fully connected layers. We design eight experts, which are used to extract abstract feature information.
Gate Layer: it is used to represent the weight of experts among different tasks. Tower Layer: it is two fully connected layers, and the number of towers is the same as the number of tasks. Prediction Layer: it is a nonlinear mapping that represents the probability of the positive sample.
On the right side of Figure 1 (auxiliary network) is as follows: Input Data: it is the same as the input of the main network named multitasking learning. Embedding Layer: the processing mode is the same as that of the main network. MLP Layer: it is 3 fully connected layers using linear transformation and activation function. Match Layer: it is three fully connected layers with L2 norm. Prediction Layer: it is a nonlinear mapping, which indicates the probability of matching symptoms and syndrome elements.
In Figure 1, total loss includes main network loss and auxiliary network loss.
e Adam optimizer is used for gradient updating to continuously reduce the total loss, which obtains the optimal solution.

Improve Method 1: Multitask Learning Introducing
Matching Network. We predict the syndrome elements in the TCM project, and each medical case has a real syndrome element.
ere are more than 60 syndrome elements in 150,000 medical case sets. e method of constructing training samples is as follows: if a medical case corresponds to the real evidence element, then it is label 1; if the example corresponds to the evidence element that does not belong to itself, then it is label 0. e recommendation system divides into two main stages, recalling and ranking. In the recall stage, the candidate items are produced from massive data and then provided to the ranking stage. Generally speaking, the recall data are large and the model is a simple two-tower model. With fewer data in the ranking stage, various complex models can be optimized. We regard syndrome recommendation as the ranking stage and use multitask learning MMOE. Since features are symptoms and syndrome elements, inspired by the user side and item side of the two-tower model in the recall stage, a matching tower is designed to assist ID feature learning better. e original inputs are symptoms and syndrome elements, as shown as follows: where X zz represents TCM symptom data and X zs denotes TCM syndrome element data. e embedding layer maps the high-dimensional sparse vector to the low-dimensional dense vector space, which is expressed as follows: where [e 1 , e 2 , . . . , e n ] represents concatenating multiple vectors. e detailed processing steps of f( * ) are as follows: word embedding is performed on the words in each field, including the syndrome, tongue, moss, pulse, and syndrome elements, respectively. en, the average is taken to obtain the vector representation of each field, and the vectors of each field are concatenated to get the vector representation of the input data. e output of the multitask learning tower is shown as follows: where t k ( * ) represents the Kth tower and f k ( * ) denotes the linear combination of gate network and expert network.
where g k ( * ) represents gate function, and it is a softmax activation function. W gk represents the weight of a gate network.
where FFN( * ) represents two-layer feed-forward networks, and the number of nodes is 256 and 128, respectively. e input of the matching network is symptom feature and syndrome element feature. After multiple layers of forward propagation, the matching probability is calculated, and the formula is as follows: where W 1 and W L represent the model weights. X zz represents TCM symptom data, and X zs denotes TCM syndrome element data. b 1 and b L denote the bias. Relu( * ) is a nonlinear activate function.
where p( * ) is a nonlinear mapping and b represents bias. Matching network as the auxiliary task is used to assist the main task of syndrome element prediction. We design a matching hyperparameter λ 1 weighted sum of the loss function of the main task and auxiliary task, which is expressed as follows: Loss � loss main + λ 1 loss match , where y k is the true label and y k and y m are predict values.

Improved Method 2: Match Network Output as Multitask
Learning Feature. We try to take the output of matching network as the feature of the syndrome element main prediction task, and the combined features are as follows: en, we put feature all into equations (2)-(6), which enrich the feature representation ability to multitask learning. e model network structure diagram of improved method 2 is shown in Figure 2.
On the left side of Figure 2 is the main network called multitask learning, and on the right side is the auxiliary network called a two-tower matching. e internal structure of the main and auxiliary networks is the same as shown in Figure 1. e difference is that the output of the auxiliary network in Figure 2 is taken as a part of the features of the main network, and the prediction accuracy is improved via feature richness.
To explore the relationship between features and labels and extract abstract features, we design the cross-features of symptom and syndrome element based on TCM knowledge. e core construction logic of the cross-features will be introduced in the experiment section. e sources of statistical feature calculation are summarized in Figure 3. e details of the experiment are given in Section 4, and from the experiments, we can know that (1) Comparing improved method 1 and method 2 with the base multitask model, the experimental effects are all better than MMOE. It indicates that matching network can learn feature cross-information. Feature engineering is abstracted into models, which can help us save the cost of manual design. e matching network results are combined as MMOE features, which brings richer feature knowledge. (2) Comparing improved method 2 with our artificial design features, we know that there are different ways of adding new features. e fact denotes that the gain brings by well-designed features is also very significant. e man-made syndrome elements and symptom statistical features need to be familiar with the business knowledge, and the knowledge of TCM should be combined to guide model learning.
Conclusion: in the early stage of business, when we do not have a lot of feature experience, the idea of model matching is a worthy direction to try. In the later stage of business, experience guides feature design better.

Improved Method 3: Mimic Interaction Mechanism between Recalling and Ranking.
e core of recalling and ranking is a two-tower matching network, and multitask learning has been studied long before. In fact, it is our innovation to combine recalling and ranking in one model. Match network to model is not a new problem; however, how to better interact with match network and ranking model is a worth exploring problem. To provide better interaction between MMOE and match network, we introduce a mimic interaction mechanism, which can assist ID feature learning. e interaction mechanism is represented by a u , a v vector and p u , p v operator. e detailed mathematical expression of improvement is as follows.
On the left of Figure 4 is the main network multitask learning, and on the right side is the two-tower matching auxiliary network. e internal structure of the main network and the auxiliary network is the same as in Figures 1  and2. e innovation is that a u , a v vectors and p u , p v operators are added in Figure 4 to make the recalling and ranking better interactive.
where || is the vector concatenation operation. e vectors c u and c v not only mean information about the current symptom and syndrome element but also contain information about historical positive interactions through a u and a v .
where c means c u and c v and p means p u and p v ; W L and b L are the weight matrix and bias vector for the Lth layer. p u and p v , the output vectors of the L2 normalization layer, represent the MMOE embedding and match-net embedding, respectively.

Data Sets.
Our laboratory has been conducting research on TCM based on big data for decades and has undertaken and completed dozens of TCM projects and accumulated a large number of medical records of real patients. Up to now, there are a total of 480,000 medical cases, and we have full scientific research rights. e data set used in this research is derived from it. e diagnosis and treatment of medical cases are the generations of famous old Chinese medicine doctors, such as Pei-Sheng Li and Ren He. Pei-Sheng Li was the first batch of old Chinese academic experience heir guidance teacher, and his representative works include "Treatise on febrile diseases" and "Annotation on Koch Febrile Diseases." Ren He was the first "national physician master" honorary title winner, and his popular works include "Synopsis of the Golden Chamber," "Synopsis of the Golden Chamber," and  "Synopsis of the Golden Chamber." Medical records include five aspects of information: patient's personal information, such as gender and age; environmental information at the time of medical treatment, such as the date of medical treatment and solar terms; patient's disease description information, such as engraving and tongue; patient's medical history, such as genetic disease and past history; and doctors' analysis of patients' diseases, diagnosis, and treatment information during diagnosis and treatment, such as syndrome and treatment principles and methods. Although there is no standard specification for TCM, the description of our medical records is taken from standard processing, which uses years of technical experience accumulated in the laboratory. All aspects of the information in the medical case are relatively standardized. Fields that are abstract and difficult to predict by models are also processed according to national standards and specifications, such as syndromes. Syndromes are separated by fine granularity according to TCM syndromes, so that the information of etiology, disease location, and disease nature can be more clearly displayed. e purpose of this study is integrating the way of patients' diagnostic thinking and treatment into the deep learning model based on the idea of recommendation. We hope that the model can imitate the diagnosis and treatment methods of famous TCM doctor, which can carry out the diagnosis and treatment of diseases better. e first step of diagnosis and treatment is to analyse the location and severity of the disease based on the patient's disease information, namely the syndrome element prediction. It is also the stage of diagnosis and treatment in our experiment. We screen 150,000 medical cases from the data set and randomly select 130,000 of them as the training set and the remaining 20,000 as the test set. en, we extract the syndrome, tongue, moss, and pulse information from medical cases to generate the syndrome element information data set. We regard the syndrome as the recommendation target. Table 1 shows examples of medical cases. For privacy reasons, only the fields of the syndrome, tongue, moss, pulse, and syndrome elements used in this study are shown here. It is worth noting that, when generating the data set, a medical case is equivalent to a user recommended sample in e-commerce, and the certificate is equivalent to the goods to be recommended. Finally, the number of training samples is 8.06 million, and the number of test samples is 1.24 million.
In addition, based on previous experimental experience, we design some statistical features, which also play a vital role in the prediction of the syndrome element. e statistical computing methods include confidence degree feature, promotion degree feature, and TF-IDF feature. Taking the statistics of the underlying symptoms and syndrome elements as an example, the calculation methods are as follows: where F 1 and F 2 denote the two types of confidence degree feature computing methods. N zz denotes the number of symptoms that occur, and N zs denotes the number of syndrome elements that occur. T zz means the number of medical cases containing the current symptoms, and T zs means the number of medical cases containing the current syndrome elements.
where L denotes the promotion degree feature and T total means the number of total medical cases.
where S zz denotes the number of current symptoms that appear with whole syndrome elements at the same time.
where Y zz denotes the number of whole symptoms and H zs means the number of current syndrome element that appears with other syndrome elements at the same time.
On the basis of the above statistical method, in the training set, we do statistics on the engraving symptoms, tongue, moss, pulse, and syndrome element separately. en, we calculate the statistical information of each medicinal case via statistical results. In each sample, the engraving symptoms, tongue, moss, and pulses containing words are not a fixed length, so we take the further processing. e processing methods include summation, averaging, and padding. e padding rule is defined: maximum length of engraving symptom is 7; if the length is insufficient, it will be filled −1; and if the length is redundant, it will be truncated. e retention length of tongue, moss, and pulse is 1, which means the average of the corresponding statistical information of each medical case. Taking one of the samples as an example, the statistical information generation process is shown as follows.
Firstly, we combine the sample information and statistical information to obtain the statistical values of the engraving symptoms, tongue, moss, pulse, and syndrome elements. Secondly, the statistical values were calculated by summing, averaging, and padding. e sample information is shown in Table 2. e corresponding statistical values of samples are shown in Table 3.
Here, F 1 and F 1 denote two types of confidence degree features, and L means the promotion degree feature. e sample statistical features are shown in Table 4.

Compare Models.
TextCNN [31]: TextCNN is a multilabel framework, which uses a convolution neural network to construct the model structure.
MLP [32]: multilayer perceptron is a base structure of deep learning and is widely applied in the recommendation system, which we use as our baseline.
MMOE [23]: the MMOE with Expert-Bottom pattern can integrate experts by multiple gates in the gate weights.
Two-Tower Model [7]: a deep two-tower neural network is based on the recommendation method proposed by YouTube. Vectors of items and users are concatenated and fed into a multilayer feed-forward neural network.
MMOE-Match-Mimic: this is our proposed model, which combines a match network and multitask learning network together.

Parameter Setting.
All the above neural network models are realized based on the TensorFlow framework. We use a GPU (Tesla V100-PCIE-32 GB) server to train and test each model. To make the models comparable, we maintain consistency in data sets, hyperparameters, and model structures as much as possible. TextCNN is implemented based on the idea of multilabel classification, and data sets need to be constructed independently, so this model has the biggest difference from other models. It should be emphasized that the medical case fields used in the training and test sets are uniform for all models. TextCNN employs the Adam optimizer to tune the model with a learning rate of 0.001. Each batch contained 32 medical cases. e model uses the Albert pretraining model to embed the input, and convolution kernels with lengths of 4, 5, and 6 are, respectively, used for one-dimensional convolution. Finally, the outputs of each convolution layer are spliced through the full-connection layer for multilabel prediction. e activation function of the full-connection layer is sigmoid. In other networks, multilabel problems are converted into N dichotomous problems for   e Adam algorithm is used to update model parameters with a learning rate of 0.0005. Each batch contains 5000 samples. e word table of the embedding layer is randomly initialized based on normal distribution and updated with the training of the model. MLP network is a three-layer fully connected network, and the nodes of each layer are 256, 128, and 64, respectively. MMOE contains 8 expert modules, each of which is a two-layer fully connected network with 256 and 128 nodes, and the tower layer is a onelayer fully connected network with 64 nodes. e structure of the two towers in the match network is consistent with MLP.
e MMOE module and two-tower module in the MMOEtwo-tower network are consistent with the settings of the MMOE and two-tower model, respectively.
All the model parameters are randomly initialized based on truncated normal distribution, and there is a certain fluctuation in the evaluation. To prevent the interference of index fluctuation on the results, the model is trained 5 times in each experiment, and the average value of the evaluation index of each model is taken.

Evaluate Metrics.
ere are 5 evaluation indicators in our experiments, and they are area under ROC curve (AUC), RelaImpr, Hits@10, MeanRank, and MRR. [33]. AUC measures the performance of ranking model with a predicted value in the test data set. We define AUC as follows, and the higher the better:

AUC
where T + denotes the positive and negative samples, |T^+ | and |T^-| are the number of positive and negative samples, f(x) is the predict function, and I(x) is the indicator function.

RelaImpr.
RelaImpr represents the relative improvement effect, the higher the better, and we define it as follows (23):

Hits@10.
In the label set of a medical record, Hits@10 denotes the probability of the first 10 hits in the output, the higher the better, and we define it as formula (24): where pred_10 means the top 10 predict syndrome elements.

MeanRank.
MeanRank is used to measure the likelihood that the model will incorporate errors, the smaller the better, and we define it as (25): where y denotes the overlap number of the corresponding syndromes of the real medical record with the top 10 predict syndromes, s denotes the number of real syndrome elements, and i is the ranking position of predict syndrome elements.

MRR.
MRR indicates the generalization ability and robustness of the model, the larger the better, and we define it as follows (26): 4.5. Offline Results. In this section, we compare our proposed method with several benchmark models. e TextCNN is a common model to solve the problem of multilabel classification. MLP is a classical neural network, which is widely used as baseline. MMOE is the representation of multitask learning, which is the model we proposed based on, so it is natural to compare with it. It can be seen from Table 5 that our proposed model has advantage in all indicators. Hits@10 is the indicator we pay attention to, and our model has obvious improvement in this indicator. It is worth noting that MMOE does not perform well in the experiment, and its AUC is even lower than MLP, because MMOE is a multitask model. However, due to the limitation of data set, our experiment is a single-task scenario, and MMOE could not give full play to its advantages. e model we proposed is based on MMOE, which can better demonstrate the superiority of match network in this scenario.

Ablation Experiment
(1) Our proposed model extracts the features between symptoms and syndrome elements via the match module and acts as the match output on loss function to assist network updating. To verify the effects, we separate them and conduct experiments with MMOE and match model, respectively. e experimental results are shown in Table 6. It can be seen that using a match module to assist loss function Journal of Healthcare Engineering updating and employing match-net for automatic feature extraction can both play a certain effect, and the combination of the two achieves more obvious effects. (2) We construct a series of features via the method in 4.1 and add these features to the MMOE for training, respectively. We compare them with our model, and the experimental results are shown in Table 7, as is shown that most of the carefully designed features are better than the automatically extracted features by the model. However, there are also some cases, such as the promotion degree feature, which reduces the model prediction effect. Our model adopts a match module to automatically extract features, which shows certain well effects. When a new business needs to be launched quickly or the domain knowledge is insufficient, it is difficult to construct artificial features. Under this circumstance, our proposed model does not require any prior knowledge, and features are completely learned automatically. It not only improves the speed of business going online, but also reduces the dependence of features on domain knowledge.

Conclusions
In this study, we propose MMOE-match-mimic network to model the relationship between symptoms and syndrome elements. e proposed match network module combines the multigate mixture-of-experts in the loss function. In this way, it can learn more information of different ID embeddings to improve the performance of multi-task learning. To our best knowledge, this is the first work that introduces mimic interaction into the MMOE-match network. e work adds the communication of recalling and ranking, which makes combined training better. e mimic mechanism aims to model the information interaction between MMOE-match networks and produces better symptom representations for TCM data. We conduct extensive experiments and obtain 0.705% increase in the AUC, which demonstrates the effectiveness of MMOE-matchmimic for TCM symptom recommender. Moreover, MMOE-match-mimic has been successfully deployed to serve the laboratory TCM health diagnostic system, which shows significant improvements compared with state-ofthe-art baseline models.
Data Availability e data set comes from medical records of 480,000 real patients collected in the laboratory.

Conflicts of Interest
No conflicts of interest exit in the submission of this manuscript.