Tongue Image Database Construction Based on the Expert Opinions: Assessment for Individual Agreement and Methods for Expert Selection

This study aims at introducing a method for individual agreement evaluation to identify the discordant raters from the experts' group. We exclude those experts and decide the best experts selection method, so as to improve the reliability of the constructed tongue image database based on experts' opinions. Fifty experienced experts from the TCM diagnostic field all over China were invited to give ratings for 300 randomly selected tongue images. Gwet's AC1 (first-order agreement coefficient) was used to calculate the interrater and intrarater agreement. The optimization of the interrater agreement and the disagreement score were put forward to evaluate the external consistency for individual expert. The proposed method could successfully optimize the interrater agreement. By comparing three experts' selection methods, the interrater agreement was, respectively, increased from 0.53 [0.32-0.75] for original one to 0.64 [0.39-0.80] using method A (inclusion of experts whose intrarater agreement>0.6), 0.69 [0.63-0.81] using method B (inclusion of experts whose disagreement score=“0”), and 0.76 [0.67-0.83] using method C (inclusion of experts whose intrarater agreement>0.6& disagreement score=“0”). In this study, we provide an estimate of external consistency for individual expert, and the comprehensive consideration of both the internal consistency and the external consistency for each expert would be superior to either one in the tongue image construction based on expert opinions.


Introduction
Recently, traditional Chinese medicine (TCM), as a kind of complementary and alternative medicine with a history of five thousand years, has been gradually accepted and embraced by the western medicine system, while, in traditional Chinese medicine, tongue diagnosis plays an important role in the clinical syndrome differentiation and therapeutic evaluation. However, the observation diagnosis of varied tongue features is often biased by the variation of the observers' subjective experience, the uncertainty of the classification standard, and the external lighting environment, so it has become the bottleneck of TCM objectification and internationalization.
Nevertheless, with the spring-up of computer image processing techniques for the past 30 years, tongue image diagnoses have gradually been objectified and quantified, which have greatly promoted the development of TCM diagnosis technology. The digital tongue image database construction based on experts' opinions has become an inevitable trend of the objectification of tongue diagnosis. While the highquality labeled data is the foundation of the database, in order to obtain the high-quality labeled data to construct a more reliable database based on clinical decision-support 2 Evidence-Based Complementary and Alternative Medicine from experts, a reliability and agreement evaluation of the obtained data is essential. As for the methods of reliability and agreement study, Kappa statistics, firstly proposed by Cohen in 1960, is used as a scientific indicator of the degree of agreement. Since then, Kappa statistics has been widely used in the clinical agreement and reliability studies for nominal and ordinal measurement, such as in the neurology [1], pathology [2], epidemiology [3], clinical diagnostics especially for medical images [4,5], and clinical therapeutic evaluation [6]. In addition, according to the different data types and the number of raters, corresponding methods of agreement and reliability assessment are recommended. For example, Cohen's Kappa can be used in nominal data with two raters. Fleiss' Kappa is fit for nominal data for more than two raters. Weighted Kappa can be applied in ordinal data and intraclass correlation coefficients (ICC) are suitable for continuous data [7]. Moreover, the benchmarks for the range of varied agreement coefficients' values are provided by Landis and Koch with 0-0.20 as poor, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1.0 as almost perfect [8].
Therefore, studies have been carried out to assess the interrater and intrarater agreement for tongue diagnosis among those TCM practitioners [9][10][11][12], from which widespread inconsistencies are commonly observed. Besides, there remain some problems in previous studies that need to be discussed. First, these discrepancies among raters have indicated that large-scale agreement studies where many raters contribute ratings should be conducted. Second, the statistical methods used in those studies to assess the agreement and reliability are limited to Kappa coefficient. However, combined with our preliminary study, we found low Kappa values for certain items despite the high percentage of agreement, which has been defined as the "Kappa paradox" by Feinstein and Cicchetti [13], so that the straightforward interpretation of the magnitude for Kappa value may not reflect the true condition. Third, the traditional coefficients to evaluate the interrater agreement are actually total agreement for an experts' group. For expert individual, only intrarater agreement could be obtained to reflect the internal consistency. How to evaluate the external consistency for each expert remains unreported.
To the best of our knowledge, no researches have provided an approach which is appropriate for the construction of a reliable tongue image database based on expert opinions. Thus, through the large-scale agreement study of expert opinions, our study focuses on identifying the discordant raters from the experts' group and if there exist related method to assess the external consistency of individual expert and, in addition, what the best method for experts selection would be in order to construct a more reliable database.

Subjects.
A total of 300 randomly selected tongue images were collected by TDA-1 hand-held tongue image acquisition device [14] under the same standardized tongue image acquisition process. They were all obtained from the patients from the physical examination center or the outpatient department in Shuguang hospital affiliated to Shanghai University of traditional Chinese medicine during the year 2015. Among them, 35 randomly chosen tongue images were repeated twice to assess the intrarater agreement, because at least this sample size can give a stable evaluation of the intrarater agreement for each rater according to our preliminary research with 10 raters for 60 tongue images.

Ethics Statement.
IRB of Shuguang Hospital affiliated with Shanghai University of TCM approved the study (no. 2015-388-16-01), and written informed consent was obtained from all included subjects according to the Declaration of Helsinki.

Instruments.
The ratings for tongue image diagnosis were conducted on a web interface ( Figure 1); we created it in order to allow a remote collection of tongue image ratings by the experts from the TCM diagnostic field all over China. All tongue images were converted into 300 * 300 Evidence-Based Complementary and Alternative Medicine 3 pixel images for display. The features of those tongue images which needed to be classified by the experts consist of tongue color (pale tongue, light red tongue, red and crimson tongue, and purplish tongue), tongue texture (old, moderate, and tender), tongue shape (enlarged, moderate, and thin), other morphological tongue features (teeth-print, red dot, crack, petechia, and bruise), fur existence (no, yes), fur color (white fur, yellowish fur, black, and gray fur), fur thickness (thin fur, thick fur), fur saliva (moist fur, dry and rough fur, and damp and smooth fur), and other fur features (greasy fur). All the above features were classified into "existed" or "not existed." In addition, in order to minimize the color bias caused by the different monitors, the referential images for the standardized tongue colors (pale tongue, light red tongue, red and crimson tongue, and purplish tongue) and standard fur colors (white fur, yellowish fur, and the black and gray fur) were provided on the web next to the images which needed to be diagnosed.

Raters.
A total of 50 experts who come from the TCM diagnostic field all over China are recruited in this study. They are all registered doctors of TCM who were tested on their ability to identify different tongue features as part of their certification with at least a professional level certificated as "secondary senior." Experts would receive a small remuneration for their work. Each expert would be assigned an account by administer who was blinded to their rating results, and then experts would enter the tongue image diagnosis system to give ratings for those tongue images features independently.

The Proposed Method.
In order to assess the individual agreement for each expert, two procedures were included. One is the identification of the discordant experts; the other is the continued optimization of the interrater agreement. For the former one, the confidential interval method based on the Spearman-Brown formula was used to analyze whether the information provided by one expert was consistent with the whole experts group. This method was brought up by van Ast JF [15] in the reliability study of epileptic seizures diagnosed by neurologists/epileptologists based on intraclass coefficients (ICC). The process to detect the experts with deviating opinions has the following steps: first, use all of the n experts' results to calculate their reliability Rn by Gwet's AC 1 , and then calculate the reliability of the rest of the n-1 experts' results R i (i=1, 2, 3. . . . . .n) after exclusion of one expert's results in sequence and its corresponding standard deviation (SD). Next, by the application of Spearman-Brown formula and R i , the predicted reliability of the n experts R j (j=1, 2, 3. . . . . .n) and its confidential interval (R j ±2SD) are calculated. Here, we used R j ±2SD to calculate the confidential interval, because the detecting process of the experts with deviating opinions from the whole experts' group is similar to the "outliers" identification in the initial group. If the actual reliability for n experts R n falls within the confidential interval of the predicted reliability for n experts R j , it suggests that the excluded expert provided consistent opinions, while if R n > R j +2SD, it means the excluded expert could improve the reliability of the whole experts group; however if R n < R j -2SD, it indicates the excluded expert could decrease the reliability of the whole experts group. In our study, we aim at establishing a more reliable tongue image database; therefore, we only need to find those experts whose results would decrease the whole reliability for all the experts. In other words, suppose △R=R n -(R j -2SD), for each exclusion of one expert, the corresponding △R would be calculated, and △R<0 is an indication that the excluded expert was inconsistent with the rest of the experts. For the latter procedure, it was conducted at the premise of the former one, and the cut-point for this process was set to 0.6 which suggests that the reliability arrives at a "substantial" level (agreement coefficient > 0.6). We assume that k experts were detected as discordant raters in the procedure of the identification of the inconsistent experts, and then the reliability of the rest of the n-k experts R n-k could be calculated. If R n-k ≤ 0.6, it suggests that the reliability is below the "substantial" level; at this moment, the circular application of procure one could be conducted until the reliability reaches the "substantial" level (R n-k > 0.6) after exclusion of those inconsistent experts or could not be optimized anymore, and this procure could be called the optimization of the interrater agreement. The whole process of this proposed method is shown on Figure 2.
In the process mentioned above, we assume that the recognition of discordant raters was performed m times and k experts were identified and excluded until the reliability of the rest of the n-k experts R n-k > 0.6 (or could not be optimized anymore). Then the individual agreement for the discordant raters who were identified at the first time could be scored "m"; the discordant raters who were identified at the second time could be scored "m-1". . . and so on, until the discordant raters who were identified at the m time could be scored "1"; the rest of the n-k experts could be scored "0"; the detailed scoring method is shown in Table 1. According to this scoring method, we can obtain the scores for all the participated experts. Because the score reflects the degree to which one expert disagrees with the whole experts' group, compared with the intrarater agreement to assess the internal consistency of one expert, and this scoring method manifests the external consistency of one expert. The higher the score is, the more the experts disagree with the remaining experts and vice versa. So we call it disagreement score.
Under the above condition, along with the corresponding intrarater agreement, we compared the three experts' selection methods for the standard tongue image database construction, respectively, method A (inclusion of experts  who had at least a "substantial" internal consistency with their intrarater agreement>0.6), method B (inclusion of experts who had at least a "substantial" external consistency with their disagreement score="0"), and method C (inclusion of experts who had at least a "substantial" internal consistency and a "substantial" external consistency with their intrarater agreement>0.6 and disagreement score="0").

The Results for the First Identification of Discordant
Experts. In the 50 experts' rating results from the 230 nontesting tongue images, the discordant experts for all the 25 tongue features were recognized, and the interrater agreements before and after the removal of those discordant experts are manifested in Table 2, from which we can see that the interrater agreement for all the 25 tongue features were increased after the first exclusion of those identified experts.

The Optimization Results of the Interrater Agreement.
After the first recognition and removal of discordant experts from the 50 experts, the reliability for some tongue features was still below the "substantial" level (AC 1 ≤ 0.6), which included the tongue body features (light red tongue, moderate texture, tender tongue, enlarged tongue, moderate tongue shape, teeth-print, and crack) and tongue fur features (white fur, thin fur, thick fur, moist fur, and greasy fur). Among the above tongue features, the interrater agreement of the experts' classification for "moderate tongue texture" was the lowest (AC 1 value=0.0948). We took it as an example, and the results for the optimization process of the interrater agreement are manifested in Table 3. From this table, 8 recognition times were conducted to identify 34 discordant experts, which improved the interrater agreement from 0.0948 to 0.6333. Similarly, the results for optimization of the interrater agreement for all the above tongue features are shown in Table 4, from which we could see interrater agreement for all tongue features could be successfully optimized to the "substantial" level (AC 1 value>0.6) except for "thin fur" whose largest optimized value was 0.4601.

The Individual Agreement for 50 Experts.
Taking "the moderate texture" as an example, 8 recognition times with 34 discordant experts were progressed, the disagreement scores for those discordant experts are shown in Table 5, and the disagreement scores for the rest of the experts apart from the deviating ones were scored "0." In this way, the external consistency of each expert for all the 25 tongue features could be indicated by the disagreement scores. Eventually the disagreement scores and intrarater agreement for 50 experts are shown in Figure 3, in which the red dashed line represented the average level for all the 50 experts, and they   divided the scatter diagram into 4 sections (A, B, C, and D), most experts were distributed in sections A and D, which suggests that the experts with a higher internal consistency usually tend to have more consistent test results with the remaining ones, while there are still certain experts whose test results may disagree (agree) with the rest of the ones even though they have a relatively higher (lower) internal consistency.  . However, statistical significance was only found in the pairwise comparison between method C and the original one before selection (P<0.05). Based on the above results, it indicates that, in the three selection methods for the expert, the comprehensive consideration of both the internal consistency and the external consistency for each expert would be superior to either one.

Discussion
In the progress of the tongue image database construction based on expert opinions, the main problem is how to assess the agreement or reliability for each expert, so that we could select the appropriate ones to establish a more reliable database based on experts' clinical decision-support. However, for individual rater, only intrarater agreement could be evaluated, and the traditional agreement coefficients for interrater agreement such as intraclass correlation (ICC), Kappa coefficient, and AC 1 value are actually measurements of total agreement that reflect the whole reliability condition for a group of raters. Some researchers made some attempts on the above issues; however they were limited to certain data type (continuous, ordinal, and nominal). For example, Barnhart put up coefficient of individual agreement (CIA) to assess individual agreement for the continuous data [17,18]. Nelson raised an approach based upon the class of generalized linear mixed models to assess the influence of rater and subject characteristics on measures of agreement for ordinal ratings [19]. Ruddat proposed a Kappa-based method to improve assessing agreement between several observers [20]. Different from the above studies, we proposed the optimization method of interrater agreement on the basis of the available confidential interval method based on the Spearman-Brown formula and further introduced the disagreement scores to evaluate the degree to which the experts were inconsistent with the rest of the experts. Besides, the method provided in our study is not bound by the data type limit because it is based on value of certain agreement coefficient itself, so it could be widely applied to varied agreement coefficients for different data type (ordinal, nominal, or continuous) in other standard knowledgebase construction based on clinical decision-support from experts. In this study, we first verified the effectiveness of the available confidential interval method based on the Spearman-Brown formula for the identification of discordant experts (Table 2). Besides, the optimization method of interrater agreement was conducted for tongue features whose interrater agreements were still less than or equal to 0.6 after the first identification of the discordant experts. And the optimizing results of interrater agreement for these tongue features are manifested in Table 4, which indicates that the interrater agreement for all tongue features could be obviously optimized by circular identification process of the inconsistent experts.
Furthermore, along with the intrarater agreement for the internal consistency for individual expert, the disagreement score was introduced to assess the external consistency of each expert. Then, we studied the relationship between the internal consistency indicated by intrarater agreement and the external consistency suggested by disagreement scores for those participant 50 experts (Figure 3).The results illustrate that the experts with a higher internal consistency usually tend to have more external consistency. However, there are still some experts whose internal consistency was not corresponding to their external consistency. Therefore, we further compared different expert selection methods for 25 tongue features ( Figure 4).The results indicate that, in the standard tongue image database construction based on the expert opinions, both the experts' internal consistency (indicated by intrarater agreement) and their external consistency (indicated by disagreement scores) should be considered to get a more reliable database based on clinical decision-support rather than either one.
Last but not the least, this research was conducted on a web interface which could allow a remote collection of tongue image ratings by involving as many raters as possible to participate in this research simultaneously and independently. In spite of the above merits, a different result might be obtained if the ratings were not conducted on a web interface, and color difference is the key problem which might lead to it. However, in this research, we designed the following contents to avoid this problem as much as possible: Firstly, all the tongue images were acquired by TDA-1 hand-held tongue image acquisition device, which we mentioned before in our other paper [14]. By imitating a stable illumination environment which is closest to the natural light in the  traditional visual observation, the standardized tongue image acquisition process can guarantee a more realistic color feature for the tongue image. Secondly, in order to minimize the color bias caused by the different monitors that experts used, the referential images for the standardized tongue colors (pale, light red, red and crimson, and purplish) and standard fur colors (white, yellowish, the black, and gray) were provided on the web next to the images which needed to be diagnosed.

Conclusion
In this study, we successfully optimize the interrater agreement and provide an estimate of external consistency for individual expert. Besides, we find that the comprehensive consideration of both the internal consistency and the external consistency for each expert would be superior to either one in the tongue image construction based on expert opinions.

Data Availability
The datasets generated and analyzed during the current study are not publicly available due to the confidentiality of the data, which is an important component of the National Key Technology R&D Program of the 13th Five-Year Plan (no. 2017YFC1703301) in China, but are available from the corresponding author on reasonable request.

Conflicts of Interest
The authors report no conflicts of interest in this work.