Validity of a Diagnostic Scale for Acupuncture: Application of the Item Response Theory to the Five Viscera Score

In acupuncture therapy, diagnosis, acupoints, and stimulation for patients with the same illness are often inconsistent among between Traditional Chinese Medicine (TCM) practitioners. This is in part due to the paucity of evidence-based diagnostic methods in TCM. To solve this problem, establishment of validated diagnostic tool is inevitable. We first applied the Item Response Theory (IRT) model to the Five Viscera Score (FVS) to test its validity by evaluating the ability of the questionnaire items to identify an individual's latent traits. Next, the health-related QOL scale (SF-36), a suitable instrument for evaluating acupuncture therapy, was administered to evaluate whether the FVS can be used to make a health-related diagnosis. All 20 items of the FVS had adequate item discrimination, and 13 items had high item discrimination power. Measurement accuracy was suited for application in a range of individuals, from healthy to symptomatic. When the FVS and SF-36 were administered to other subjects, a part of which overlap with the first subjects, we found an association between the two scales, and the same findings were obtained when symptomatic and asymptomatic subjects were compared regardless of age and sex. In conclusion, the FVS may be effective in clinical diagnosis.


Introduction
Traditional Chinese Medicine (TCM) originates from the Huangdi's Internal Classic, which was written in around the 2nd century BC. According to this book, the five elements (e.g., the human organs: liver, heart, spleen, lung, and kidney) are created from yin and yang, which are the foundations of all parts of the world. Qi and blood were described as flowing out from the five viscera, traveling through meridians, and connecting the acupoints on the body surface. The method of examination is also stated in the book [1,2]. Unlike western medicine, which clearly differentiates health and illness, the concept of health in TCM denotes a state where all body components are in good balance [3]. It is considered that an imbalance is caused by the stagnation of qi and blood flow, and acupuncture is performed on the acupoints in order to resolve the stagnation.
Acupuncture was introduced to Japan around the 6th century [4]. In the following 1500 years, Japanese acupuncture has undergone a unique development in an isolated environment. Meridian Therapy (MT) is a therapy unique to Japanese acupuncture with an approach based on the theories described in ancient Chinese literature [5]. Instead of directly controlling the flow of qi and blood traveling through the meridians, MT aims to restore balance of the five viscera that control the amount of qi and blood flowing through the meridian. Hence, five viscera diagnosis is very important in Japanese acupuncture therapy.
Today, acupuncture therapy is practiced widely throughout the world, particularly in East Asia, and it is recognized as a form of complementary and alternative medicine (CAM) [6]. In acupuncture therapy, diagnosis, acupoints, and stimulation methods for a patient or patients with the same illness often vary between TCM practitioners [6][7][8]. Scientists are conducting intervention studies for specific diseases to evaluate the effectiveness of acupuncture therapy. However, as acupoints and stimulation methods used in the interventions vary for the same illness as mentioned earlier, the focus of these studies is usually the assessment of therapeutic effectiveness [9][10][11]. Their effectiveness may be affected by the variation of intervention methods. This is in part due to the paucity of evidence-based diagnostic methods in TCM. As a result, TCM cannot but has allowed the existence of various intervention methods. While a standardized diagnostic method to elucidate the effectiveness of acupuncture therapy could eliminate the advantage of TCM, developing a common diagnostic scale might on the other hand promote research and bring acupuncture therapy into the mainstream.
We have been developing the "Five Viscera Score (FVS)" as a diagnostic scale by selecting, on the basis of statistical analysis, symptoms of the five viscera from leading TCM literature of the last 2000 years [12][13][14]. Like most typical scales, the FVS was created by conducting an exploratory factor analysis of the data collected from the target population and generalized based on coefficients such as Cronbach's alpha in classical test theory (CTT). Scales with higher accuracy created using CTT may not be applied to other populations, since the scale based on CTT is greatly representative of the population from which the scale was created. This dilemma is common not only in the FVS but also in all scales that are validated with CTT. In recent years, Item Response Theory (IRT) has been used so as to evaluate questionnaire items and specific properties of individuals, which enables researchers to use questionnaires without being restricted to use in the population from which the questionnaires were created. The advantages of IRT are the interchangeability of questionnaire item, the ability to develop scales as questionnaire items, and the standardization of individual traits. Accordingly, many scales are validated with IRT [15][16][17].
Scales derived from factor analysis are affected by latent variables, and IRT is a method that applies this concept and can estimate such latent ability ( ) hidden in individual responses and questionnaire items based on logistic functions. Hence, it is not possible to directly measure from CTT. Details of IRT have been described previously by Baker and Kim [18]. It should be noted that higher effectiveness can be expected when scales are developed, and their validity are tested with both CTT and IRT than when using CTT alone [19,20]. There have been some studies examining diagnostic methods in TCM [21][22][23][24], but such studies to date that used IRT have been very few [25].
In our study, we first applied the IRT model to the FVS (Phase 1) to test the validity of the scale by evaluating the latent ability of the questionnaire items and individual traits. Next, the Medical Outcome Study Short-Form 36item Health Survey (SF-36) Version 2 [26], which is a suitable instrument for evaluating acupuncture therapy, was administered to evaluate whether the FVS can be used to make a health-related diagnosis in patients (Phase 2).

Study Subjects and Implementation Method
2.1.1. Phase 1. A total of 781 subjects (560 men and 221 women) took part in the study: 739 were students, and 42 were employees working in a vocational school in Osaka City, Japan, where the admission requirement is a high school diploma. Anonymous questionnaires were distributed to the students and employees. The study was conducted at the end of May 2010, and the collection period was 2 weeks. Additionally, the questionnaire items used in this study were based on survey sheets used to create the FVS.

Phase 2.
This phase was conducted at the end of May 2011, using the same method as Phase 1. Two hundred and ninety-one students and 30 staff members from the same vocational school give a total of 321 subjects (208 men and 113 women). Out of these 321 subjects, 193 had also taken part in Phase 1.

Ethical Considerations.
The study was conducted after obtaining approval from a joint ethics committee from the Kansai Vocational College of Medicine and an external evaluation committee (H22-02, H23-08). The study subjects received a verbal and written explanation of the study objectives. Only those who expressed their willingness to participate were given a questionnaire to be completed and placed in a collection box.

Questionnaire
2.3.1. The Five Viscera Score. The FVS is a self-administered questionnaire consisting of 20 items related to general wellbeing for the previous month. It is constructed based on symptoms related to the five viscera established in TCM. To prevent biases of the questionnaire item distribution, a total of 773 symptoms related to the five viscera were selected from the TCM literature ranging from ancient to modern Chinese and Japanese texts [12]. Next, we collapsed them into 111 items excluding overlapping or unclear symptoms. Furthermore, 83 symptoms were excluded, since their standard deviations were too wide beyond their mean values (i.e., ceiling or floor effects). Then, exploratory factor analysis of 5 factors (generalized least squares method with varimax rotation) and CTT was applied for the remaining 28 symptoms, leaving 20 ones that had significant factor loadings (>0.35) [13]. Based on the five viscera function [27] and acupuncture clinicians advises, labels of "liver, " "heart, " "spleen, " "lung, " and "kidney" were assigned to each factor that has 4 subscale scores of the symptom, and the summed scores (frequency of the symptoms) were compared among the subjects. The factor loading values as well as Cronbach's coefficients of subscale scores were presented in Table 2.
The questionnaire items were answered using the follow- score of the questionnaire items (0 to 16 points), higher scores indicating severer symptoms.

SF-36
. SF-36 Version 2 is a health-related quality of life (QOL) scale to assess subjective health status and daily life functioning [28,29]. It is used frequently in qualitative evaluation in the field of alternative medicine. SF-36 is a selfadministered questionnaire that contains 36 items related to physical and mental health status for the past month. The reliability and validity of the Japanese version have been thoroughly confirmed, and the scale has been standardized [26,30]. SF-36 consists of the following 8 subscales: physical functioning (PF), role physical (RP), bodily pain (BP), general health (GH), vitality (VT), social functioning (SF), role emotional (RE), and mental health (MH). The scores have been standardized for the Japanese population, with the score for an average, healthy individual being 50 points. A higher score indicates a better state of health.

Statistical Analysis
Phase 1: Examination of IRT Applied to the FVS. Figure 1 is a chart for understanding item discrimination and item difficulty of IRT. The five curves that were estimated by a logistic function represent each of the choices: never (curve 1), rarely (curve 2), sometimes (curve 3), most of the time (curve 4), and always (curve 5). Two fundamental parameters of IRT are item discrimination and item difficulty. The slope of the curves represents item discrimination, and in the point of intersection of the curves represents item difficulty, which means that more acute inclines suggest higher discrimination ability, and that higher values suggest more difficulty for healthy people to reply as described in the following. In an ideal question item, the graph is symmetrical with varying degrees of difficulty.

Item Discrimination.
Item discrimination refers to the ability of an item to discriminate between individuals. The differentiation between presence and absence of symptoms is easier when scores are higher. In our study, a value of 0.35 was established as the lower limit of acceptability. A value of 1.0 or more was considered excellent. Furthermore, when estimating item discrimination, a standard error (SE) of less than 0.3 was considered excellent.

Item Difficulty.
Item difficulty expressed the estimated degree of difficulty in answering each choice for the questionnaire item. As the FVS is answered with a 5-point scale, the degree of difficulty in representing the limitations of the choices is classified into 4 steps. The degree of difficulty also represents the intensity of the individual respondent's symptoms ( ). Hence, both the degree of difficulty and individual latent trait is on the same axis of the , which means that the higher the value, the more difficult it is for a normal subject to answer the question. In other words, the easier it is for a symptomatic subject to answer the question. A degree of difficulty more than −4.0 or less than 4.0 and SE less than 0.3 derived at the time of estimation was considered satisfactory.

Test Information Curve (TIC).
The test information curve (TIC) as a scale and individual's latent nature is a graphical representation of measurement accuracy of each and is comparable to the reliability coefficient in CTT. Test information is shown on the vertical axis and on the horizontal axis, which represents the range of participants that could use the FVS. The location where is 0 represents an average subject that corresponds to a healthy individual in the FVS.

Relationship of Individual and Raw Subscale Score.
Although the FVS is evaluated using the total item score for each subscale, test subjects with the same raw subscale score may have a different response from . The validity of the raw subscale score for the FVS was confirmed by the correlation between the raw score and the estimated . These analyses were conducted separately for gender and ages (adolescents and young adults aged teens and twenties and adults aged thirties or more), since individual and raw subscale score might be affected by those factors.

Phase 2: Evaluation of FVS and Health-Related QOL Scale.
Using SF-36 as an external criterion, we examined whether the FVS (i) can be used in health-related diagnosis, and (ii) can differentiate patients with and without symptoms.
Statistical analysis for the outputs of , item discrimination, item difficulty, and TIC was performed using Kumagai's EasyEstGRM Version 0.3.6. Mann-Whitney test and Spearman's rank correlation test were conducted using IBM SPSS 18. Generally accepted values [17,20,31] were used as a standard to determine whether each IRT item was satisfactory. For other values, statistical significance level was set at 5%.  for the following reasons: missing data (89 subjects), entry error (2 subjects), and symptomatic patients who answered "always" or "most of the time" to the question "I am currently seeing a doctor for an illness" and "I am currently taking a medication" (42 subjects). Thus, a total of 594 healthy subjects (76.1%) took part in the study, 430 men (72.4%) and 164 women (27.6%). The characteristics of the subjects are shown in Table 1. The ages (mean ± SD) of men and women were 27.5 ± 8.0 and 26.0 ± 8.5 years, respectively. Regarding educational background, 36.3% of men and 38.5% of women had graduated from a two-year college or a higher academic institution. As for working hours, 77.7% of men and 51.8% of women worked 4 or more days per week. Table 2, all item discrimination values reached the lower cut-off level of 0.35. Furthermore, 13 of 20 items exceeded the 1.0 level for item discrimination of which at least two or more items were included in the each subscale.

Item Difficulty.
The average item difficulty for items greatly varied ( Table 2). The questionnaire items with the highest average item difficulty value for each subscale were the question Q4 "I have migraine headaches (headaches)" for liver, Q7 "I have a lot on my mind and am not able to enjoy anything" for heart, Q12 "I don't have much energy in the morning" for spleen, Q16 "I get the hiccups" for lung, and Q19 "My memory has deteriorated" for kidney. When the subject's response was "always" for a questionnaire item that had a high average item difficulty, the symptom of the viscera related to that question was thought to be of greater severity. The item difficulty for Q4 (b4: 5.32) and Q16 (b3: 5.95, b4: 7.30) were particularly high. The SE for b4 of Q16 exceeded the standard value of 0.3. Figure 2 shows the quantity of measurement information from the subscales resulting from the application of IRT. The amount of information for adjacent s was compared, and the distance between the points where that measurement information most increased and most decreased maximally in TIC was defined as the effective ability range of the . The subscales are expected to be applied to those who are in that range. The ranges were as follows: liver (−0.50 to 1.60), heart (−1.00 to 1.80), spleen (−1.50 to 2.00), lung (−1.50 to 2.80), and kidney (−1.00 to 2.60). All the effective ability ranges straddled zero and extended in the positive direction. The greatest measurement information was as follows: 10.28 for liver, 9.24 for heart, 6.59 for spleen, 3.18 for lung, and 5.89 for kidney.

Test Information Curve (TIC).
Evidence-Based Complementary and Alternative Medicine 5   Evidence-Based Complementary and Alternative Medicine (Table 3). The correlation coefficients for and raw scores for each subscale ranged between 0.77 and 0.95 in men and between 0.92 and 0.97 in women, indicating strong associations, and the correlation was significantly stronger for women than for men in whole subjects. Fewer gender differences between individual and raw subscale score were observed in adults compared to adolescents and young adults, while women showed higher values in both generations. As with whole subjects, strong correlations between individual and raw subscale score were observed in men and women in both generations.

Phase 2.
A total of 302 subjects (94.1%) responded to the questionnaires. Twenty-six subjects were excluded due to missing data, and the remaining 274 subjects (85.4%) were eligible for the analysis. There were 175 male and 99 female subjects, with a mean age of 28.6 ± 7.8 years and 28.5 ± 8.6 years, respectively. Of the 274 subjects, 120 men and 66 women had participated in Phase 1. The raw subscale scores of the FVS from the Phase 1 results were used in Phase 2.

Gender Difference in FVS and SF-36.
As with Phase 1, a comparison was made to determine the presence of any gender difference in the perception of health in SF-36 compared with the FVS. Average values for women were lower than for men for all SF-36 health-related QOL subscales in whole subjects (data not shown). A significant gender difference was apparent in the following: RP (47. 30  .56, = 0.054). As in Phase 1, the average FVS subscale scores for women were higher, indicating a severe symptomatic state compared with men. There was a significant gender difference for the liver (6.38±3.64 for men versus 7.40±3.44 for women, = 0.022), and a similar trend was observed for the kidney (6.07 ± 3.28 versus 6.97 ± 3.31, = 0.051), although this did not reach statistical significance. (Table 4). To determine whether the FVS can be used as a health-related diagnostic scale, its correlations with SF-36 scores are represented as validity coefficients in Table 4. For both men and women, all FVS subscales were significantly correlated with more than one SF-36 subscale in whole subjects. The heart, spleen, and kidney subscales of the FVS had a strong relationship with SF-36 subscales; specifically, heart and MH in men, heart and VT in women, and spleen, MH, and VT in women had correlation coefficients exceeding 0.60, indicating strong associations. Similar results were observed when dividing the subjects into two generations except that no significant correlations were observed between lung and SF-36 subscales in adult women. (Table 5). Lastly, for the poor healthy group (subjects who responded "always" or "most of the time" to the questionnaire items "seeing a doctor for an illness" and "taking medication for an illness") and the healthy group (all other subjects), results for the FVS and the SF-36 were compared, as shown in Table 5. For both men and women, the poor healthy group scored high on all subscales of the FVS and low on all subscales of the SF-36. In both the FVS and SF-36, there was a clear difference between poor healthy and healthy groups for women but not in men. Similar results were observed when comparing adolescents and young adults with adults (data not shown).

Discussion
To the best of our knowledge, this is the first study to apply IRT to FVS, a TCM diagnostic scale, evaluating the ability of the questionnaire items to identify individual latent traits.
As mentioned previously, by removing the restriction to use caused by a population, IRT allows standardization of questionnaire items and the individual's traits that was not possible with CTT. IRT also allows the evaluation of questionnaire items on an individual basis and allows interchangeability. As there is no consistency in TCM diagnostic methods, different acupoints and methods have been used in clinical intervention studies of acupuncture to date. This has complicated the evaluation of effectiveness and hindered the collection of reliable evidence. In general, TCM uses the following four methods to formulate a comprehensive diagnosis for patients: inquiry, inspection, auscultation and olfaction, and palpation [5,22]. Hence, it is difficult to determine whether it is possible to make a diagnosis based merely on the FVS, which is an inquiry method. However, if application of the FVS to acupuncture studies results in enhanced repeatability, it is possible that the FVS could become a standard part of the inquiry process. The FVS can be used in all fields of TCM and CAM where the state of the five viscera is evaluated in the diagnosis. Furthermore, the FVS may prove to be useful in combination with other TCM diagnostic methods [21][22][23][24] that have been under consideration.
Excellent item discrimination (exceeding 1.0) was seen for 13 of 20 items (65.0%) of the FVS. In a previous study that applied IRT to the Beck Depression Inventory (BDI) [17], which is widely used in the diagnosis of depression, 9 out of 21 items (42.9%) demonstrated item discrimination values exceeding 1.0. We can accordingly say that the FVS has relatively high item discrimination. Using these 13 items, we can set the cut-off point for deciding the presence or absence of symptoms. Moreover, for each subscale, when a subject scores highly on an item with the highest average item difficulty, this indicates severe illness of that "viscus. " Greater item difficulties and variability exceeding the standard were observed for Q4 of liver and Q16 of lung subscales in particular, and respondents with stronger symptoms found it easier to answer these items compared with other items. The fact that questionnaire items on the FVS showed a variety of Table 3: Gender differences and the correlation of a raw scale score and the latent trait .  Mann-Whitney U test. Poor: subjects who responded "always" or "most of the time" to the questionnaire items "seeing a doctor for an illness" and "taking medication for an illness"; good: the others excluding "poor. " item difficulty indicates that the instrument can appropriately evaluate subjects who have various latent symptoms. Regarding TIC (Figure 2), which indicates measurement accuracy of the FVS, all the effective ability ranges straddled zero and extended widely in the positive direction. Hence, we demonstrated that the FVS is able to measure healthy and symptomatic individuals as well as those in a suboptimum state of health. In other words, the FVS can be used for screening healthy subjects and as a diagnostic tool for those with suboptimal health and symptomatic subjects. Furthermore, we found that the liver and lung have completely opposite properties. The liver yielded more measurement information than the lung; however, the range of application for subjects was limited. Compared to the liver, on the other hand, the lung yielded low measurement information but had a wider range of application. For instance, while there were few symptomatic subjects for the liver compared with the lung, a high score for the liver indicates with certainty the presence of a liver symptom. Similarly with the lung, while anybody with a cold becomes symptomatic, this viscus has no uniform symptoms. Hence, the TIC may be demonstrating differences in the properties of the five viscera.
If the responses significantly changed in terms of , the FVS scores had to be converted to every time the FVS was used, as cannot be observed directly in CTT. This was a possible obstacle for clinical application. However, we found a high correlation between and raw subscale scores, proving that raw subscale scores can be used in the FVS as they are, regardless of age and sex.
In order to evaluate the external validity of the FVS, the SF-36 was administered to another subjects, a part of which overlap with the first subjects. Although there was a gender imbalance of participants with fewer women recruited, the results of both scales were consistent in showing that women have a lower subjective perception of health than men.
Typically, when the external validity of a scale is examined, it is unnecessary to develop a new scale when the association between all items of the external standard scale and the scale that needs validation is extremely strong. Further, FVS and SF-36 show significant correlation in many items while all of which were not completely consistent. From this perspective, there is significance in developing the FVS. In addition, when we separated and compared the results of symptomatic and healthy subjects, many more differences among women could be observed in both scales, regardless of age and sex. These findings suggest that the FVS can be used for health-related diagnosis including gender differences.
The FVS is a scale that gives objectivity to TCM diagnosis which has been used to rely on the TCM practitioner's subjective observations. TCM aims to treat those with "suboptimal health, " to prevent illness [3,32]. In other words, prevention is considered the ultimate form of treatment in TCM. "Suboptimal health" in Western medicine is the susceptibility period prior to becoming ill. Health complaints during that period are predominantly subjective symptoms of indefinite complaint encountered in daily life [32]. Most items that compose the FVS are indefinite complaint. Since the chief complaint is the most important sign of illness both in Western medicine and in TCM, the results of our study are pertinent as we demonstrated that the symptoms related to the five viscera of TCM can be effectively used in health assessments in the Western medical field. As suggested by Schiff et al. [33], cooperation between Western medicine and CAM is important. The FVS can act as a bridge, not only for TCM practitioners, but also between eastern and Western medicine, in supplying mutually beneficial information to both sides.
One limitation of this study was the large difference in the numbers of male and female subjects. Moreover, differences in the FVS according to age were not examined especially between young and elderly people, since the majority of the subjects were under 50. However, item discrimination as well as difficulty is not affected by individual traits such as gender or age. Further study for evaluating individual and raw subscale score as well as their correlation with SF-36 has been launched among elderly community dwellers aged 50 or more. The reliability and validity of the scale need constant examination in order to evolve and generalize the scale. In this regard, it should be noted that the limited range of application for liver and the low measurement accuracy for lung may be affecting the results of Q4 (liver) and Q16 (lung). The results of this study will be useful if the need to change these items arises in the future.

Conclusion
In this study, we succeeded in applying IRT to the FVS to evaluate latent traits. All 20 items of the FVS had adequate item discrimination, and 13 items had high item discrimination power. Measurement accuracy was suited for application in a range of individuals, from healthy to symptomatic. There was also a strong correlation between the estimated latent traits and raw subscale scores, which demonstrated that the FVS scores could be used clinically without adjustment. When the FVS and a health-related QOL scale (SF-36) were administered to other subjects, a part of which overlap with the first subjects, we found a significant association between the two scales, and the same evaluation was obtained when symptomatic and asymptomatic subjects were compared. Thus, the FVS may be effective in clinical diagnosis.