Towards a Standard Psychometric Diagnostic Interview for Subjects at Ultra High Risk of Psychosis: CAARMS versus SIPS

Background. Several psychometric instruments are available for the diagnostic interview of subjects at ultra high risk (UHR) of psychosis. Their diagnostic comparability is unknown. Methods. All referrals to the OASIS (London) or CAMEO (Cambridgeshire) UHR services from May 13 to Dec 14 were interviewed for a UHR state using both the CAARMS 12/2006 and the SIPS 5.0. Percent overall agreement, kappa, the McNemar-Bowker χ 2 test, equipercentile methods, and residual analyses were used to investigate diagnostic outcomes and symptoms severity or frequency. A conversion algorithm (CONVERT) was validated in an independent UHR sample from the Seoul Youth Clinic (Seoul). Results. There was overall substantial CAARMS-versus-SIPS agreement in the identification of UHR subjects (n = 212, percent overall agreement = 86%; kappa = 0.781, 95% CI from 0.684 to 0.878; McNemar-Bowker test = 0.069), with the exception of the brief limited intermittent psychotic symptoms (BLIPS) subgroup. Equipercentile-linking table linked symptoms severity and frequency across the CAARMS and SIPS. The conversion algorithm was validated in 93 UHR subjects, showing excellent diagnostic accuracy (CAARMS to SIPS: ROC area 0.929; SIPS to CAARMS: ROC area 0.903). Conclusions. This study provides initial comparability data between CAARMS and SIPS and will inform ongoing multicentre studies and clinical guidelines for the UHR psychometric diagnostic interview.


Introduction
The development of psychometric tools to prospectively identify subjects at ultra high clinical risk (UHR hereafter) of psychosis has allowed preventative screening [1], diagnosis [2], and interventions [3] to be feasible in psychiatry. In 1991, Jackson and McGorry were the first to initiate reliability studies to psychometrically assess first-episode subjects via a semistructured interview in order to ascertain the presence of prodromal symptoms [4]. On the basis of their results, in 1995 Yung and colleagues set up the first clinical service for UHR individuals and conceived the first comprehensive UHR psychometric instrument [5]. The Comprehensive Assessment of At-Risk Mental States (CAARMS hereafter) was developed at the Personal Assessment and Crisis Evaluation (PACE) Clinic in Melbourne [6] and has been widely used in Australia, Asia, and Europe to interview for "At-Risk Mental State, ARMS," criteria. Their pivotal work resulted 2 Psychiatry Journal in the formulation of three UHR criteria: attenuated psychotic symptoms (APS hereafter), brief limited intermittent psychotic symptoms (BLIPS hereafter), and trait vulnerability plus a marked decline in psychosocial functioning (Genetic Risk and Deterioration syndrome: GRD hereafter). A few years later, in 1999, based on these criteria, Miller et al. (1999) [7] developed a similar psychometric instrument for quantitatively rating symptoms in patients at UHR of psychosis [8], in the Prevention through Risk Identification, Management and Education (PRIME) Clinic in New Haven (USA): the Structured Interview for Psychosis-Risk Syndrome (SIPS hereafter) [8] (for a detailed genealogy of the CAARMS and SIPS see [9,10]).
The CAARMS and the SIPS address the same construct and use similar criteria, and they can deliver comparable positive predictive values over follow-up time [11,12]. However, their operationalization differs [10], with substantial changes over different versions of the instruments [10]. Operationalization differences include disparity in psychopathological definitions of the APS, time and frequency criteria, functional decline criterion, BLIPS criteria, assessment of comorbidities, and substance misuse (see Tables 1 and 2  The resulting overall weight of similarities and differences between the two instruments on UHR identification is unknown. Psychometric diagnostic uncertainty questions validity of the UHR diagnostic interview, creating inconsistencies between clinicians or researchers and misunderstandings in patients [13]. Comparability of current clinical, neurobiological, cognitive, and therapeutic UHR research findings may be also questionable and compromised, with the risk of "a profusion of statistically significant, but minimally differentiating" [14] results of limited clinical utility. Psychometric uncertainty may significantly impact the development of future large-scale UHR multicentre studies, by amplifying heterogeneity across individual sites. These concerns and speculations have never been tested empirically. To resolve the "current confusion" [13], research studies allowing "a thorough evaluation of the comparability of samples" [13] have been urgently advocated [10,15]. We present here the first study addressing the psychometric comparability of the CAARMS 12/2006 [16] versus SIPS 5.0 [8]. Our principal aim was to test if the CAARMS 12/2006 and the SIPS 5.0 can equally identify UHR subjects in a large pool of individuals referred to high-risk services for potential UHR symptoms. Our second aim was to qualitatively investigate potential discrepancies and to link the severity and frequency of symptoms with equipercentile-linking tables. Our third aim was to develop a pragmatic algorithm to convert individual cases across the two instruments, to implement it in an automated conversion package (CONVERT), and to validate it in an independent UHR sample.  [17] and it is specialized in detecting and treating subjects at UHR of psychosis aged 16-35 [17]. CAMEO was started in 2007 and it is an early intervention in psychosis service which offers management for UHR people aged 17-35 in Cambridgeshire, UK, and provides initial assessments to those under 17. Referrals for both services are accepted from multiple sources including general practitioners, other mental health services, school and college counselors, relatives, and self-referrals [18]. The validation sample for CONVERT included all referrals to the Seoul Youth Clinic assessed for a UHR state with the CAARMS 12/2006 and the SIPS 5.0 as part of the standard clinical practice. The Seoul Youth Clinic was started in 2004 and offers assessment and treatment for UHR people aged  in Seoul, South Korea [19]. Subjects are recruited from Seoul National University Hospital and other psychiatric clinics and public mental health centers or they can contact the clinic by telephone or an Internet homepage.

Procedure.
The study samples were designed to reflect at-risk populations as they are encountered in day-to-day practice: all subjects were drawn from the same pool of people referred to high-risk services because of suspect prodromal signs and symptoms of psychosis. Avoiding the use of external and non-help-seeking control groups who do not reflect the clinical composition of people actually assessed in high risk is essential to properly compare the diagnostic abilities of the two instruments. Furthermore, we only included subjects who were directly assessed with both psychometric instruments during face-to-face interviews, excluding those who declined the full assessment or who were unable to complete it. Responsible clinicians interviewed the participants with the CAARMS 12/2006 [16] and with the SIPS 5.0 [8]. The training procedure and the interrater reliability of the OASIS and CAMEO clinicians have been fully detailed in Supplementary eMethod 1. Subjects accessing the Seoul Youth Clinic are usually assessed with both CAARMS 12/2006 and SIPS 5.0 instruments as part of standard clinical practice, and further details are provided in an independent publication [19].     [20], we further performed a weighted kappa analysis, weighting the three groups according to their relative baseline functional level, as established in our previous meta-analysis (i.e., UHR− = 1, UHR+ = 0.84, and Psychosis = 0, eFigure 1, adapted from [20]). We additionally estimated the prevalence and bias Psychiatry Journal 5 adjusted kappa (PABAK) [21] which adjusts the kappa for imbalances caused by differences in prevalence and bias [22]. Interpretation of the kappa values varies, but some guidelines were provided by Landis and Koch (1977) for kappa coefficients suggesting that kappa of 0.01 indicates "poor" agreement; kappa values from 0.01 to 0.20 indicate "slight" agreement; kappa values from 0.21 to 0.40 indicate "fair" agreement; kappa values from 0.41 to 0.60 indicate "moderate" agreement; kappa values from 0.61 to 0.80 indicate "substantial" agreement; kappa values from 0.81 to 1.00 indicate "almost prefect" agreement [23].
The secondary aim of the study was investigated using post hoc explorative residual analyses comparing different subgroups (i.e., UHR−, GRD, APS, BLIPS, and Psychosis), Bonferroni-corrected for multiple comparisons. Qualitative analyses of discrepancies across the two instruments were also conducted, to better elucidate the impact of each specific cell on the overall results. We further converted the severity and frequency of symptoms by employing a linking method, using equate 2.0-3 [24] under R 3.1.2 software. This method is detailed in eMethod 2.
The third aim of the study was investigated with qualitative a priori comparisons of a priori operationalization differences between the CAARMS 12/2006 and SIPS 5.0, as detailed in Tables 1 and 2. Equipercentile linking, percent overall agreement, and kappa estimated above here in the OASIS and CAMEO services were not used for the development of this pragmatic algorithm. A software engineer (JL) then implemented the conversion algorithm in an automated package. CONVERT is a Python application which implements the conversions of individual outcomes (UHR−, UHR+ [GRD, APS, and BLIPS], and Psychosis) between the CAARMS 12/2006 and SIPS 5.0 as proposed by the conversion algorithm. CONVERT takes as input a * .xls file and produces as output a * .xls file. To freely download the tool and the template * .xls input file and get further details please visit https://bitbucket.org/ioppn/convert. We first piloted CON-VERT in the OASIS and CAMEO dataset and then validated it in an independent UHR sample recruited at the Seoul Youth Clinic. Diagnostic accuracy measures of CONVERT were analysed with respect to both CAARMS 12/2006-to-SIPS 5.0 and SIPS 5.0-to-CAARMS 12/2006 conversions and included percent overall agreement, kappa (with its 95% CI), PABAK, sensitivity, specificity, and nonparametric Receiver Operating Characteristic (ROC) analyses. The validation sample from the Seoul Youth Clinic included 93 UHR subjects, with a mean age of 20.24 years (SD = 4.00, range = 15-33), mostly males (29% females), of Asian ethnicity [19].  Table 3. When the analysis was weighted for the relative functional impairment of the three groups the results were very similar: the percent overall agreement was 88.43% (expected agreement 45.44%) and the kappa was 0.788 ( = 16.22, < 0.001, and 95% CI from 0.693 to 0.883). To better elucidate these differences we have conducted a qualitative analysis of psychopathological characteristics of these 14 patients, which is appended in eTable 4. The equipercentile linking between severity and frequency scores of the two instruments is detailed in Table 4.

Development and Validation of an Automatized Conversion Algorithm (CONVERT).
The pragmatic algorithm to convert individual cases across the SIPS 5.0 and the CAARMS 12/2006 is depicted in Figure 1.
The steps illustrated in Figure 1 have been fully automatized in the CONVERT tool, which is appended online (https://bitbucket.org/ioppn/convert) with further information for users. The tool was first piloted on the OASIS and CAMEO sample and it was able to correctly convert all cases between the two instruments. External validation was performed in an independent sample assessed for suspicion of UHR symptoms at the Seoul Youth Clinic (see eTable 5). For the SIPS 5.0-to-CAARMS 12/2006 CONVERT 6 Psychiatry Journal

Discussion
This is the first pilot study addressing comparability of the two psychometric instruments most frequently used to interview subjects seeking help from high-risk services for psychosis. Strengths of this study include enrolment of a relatively large sample size inclusive of UHR+, UHR−, and psychotic patients, the use of kappa analyses and equipercentile-linking methods, the development of automatized algorithms, and the use of external validation samples of different ethnic background. We found an overall substantial agreement between CAARMS 12/2006 and SIPS 5.0 in the identification of UHR subjects, with the exception of BLIPS. These findings however may be influenced by the type of recruitment strategies adopted by the high-risk services. Residual and qualitative analyses and equipercentile-linking tables provided additional comparability data. The automated conversion algorithm (CONVERT) to convert individual cases was validated in an independent sample and showed an excellent accuracy. Our first aim was to test the diagnostic comparability of CAARMS 12/2006 versus SIPS 5.0 in a large sample of more than 200 subjects referred for psychometric diagnostic interview to high-risk services. We found an overall substantial agreement (kappa) between the CAARMS 12/2006 and SIPS 5.0. Such a substantial agreement is not completely surprising. First, the two instruments show similar psychometric parameters, such as excellent reliability properties (overall IRR agreement for the SIPS 0.95 [25], for the CAARMS 0.85 [6]). Second, independent authors not involved with the development of the two instruments argue that the development of the SIPS was influenced by the CAARMS. Specifically, they claimed that the SIPS was developed at the PRIME Clinic "to evaluate the severity of ARMS, as defined by [9]." Third, in our previous meta-analysis we found that the CAARMS and the SIPS can identify a similar proportion of true positives over time (transition risk by 31 months with the CAARMS = 27.4%, 95% CI from 24.6% to 30.4%; transition risk by 31 months with the SIPS = 28.1%, 95% CI from 25.1% to 31.3%; = 0.73) [12]. Finally, in a recent meta-analysis we specifically confirmed that, in help-seeking samples, the two instruments share similar excellent prognostic accuracy in ruling out psychosis risk, with no significant differences [26]. On the basis of the above substantial agreement, our findings are the first to address the "Babylonian confusion" of UHR diagnostic interview, making it easier to overall compare the results of UHR research, which had formerly been difficult and risky [9]. Indeed, the definition of case (i.e., whether or not a subject has the condition of interest) is highly problematic across the entirety of clinical psychiatry, where no objectively assessable measures/markers/tests/exams other than clinical impression usually exist to establish the presence or absence of a given condition. Our results are thus highly relevant to permit overall meaningful comparisons of clinical, neurobiological, neurocognitive, and cost-effectiveness UHR studies worldwide, with potential beneficial impact for ongoing large-scale multicentre UHR projects such as the PRONIA (http://www.pronia.eu/), NAPLS (http://napls.commons.yale.edu/), and PSYSCAN (http://www.psyscan.eu/).
Our secondary aim was to qualitatively investigate potential discrepancies across the two instruments and to provide equipercentile-linking comparisons. The     (Table 4) suggests a close relationship between the symptoms severity and frequency scores of the CAARMS 12/2006 and SIPS 5.0, as also confirmed by equipercentile-linking analyses of SOFAS and GAF measures [27]. However, we also found some sources of disagreement, in particular, with respect to the diagnosis of BLIPS subjects. Indeed 14 out of the 25 BLIPS subjects diagnosed by the CAARMS 12/2006 were diagnosed as already psychotic by the SIPS 5.0 (eTable 4). There are significant differential operationalizations of the BLIPS across the two instruments. On one side the psychosis threshold is higher in the SIPS 5.0 than in the CAARMS 12/2006 (psychotic symptoms may last more than 7 days); on the other it is lower since the symptoms should not have urgency features [28] ("urgency is any positive psychotic symptom that is seriously disorganizing or dangerous no matter what the duration" (SIPS 5.0 manual page 15 [8])). To elucidate this difference we conducted a qualitative analysis Psychiatry Journal 9 of the medical records of these 14 subjects. We confirmed that all of them were presenting with disorganized or dangerous symptoms (see definition in Table 2), which were meeting BLIPS criteria under the CAARMS 12/2006 (eTable 3) while at the same time being regarded as over threshold for psychosis with the SIPS 5.0. Since South London has one of the highest rates of psychosis in the world [17], BLIPS subjects alone represent about 9% of OASIS patients, with an additional 9% meeting conjointly APS or GRD criteria ( Figure 6 from [17]). Conversely, operationalization differences in BLIPS duration (i.e., less than 7 days on the CAARMS 12/2006) affected only 4 subjects in our database (eTable 3). This may be partially due to the fact that our referrers are trained to refer only BLIPS subjects defined by the CAARMS 12/2006 criteria. Similarly, differences on psychosis threshold in the Perceptual Abnormalities subscale (i.e., severity of 6 on the SIPS 5.0 and 5 on the CAARMS 12/2006) affected only 6 subjects in our database (eTable 3).
Conversely, operationalization differences of the APS, which includes four domains (P1-P4) in the CAARMS 12/2006 and five in the SIPS 5.0 (P1-P5), did not play a significant role (eTable 3). It is possible to speculate that the additional SIPS 5.0 domain, Grandiose Ideas, with manic or hypomanic features is not particularly frequent or severe in UHR subjects (4 subjects only had higher scores on SIPS 5.0 P3 as compared to SIPS 5.0 P2). Also, operationalization differences in APS onset criteria did not impact the overall consistency of the diagnostic interview for APS across the two instruments. This is somewhat surprising giving that CAARMS 12/2006 permits positive symptoms to qualify even if they are no longer present in the past month, whereas SIPS 5.0 requires them to be present in the past month. Also, the functional decline criterion (i.e., SOFAS drop), which is explicitly used in the CAARMS 12/2006 but not in the SIPS 5.0, did not impact the diagnostic agreement (see limitations below). This finding however is also supported by our recent meta-analysis showing that both SIPS-defined and CAARMS-defined UHR samples display consistent baseline functional impairments as compared to matched controls [20]. Similarly, there was no effect for the assessment of differential diagnoses associated with comorbidities [29] between the CAARMS 12/2006 (which does not consider comorbidities) and SIPS 5.0 (in which the UHR+ diagnosis should not be made if the symptoms are better explained by other comorbid disorders; this affected only 5 subjects in our database; see eTable 3). However, the notion "better explained" is poorly coded and may therefore be subject to arbitrary clinical judgment: no studies have ever addressed the impact of this criterion on subjects identified by the SIPS. Because of this, comorbid disorders may be highly prevalent in both the CAARMS 12/2006 and the SIPS 5.0. Indeed in our meta-analysis specifically investigating comorbid affective disorders in UHR subjects, we found a similar prevalence of anxiety or depressive disorders in SIPS [30,31] and CAARMS [32] studies.
Our third aim was to develop an automated algorithm to convert individual cases and to validate it in an external sample. Therefore, on the basis of the CAARMS 12/2006 and SIPS 5.0 differences in operationalizations detailed in Tables 1 and 2, we were finally able to develop and propose a pragmatic algorithm to convert the main clinical outcomes across the two instruments (see Figure 1). This algorithm has been implemented in the CONVERT tool, which has been made freely available for the use of future researchers and clinicians and externally validated in an independent sample. We found that CONVERT performed in the excellent range of accuracy in both directions: the ROC area was 0.903 for the SIPS 5.0-to-CAARMS 12/2006 conversion and 0.929 for the CAARMS 12/2006-to-SIPS 5.0 conversion. The ROC area serves as a global measure of test performance and values in the range of 0.9-1 are considered excellent, between 0.8 and 0.9 very good, between 0.7 and 0.8 good, between 0.6 and 0.7 sufficient, and between 0.5 and 0.6 bad [33]. Of relevance, CONVERT was developed on a priori differences between CAARMS 12/2006 and SIPS 5.0 operationalizations (Tables  1 and 2), and its performance was tested against an external validation sample that was characterized by a different ethnic background. Because of this, we expect that CONVERT could perform well in other UHR samples and we hope that it will facilitate data merging across UHR sites employing different diagnostic instruments. This may support largescale multicentre UHR analyses across PRONIA, PSYSCAN, or NAPLS or replication studies to consolidate the current UHR findings.
This study had limitations. First, we did not perform a follow-up. Second, the type of recruitment adopted in OASIS and CAMEO services may have impacted the observed substantial agreement between CAARMS 12/2006 and SIPS 5.0. For example, the close link of the OASIS or CAMEO with the local first-episode services may have increased the proportion of disorganizing/dangerous BLIPS or the clinical composition of subjects referred for UHR assessment to our services [34]. Also, our referrers underwent a long-standing training to identify and signpost subjects meeting the functional deterioration criterion according to the CAARMS 12/2006 intake criteria. Therefore, it is possible that UHR patients meeting SIPS 5.0 criteria (e.g., with attenuated psychotic symptoms but without functional deterioration) may have been undetected by referrers and not assessed at all by our teams, inflating the observed agreement. Indeed, when the CAARMS 12/2006 was compared with the SIPS 5.0 in other epidemiological samples of non-help-seeking subjects, significant differences between the two instruments were observed [35] (the use of UHR instruments in non-helpseeking samples is not recommended however [11]). There is also recent meta-analytical evidence indicating that samples referred to high-risk services are highly heterogeneous and that their actual composition may reflect the type of outreach campaigns adopted [36,37]. Future studies should investigate the likely possibility of a lower agreement between CAARMS 12/2006 and SIPS 5.0 in high-risk services employing SIPSbased outreach campaigns. Third, our procedure involving a unique rater scoring both instruments in an uncontrolled order may have significantly inflated agreement across instruments. However, assessing subjects referred for suspicion of UHR symptoms at the time of the first contacts with highrisk services (who may be already psychotic or eventually deemed not at risk of psychosis) with independent raters poses severe logistic difficulties for the patients. It may also paradoxically create additional biases because the most severe patients may be more likely to decline lengthy assessments. To control for this we performed an independent analysis in a subset of patients ( = 21) assessed with independent raters and we confirmed that magnitude of agreement remained substantial (see results in eMethod 1). Interrater reliability of the original instruments has been investigated in even smaller samples ( = 14) [25]. Fourth, given the differences in the two instruments' development itself [10], our findings and the CONVERT tool may not be applied to older versions of the instruments. Fifth, psychometric reconciliation of the CAARMS versus SIPS is not sufficient to mitigate the differences between the various UHR research teams. Differences remain between the characteristics of the basic population, the recruitment of patients, the follow-up, and the specific treatments provided [9].

Conclusions
There is overall substantial diagnostic agreement between the CAARMS 12/2006 and SIPS 5.0 towards identification of UHR subjects. Disagreement was mostly due to differential operationalization of BLIPS. However, type of recruitment strategies may have inflated the observed agreement and future studies should repeat these analyses in high-risk services adopting different outreach campaigns. The conversion algorithm, CONVERT, had excellent performance characteristics even in samples of different ethnic background. The results of the current investigation may be highly relevant to the field, as they may inform future multicentre studies as well as international consensus conferences aiming at standardizing the UHR diagnostic interview.