Supporting Risk-Aware Use of Online Translation Tools in Delivering Mental Healthcare Services among Spanish-Speaking Populations

Neural machine translation technologies are having increasing applications in clinical and healthcare settings. In multicultural countries, automatic translation tools provide critical support to medical and health professionals in their interaction and exchange of health messages with migrant patients with limited or non-English proficiency. While research has mainly explored the usability and limitations of state-of-the-art machine translation tools in the detection and diagnosis of physical diseases and conditions, there is a persistent lack of evidence-based studies on the applicability of machine translation tools in the delivery of mental healthcare services for vulnerable populations. Our study developed Bayesian machine learning algorithms using relevance vector machine to support frontline health workers and medical professionals to make better informed decisions between risks and convenience of using online translation tools when delivering mental healthcare services to Spanish-speaking minority populations living in English-speaking countries. Major strengths of the machine learning classifier that we developed include scalability, interpretability, and adaptability of the classifier for diverse mental healthcare settings. In this paper, we report on the process of the Bayesian machine learning classifier development through automatic feature optimisation and the interpretation of the classifier-enabled assessment of the suitability of original English mental health information for automatic online translation. We elaborate on the interpretation of the assessment results in clinical settings using statistical tools such as positive likelihood ratios and negative likelihood ratios.


Introduction
Despite the increasing public awareness of the prevalence of mental health issues among populations from low and middle-income countries, accurate, scientific, non-discriminative communication of mental disorders remains a real challenge [1][2][3]. Within different societal, cultural systems, conventionalised linguistic constructs have been developed over years and decades to describe and convey the underlying social attitudes and understanding of different mental disorders like varieties of anxiety or depressive disorders. In English-speaking multicultural countries, the communication and interpretation of mental disorders and their treatment for non-English-speaking migrant populations pose important challenges to frontline health workers and clinicians [4][5][6].
e rapid development of machine translation technologies has offered necessary technical means to interact and engage with multicultural vulnerable communities and people who have limited access to mental healthcare services, despite the prevalence of mental health issues among such populations who are at higher risks of developing clinical mental disorders or other comorbidities such as chronic non-communicable diseases or physical health conditions that are likely to exacerbate their mental health issues. Currently, there is very limited research which examines the reliability, safety, or levels of risks in using state-of-the-art online translation tools such as Google Translate in clinical settings for communicating and talking with patients about mental disorders.
Much of current research shows that the use of automatic translation tools in primary healthcare settings is driven by a persistent lack of qualified bilingual health professionals [7][8][9][10].
e risk of an unchecked use of translation technologies in specialised health and medical settings which have been developed largely for general crosslingual communication purposes is real and well documented [11][12][13][14]. However, the practical needs for cost-effective translation tools in disease diagnosis and medical treatments are increasing. Although the provision of proper training to a certain number of bilingual health professionals can help reduce healthcare inequality issues, real-life scenarios can be much more complex, uncertain, and dynamic for any health systems at various levels. For example, a recurring issue in clinical settings is the lack of adequately qualified health translators working with under-resourced languages. Even for resourced language pairs such as English-Spanish and English-Chinese, it can be challenging to find bilingual health translators with extensive in-depth knowledge of different medical specialities.
at is, the quality of human translation can also be compromised by the complexity and speciality of medical communications. In fact, with simple, direct sentences, online translations tools can fulfil their function to support a meaningful exchange of information between doctors and patients. We argue that health communication technologies like neural translation tools are evolving rapidly and that they have important potential for scaled uptake in health systems, especially those serving vulnerable or disadvantaged communities in under-resourced local health districts. Health technologies can be leveraged to help close the gaps in current medical and healthcare structures and improve the quality and accessibility of healthcare resources to populations and people at risk. Research is needed to develop instruments and aids to improve the safety and reliability of available translation technologies to be adopted in health and clinical settings.

Data Collection.
We collected authoritative health information on anxiety disorders from the websites of federal and state health agencies and not-for-profit health promotion organisations in the U. S., Australia, Canada, and the United Kingdom. e total database contains 557 fulllength articles including 270 (48.47%) original English materials associated with automatic translations to Spanish which contained misleading errors. We labelled these materials as positive or "risk" samples. Around 51.53% of the total samples we collected were original English texts whose automatic translation into Spanish using Google Translate did not contain any misleading information. e evaluation of the Spanish translations by Google was through the comparison of the original English texts with their backtranslations from Spanish. is method known as forward and backward translation was endorsed by international health organisations such as the World Health Organisation [15]. We subsequently labelled such English texts as negative or "safe" cases for automatic translation. We divided the entire database into 67.65% training data (389) and 32.35% testing data. Within the training dataset, there were 187 positive "risk" English texts which were prone to automatic translation errors and 202 negative "safe" English texts which had been proven to be suitable and reliable for automatic translation to Spanish. Similarly, within the testing dataset, there were 83 positive samples and 85 negative samples for testing the performance of classification of the classifiers to be developed. We applied 5-fold cross-validation on the training dataset to develop the classifiers to help remove biases in the development of algorithms.

Annotation of Feature Sets.
Traditionally, health translation mistakes are believed to be associated with or triggered by the linguistic difficulty or lack of readability of the original English materials including complex, sophisticated, structural, syntactic, and lexical features. However, with the rapid developments of machine translation technologies, more research shows that semantic polysemy, that is, the multiple meanings of a certain word across domains and other issues, could be more challenging for latest neural machine translation tools. As a result, we included four large sets of features to investigate possible reasons which have triggered significant mistakes in machine-translated health materials from English to Spanish.
We annotated both training and testing datasets with 4 large sets of linguistic features: structural features (24 in total using Readability Studio software), lexical dispersion rates based on the British National Corpus (20 features in total), English lexical semantic features (115 features in total), which we annotated using the USAS system developed by the University of Lancaster [16][17][18], UK, and lexical sentiment features (92 features in total) that we annotated using Linguistic Inquiry and Word Count software. Appendix A shows the details of these 4 linguistic features.

Bayesian Machine Learning Classifiers (MLCs).
Bayesian MLC is a sparse classifier which can effectively counter model overfitting with relatively small datasets like ours. Bayesian classifiers are different from other supervised machine learning techniques in that they produce the posterior odds of a class dependent on the prior odds of an event and asymmetrical classification errors of the model, whereas frequentist ML classifiers only return a hard binary decision. In solving practical questions like ours, posterior odds are much more informative than a certain predicted binary outcome, as the Bayesian-style prediction using posterior odds helps practitioners and decision makers to appreciate the levels of risks of negative and positive cases over a continuous probability scale and assists in developing more effective intervention strategies to achieve optimal outcomes. is advantage of Bayesian MLs suits the purpose of our study, as we aimed to identify original English mental health materials which are more likely to cause significant errors if translated using automatic tools without further human evaluation. is can help health agencies developing bilingual health materials to better invest their resource and minimise risks of using machine translation in healthcare settings.

Methods
To identify the best subset of features within each annotation category, as well as the best subset of features across annotation categories, we applied separate and combined feature optimisation techniques. e automatic feature selection technique we used was recursive feature elimination (RFE) with cross-validation in Python scikit-learn to increase the generalisability and accuracy of the Bayesian machine learning classifiers we developed. To identify and rank highly predictive features, recursive feature elimination used linear kernel support vector machine (SVM) as the base estimator. An optimal set of features was identified when the cross-validated classification error reached the minimal value. Figure 1 shows the results of the automatic optimisation of different feature sets: in Figure 1(a), the optimised features of lexical dispersion rates were 4, as the crossvalidated classification error dropped from 0.425 with the full feature set (20 in total) to 0.393 when the features were reduced to 4. In Figure 1(b), the optimised feature set of English semantic features was 10, as the cross-validated classification error decreased from 0.40 with the full feature set (115) to 0.375 when the features were reduced to 10. Further elimination of features however led to a spike in the classification error. In Figure 1(c), the optimised feature set of English sentiment features (annotated using the LIWC software) was 10, as we observed that the minimal classification error of 0.416 was reached when the total number of sentiment features was scaled back from 92 to 10. In Figure 1(d), 5 optimal structural features (92 in total) were identified when the minimal classification error of 0.409 was reached. Lastly, in Figure 1(e), we conducted the combined feature selection by integrating the 4 feature sets (251 in total): lexical dispersion rates and semantic, sentiment, and structural features. e optimal number of features emerged from the combined optimisation was 33 which was associated with the minimal classification error of 0.383.

Results and Discussion
4.1. Results. Following automatic feature optimisation to enhance the classification accuracy of classifiers, we evaluated the performance of Bayesian models (relevance vector machine (RVM)) with different feature sets on both the training and testing datasets. As discussed earlier, we applied 5-fold cross-validation on the training dataset to minimise biases in the classifiers being developed. First, we compared the original feature sets with their respective optimised feature sets in Table 1-5. Next, we compared the performance of RVM classifiers with different pairs of optimised feature sets. Table 6 shows the comparison of RVM classifiers with double, triple, and quadruple optimised feature sets, respectively. Like feature optimisation, feature normalisation is another useful automatic technique to enhance the performance of machine learning classifiers. We applied three popular feature normalisation techniques: min-max, L 2 -norm (L2), and Z-score normalisation with each RVM classifier to see whether this could help balance asymmetrical classification errors within each model. Table 1 shows the performance of RVM classifiers with lexical dispersion rates as features. It shows that after automatic feature selection, RVM with the reduced and optimised feature set (D4) reached a largely comparable performance to that of the classifier run on the full feature set: on the training dataset, the mean of area under the curve (AUC) of RVM (D4) was 0.617 (SD � 0.06), compared to 0.625 (SD � 0.06) of RVM (full 20 features), suggesting that feature reduction could also help encounter the issue of overfitting in training machine learning classifiers. On the testing dataset, the AUC of the RVM (D4) (0.648) was similar to that of RVM (All 20) (0.649). Sensitivity dropped slightly from 0.578 (RVM-All 20) to 0.566 (RVM-D4), and specificity remained unchanged at 0.753. Normalisation did not improve RVMs with the entire or optimised feature sets of lexical dispersion rates. Table 2 compares the performance of RVM classifiers run on English semantic features. It shows that after automatic feature selection, the performance of the RVMs improved on both the training and the testing datasets: on the training data, the mean of AUC of RVM with the full semantic feature set (USAS115) observed a marginal improvement from 0.652 to 0.659 with a slightly reduced standard deviation (SD) from 0.052 to 0.045. On the testing dataset, the AUC of RVM (USAS115) saw an improvement from 0.692 to 0.714. Specificity improved from 0.729 of RVM (USAS115) to 0.777 of RVM (U10); sensitivity decreased from 0.590 of RVM (USAS115) to 0.578 of RVM (U10). Normalisation did not improve model performance. Table 3 compares the performance of RVMs with English sentiment features annotated with the Linguistic Inquiry and Word Count (LIWC) software. It shows that after automatic feature optimisation, the performance of the RVM classifier (LWIC all 92) improved on the testing datasets. e AUC of RVM (L10) increased from 0.580 to 0.605. Model sensitivity increased from 0.518 to 0.651, but specificity decreased from 0.577 to 0.494. e impact of feature normalisation on RVMs with all and optimised feature sets was similar, while the classifier specificity improved, sensitivity decreased, and the overall model accuracy on the testing dataset however did not improve significantly. Table 4 compares the performance of RVMs with various structural features which we annotated with the Readability Studio software. After automatic feature optimisation, the AUC of the classifier RVM (structural all 24) decreased from 0.636 to 0.621, which was due to decreased model sensitivity from 0.518 to 0.446, but the model specificity increased from 0.729 to 0.788. Feature normalisation helped to balance the asymmetric classification errors on the classifier RVM with both the entire feature set and the optimised feature set: the model specificity decreased and sensitivity increased. However, the overall model accuracy or the AUC did not improve with different feature normalisation techniques. Table 5 shows the performance of the RVM with the combined feature sets of lexical dispersion rates and semantic, sentiment, and structural features, which represented 251 features in total. Automatic feature optimisation reduced the original feature set of 251 features to a parsimonious model containing 33 features only. With less predictive and noisy features involved in the model, the performance of the classifier also improved significantly on both the training and the testing datasets: on the training data, the model AUC was 0.642 (SD � 0.038). is increased to 0.658 (SD � 0.034) with the optimised RVM classifier. On the testing data, the AUC improved from 0.647 to 0.718.    With automatic feature optimisation, both sensitivity and specificity improved: sensitivity increased from 0.518 to 0.554 and specificity increased from 0.706 to 0.741. Importantly, feature normalisation played a critical role in balancing asymmetrical classification errors on RVMs with combined feature sets. Specifically, min-max normalisation increased sensitivity of the optimised classifier to 0.651, the highest so far, and retained the high specificity of the classifier at 0.741. is sensitivity and specificity pair was the best combination among the classifiers developed so far. Table 6 compares the performance of classifiers with double and multiple optimised feature sets. We compared in total 10 different pairs of optimised feature sets and conducted feature normalisation with each combination of optimised features, and as a result, each RVM in Table 6 Figure 2 shows the visualised comparison of the AUCs of these 5 highperforming classifiers, the RVM with the entire combined feature sets (251 features) with L2 normalisation, and the best-performing classifier we developed (RVM (All33)) with min-max normalisation. Tables 7 and 8 show the paired sample t-tests assessing the significance levels of differences in sensitivity and specificity between the various competing high-performance classifiers and the best-performing RVM classifier we developed through the combined automatic optimisation of four different feature sets followed by automatic feature normalisation. To reduce false discovery rates in multiple comparison, we applied the Benjamini-Hochberg correction procedure [19][20][21]. e results show that sensitivity of our best-performing RVM classifier was significantly higher than that of most other high-performing models, except for F35 (p � 0.0017); specificity of our best-performing RVM classifier was statistically higher than that of all other competing models with p values equal to or smaller than 0.004. Table 9 shows the paired sample t-tests assessing the significance levels of differences in AUCs between various competing high-performance classifiers and the best-performing RVM classifier on testing data using different training dataset sizes (i. e., 100, 150, 200, 250, 300, and all 389 training samples). We applied Benjamini-Hochberg correction to reduce false discovery rates in multiple comparison. e results show that AUC under different training dataset sizes of our best-performing RVM classifier was significantly higher than that of most other high-performing models, except for F13 (p � 0.0752) and F27 (p � 0.1698). Figure 3 shows the visualised comparison of the mean AUCs of these 6 competitive classifiers and the developed bestperforming classifier. As shown in Figure 3, our best-performing RVM classifier gained the highest mean AUC than all other competing models.

Discussion
A few important findings emerged in our extensive computational analyses, especially the search for the best subset of features for developing Bayesian machine learning classifiers to address our core research question, which was to predict and assess the risk levels of original English mental healthcare materials in terms of their suitability for automatic neural machine translation targeting Spanish-speaking patients. Our study shows that separate feature optimisation on the four distinct feature sets did not achieve acceptable pairs of model sensitivity and specificity. Let us take a close look at features retained in each optimised feature set. Table 10 summarises the list of separately and jointly optimised features. First, the optimised feature set of lexical dispersion (D4) contained DiSp8: 0.7-0.8, DiSp9: 0.8-0.9, DiSp10:0.9-1.0, and DiWr10:0.9-1.0. Lexical dispersion rate is a measurement of familiarity of language to the public. We used existing lexical dispersion rates of the British National Corpus which had 10 intervals between 0 and 1 for spoken and written materials, respectively. In both spoken and written materials, higher lexical dispersion rates like those in the optimised feature set (D4) indicate that automatic machine translation mistakes were strongly associated with lexical items of higher familiarity in spoken and written materials. We used non-parametric independent sample test and Mann-Whitney U test to compare samples labelled as "risky" and "safe" original English mental health materials for automatic machine translation.
e list of optimised linguistic features included in the best-performing classifier also provides important opportunities for health professionals to make well-targeted, cost-effective interventions to English health materials to improve their suitability for automatic translation purposes. For example, health professionals could adjust the distribution patterns of relevant linguistic features, especially those associated with higher risks of causing automatic translation mistakes, and rerun the automatic assessment of the English input materials using our machine learning classifier, iteratively, until the predicted risk level reaches an acceptable level. Importantly, this process does not require any linguistic knowledge on the part of Englishspeaking medical professionals of patients' language (in this case, Spanish).

Conclusions
Our paper developed probabilistic machine learning algorithms to assess and predict the levels of risks of using the Google Translate application in translating and delivering mental health information to Spanish-speaking populations. Our model can inform clinical decision making around the usability of the online translation tool when translating different original English texts on anxiety disorders into Spanish. is was achieved through the probabilistic prediction of Bayesian machine learning classifiers: if an input English text was assigned a high probability (over 50%) of causing erroneous and misleading automatic translation output, health professionals should become alert of the risk of using Google Translate; by contrast, if an input English text was assigned a low risk probability (below 50%), health professionals can feel reassured that the whole piece of English information can be translated safely to its intended user, using the online automatic translation tool. e smaller the risk probability of an English text is, the safer it is for the text to be translated automatically online. For original English materials which were labelled as non-suitable for automatic translation, our machine learning offers the opportunity to adjust, modify, and fine-tune the text to improve its suitability for automatic translation.
is was achieved through the feature optimisation technique developed in our study. An important and useful feature of our model is that it does not require any linguistic knowledge on the part of English-speaking medical professionals of the patients' language. e classifier can be applied as a practical decision aid to help increase the efficiency and cost-effectiveness of multicultural health communication, translation, and education.