Usability Assessments for Augmented Reality Motor Rehabilitation Solutions : A Systematic Review

This article aims to understand which methods and user assessment approaches are most commonly used in motor rehabilitation studies that useAugmentedReality (AR) applications.Theway in which this was performed anddiscussedwas through a systematic review of the area. Firstly, the different uses of AR in these treatments, and the importance of improving usability studies to evaluate these interfaces and their interactions with their users, were discussed. Then, the systematic review of the literature was structured according to previous studies and covered the period from 2007 to September 2017, using the main scientific journals of the area. Of the 666 results collected in the initial search, 32 articleswere selectedwithin the planned requirements and scope.These publications were classified by relevance, using theQualSyst evaluation tool for health technology research, and the type of evaluation, approach, andmethods usedwere catalogued. It was found thatmost of the studies prioritisemethods of data collection, such as task execution and performance analysis. However, through QualSyst, it was verified that the best-evaluated studies chose to use a combination of quantitative and qualitative analysis, through the application of methods such as user preference questionnaires, interviews, and observations. The results presented in this review may benefit designers who intend to design and conduct usability assessments for AR within this context.


Introduction
The public health industry has a great relevance in today's social context and a permanent demand for improvement has driven the development of new assistive technologies.In its sector, motor rehabilitation has provided opportunities for physiotherapists and researchers to explore new technological resources to improve motor recovery of patients suffering from strokes or traumatic accidents, or even elderly people in search of better maintenance of their functionalities.
Between these aspects, Augmented Reality (AR) has been one of the technologies that has been widely studied in the health industry.Through the inclusion of virtual objects in the real world, AR promotes the immersion of the user, while still maintaining their sense of presence with the environment.Using virtual elements that simulate movement behaviour, such as collision, reaction, and physical simulation, for example, AR allows credible and immersive interactions for the user to be created [1].Some examples of situations for the positive use of this technology are the possibility of training for health professionals and students, since it allows the understanding of spatial relationships and concepts, providing substantial, contextual, and situational learning experiences [2], in addition to providing motivation for patients during therapies [3], and to reduce risks during the exercise learning process [4].
AR makes games and applications an attractive alternative for healthcare, because it provides a variety of applications with immersion, interaction, and involvement characteristics according to its purpose [5].Games have been incorporated into health solutions to facilitate user's engagement, by using these interactive technologies to move International Journal of Computer Games Technology affected body areas (such as arms or hands) and encourage the repetitive practice of motor tasks necessary to stimulate the neuroplastic changes responsible for the recovery.It also encourages the motivational aspects and interactions that allow cognitive engagement and challenge [6].In physiotherapy, the use of motion tracking devices to create gaming applications has provided an improvement in the patient's commitment, motivating the movement sequences established by the professional during the treatment sessions [7].
While researching, it appears that there are some difficulties that arise for the use of game therapies.Burke et al. (2009) describe some pertinent user problems; namely, they have paralysis in one or more areas of their body, have difficulty communicating, and are not familiar with the equipment and it is not uncommon to suffer from depression and therefore they may find it difficult to focus on a therapy program and not get involved with gambling.To encourage these users, it is suggested that games designed for rehab should encourage engagement and reward success.When users deal with failure positively, they are more likely to remain engaged and not feel that failure in the game results from their more physical abilities [6].
Thus, it is necessary to work on the elements of the interface so that they are clear to understand and provide the appropriate feedback to the patients.To achieve this objective, it is necessary to carry out investigations and accurate analysis in usability evaluations.However, there is no consensus in the literature of a methodological approach that is adequate to the diversity of patient's conditions.In the previous evaluation of mixed reality [8], a number of problems were analysed and no usability methods specifically designed for these systems were found.At the time of the survey [8], three categories of methods were identified that were sufficiently general in their approach and that could be appropriate for these systems: questionnaires and interviews, inspection methods, and user testing.
Questionnaires and interviews would be useful resources for gathering subjective data, user preferences, and comparing performance data.On the other hand, the methods of inspection would be more limited since there were not yet guidelines of ergonomics and consolidated design for specific systems of mixed reality.And the user test was the most used method in this area [9].
This review proposes to investigate how user tests on rehabilitation AR applications are being planned and conducted.It aims to understand what the most frequent user profile is, what methods have been used, and how the collected data is analysed.In the next section, this paper describes the reviews used to support the survey and states the objectives of this research (Section 2).The research methodology will follow, where the structure for item and filter classification will be explained (Section 3).Thereafter, the results are presented (Section 4) including study selection through the Prism Flow, review analysis, and synthesis.To finish, a brief discussion about review finding is performed and is followed by the paper's conclusion (Sections 5 and 6).Swan et al. (2005) [10] surveyed 1104 papers published in the most relevant studies between 1992 and 2004, where only 266 described research for AR.38 of them addressed HCI (human-computer interaction) aspects and only 21 described a formal user-based experience.At the time, they identified and classified user-based experimentation along three lines: "human perception and cognition," "user task performance," and "generic interaction and multi-user communication (collaboration)."

Literature
Years later, Bai and Blackwell (2012) [11] analysed 71 articles, from 2001 to 2010 that deal specifically with usability assessments.They proposed to group the evaluation types according to 4 categories of usability research, 3 of them previously worked by Swan et al. (2005) [10], "performance," "perception and cognition," and "collaboration" and included "user experience," as did Dünser et al. (2008) [9].Both studies included the fourth category by detecting that some publications intended to analyse usability with a more subjective approach, in order to discover more individual feelings and experiences of users while testing the system or prototype.Dünser et al. (2008) [9] selected 161 articles, on a basis of 6071 publications, related to user ratings and also included a rating on the following categories from approach to evaluation: "objective measures," "subjective measures," "qualitative analysis," "usability assessment techniques," and "informal assessments."They found that over the years the proportion of formal user evaluations compared to informal evaluations had increased, and between 1995 and 2001 there was an average of 57% of formal evaluations, yet between 2002 and 2007 this percentage increased to 76%.They concluded that there is a growing need to formalise the evaluation process and conduct properly designed user studies.Dey et al. (2018) [12] selected 291 works, in the 9 years from 2005 to 2014, that were reviewed and classified based on their areas of application.Their main contribution was to show how the scenario changed over the years, identifying areas where there were few studies with users and opportunities for future research.The health area was identified as the most relevant.
These user tests were concentrated within the period of 2002 and 2016 where the majority of the usability tests (83%) were performed with healthy people and the others (17%) involved patients or a combination of users with differing abilities [13].Demographic data collected from the articles conducting these tests showed that 62% of them tested young participants, mostly university students with an average age of 30 years [12].This could lead to deficiencies in the analysis and contradict studies [14] that recommend applying usercentred design to the development of interactive technologies in health: including the biased sampling, lack of user experience studies, and noninclusion of patients in the tests.This user-centred design could improve the functionality for systems and usability for patients, increasing the probability of promoting the desired behaviours and health outcomes [14].
These previous reviews [9][10][11][12][13] provided a starting point and a reference for structuring this study.Through the selection and classification of the articles, the information was analysed, in order to gain an overview and summary of the research findings.The main objectives of this study are (1) to investigate the methods and approaches commonly used to evaluate the usability of these applications, (2) identify the sampling profile that is used for tests in motor rehabilitation with AR, and (3) to verify which methodologies were used for data collection.

Methodology
This systematic review aims to investigate how user testing with AR applications for rehabilitation is being planned and conducted.Through the results and the analysis, it will be possible to reach conclusions on the knowledge collected from these publications [15], which may serve to define the methods of evaluation or the profile of the sampling used in the tests.
To achieve this, papers were searched, selected, evaluated, analysed, and synthesized according to the protocol described below.To organise the data, a checklist of recommendations for systematic PRISMA reviews were used [16], and the publications were counted and catalogued at each phase.

Search Question.
To help define the purpose of the review PICO's approach to structure and define the research question was used [17].The question was formulated to fit the four components that make up the acronym PICO: population (P); intervention (I); control (C); outcome (O).The details for each are as follows: (i) Population: the intended audience required the inclusion of users who needed some treatment or follow-up treatment for motor rehabilitation, whose disabilities included accidents, neurological problems, surgical recovery, and motor limitations, or involving profiles related to the development of applications such as physiotherapists, healthy adults, and developers.
(ii) Intervention: make the use of AR to assist the treatments.
(iii) Control: studies that have control groups for tests performed with users, such as performance tests, comparisons between different profiles or groups, a comparison between different applications or previous experiments, etc.
(iv) Outcome: when the results of the evaluations were supported in the data for greater reliability, but not excluding empirical analysis.
In order to identify trends in this area and compare them with evidence found in previous reviews, the following research question was defined: "What and how methodological approaches have been used in usability assessment research for motor rehabilitation applications, with the use of Augmented Reality?" 3.2.Research Protocol.This review prioritises peer-reviewed journals in the scientific community that are mentioned in other review articles [3,9,18] or are chosen due to their popularity in AR-related research.There are four databases that were used: IEEE Xplorer, Springer Link, Science Direct, and ACM Digital Library.The keywords included in the searches were written in English and concatenated with "AND": "Augmented Reality" AND "rehabilitation".The search was performed by two evaluators and the inclusion criteria were separated into two phases: "Screening" and "Eligibility", which are detailed below (Sections 3.2.1 and 3.2.2).For complete reading and classification, articles that had conflicting approval among researchers were read and selected by a third reviewer.
The final publications were thoroughly examined to provide a score for the relevance of the methodological processes and for the description of the evidence found.QualSyst's [19] evaluation criteria that were developed specifically to evaluate health items were used to generate these scores (Section 3.2.3).They were catalogued according to the classification (Section 3.2.3).

3.2.1.
Screening.The initial screening phase was performed by focusing on paper characteristics and through the reading of the titles and abstract.The inclusion criteria were put in place to investigate an application in Augmented Reality; to perform user tests about the application use and benefits; to be written in English, Portuguese, Spanish, and French; and to be published in the last 10 years (2007 to 2017).The exclusion criteria were short papers (articles with 4 pages or less) and systematic reviews of the literature.The identification process of these publications was accomplished through reviewing their title and abstract.
After being selected, each evaluator downloaded and catalogued articles in folders using the Mendeley Desktop© [20].The program allowed the evaluators to work individually, checking for duplicate articles and including "select" and "out" tags to identify the articles that would be selected or excluded for the next phase.

Eligibility.
This phase identified articles by evaluating research methodology and conclusions.It used the following criteria: to be an experimental research, that is, to involve some type of experiment, where the researcher participates actively in conducting the phenomenon, process, or fact evaluated, acting on the cause, modifying it, and evaluating the changes in the outcome [21]; to be a quantitative research and may simultaneously use a qualitative approach; to have a control group in the sampling or to use criteria for validation of the data found.The exclusion criterion was whether the research experiments were exclusively performance-tested or technical evaluation of system/software/robot/tracking.

Classification.
To catalogue the publications by relevance, QualSyst [19], an evaluation tool for health technology researches, was used.It consists of 14 specific questions which compose a note that reflects the quality of the evidence found.QualSyst evaluation criteria looked at many International Journal of Computer Games Technology aspects, for example, (i) research objective description, (ii) adequate study design, (iii) analytical methods which are well described and justified, (iv) sample size, (v) randomization or use of control groups, and (vi) sufficiently detailed results.To perform QualSyst grade composition, the reviewers give points to each question according to their own evaluation and understanding, being "yes" (2 points), "partially" (1 point), "no" (0), and "N/A" when the question does not apply to the study analysed.The score was calculated by summing the total points obtained across relevant items and dividing by the total possible score.The final score is given as a percentage and presented in this study by the combined average of the two evaluators' scores [19].In the case of any disputed scores, a 3rd reviewer would be called for the calculation of a new average based on the 3 scores given.
After evaluating research quality with QualSyst, the next step was to classify the publications into the "Classification Table ."Among the information collected the following categories were recorded: part of the body or movement worked on the studies; user's disabilities; the number of individuals who participated and their profile (patients, physiotherapists, developers, and healthy people); type of usability evaluation and methods used.
(1) Type of Assessment Classification.User-based experimentation progresses along three lines of enquiry: (1) those that aimed to understand how human perception and cognition operate in AR contexts; (2) those that examine the performance of the user's task within specific AR applications; and (3) those that examine the generic user interaction and communication between multiple collaborating users [10].
User experience (UX) was included as a category in other systematic reviews [9,11] as some studies did not necessarily involve the measurement of performance of user tasks, but instead used other ways of identifying problems with system usability and were identified by using two groups of documents addressing UX.In the first, the main focus of the evaluation was in the user experience itself, with the purpose of evaluating their attitudes and acceptance of the system.In the second, the main focus of the evaluation was on perception, performance, or collaboration, using UX as a supplemental evaluation measure [11].In this way, the classification of the publications found was made in four categories: (i) User task performance studies: these studies measure the performance of tasks with users in specific AR applications in order to gain an understanding of how technology can affect the underlying tasks; (ii) Perception and cognition studies: experiments that study low-level tasks, with the aim of understanding how human perception and cognition operate in AR contexts; (iii) Collaboration studies: experiments that evaluate the generic interactions between the user and the communication among multiple participating users; (iv) User experience: study the subjective feelings and experiences of users.It can be presented in two ways: formal and informal.Formal evaluations involve controlled experiences with a fixed sample of voluntary users and collect participants' experiences with structured surveys or questionnaires.Informal evaluations involve unstructured interviews or observations with a random sample of potential users or domain experts.
(2) Evaluation Approach Classification.The research methods used to classify the evaluation approach were grouped into 5 categories: objective measures, subjective measures, qualitative analysis, usability assessment techniques, and informal assessments [9], as described below: (i) Objective measurements: studies that include objective measures, the most common being task completion times, accuracy, and error rates.In general, they employ a statistical analysis of the measured variables, though only a few include a descriptive analysis of the results.
(ii) Subjective measurements: publications that study user questionnaires, user classification, or subjective judgments.Regarding the analysis, some of these studies may also employ statistical analysis of the results, whereas others only include a descriptive analysis.
(iii) Qualitative analysis: studies with formal user observations, formal interviews, or classification or analysis of user behaviour (e.g., speech or gesture coding).
(iv) Usability evaluation techniques: publications that employ evaluation techniques that are frequently used in interface usability assessments, such as heuristic evaluation, expert-based assessment, task analysis, and Think Aloud.
(v) Informal evaluations: assessments such as informal observations of users or informal collection of user comments.

Results
This section offers the results of the review, which along with paper selection, synthesis, and analysis were demonstrated within their categories, starting with evaluation type, methods used in the tests, profile of the users, and lastly, analysis of the QualSyst score.

Multilevel Filtering of Studies.
Following the protocol described, 666 articles were found, of which 32 remained for systematic review after selection.The number of articles in each step can be seen in Figure 1 based on the PRISMA protocol [16].
In the "Identification" phase (Figure 1), 666 publications were found with the "Augmented Reality" AND "rehabilitation" keywords, and 77% of them (557 articles) were excluded because they did not meet the requirements in the "Screening" phase.Thus, a total of 150 publications were selected, of which 41 were selected simultaneously by  the two reviewers and only counted once, resulting in 109 publications.
In the "Eligibility" phase (Figure 1), of the 109 articles, only 76 remained within the requirements of the selection (Section 3.2.2), of which 21 were chosen by the two reviewers, thus passing 55 to the reading phase and classification.At the end of this selection, 21 were selected by both evaluators and 34 were conflicting.To resolve the conflicts, a third reviewer evaluated and approved 11 of these publications.And so, 32 articles were taken forward for reading and classification (Tables 1 and 2).

Synthesis and Analyses
4.2.1.Type of Evaluation."Task Performance" was the most common theme that came up from the results, present in 22 of the studies (51%), followed by 15 studies (35%) that undertook "User Experience" (UX) investigations.Figure 2 shows the type of evaluation through the frequency in which they appear.Of these 15 articles, only 5 evaluated UX and its subjective aspects during the use of the solutions, such as ease of use, utility, and attitude [22]; improvement in the frequency of physical activities [23]; levels of effectiveness, efficiency, and satisfaction [24]; motivation [25]; and preference and immersion between different systems and classical therapy [26].Another 5 studies [27][28][29][30][31] performed data collection and analysis that configured both types of evaluation.
Lastly, 6 studies (14%) set out to investigate the "Perception and cognition" of users in relation to the technologies studied; 5 of them were also associated with UX.None were found that dealt with "Collaboration" solutions.

Evaluation
Approach.An evaluation can use quantitative or qualitative data and often includes both.Both methods provide important information for evaluation, and both can improve community engagement.These methods are rarely used alone; combined, they generally provide the best overview of the project.
The remaining publications were 3 in "Qualitative analysis," 2 in "Usability assessment techniques," and 4 in "Informal evaluations" that predominantly used unstructured interviews and observations.Two studies of "Usability Evaluation Techniques" [32,33] performed the Think Aloud method during the execution of the experiments.

Evaluation Methods.
The evaluation methods are the resource the researchers use to collect data in their experiments.They can be used alone or in combination to provide better evidence for the analysis.Among the most used methods, 91% of the selected papers chose to make a combination of them, the most common being "Execution of Tasks" and "Performance Analysis," appearing in 16 articles (Figure 4(a)).
There were only 3 that chose to use solely questionnaires [25,34,35].Two of these received low scores in the QualSyst analysis [34,35], because they did not sufficiently describe the sampling, and the results were not well defined or robust to the research proposal.Of the 6 articles that made "Observations," half of them (3) also had "Performance analysis" for data collection [23,33,36], and only 1 did not apply a questionnaire [23].

Participants Profile.
Within the 32 selected articles, there were a total of 806 participants with an average age of 39.89 years, distributed through a proportion of 37% men, 23% women, and 40% unknown (Figure 5).Of the 32 articles, 14 did not report the participants' ages, 2 reported within an age range (20-30 years and 27-35 years) [26,37], and 3 articles reported the mean ages per group [22,28,29].From the included papers, 3 presented study protocols [38][39][40] to be completed during the research.
User profile investigations included a mixture of healthy users (59%), patients (34%), and studies that investigated both profiles (7%) (Figure 6).Of the 10 studies that reported the mean age, 6 used individuals under 30 years old and the remaining 4 were over the age of 58.Studies which included older users investigated health seniors to study the negative effects of aging [33], Parkinson's disease patients [41], and those who had a stroke in the last 3 months and were in a long-term rehabilitation program [42].
In relation to the researchers conducting the experiments, 44% reported being conducted by the researchers themselves, 37% were physiotherapists, 16% were undefined, and 3% were both physiotherapists and researchers (Figure 7).If the situation allowed the remote monitoring or configuration by physiotherapist professionals, 20 of them reported "yes", 8 reported "no," and 4 were undefined (Figure 8).

Disabilities and Limbs
Studied.Among the conditions addressed in the studies, those that suffered a stroke were the most frequent, appearing in 17 publications (48%), and included both stroke sufferers, and patients with motor or cognitive limitations.On the other hand, 8 of the studies (23%) aimed to investigate users who suffered injuries due to traumatic accidents, such as cerebral and spinal cord [43] and hip fracture in the elderly caused by fall [40].Others involved several disabilities and included unique occurrences, such as muscular dystrophy fluctuations in human locomotion [44], motor coordination [33], mastectomy and stroke [29], Alzheimer [22], prosthesis users [37], Parkinson's [41], phantom limb pain [45], and different developmental difficulties (mildly intellectually retarded; cerebral palsy; moderate multiple disabilities, weak legs, and low vision) [23], and only one study [31] did not indicate the disability altogether (Figure 9).
The upper body for the movement was the most commonly used to control applications, appearing in 14 studies    (40%) and included shoulders, arm and forearm, wrists, and elbows.Hands were the second highest occurrence in studies, accounting for 31% (11 studies) of these, 3 contemplated hands and upper limbs [36,46,47].The lower limbs accounted for 14% of studies ( 5), 4 of these concerned with walking [38][39][40]44], the other focusing on the lateral movement of the legs for patients with Parkinson's disease [41].Studies that proposed to investigate both upper and lower limbs numbered 3 (9%) [22,26,48].The final 2 were classified as "body" because they involved several physical exercises performed for different purposes [49], as well as an interactive game of body movement to improve the strength of children with disabilities [23] (Figure 10).and robustness to measurement are all characteristics that reinforce the evidence of the studies bringing more credibility to the data collected and consequently improving the score in the QualSyst [19].All 32 articles were thoroughly read by two reviewers and received QualSyst scores (Section 3.2.3).The scores were assigned individually by each reviewer and the average score was made between them.The score ranged from 97 to 25, with the top 10 articles scoring between 97 and 81 (Table 1).There was not much disparity in the scores that justified calling a 3rd reviewer.Of the 10 that obtained the highest scores, 6 performed the measurement of the data collected through the Analysis of Variance (ANOVA) for three or more groups based on a continuous response variable.They were perception of temperature judgment with variations and different subjects [50]; differences between 6 experimental conditions [37]; comparing score results [30]; comparisons of results within groups at different stages [29]; performance results and questionnaires within three approaches [31]; and analysis of 36 individuals walking on a treadmill with three different conditions [44].
Another 2 studies used averages specifying their standard deviations.One of them [24] aimed to verify the linear and quadratic effects of amplification and its moderation by comparing the two parts (A vs. B) of the experiment.It also used a Participant Demographic Research Questionnaire and Post-Study Questionnaire that was divided into 2 sections with items selected and adapted from the IBM Usability Satisfaction Questionnaire.Another [51] used the mean and standard deviation of task completion time for six games played by patients, in addition to analysing responses to questions related to game usability.
And finally, 2 articles did not make use of statistical analysis [25,42].Hoermann et al. (2015) [28] structured their experiment based on a protocol validated for mirror therapy (Berlin Mirror Therapy Protocol) and used three types of questionnaires as a method of data collection, 2 for therapists and 1 for patients.One of these was based on the meCUE questionnaire that aimed to measure user experience and was separated by four different research modules: product perception (which was replaced by "perception of therapy system"), user emotion, consequences of use, and overall assessment.
The grades were assigned with the averages given by scales for each evaluated item.Correa et al. (2013) [44] used a blind evaluator who applied a semistructured questionnaire to each group, developed by the authors themselves.The alternatives sought to measure the degree of satisfaction, the motivation, and the perception of the effectiveness of the exercise.The percentage responses served to compare the results of the therapies with and without the technology.

Discussion
This review focused on applications with AR for motor rehabilitation and the techniques used to evaluate the usability and found a predominance of structured tests for task execution and performance analysis, which is very common in AR usability studies [9,11,12].In general, studies that planned performance analysis tests were satisfied with the data they acquired to answer their research questions.However, the results of this review still demonstrate the use of few observational and empirical methods to validate or compare with analytical data generated by the tools.These results are reinforced by the QualSyst evaluation, which demonstrates that the best articles (Table 1) performed a combination of methods and provided evidence based on quantitative data collection and subjective measures, like questionnaires.The empirical analysis, such as observation and interviews, is still a little-used practice in the area (Figure 4(a)) and should be further explored in an attempt to understand users' behavior and expectations more deeply.
Studies that evaluated user's preferences and perceptions or UX predominantly used structured and semistructured questionnaires, with a smaller proportion of them also using observations and interviews.Performance surveys, for the most part, did not evaluate subjective aspects of users' interactions, such as the experience of use, satisfaction, and motivation when using the tool.When some subjective analysis was performed, it was done through a questionnaire, squandering the opportunity to conduct in-depth interviews with the user that would bring more intrinsic information about the opinions, values, emotions, and perception of the user experience [52].
Results of the "Evaluation methods" showed an absence of heuristic evaluation with specialists (designers, developers, and physiotherapists) (Section 4.2.3), which shows a lack of research with expert users based on AR-adapted guidelines.It would be a good research opportunity for application requirements area and is particularly important because such evaluations focus on enhancing the experience and detecting critical errors, before testing with ordinary users [53].
Furthermore, it is possible to analyse whether it was a recurrent practice to perform research with users who did not suffer from the disability proposed by the applications (as much as 59% of the studies) in selected articles.The mean age of non-young adults (39.89 years) reflects that the number of studies with higher sampling and younger ages lowered the general mean age, since the studies were generally conducted with healthy students within the university environment.This may be a problem because the use of nonrepresentative users in the development of assistive and rehabilitative technologies, without considering their clinical needs, may result in products with low applicability [54].
The results demonstrate a need to invest in surveys that test their applications with final users to whom the application is intended.Some authors [54] recommend the development of these applications that users need to be involved in the development process, including prototyping and validation to increase and direct the applicability of the solution.
The studies with patients had fewer users and a more advanced age, because the sample contained people with characteristic disabilities that caused the need for motor rehabilitation: stroke (47%) and traumatic accidents (23%).Three of these studies pointed out some difficulties in working with this type of participant.The first of them indicated the difficulty for older people to use new technologies that they are not used to.In addition to low sampling, the characteristics were intrinsically linked to the specialised neurorehabilitation service where the study was performed and may have restricted the generalisation of the results [30].The second, which studied individuals with Parkinson's disease, also evaluated an intervention with a small group and did not include a control group.They emphasised the need for a follow-up evaluation to be able to consider possible residual effects and recommended additional research with more subjects, in order to increase the results to a population [55].The third pointed out the difficulty in recruiting the participants, because the intervention was relatively new for the elderly, or because the participants had limited time for testing because of personal routine.They recommended recruiting and recording larger samples to provide stronger evidence [33].
In a usability study for games with gestural interaction [56], it was also found that it was not evident that there is a standardisation in usability assessments that cover who should be evaluated and how it should be evaluated.They assume that it is because each application intends to investigate different variables and consequently has different objectives.It is recommended that standardisation of usability assessments is implicated; this would be important to guide a validation for future studies and to check whether particular hardware or software is appropriate for intended use [56].
The QualSyst evaluation criteria contributed to classifying the best-structured studies with the most solid and consistent results [19].It is noticed that the publications with the highest grades used a combination of methods of data collection [24,25,29,50,51] and sought to quantify these results by validating them with analysis based on statistical calculations.For this, they experimented with different configurations, by either grouping user profiles [29], testing different versions of the system [24,50], or comparing with and without the use of technology [25], in order to obtain data to make comparisons or detect patterns of changes.
Although the technical and practical success of rehabilitation applications is fundamental to improve them, the researches that make subjective judgments of the users are essential for the usability improvement.Of the 10 studies that received QualSyst's best scores, 7 used questionnaires (validated or authored) to verify aspects related to user experience, such as motivation [29,51], satisfaction [24,25], perception [42,50], and ease of use [31].This analysis contributed to bringing parameters beyond those provided by performance assessments and helped to justify or explain events related to changes in user behaviour patterns.
This review may help developers and researchers to have a current overview of the scientific community and the types of assessments that are being conducted when using AR technology for motor rehabilitation.This overview can help to guide the formulation of research protocols in this context and also be a good reference for a methodological model for future revisions.

Conclusions
This study sought to provide an overview of the studies that had performed tests with users using AR games and applications for motor rehabilitation purposes.It was verified that the great majority still prioritised data collection methods in the execution of tasks and performance analysis, which certainly generate solid quantitative data.However, it was concluded through an external scoring system (QualSyst) that the studies which generated the best evidence chose to use a combination of quantitative and qualitative analysis, through the application of methods such as user preference questionnaires, interviews, and observations.Through userbased studies, it is possible to understand the evolution of interactions with projected space and gain insight into new approaches to solving user interface design problems identified as part of the design/evaluation process [57].The perception is that it follows the trend found in another study [9] that seems to be a growing understanding of the need to formalise the evaluation process and conduct properly designed user studies.
The difficulties of performing tests with users who have suffered a stroke or any other type of injury that limits their movements are reflected directly on the sampling found in this selection.It has been noted that the predominance of healthy users, students, and young people can lead to results that do not reflect the needs for the reality of these users.It would be suggested that the researchers included in their protocols at least one group of users that represented this profile.
It was important for this review to measure the publications through the QualSyst score by two reviewers.By using this system, it was guaranteed that the best-evaluated publications were the ones that were more methodologically structured and have strong evidence in their experiments.For future work, it is recommended to use this evaluation system, not only for health studies but also for other areas such as computer science and design, requiring minor adjustments to adapt it to the context of these researches.
In further reviews of this area, it is recommended to include more specific terms that focus on usage analysis, such as "usability" or "user testing."It might also be interesting to expand searches in health database, more specifically where studies in motor physiotherapies are published.In this way, it would be possible to expand the search in areas that may have a direct correlation with the searched theme.

Figure 1 :
Figure 1: Flow diagram of the study selection.

5 Figure 2 :Figure 3 :
Figure 2: Type of evaluation in articles numbers.

Figure 4 :
Figure 4: Method used in articles numbers (a) and percentage (b).

4. 3 .Figure 8 :
Figure 8: The system has the possibility of remote monitoring.

Table 1 :
Selected articles, ordered by QualSyst score, from the first to the 16th.

Table 2 :
Selected articles, ordered by QualSyst score, from the 17th to the 32nd.