A Framework of the Training Module for Untrained Observers in Usability Evaluation Motivated by COVID-19: Enhancing the Validity of Usability Evaluation for Children’s Educational Games

)e usability evaluation of educational games is an important task, especially for children. By applying Jakob Nielsen’s ten heuristics, most of the HCI designs can be evaluated, but when educational games are involved, where the user being observed is a child between the ages of six and eight, many questions arise. Is the observer trained well enough to observe the child’s reactions to the game with regard to its memorability, learnability, ease of use, and enjoyment? Will it be necessary for the observer to have a training session exploring the game before evaluating a child? Our research suggests that a training module designed to train an untrained facilitator (observer) in how to evaluate four usability dimensions (learnability, memorability, ease of use, and enjoyment) would be very useful. )e usability evaluation data was collected by observing users playing generic educational games, using the Mann–Whitney U test, which was conducted by two groups of observers, one trained and one untrained. )is was then reviewed, and a distinct difference was found between the results of evaluations in the two groups, thus validating the importance of training for an observer.


Introduction
e slogan "user friendly" appeared popular during the 1980s, but since the 1990s, the focus of usability engineering has relied heavily on the elaboration of usability evaluation methods. e usability engineering books by [1,2] set the basis of encompassing the concept of human-computer interaction (HCI). e first decade of the twenty-first century developments regarding usability analysis had softwareflavored tactics such as user interface implementation through software tools, standards, and "look and feel" aspects. is move enhanced the awareness of the need to work on evaluating usability through the user interface as a medium. Nielsen defined the usability through five elements (i.e., ease of learning, efficiency of interaction, ease of remembering, frequency and seriousness of errors, and user satisfaction). According to [3], the usability is the degree of efficiency, effectiveness, and user satisfaction obtained from a product used in a particular environment. Similarly, there is a definition of usability from the International Organization for Standardization (ISO), which is "the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use" [4].
Traditionally, there are certain usability evaluation methods such as behavioral analysis, heuristic evaluation, cognitive walking, interviews, and questionnaires. Besides the traditional methods, a variety of other interface evaluation methods are in use nowadays. However, few of them have considered the importance of the role and relevant education of the evaluator. Reference [5] worked on the evaluator effect and revealed that there was a considerable variation in the results reported by different evaluators using the same application in similar conditions. is casts doubt on the effectiveness of these methods. e heuristic evaluation method, conducted by experts, is one of the preferred methods for assessing the usability of games. Most of the modular heuristic evaluations are conducted using Neilson's proposals that focus on software [6,7]. Recent research by [8], in which a systematic review on heuristics was conducted, have reported that the heuristics and usability of educational games still offer potential for exploration in specific evaluation and validation of the process.
Reference [9] presents a set of heuristics resembling traditional heuristics while emphasizing the context of their use. Many of the recent studies have tried to provide platforms to enhance the outcomes of usability evaluations, but none of the studies so far has attempted to explore the competence level of evaluators in a specific cultural context. Reference [10] reveals that the inappropriateness of feedback on errors and the inadequate interaction with educational games are still little explored. erefore, the evaluation of game-based learning software needs to be accurate in observation, which makes the role of observers a vital one.
Regarding the usability evaluation method, Cognitive Usability Evaluation (CUE-E) is a new dimension in addition to traditional heuristics [11].
ere is continuous optimism about increased access to the network-based technologies for encouraging young learners to pursue their interest in technology-based learning [12,13]. e development and improvement in games design are very much needed in order to create a learner-centric approach in development which facilitates learning. Feedback on learning through the observations provides the input for improvement. In this scenario, the role of the observers is very important in the context of their competencies of observation and their ability to provide reliable and valid input. A framework of heuristics reflecting the specific learning role support for educators is proposed, which is very important for them [14]. Many of the previous studies on the role of the observer have been focused mainly on the perspective of learning [15][16][17]. In studying the role of the observers, the active observation is scripted and promoted as potentially serving the learning experience by providing stimulation. To offer learning experiences, observers dwell in learning intentions by maintaining distance and detachment as these depict their part into evaluations [18][19][20][21].
Keeping the evidence from literature in view, the current study infers that the role of the observers under the preview of their competence is still unexplored. Cultural context and symbolism in educational games have a great potential for catching the attention of users. While the effectiveness of an evaluation method may not result in similar outcomes in varying cultural contexts and learning perceptions, the competence of the observers in evaluating the usability is very important in order to ensure the reliability of the input in usability evaluations. erefore, the current study aims to propose a training module conducted remotely for the observers. e research aims to improve the role of the observers based on learning from the divergence in observations of the learners in a reallife situation. e research oversees the observer's role in user system interaction.

Literature Review
2.1. Usability Testing Methods. Moderated testing is done using phone, video, or interviews with the users in any HCI design. Lab and guerrilla testing methods are also concerned with moderated testing. In this usability testing platform, phone, interviews, and video testing can all be conducted remotely. Lab and guerrilla testing must be carried out in person. Moderated testing conducted remotely has a high success rate in collaborative usability testing of virtual reality systems [22,23].

Lab Usability Testing: In Person vs. Remote.
A user's ability to complete the task or set of tasks within the time frame can be tested by using lab usability testing. is testing can be in person or remote. An empirical comparison between the lab and the remote usability testing of websites was conducted, where 8 participants were tested through the lab and 38 people were tested remotely [24]. e average subjective ratings of usability given by the users are shown in Figure 1.

Affordance
Testing. "Affordance" refers to the features of an object, software program, website, or any other application that provides a default clue about how to use it. Bower conducted an affordance analysis on an online educational site that offers additional affordance testing including "linkability," "highlight-ability," and "permission-ability." e study concluded that interactions between affordance and operations performed make a significant impression on learning experiences. e study further proposed numerous levels of awareness for the online educational program [25,26].

Learnability Testing.
e feature of "learnability" helps the users to familiarize themselves quickly with the tasks allowed by the provided interface. Evaluation of learnability for learning management systems is of great importance. A study was conducted at King Abdul Aziz University, Saudi Arabia, in order to evaluate the usability and learnability of its LMS "Blackboard." e investigation concluded that LMS is reliable and is well designed but still lacks the ability to guide distance education learners. It also found that Blackboard violates some of the basic usability guidelines [27].

Ease of Use Testing.
A study was conducted for usability testing of a school website developed to provide relevant information to parents and visitors at Kennesaw State University. A qualitative approach was adopted to conduct testing using observation and think-aloud methods.
ere were different tasks given to be performed by users with a rating system of zero to ten. In some of the more challenging tasks, it was found that the website was not easy to use at a certain level. It was concluded that Dunwoody School's website contains beneficial resources for school community members but that testing did not result in enhancing its "ease of use" factor [28].

Mobile Usability Testing.
Millions of apps are in use but, at the same time, the number of applications fulfilling the HCI and usability standards is very small. In fact, it is important to know what usability evaluation methods have been applied to different mobile applications and where we stand.

Mobile Usability Evaluation Models.
e first model was introduced in 2002 for the evaluation of a set of usability dimensions. ese measures included navigation, input rate and menu visualization, presentation, error prevention, navigation, and contents and architecture of information design, but this study was not able to specify how each dimension is explicitly connected to each usability dimension [29]. In 2006, a study purely related to highlighting the challenges faced in evaluating the usability dimensions was conducted. is discussion was based on one hundred and eighty publications published in core human-computer interaction journals [30].
In [31], a model for usability measurements of mobile applications has been introduced. is model was designed after a review of hundreds of empirical papers. is model proposed guidelines for researchers to adopt the way of usability evaluation of mobile applications in general, but this research still lacks guidance on which usability dimension should be chosen for specific types of mobile application.
e mGQM model was built in 2011 (and revised in 2017) based on ISO 9241-11 usability measurements [32]. is model is comprehensive enough to measure effectiveness, efficiency, and satisfaction but also lacks guidance on setting the dimensions of specific mobile applications [33]. e PACMAD model opens the discussions, where the authors argue that mobile applications require a specific model and there is an extension required in existing models such as Nielsen or ISO models to measure usability dimensions [34].

Usability Evaluation for Children's Mobile Learning
Applications. It is important to understand the user preferences when designing the behavior of any system, whether it is online or mobile-based [35]. A study conducted in 2016 was based on evaluating the quality attributes of mobile learning applications for children. It reviewed the top four usability quality attributes which are efficiency, effectiveness, learnability, and user satisfaction. e purpose of this research is to explore the current literature on the subject matter as well as creating a simulation for further studies that may improve the usability and design for mobile learning apps for children [36]. e literature explains succinctly how usability and HCI are taught. In [38], the authors explained how the use of a case study performed by the students relates to the life cycle of usability. Furthermore, [39] presents an assessment developed by the students to explain the application of heuristic evaluation. Reference [40] finalizes a method that enables students to apply certain techniques in addressing usability conducted through the testing and analysis of results. In [41], the authors have conducted a usability study involving students who used a set of web pages to answer questions about the usability of these pages. An investigation of the usability of "Blackboard" at King Abdul Aziz University, Saudi Arabia, concludes that LMS is reliable and is well designed but still lacks the ability to guide distance education learners and that it also violates some basic usability guidelines [27]. In [36], the authors carried out a study reviewing the top four usability quality attributes which are efficiency, effectiveness, learnability, and user satisfaction. e study recommends conducting further studies to improve the usability and design of mobile learning apps for children.
e literature review provides systematic information about the usability evaluations of four dimensions which are learnability, memorability, ease of use, and enjoyment for educational online systems as well as mobile applications. It also refers to the fact that usability evaluations are normally conducted by researchers who are directly involved in data gathering and validates the fact that in-person evaluation results are more efficient than remotely done evaluations.

Training Module Framework
is study proposes a training module framework to train a novel observer to conduct usability evaluations of mobile app-based educational games for the age group of 6-8 years.
is section elaborates on the phases of the suggested training module, as well as conducting a pilot study to validate the module.  Figure 1: Average subjective ratings given on nine rating scales for both lab and remote tests. Source: [24]. Scale: −3 to +3 where higher ratings are better. Averages: lab � 1.6, remote � 0.7, r � 0.49.

Training Framework Proposition.
Our training framework specifies a fast-track comprehensive descriptive training design to train a novel observer who has no background in information technology or HCI and usability. e training is very specific for children's educational mobile app games for the 6-8-year age group in order to assess the learnability, memorability, enjoyment, and ease of use. e proposed framework is described in Figure 2. 3.2. Phase 1: Training Need Assessment. Whatever research has been conducted so far where the observation of children is involved to evaluate the learnability, memorability, ease of use, and enjoyment for any educational game, it was not believed that there might occur circumstances where researchers would not be able to conduct direct observation themselves all the time, which is currently the situation because of COVID-19. Section 2.1.1 of the literature review also justifies the validity of in-person evaluation being better than remote observations, specifically when children are involved. In this situation of COVID-19, where classes and most of the educational procedures have gone online, it is felt to be essential to design a training module to train any untrained observer that might be one of the parents of the children, their guardian, or a person who can directly observe a child in order to evaluate the mentioned dimensions to test the usability of educational games.

Environment Analysis.
In the current circumstances of COVID-19, it would seem difficult for a researcher to conduct usability evaluations of any educational game directly, especially when children are to be observed. One of the main barriers is to follow a comprehensive ethical procedure to reach someone's child, which may include a pre-COVID-19 negative test certificate obtained within an appropriate number of hours [42] and the confirmation of not infecting the child under any circumstances. Even then, parents are extremely protective towards their children.

Selection of the Observer.
e training reflects absolutely the knowledge and understanding of usability testing of the said dimensions. e potential observer might be a novel computer user and may not have any idea about how to use the IT tools, and even with an efficient computer expert there is no guarantee that they will be a good usability tester.

Game Selection.
With insufficient availability of educational games in other languages, a good ranking English generic educational mobile application for children in the 6-8year age group is appropriate. It should also be ensured that the subjects of the selected game should be generic, such as English, Maths, or Science which are common across the globe.

Trainer's Selection.
A certified usability testing training from ISTQB (International Software Testing Qualifications Board) or a B.S. in HCI and usability engineering with sufficient industrial experience of mobile app testing and evaluation procedures is considered to be appropriate training for the proposed training module.

Training Material.
Training material is generic, where the trainer is required to cover the following learning outcomes: (1) To provide an overview of usability evaluation for learnability, memorability, ease of use, and level of enjoyment [43] (2) To evaluate the usability of an educational game for learnability, memorability, ease of use, and level of enjoyment [44] It is the choice of the trainer to develop the training material to cover the above learning outcomes.

Training Mode.
In the circumstances of COVID-19 referred to above, the best way to deliver the training is by adopting an online training strategy. Any of the recommended online teaching tools can be utilized for this purpose.
e tools adopted must be made available to the trainees, and it should be ensured that they either know how to use them or are provided with an understanding of their usage.

Tools.
e trainer is free to select the tools that can be utilized for all training purposes. Trainers can prepare the handouts as well as utilizing any multimedia tools that facilitate the training sessions.

Training Sessions.
Training sessions can be designed as per the trainer's delivery plan, which should be sufficient to train the trainees. However, the recommended number of training sessions should be at least four, where each of the dimensions should be covered within an hour [45]. e learning outcomes should be covered to a satisfactory extent.

Training Activities and Feedback.
e arrangements of the training session and the expected procedures and outcomes are explained in Table 1.

A Pilot Run of a Training Session.
A group of 20 observers including parents and teachers were selected for this research purpose. e selection was made randomly by inviting school teachers and parents from Muscat, a city of Sultanate of Oman, to the training session. It was ensured that none of the selected trainees were aware of usability evaluation procedures and that they understood English very well. For research purposes, "Math Games" by Gun-janApps Studios was selected because of the number of downloads it had received and the star ranking given to the mobile app-based game. e selected game is the highest ranking, having received 4.4 stars and been downloaded 4 Advances in Human-Computer Interaction 153,000 times. e selected users were children in the 6-8 age group studying in an English medium School in Oman. For this research purpose, an academician with 5 years' experience of teaching HCI and usability was selected.
Online training was conducted where the introduction, knowledge, and understanding of learnability, memorability, enjoyment, and ease of use were delivered for 1 hour each, with a break of 15 minutes between each topic. e trainer utilized MS PowerPoint to prepare the presentation as training material, and the observers were requested by the trainer to join online through MS Teams at a specified time.
He developed the training handouts by utilizing the book "Student Usability in Educational Software and Games" [46] in order to cover the dimension of "learnability." To cover the "memorability" and "ease of use" dimensions, he designed the handouts from the book "Usability and User Experience Studies" [47]. In order to design the handouts for the "enjoyment" dimension, " e Mobile Learning Voyage -From Small Ripples to Massive Open Waters" [48] was reviewed by the trainer. A live question-and-answer session was delivered after the training to gather feedback from the trainees and make sure they had gained a thorough understanding of the subject.

Methodology
e current research used a descriptive design based on the experiment. e purpose of experimental design is to find out how evaluators allocated to different groups provide data based on their observations. To describe a phenomenon systematically, the researchers used field data to investigate and compare the behavior of data related to different variables. Hence, analytical research established the existence of observed facts in order to describe the phenomenon, which was not otherwise possible without applying this method [49]. A total of 40 observers were selected for two groups through two-stage sampling. Firstly, two groups were identified. Secondly, observers for each targeted group were selected. e first group were 20 observers trained for the proposed training framework given in Section 3 as explained in Section 3.5. e second group were the parents and private mentors of students, from which 20 observers were selected with informed consent. Each of the observers was assigned to record observation data on one learner only. Hence, a sample of 40 observers participated in the investigation in order to provide their observations of learners' experiences on the game's usability. To compare the difference in variables between groups, the Mann-Whitney U test [50], as cited in [51,52], was used to determine whether the distribution of variables was the same for the two groups and that the samples were likely to have been derived from the same population.

Measurement Scale.
To measure the responses of trainers regarding usability dimensions (i.e., learnability, memorability, ease of use, enjoyment), an instrument was developed by adapting the measurements of the different items from previous research. A self-administered structured questionnaire scaling the items on the 5-point Likert scale was used to collect the data. Table 2 shows the measurement of the dimensions of the items for the four criteria under observation, which is directly related to the learners' experience of the selected dimensions of usability of the game under investigation.

Procedure of Experiment.
e procedures of the experiment consist of four steps as follows:

Results and Analysis.
To evaluate the difference between trained and untrained observers regarding the reporting for learnability, the data was tested using the Mann-Whitney U test. As shown in Table 3, there is not enough evidence to support the assumption of similarity between observer groups for selected dimensions of usability design (i.e., learnability, memorability, enjoyment, and difficulty of use).  Table 4 and Table 5, the null hypothesis for similarity is rejected for all four dimensions of usability design.

Descriptive Analysis.
A descriptive analysis executed in this study evaluates the perceptions of the respondents on the five study variables, as regards the importance levels of the variables. Hence, the success of the model for untrained observers in usability evaluation could be determined. Accordingly, the means, the standard deviations, the minimum level, and the maximum level of the feedback on the research variables obtained from the respondents can be viewed in Table 6. ere are five levels of agreement that can be selected by the respondents for the items and variables, and in order to ease interpretation, the levels were split into three categories of low level, high level, and moderate level.
Specifically, low level is concluded when the mean scores are lower than 2.33, high level is concluded when the obtained mean scores are higher than 3.67, and moderate level is concluded when the mean scores are between 2.33 and 3.67.

Structure Model.
e evaluation of the measurement model or the outer model involves the measurement of convergent validity and discriminant validity. Convergent validity determines the ability to differ instruments in measuring the exact construct [57] and it also demonstrates to what extent the instruments are in agreement with one another. Convergent validity also includes reliability of construct measurement using composite reliability and internal consistency. Meanwhile, the internal consistency can be measured using Cronbach's Alpha coefficient where the alpha value should be greater than 0.7 to be interpreted as having reliability. Additionally, Table 7 shows composite reliability (CR) of greater than 0.7 for all constructs.
Further, all constructs in this study had their scales' internal consistency (ICR) verified. Additionally, convergent validity can be determined through the use of the factor loadings of the model's construct items. Accordingly, [58] stated that items with loadings of 0.70 or higher should be retained as they denote convergent validity. Five constructs (learnability (EF), ease of use, memorability, enjoyability, and educating untrained observers in usability evaluation) showed loading value of greater than 0.7 (factor loading) and those fulfilling the requirement of threshold value were analyzed further.

Discussion
e outcomes of the research have revealed that there is a significant difference in the observations on the usability dimensions of the game between the two groups of observers. e research successfully answered the research question posed in the Introduction. e study expected that the observations by the trained observers, who are considered to be familiar with usability and experienced in the HCI environment and its issues, will vary from those by the group of untrained observers, who had no background in technical and cognitive issues regarding HCI. is was indeed revealed by the statistical analysis of ranks and medians, which showed a significant difference in the observation data between the two groups relating to selected dimensions of the usability model of the selected game. e results are consistent with previous research reported in the Observer's ability to observe and report the behavior of users on any given dimension Explanation and demonstration on understanding and assessing the cognitive aspects of dimensions (i.e., learnability, memorability, enjoyment, ease of use) 1 hour Informative feedback from attendees of the session on current training using a scale-based adaptive instrument Evaluation of trainees' understanding in observing the dimensions of the proposed usability evaluation design 6 Advances in Human-Computer Interaction

Dimension Measurement items
Learnability [53] (i) Selection of menu to start the game (ii) e command on navigational controls from one activity to another (iii) e level of completion of the task given to the user (iv) e level of completion of the task within the time frame given to the user (v) e number of errors committed while performing a task (vi) Fixation and scan path duration Ease of Use [54] (i) e amount of trouble it takes to move from one activity to another (ii) Time spent on understanding a process (game step) (iii) Time spent on watching a process (game step) (iv) Time spent on understanding (how and when) to pause the game (v) Time spent on understanding (how and when) to move from one level to another Memorability [55] (i) e time spent in accomplishing the first level of the game in the first attempt (ii) e time spent in accomplishing the second level of the game in the second attempt (iii) e difference in timings between the first and second attempts Enjoyment [56] (i) e level of concentration of the user when playing the game (ii) e level of immersion in the game that is experienced by the user (iii) e ease of distracting the user from the game (iv) e strength of feeling and sense of control over the game experienced by the user (v) e strength of motivation of the user to play again   Sig Decision e distribution of learnability is the same for both groups 0.001 * Rejected e distribution of memorability is also same for both groups 0.017 * Rejected e distribution of enjoyment is the same across categories of group 0.001 * Rejected e distribution of difficulty of use is the same across categories of group 0.033 * Rejected Asymptotic significance is displayed. e significance level is 0.05. * , p < 0.05.
Advances in Human-Computer Interaction literature review which concludes that the context and ability of observers may produce different results. e outcomes of the research give merit to the proposal for the training of observers. e uniformity of approach and common understanding of the aims of observation for observers will increase the validity of observations on the usability of games. Observers indicate usability and provide precise and constructive feedback that clarifies the practical aspects of games. An observer's ability to estimate and record the data, to be alert, and to control emotions are key qualities for a good observer. e observers' training will give them a broader reach, enabling them to understand and appreciate the need for the observation of the cognitive skills displayed by the user. Additionally, understanding the importance of neutrality and noninterfering behavior would become a part of knowledge enhancement related to observation.

Implications
e present research provides evidence on understanding issue-based empirical data related to actual conditions. e study guides the readers to consider the contextual environment of the learners, for whom learning games are designed and who, indeed, benefit. e results provide a clear guideline on evaluating the usability of games and how evaluation carried out by heterogeneous groups of observers may produce misleading results. eoretically, the research provides an insight into understanding the importance of the cognitive learning aspects of learners and the competence of observers in the evaluation of usability. e research also guides the directions of future research into linguistics, exemplification, and esthetic utility of games.

Research Finding
e present study is significant to the domain of educational games evaluation as it sheds light on the subject in the midst of COVID-19 pandemic. is study addressed the issue of reaching the target audiences in the assessment of a certain systems amidst the pandemic, by bringing forth an innovative approach. Accordingly, the needed theoretical knowledge in dealing with the problems is offered by this study. A new model comprising various variables with strict coordination and other variables was proposed.
is study looks into the important considerations in remote evaluation of systems in situations where it is not  Usability evaluation of educational games is regarded as an important task to several bodies including game developers, departments of research and innovation, and, in the context of this study, the ministry of education. Accordingly, the usability evaluation can be applied in evaluating the potential of education games and the children's acceptance of these games. Accordingly, the usability evaluation in Oman was clarified succinctly in this study. Moreover, the factors impacting usability evaluation of educational games for children specific to the context of Oman during COVID-19 pandemic were identified. e measurements and conceptual framework illuminating the relationship between trainers were developed, where learnability (EF), ease of use, enjoyment, and memorability were the independent variables, while the construct of "educating untrained observers in usability evaluation" was the dependent variable.

Research Contributions.
Several issues significant to the situation at hand were discussed in this study. For this purpose, this study established and examined a framework of the training module for untrained observers in usability evaluation during COVID-19 pandemic, to improve the validity of usability evaluation for children's educational games. In view of that, this study had made several contributions as discussed below: (i) is study helps authorities in their remote execution of usability evaluation as needs arise (ii) is study becomes a valued input to the scrutiny on the impact factors affecting usability evaluation of educational games (iii) is study enriches the knowledge on the benefits of and prerequisites for the improvement of usability evaluation of educational games in Omani context (iv) is study presents a model that describes the impact of some factors on usability evaluation of educational games in Oman More importantly, the research model proposed was applied in the examination of variables analysis of usability evaluation for educating untrained observers in usability evaluation in Oman.

Recommendations
Based on the findings, the researchers recommend the training module (framework) detailed in Section 3 in this paper for untrained observers who want to observe the children for usability evaluation of an educational game to produce more valid and accurate results. e major considerations in developing an observer's deployment would cover the extent of knowledge needed by an independent observer.

Limitations of the Study
Because of the unavailability of games in a local language (Arabic) interface, we decided not to investigate the differences in communicative attributes of games. Due to time constraints and limited access due to COVID-19, a large enough sample size was not assured.
Another limitation in assessing the differences between observers involved the issue of exposure of the learners to the game. e learners were not classified as being first-time users or already experienced in using games. e difference in experience itself may have impacted on the comparison between the two groups.
Data Availability e data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.