A Systematic Review of Modifications and Validation Methods for the Extension of the Keystroke-Level Model

The keystroke-level model (KLM) is the simplest model of the goals, operators, methods, and selection rules (GOMS) family. The KLM computes formative quantitative predictions of task execution time. This paper provides a systematic literature review of KLM extensions across various applications and setups.The objective of this review is to address research questions concerning the development and validation of extensions. A total of 54 KLM extensions have been exhaustively reviewed. The results show that the original keystroke and mental act operators were continuously preserved or adapted and that the drawing operator was used the least. Excluding the original operators, almost 45 operators were collated from the primary studies. Only half of the studies validated their model’s efficiency through experiments.The results also identify several research gaps, such as the shortage of KLM extensions for post-GUI/WIMP interfaces. Based on the results obtained in this work, this review finally provides guidelines for researchers and practitioners.


Introduction
Human-computer interaction (HCI) simplifies reality with models of human behaviour to design and evaluate computer systems [1]. Within HCI, models of motor behaviour lie on a continuum of analogy and mathematical equations. Generally, the models are categorised as either descriptive or predictive. Descriptive models present a framework to describe a phenomenon by identifying its features within a computer system. At the other end of the continuum, predictive models are commonly used to provide analytical a priori estimations of human performance without user participation, thus reducing time and resource consumption.
A family of predictive models (GOMS) were developed to compare and evaluate goals, methods, and selection rules of skilled, error-free user performances [2]. GOMS techniques model goal hierarchies of defined unit tasks rendered as a composition of action and cognitive operators [3,4]. The simplest member of the GOMS family is the keystrokelevel model (KLM), which predicts the execution time of specific tasks in a desktop environment using a mouse and keyboard. The KLM has been widely utilised to evaluate expert performances of various desktop interfaces, and its aptitude and usefulness have been well demonstrated.
The challenges of designing and developing computer systems and the emergence of new technologies have revealed a need for updated quality assessments. Revising predictive models for these challenges can help evaluate human performance a priori and reduce the need for time-and resource-intensive human studies. The KLM was developed from and intended for desktop systems but has continually been extended to model systems designed for other computer setups in various domains. These extensions involve adapting the original KLM operators, introducing or inheriting new operators, revising heuristics, and presenting new execution calculations or techniques to satisfy the extension's purpose.
A systematic review of KLM extensions provides an objective procedure for identifying the extent of the research that is available; to the best of the authors' knowledge no prior systematic review exists that focuses on KLM extensions. This paper extensively reviews KLM extensions between 1980 and 2016. The goal of this review is to summarise, analyse, and assess the empirical evidence regarding the purpose for each extension, the extension's application domains and setups, 2 Advances in Human-Computer Interaction Table 1: KLM operators and predicted execution times in seconds [6].

Type
Operator Description Time (s) Physical

Keystroke or button press
Best typist (135 wpm) 0.80 Good typist (90 wpm) 0.12 Average skilled typist (55 wpm) 0.20 Average non-skilled typist ( The results of this review also outline relevant issues for designers, developers, and researchers who apply or extend the KLM. The rest of this paper is organised as follows. Section 2 presents the background for the KLM by introducing the topic and its seminal publications. Section 3 describes the methodology and protocol used to systematically review the KLM extensions. Section 5 describes the results of the review. Section 6 discusses the principal findings, limitations, and implications for research and practice. Finally, Section 7 concludes the paper and suggests future directions. KLM [6] is the simplest and most practical GOMS method for evaluating the time performance of user-computer system interaction. Underlying the KLM is the assumption that user employs a series of small and independent unit tasks. These tasks support the decomposition of larger tasks into manageable units. The sum of the durations of these small units equals the time it takes to complete the task. Each unit task has two phases: task acquisition and task execution: the total time to complete a unit task is the sum of these two parts:

Keystroke-Level Model: An overview
First, in the acquisition phase, the user conceptualises and develops a mental representation of the unit task. Then, during execution, the user invokes the appropriate system commands required to accomplish the unit task. The KLM predicts only the execution time of a unit task because that is the only phase over which a system designer has direct control. Unit tasks in the KLM are described with a set of physicalmotor, mental, and response operators (see Table 1). Operators are identified by a letter and include: K keystroking, P pointing, H homing, D drawing, M mentally preparing, and system response. K is the most frequently used operator and represents a keystroke or a button press. The operator is the act of pointing to a target on a display with a mouse. P would typically be computed as a function of the distance to a target and its size (Fitts' law, [7]); however, for simplification it is assigned a constant time. In a typical computer setup, H is the action of moving the hand between keyboard and mouse and includes any fine hand adjustments on those devices. The physical operator, D, is restricted to the mouse and refers to manually drawing a set of straight-line segments within a constrained 0.56 cm grid. Before carrying out a physical action, the user has to mentally prepare for its execution. This preparation is represented by the operator and a constant value of 1.35 seconds. The final operator, R, refers to the time it takes for the system to respond to a user's actions. Unlike physical and system operators, M is not an observable user behaviour, yet it comprises a substantial fraction of the prediction. The occurrence of is based on specific knowledge of user skills, and their placements are governed by a set of heuristics that embody psychological assumptions about users. Methods are a sequence of system commands that form a compiled segment of a user's behaviour when executing a unit task. A user cognitively organises a method according to cognitive chunks, and typically occurs between chunks rather than within them. In Table 2, while Rule 0 identifies possible decision points within the methods, Rules 1 to 4 attempt to identify these method chunks.
Execution time is predicted by decomposing a unit task into a list of operators and then computing their summation: is an operator's total time, e.g., = , where is the number of keystrokes and is the duration of each . To illustrate how the KLM's equation and rules can be applied to predict user performance, consider the following example of a user renaming a folder to "klm" on a desktop. The user homes the hand on the mouse, H; points the mouse cursor at the object, P; double-clicks on the folder icon to Advances in Human-Computer Interaction 3 KLM was validated against observed values to determine how well the model predicted performance times and was subsequently used to model typical tasks in various systems (text editors, graphic editors, and executive subsystems). K's value can be determined from a typing test prior to the test tasks. After a practice period, each expert user carried out test tasks and their keystroke times were logged. These times were then compared against the modelled predictions. The root-mean-square percentage error (RMSPE) was calculated as 21%. The developers of the KLM reported that this accuracy is the best that can be expected from the KLM and that it is comparable to the 20-30% previously obtained from more elaborate models [2,6].
KLM inherits several limitations from GOMS. It assumes the user is an expert and does not account for user errors. This makes the model ill-suited for predicting average or novice system users. The model also assumes that the task is performed linearly; however, users often multi-task and are frequently interrupted. The KLM also does not consider individual differences in performance, such as mental workload and fatigue. In addition, the KLM predictive model is usually not generalizable because it is constructed to fit and evaluate a given interface.

Methodology
This systematic review of KLM extensions was carried out following the procedure given by Kitchenham and Charters [5,8]. The review process consisted of three stages: planning, conducting, and reporting (see Figure 1). The review protocol was established after several meetings and discussions to reduce the risk of research bias. The rest of this section describes the research questions and the subsequent steps undertaken to conduct the review.

Research Questions.
The goal of this review is to examine the current extensions of the KLM from the point of view of the following research question: "What extensions have been applied to the KLM and how have these extensions been developed and evaluated?" The question aims to summarise the current practices around extending the KLM to shed light on gaps in the current research, suggest areas for further investigation, and provide knowledge on the adoption of the KLM and its extensions to measure the performance of prototypes. Table 3 lists all the research questions and their  motivations. 4 Advances in Human-Computer Interaction

Search Strategy and Terms.
The search string consisted of two main parts: the KLM and its extensions (see Table 4). The first part relates to studies utilising the KLM for extension or evaluation, and the second part relates to extensions. The terms were extracted from textbooks and research papers on the KLM. The search string was formed by incorporating alternative terms and synonyms using the Boolean "OR" expression. The two main search terms were then combined using "AND". The search was conducted by applying the search string to collections of article meta-data. The string syntax was adapted for application to each digital library and its restrictions. This review was restricted to the period from July 1980 (the first time KLM was presented in "The Keystroke-level Model for User Performance Time with Interactive Systems," [6]) to December 2016.
In addition to the primary search strategy, backward and forward searches were conducted. For each selected paper, the references were examined for a backward search, while the "cited by" links provided by some of the digital libraries were analysed for the forward search. Finally, publications citing the original KLM paper were also searched.

Study Selection Criteria.
Each primary study was evaluated for relevance against inclusion and exclusion criteria. A study was selected when it satisfied one of the following inclusion criteria: (i) Studies explicitly extending the KLM. Studies were excluded from the review when they met one of the following exclusion criteria:  (iv) The extension methodology is adequate and repeatable.
(v) The extension results and findings are clearly stated.
(vi) The study clearly defines the research methods used to validate the extended KLM.
(vii) The extension validation methodology is adequate and repeatable.
(viii) The validation results and findings are clearly stated.
(ix) The study presents a comparative analysis of the extended KLM against the original KLM.
(x) The paper has been cited by other authors and/or contributes to the literature.
Each question is ranked 1 (yes), 0.5 (partly), and 0 (no). The final quality score is the sum of these values. The maximum score is 10 and the minimum score is 0. The quality of each primary study was ranked by two researchers. After thorough reviews, discussions were conducted to reach a final decision about the inclusion of each study in the review.
3.6. Data Extraction. The data extraction strategy was used to provide answers to the research questions in Table 3. An extraction form was developed to ensure that consistent extraction criteria were used (see Appendix A). The information extracted included: (1) Title, author, year, and type of publication.
(2) RQ1: the purpose of the extension, device setup, application domain, and the intended users.

Conducting the Review
Applying the review protocol yielded the preliminary results shown in Table 5. In total, 149 studies were selected. Using the defined inclusion and exclusion criteria, 62 primary studies (based on 66 articles) were identified. During this stage several issues were identified: (i) Some studies document different stages of the same research; for this reason, we refer to a smaller number of studies based on a larger number of articles. (ii) Some studies appeared in more than one source; these were considered based on the adopted search order (ACM, Scopus, Springer Link, Science Direct, Web of Science, Taylor and Francis on-line, IEEE Xplore, and Google Scholar).
The forward and backward search of the selected studies yielded only two relevant papers that were included. This low number indicates the thoroughness of the search terms used. The number of papers reviewed totalled 64 primary studies, based on 68 articles.

Systematic Review Results
This section summarises the results obtained after conducting the review and synthesis. First, an overview of the primary studies and their corresponding quality marks is presented. Next, the answers for three of research questions are addressed in separate subsections (RQ1, RQ2, and RQ4). Because research question RQ3 (see Table 3) is considered the most important, it is addressed in a separate section. Finally, a discussion and interpretation of the results is presented. 6 Advances in Human-Computer Interaction 5.1. Descriptive Statistics. Table 6 shows the unique identifier assigned to each study and lists the associated reference. These identifiers will be used throughout the remainder of this review to refer to the primary studies.

Quality Assessment.
Each quality assessment question was assigned a score of 1 (yes), 0.5 (partly), or 0 (no). The maximum score is 10 and the minimum score is zero. The quality scores were divided into categories: (i) Very High: 8 ≤ quality score ≤ 10 (ii) High: 5.5 ≤ quality score ≤ 7.5 (iii) Medium: 3 ≤ quality score ≤ 5 (iv) Low: 0 ≤ quality score ≤ 2.5   spike in publications coincides with the resurgence of touch interactions and post-GUI configurations, which signalled a need for updated performance assessors.  studies were retrieved from Google Scholar. Fourteen studies (25.93%) were found in the ACM digital library. Springer Link produced 10 studies (18.52%), followed by IEEE Xplore with 3 studies (5.56%), Scopus and Taylor & Francis On-line with 2 studies (3.70%) each, and finally, a single publication from Science Direct (1.85%).

Publication Type.
Publications were categorised as journal articles, conference proceedings, technical reports, extended abstracts, or theses. Figure 5 illustrates the distribution of primary studies across the five publication types. Statistics show that 50% (27 studies) of the studies were conference proceedings and 29.63% (16 studies) were in journals. The remainder of the studies were technical reports (6, 11.11%), theses (4, 7.41%), and one extended abstract (1.85%).  Studies with (P1) as a purpose were conducted to extend the model, and in some instances validation studies were also conducted to confirm the viability of the model. For (P2), research methods were often utilised to extend and validate the extended model. In (P3), the studies typically extended the KLM or one of its enhancements and used experiments to determine the performance of a certain device or application. The fourth purpose, (P4), revised operators and heuristics using research methods; however, validation was infrequent. (P5) extended the model to describe new interactions, and utilised research methods to extend and validate the enhancement. The final purpose, (P6), utilised short studies to extend the KLM or one of its extensions to better incorporate it with a larger model. Table 8 summarises the results of analysing the number of publications for each purpose. The table shows that publications with purposes (P2) and (P5) had the highest median quality scores (7.75 and 7.50, respectively). This is because these studies usually carried out experiments to extend and validate KLM. Publications that extended the KLM for purpose (P3) obtained medium quality scores (median of 4.24), since validation was often not considered. Table 9 summarises the device setups collated from the review of 54 primary studies. The majority of extensions (20,37.04%) modified the KLM or one of its extensions to model mobile or tablet interactions. These extensions were further categorised as key-based (12,22.22%) or touch-based (8, 14.81%) mobile devices, smartphones, or tablets. Fourteen studies (25.93%) extended the KLM for traditional configurations. The KLM was also extended for Invehicle Information Systems (IVIS), which were categorised as either traditional (with knobs and dials, 11.11%) or touchbased (7.41.26%). Specialised configurations add features such as a digitized pad (PS8-9), Braille display (PS28), mouth-stick (PS10 and PS49), Leap Motion sensor (PS28), and specialised controls (PS64). Post-GUI configurations addressed extensions for natural user Interfaces (PS50, PS54, and PS68) and immersive projection (PS66). The KLM was also extended for web navigation on a television and for remote setup (PS47). Note that the percentages do not add up to 100% because one study (PS47) combined two setups.

Application Domain and Target Users.
Several application domains were identified from the primary studies and grouped into high-level categories. Table 10 summarizes the recurrent domains. The most frequently examined domain relates to mobile or tablet applications (13 studies). Text and/or spreadsheet editing was the domain used to validate KLM [2,6]; these studies were mainly conducted in the late 1980s to the early 1990s. Accessible interfaces were also examined to extend the KLM for interaction by blind users and users with motor disabilities. Navigating the web from various setups was also considered in the literature. IVIS setups were relatively popular (see Table 9) as a domain and considered tasks such as radio tuning, navigating lists, and using a global positioning system (GPS) for map navigation.

RQ2: What Was the Research Method Used to Extend the KLM?
This research question examines the research methods used to modify the KLM or any of its extensions. The question is addressed in two ways: (1) What was the research method used to extend the KLM operators and heuristics?
(2) What was the research method used to modify or compute the KLM operators' unit times? Figure 6 demonstrates various research methods used to extend the KLM; these include experimentation, previous literature, observations, and several combinations of these methods. Twenty-seven (50%) of the studies did not use research methods to extend operators and modify heuristics. Observations were commonly conducted to identify or examine interactions (9 studies, 16.67%). Operators and heuristics were also extracted from previous literature (7 studies, 12.96%), and some studies used experiments (6 studies, 11.11%). Research methods were also combined. Several combinations were noted, including literature and experimentation, literature and observational studies, and observation and experimentation. Figure 6 also illustrates the research methods used to modify the unit times of KLM operators. Of the 54 studies, only 12 (22.22%) did not utilise research methods. Over 50% of the primary studies (30 studies, 55.56%) conducted experiments to modify unit times. Eight studies (14.81%) relied on previous literature to adjust unit times. Additionally, research methods were also combined to extend unit times.  It was also of interest to consider the operators that have been explicitly preserved in the extensions. The rest of this section discusses the operators, equations, heuristics, and metrics based on their intended device setup (see Section 5.2.2). Figure 7 collates the operators reported in the primary studies to identify their frequencies among the selected studies and device setups.
(1) Preserved Original Operators. was the most popularly preserved of the original operators. PS6, PS32, PS11, PS65, PS41, and PS17 used the unit times associated with various typing skills. The majority of these used the time related to the speed of an average skilled typist (0.2 seconds), while others utilised the value 0.28 seconds (average non-skilled typist). H was preserved by PS1, PS11, PS32, and PS65, while was used in three studies: PS1, PS17, and PS41. Four studies (PS1, PS32, PS41, and PS68) preserved the value of . PS17 aimed to increase the accuracy of the operator by utilising Fitts' Law.
Operator is system dependent and was often not utilised in the studies, yet it was still conserved.
(2) Adapted Original Operators. Some of the KLM operators were adapted through unit time adjustments or decomposition into finer tasks. PS7 updated the unit times of H, K, M, and . PS63 dissected into two actions: homing from the keyboard to the mouse and homing from the mouse to the keyboard. P's unit time was updated in PS4, PS5, and PS7. A specialized P, PM(l), was introduced in PS11 to indicate  pointing to the ith menu item. K and were the two most frequently updated operators. K was updated in PS1, PS3, PS5, PS7, and PS63, while was revised in PS3, PS5, PS6, PS7, and PS32. In PS5, M was decomposed into three mental actions: retrieval from memory, choosing among options, and executing a mental step. K was decomposed in PS3, where the unit times were 0.36 and 0.23 for two different spreadsheet tasks, respectively.
(3) Inherited Operators. PS11 inherited ten operators from previous extensions and prior literature: pressing a button B [45]; executing a mental step [17]; retrieving from memory, dragging to a menu item, and pointing to a menu item [77]; perceiving an image, reaction time of choosing an image, and eye movement [2]; menu search slope, intercept, and an overall value from an investigation into history tools for user support [52]; pressing a button and performing a button click [45]. clicking the mouse. The symbol was introduced in three studies (PS3, PS41, and PS51) to represent two different operations: mentally scanning/searching the display and pressing a keyboard shortcut. The time it takes to listen to a spoken word was utilised in PS64. PS41 also established a new operator for the time it takes to press a navigation key when navigating websites.
(5) Updated Heuristics. Heuristics are commonly updated when new operators are introduced to revise placement. PS3 argues that commands issued through a series of menu choices involve a single rather than one for each menu choice, because the command forms a single cognitive unit. Using a history tool, PS6 stated that switching from typing to using the history tool includes an additional long-term memory retrieval. For another history tool studied in PS11, M placement was extended for formula tasks. The study also offered guidance for placing the new mental scanning operator.
(6) New Equations. In the KLM, task time is computed from the summation of the operators' unit times (see (2)). Some studies modified these equations to consider additional elements that may affect execution time. Both PS17 and PS37 introduced new equations in their extensions. PS17's authors formulated equations of various tasks that impact email archiving and retrieving. For word selection tasks, PS37 introduced equations to compute the time it takes to select a word given several variables, including scanning and scrolling time, word length, and the index of the selected word.

Key-Based Mobile.
It was in the new millennium that interest in extending the KLM for mobile interaction and text entry became most evident. Twelve of the 54 selected studies modified the KLM to accommodate key-based mobile interactions (PS16, PS20, PS21, PS24, PS26, PS27, PS30, PS34, PS35, PS39, PS44, and PS46). Figure 9 illustrates these studies and the approaches they proposed to extend the KLM. The following subsections describe the changes made to extend the KLM.
(1) Preserved Original Operators. Several operators were preserved from the KLM: H, K, M, and . PS16 utilised the original H, K, and for predictive text entry on mobile phones. The KLM was further extended by the same authors in PS30 for five predictive text entry methods that preserved the original and R. R was also used as is by PS21 and PS46. An extended KLM for modelling speech navigation and text entry preserved K.
(2) Adapted Original Operators. PS26 extended the KLM for SMS input by dissecting into nine operators for various keys and repetitions. K was also decomposed in PS39 to reflect unique interactions with a Pinyin keyboard, an input method for Chinese text using the Pinyin method of romanisation. K's unit time was revised in PS21, PS27, PS30, PS35, and PS44, the majority of which dissected the unit times based on the type of key and repetition. It should be noted that PS27 approached the KLM differently, assigning each key or repetition a score rather than a unit time. P has also been adapted and at times redefined. For instance, PS30 and PS46 modified to reflect pointing with a device to perform an action. PS44 considered for pointing to a keypad. While PS21 preserved P's original meaning, M was adapted in both PS21 and PS44 and decomposed in PS34 to represent time delays during text entry and recognition. TPER from PS20 is also an adaptation of for text entry perception. H was revised for PS30 to consider the time needed to switch between listening/speaking on the phone and reading from the screen.
(3) Inherited Operators. Only four operators from three studies were inherited from the literature or another extension. Two of these replaced the value of K, the third updated the unit time for M, and the last re-used a value from a previous model for complex actions. PS24 enhanced the KLM to evaluate Korean text entry on a mobile phone where the values for and were inherited from Kim, Kim, and Myung [78] and John and Newell [79]. K's unit time was also inherited from Silfverberg, MacKenzie, and Korhonen [80] to extend the KLM for message-text entry with a Greek corpus. Mobile KLM (PS30) was revised in PS46, which inherited the complex action operator to reflect tag-reading interactions.
(4) New Operators. Several new actions were recognised by half of the studies that extended the KLM for key-based mobile phones. PS20 introduced two new operators: waiting for the cursor to process when successive letters are entered from the same key in multi-tap text entry and the action of moving to another key. Similarly, PS26 utilised a wait operator for multitap entry. It also introduced MPHAlphaK (press and hold key), RPHAlphaK (repeat press and hold key), and InsertWord (insert word into corpus dictionary). Mobile KLM (PS30) extended the KLM with several operators: attention shift for various focus shifts, complex actions, gesturing with phone, finger movement, initial act, and a multiplicative factor for distraction. PS34 extended the KLM for speech text entry and introduced an action that reflected the time needed to consider/recognise a command and utter a syllable. is expected to occur both before and after entering a syllable. Moreover, an should not be placed before the next key since finger movement and the mental activity overlap. PS46 declared that should appear before cognitive chunks and that an is unnecessary before pointing at longer distances with respect to shorter ones.  speech text entry compared with multitap and predictive text entry, in which several equations were constructed to consider time-out delays, number of words, and word options in predictive entry. Two other studies formed equations in contexts other than text entry. The mobile KLM in PS30 proposed a new equation that took distractions of various severities into account. PS27 approached the KLM differently, presenting unit times as scores used to calculate the relative average efficacy, where the sum of the scores for each task is first divided by the number of tasks and finally multiplied by 100 to obtain a percentage.   (2) Adapted Original Operators. One KLM extension, developed to model the performance of a new interaction technique, decomposed P, D, and R. P was subdivided as follows: point stylus at segment, point to command, and point to end the mark. Dc and Dm symbolise drawing a circle around a dot and drawing a mark, respectively. R was divided into switching modes and the time it takes the system to respond. K was adapted by PS22 to consider both key repetition and movement between keys. In testing a new keyboard design for Chinese text input, 1 Line (PS45), K was dissected into a key for each finger on both hands. Similarly, M was modified in PS38 to reflect mentally initiating a task, deciding or choosing, retrieving, finding, and verifying. The extended model also adapts into two actions: homing either a stylus or a finger to some location. PS56 modified to reflect a relatively long movement from one position to another on a touch mobile phone in network gaming.

Touch-Based
(3) Inherited Operators. PS53 inherited two operators from mobile KLM (PS30): initial act and distraction. Gesture actions were inherited but adapted to reflect the time needed to physically form specialised gestures with one or more fingers. The same operator was also used by PS38 to represent holding a gesture for a certain application. to form a touch-level model (TLM) for touchscreen and mobile devices and introduced several new operators: tap, pinch/zoom to zoom in/out, swipe, rotate, drag element, and tilt device. Tapping is a common interaction in touch interfaces that was also introduced in PS38, PS55, and PS56. Swipe, zoom, and drag actions were also identified in PS55 and PS56. PS22 utilised two new operators that consider the decision and recovery times for data entry using a soft keyboard. Flick was established in PS56 to identify quick, short dragging actions. This action was decomposed in PS45 to distinguish between flick down and flick up. New operators introduced for finger/stylus touch mobile devices extended the model to include flipping or sliding a keyboard, continuously holding a key down, pressing a key on the side of the device, and plugging and unplugging other devices.

Traditional In-Vehicle Information Systems. Traditional
In-Vehicle Information Systems (IVIS) typically consist of a screen surrounded by a series of keys, buttons, and knobs indented to perform tasks such as: turning the radio on, road navigation, navigating music lists, etc. Of the 54 primary studies, six were categorised as traditional IVIS (PS13, PS15, PS18, PS31, PS42, and PS43). Figure 11 shows how the operators were extended in the new models. The following subsections elaborate further on these operators, heuristics, equations, and metrics.
( (2) Adapted Original Operators. was modified from its original values in PS31 and PS42 to reflect new homing interactions between the IVIS and the steering wheel. Similarly, PS15 decomposed into two operators, Rn and Rf, for reachnear (from the steering wheel to other parts of the wheel) and reach-far (from steering wheel to IVIS). It also presented age-adjusted unit times for older drivers. The study also dissected and into refined operators and replaced the original value of with 1.50 seconds and an age-adjusted value of 2.70 seconds. M was also modified by PS43 with two new values based on its placement after and their new turn operator. PS13 adapted for an enter keystroke along with a down keystroke. K was also modified by PS43  (4) New operators. Only two new operators were introduced in two studies, PS13 and PS43. A reading/decision operator was identified by PS13 to represent the time needed to read an IVIS menu and decide upon actions based on the menu's depth and breadth. A turn operator was introduced in PS43 for tuning a dial (clockwise or counter-clockwise) at various degrees.
(5) Updated Heuristics. The placement of was revised to incorporate the turn operator introduced in PS43. The new heuristic dictates that should be placed in two different scenarios with two different values: after and both before and after the user turns a knob.  PS31 and PS42's approach to modelling a unit task involved developing the model traditionally using their extended KLM, and then reassessing the sequence of operators by considering the vision/no-vision intervals.

Touch-Based In-Vehicle Information Systems.
Touchbased IVIS systems feature a touch screen for navigating the IVIS. Of the selected studies, four extended the KLM for touch-based IVIS (PS14, PS52, PS61, and PS62). Figure 12 illustrates the various changes made to extend the KLM. These extensions did not explicitly preserve any of the original operators; thus, the following subsections discuss the adapted and original operators, inherited actions, new operators, and new equations.
(1) Adapted Original Operators. PS14 extended the KLM to revise the unit times previously measured in the literature. They argue that the values misrepresented the evaluated IVIS because the original values were based on a QWERTY keyboard. Therefore, in their study, K and were revised. For K, several values were considered: letters, numbers, cursor keys, enter, shift, and space. The revision also considered key repetitions. M was adapted to 2.22 seconds in their extension method. K was divided in PS61 to represent function key actions and their repetition. PS61 also considered ageadjusted unit times for these operators. R was adapted by PS62 for wait-while-loading and wait-after-loading, each of which were age adjusted.
(2) Inherited Operators. PS15 was revised in PS61 to model interactions with a touch-based IVIS. PS61 inherited and revised the following actions: cursor key pressed once, cursor key after first press, letter key pressed once, letter key after first press, number key pressed once, and number key after first press. The unit times were also adjusted for age. A flicking operator was inherited by PS62 to represent the act of moving a finger in the flick direction (this operator was inherited and revised from PS52).
(3) New Operators. PS52 considered the new flick operator in the context of navigating lists of contacts or albums, each of which are age adjusted. Several new operators were introduced by PS61 to re-evaluate the traditional IVIS model, including scrolling through a list, pressing and holding a key, dragging, and first and subsequent slider actions. PS62 developed an extended KLM that overcome a noted shortcoming of the occlusion methods used by PS31 and PS42. A variety of operators were introduced: flick/scroll return, pressing an on-screen button, quick flick, reach for button, reach for console, read instructions, reposition hand on knob, scroll, search, stop screen, turn knob, and wait for goggles in known and unknown locations to represent the time the user waits for a vision period.
(4) New Equations. PS14 determined the retrieval time of a destination from an IVIS, that involved keying in part of the destination name, scrolling through a list of names, or a combination of these approaches. Destination entry tasks were also considered that involved keying in a destination name or a longitude and a latitude. To aid in modelling the KLM extension, the study created a spreadsheet for both tasks in which predicted times were adjusted for age, lighting conditions, and destination. These spreadsheets were used to construct formulas used with equations to calculate the total predicted times for destination retrieval and entry tasks.

Specialised Setup.
Specialised setups enhance a traditional device with domain-specific controls or involve more than one screen. Seven of the 54 primary studies were categorised as specialised setups (PS8, PS9, PS10, PS28, PS49, PS64, and PS67). Of these, PS8 and PS9 involve the same continuing study. These studies preserved/discarded/adapted original operators, inherited operators from the KLM or its extensions, introduced new operators or equations, updated the heuristics, and identified domain-specific metrics (see Figure 13). The remainder of this section describes the modifications applied to the KLM.

18
Advances in Human-Computer Interaction (1) Preserved Original Operators. Several operators were preserved from the KLM: H, K, P, and M. H was utilised by PS8, PS9, and PS28. PS28 also preserved the value of an average non-skilled typist (0.28 seconds). In PS9, M was used as is; however, it was not considered in their earlier work (PS8). P was utilised from the original KLM in both PS8 and PS9.
(2) Adapted Original Operators. The majority of studies in this section adapted operators from the KLM. PS64 developed a GOMS-HRA to dynamically assess the reliability of human operators in nuclear plants. The study introduced two operators, Dp and Dw, that are analogues of and represent the acts of making a decision based either on an existing procedure or without an existing procedure, respectively. H was adapted by PS67 to consider homing actions in hybrid interfaces-particularly for in-air devices such as the Leap Motion sensor. K was modified in PS8, PS9, PS10, and PS49. The first two of those studies also divided to identify the time needed to select a new function from a command menu and the time it take the system to close a polygon in a manual map digitising task. PS49 adapted for keyboard navigation by individuals with motor disabilities. An operator was identified by PS8 and PS9 that adapted to quantify two specialised pointing actions.
(3) Inherited Operators. The button click-and-release BB operator was inherited by PS49 from PS19, a study that was excluded due its low score in the review's quality assessment phase. This was also the case for the operator used in PS28 and PS49.
(4) New Operators. PS64 extended the KLM to assesses the reliability of nuclear plant operators and introduced several new operators: performing a physical action on the control board or in a field, looking up required information on the control board or in a field, obtaining required information on the control board or in a field, producing or receiving verbal or written instructions, and selecting or setting a value on the control boards for fields. A Braille operator was introduced by PS28 to evaluate blind users' interactions during web navigation. A new operator was identified by PS8 and PS9 to represent a button press on a specialised 16-button cursor used for map digitisation.
(5) Updated Heuristics. PS8 and PS9's KLM extensions for map digitisation updated the placement of the operator to reflect their modifications to the original KLM operators. They suggested placing an prior to digitising with a snap function before deciding on the next vertex to digitise as well as when deciding whether the digitising task should be ended. For a zooming task, they recommended being careful with the operator because some users may require extra time.
(6) New Equations. PS47 provided a basis for an early comparison between keyboard navigation systems (including their newly devised KeySurf system)-particularly when used for tabbing and ID navigation-for people with motor disabilities. PS47 modelled the navigation system using updated equations that reflected the unique navigation requirements for such systems. (1) Preserved Original Operators. was preserved from the KLM by PS66 for modelling an immersive interface. The extended KLM for a NUI (PS59 and PS68) utilised the original R; however, during experimentation this value was ignored.
(2) Adapted Original Operators. D, while commonly discarded in other setup categories, was adapted by PS57 and modified to reflect drawing gestures in the air, as a user would in a NUI. M was adapted by PS59 and PS68, where its values were retrieved from earlier extensions [2,45]. PS66 modified depending on various user and mobile tracking devices.
(3) Inherited Operators. A number of operators were adapted in PS59 and PS68 from prior literature. Two operators (Ms and Mp) were inherited from MacKenzie [81]. Both operators represent the mental act of preparing to execute subsequent physical actions in response to a stimulus or physical matching event. PS68 inherited from their previous work in PS57. The value of was inherited from Zeng, Hedge, and Guimbretiere [82] in PS59 and PS68 to denote the act of pointing to a target in a NUI.
(4) New Operators. Both main studies, as expected, introduced several operators to reflect the new interactions associated with their post-GUI interfaces. PS66's immersive interface required several new operators to represent tasks such as asking questions while using the interface and included start and end of task, question, gap between questions and mentally preparing a response, searching for an answer, reading, and physical movement operators. The NUI KLM also introduced several new operators, some of which were shared in two studies (PS59 and PS68), including holding a hand position, tapping by pushing or moving the hand towards the front, swiping and preparing to swipe, grasping, releasing an open hand, preparing to move the hand from a resting position to the position where a drawing stroke begins, and retracting the hand from the position where the stroke finishes. PS68 later introduced two new operators to reflect the act of pulling and a hand-preference factor. (5) New Equations. The extended model of PS59 and PS68 describes the execution of a NUI task using g-units. Gunits are gesture units that identify the time between a hand movement and returning to rest. A single G-unit can contain several gesture phrases (g-phrases) as the hand moves into various position to achieve a stroke. The execution task of the model is the summation of the g-units, each of which is defined in several new equations. PS57 also introduced a new equation from the same study to represent the act of drawing gestures in the air.
(6) Updated Heuristics. PS68 updated the original heuristic rules for placing . Rule 0 was updated from the original to consider preparation and operators. Rule 2 was adapted to reflect that when a string of M's belong to a g-phrase, all subsequent Ms excluding the first one should be deleted. Their updated heuristics also suggest that when a P follows a preparation action, then should be deleted (updated from Rule 4). Finally, t new rule was introduced (Rule 5) that stresses that when the model developed is unsure of placement, the number of operators should be emphasised over the placement of Ms. 5.4.8. Television. A single reviewed study (PS47) involved web navigation and text entry (both traditional and predictive) on a television set using a remote control. This study preserved three of the original KLM operators: K, H, and and considered two different keyboard layouts for text entry. P was adapted to represent the different layouts.
A finger movement and a dynamic mental operator were introduced into the extended KLM; the latter considers the additional cognitive load of using a word prediction system. To formulate these text entry tasks, two equations were introduced for the two text entry methods, traditional and predictive, respectively.

RQ4: What Was the Research Method Used to Validate a KLM Extension?
The purpose of this research question is to identify the research methods used, if any, to validate the performance of an extended model. The original KLM publication conducted a user study to compare observed data and predicted the KLM's results [2,6]. The model's performance was evaluated using root-mean-square percentage error (RMSPE), which was calculated as 21%. Of the primary studies, 51.85% (28 studies) conducted user experiments to validate their extended models. Performance evaluations were commonly statistically analysed using several metrics (excluding PS41 and PS47). Table 11 summarises the statistics used to evaluate the performance of predicted data versus data observed from users. Correlation analyses were applied in 11 studies (39.29%), while RMSPE was adopted by 6 studies (21.43%). Other statistical measures utilised included contrast weights, mean absolute percentage error (MAPE), percentage difference, percentage change, ratio, regression analysis, and t-tests. 20 Advances in Human-Computer Interaction Some studies combined more than one statistical method to confirm their results.
Performance measured via correlation analysis ranged in value from 0.48 to 0.98 among the eleven primary studies. RMSPE values were generally within the suggested KLM bound of 21%, excluding one instance in PS7 where the RMSPE was 31%. The percentage change ranged from -15% to 11% in studies utilising this measure.

Discussion
This section summarises the principal findings of this systematic review of KLM extensions. It also addresses the limitations of this review that may threaten its validity. Finally, a discussion of the implications of this review for research and practice is presented.

Principal Findings.
The goal of this systematic review was to examine the purposes for extending the KLM, the methods used to extend the model, how the KLM model was modified, and the techniques used to validate the extended models. The principal findings of this review are as follows: (i) This review found diverse studies related to extending the KLM for various domains and device setups. However, the extent to which the KLM was rigorously extended varied based primarily on the purpose of the study.
(ii) Some studies exhaustively applied research methods for the prime purpose of extending the KLM to new domains or setups or to adapt the models to current situations and technologies. Other studies applied the original KLM to evaluate their applications or devices and included new operators to modify the KLM.
(iii) Many of the primary studies used controlled experiments to extend the unit times of the KLM or to create new operators.
(iv) The majority of the studies did not include any type of validation for their extended models. From the studies that did report model validation, controlled experiments were often reported. Performance measures varied; however, the majority utilised correlation analyses, and RMSE (the measure originally used to validate the KLM) was the next most common.
(v) Only a small number of papers compared the performances of their extended models against the original KLM to determine their effectiveness.
(vi) The majority of the primary studies were categorised as mobile or tablet, followed by traditional setups and IVIS systems.
(vii) Several software domains were modelled with extended KLMs; nevertheless, the majority were classified as mobile programmes.
(viii) K and were two of the most commonly preserved and adapted operators, followed by P. D was almost entirely discarded by most extensions.
(ix) There is a shortage of studies that address the accessibility needs of disabled users, post-GUI, and Windows-Icons-Menus-Pointer (WIMP) interfaces. (x) In the key-based mobile category, half the studies utilised the KLM to calculate text entry with various techniques such as multi-tap or predictive. (xi) Two of the selected primary studies substituted the unit times with other measures. PS27 replaced them with scores for each operator and PS64 utilised a domain-specific measure, HEP.

Limitations.
As with other systematic reviews, this review was limited by the search terms and digital databases used. The review was also impacted by selection bias, publication bias, improper or inaccurate data extraction, and data misclassifications. Efforts were taken to alleviate these limitations including the following: (i) Setting a wider net with the search terms and digital databases. Database selection was influenced by the inclusivity of the databases, popularity, and recurrences of previous work related to predictive modelling. (ii) Publication and selection bias was overcome to some extent by including technical reports and MSc/PhD theses, which comprised the selected primary studies. (iii) Data extraction was repeatedly re-evaluated in weekly meetings by the reviewers to guarantee consensus and mitigate inaccurate data extraction and misclassifications.
6.3. Implications. The findings of this systematic review have implications for researchers who plan on refining current extensions or developing new extensions as well as for designers and developers who are considering using the KLM or one of its extensions to evaluate their computer systems. For researchers, several gaps have been identified in the literatures that lend themselves to future revisions and investigations. Despite the spike in KLM extensions in the past two years (see Figure 3), much of the work done previously requires authentication and revisions for traditional setups and mobile phones. It is unlikely that the unit times measured in the early 2000s would still hold true with current processors and memories. Efforts should be made to reevaluate useful models with the traditional setups utilised today as well as with mobile phones and tablets that are commonly used.
Tables 9 and 10 summarises the device setups and application domains of the 54 primary studies. While the summaries show a varied selection, several weak areas were identified. Device setups primarily focused on traditional setups, mobile, tablet, and IVIS systems. Despite efforts to develop post-GUI KLM extensions, a shortage still exists in studies that address new setups, including virtual and augmented reality, tangible user interfaces, physical interfaces, tabletops, large touch displays, and malleable interfaces. All of these setups have been in existence for at least a decade and are costly to develop; thus, they would certainly benefit from predictive models to determine performance in early design phases. While a reasonable array of application domains were investigated, the distribution of studies across these domains was uneven. Concentrated efforts were directed toward mobile applications and IVIS systems, leaving considerable room for further research into domains such as medical IT setups.
Extensions to the KLM commonly occur as a result of experiments to extract new actions and unit times. Figure 6 illustrates the research methods utilised by the reviewed studies to extend the KLM operators, heuristics, and unit times. However, when extending operators and heuristics, the majority of studies did not conduct experiments. While this could be expected for setups similar to the one used to extend and validate the original KLM, it is not ideal for new domains or device setups. Operators determined from normative actions could be useful but may fall short of detecting actions (particularly those relating to mental acts M) that are best observed. It is essential for an appropriate research method to be adopted to develop operators to measure human behaviours. When extending operator unit times, the majority of studies conducted experiments to empirically assign values to their adapted or new operators. This approach is also advisable for future researchers because it ensures accurate and up-to-date measurements. It should also be noted that combinations of research methods strengthen the findings by taking full advantage of their combined benefits.
Of the 54 primary studies selected, only half conducted validation studies to confirm the efficacy of their proposed extended models (see Table 11). At times the same experimental results were used for both extending and validating the models, which clearly lends itself to bias. When a new extension is proposed, it is vital that experiments be conducted to provide empirical evidence of the extension's effectiveness. This calls for more controlled experiments to determine how well the proposed extensions perform. For extensions developed for traditional setups, a comparative assessment against the original KLM could be used to determine how well the new models perform against a stable usability model. Such a comparison might even be possible with setups that rely heavily on the original operators in the KLM. A further finding was that the majority of reviewed extensions do not provide guidance or suggestions to help designers and developers apply the altered model to their product or computer system. Despite its simplicity, the application of the KLM or any of its extensions requires skill to ensure correct measurements of execution. Several tools (e.g., CogTool) have been developed to automate this process, but these are typically limited to traditional setups. Another observation from the review was that the expert level of users, in the case of most reviewed paper, was not disclosed; the users were merely declared as experts. However, what makes a user an expert? The answer to this question is highly subjective and depends on the perspective of the model developer. This in itself impacts the unit times collected for the operators and thus, the validity of the validation results. For researchers, we find that this issue could be mitigated by a clear definition of expertise that could be consistently applied across domains and device setups.
For designers/developers, we recommend the use of Table 12 to select an appropriate model given their products' domain and device setup. All the studies listed in the table 22 Advances in Human-Computer Interaction were ranked as Very High or High during the quality assessment phase and conducted experiments to extend and validate their models. It is also important to compare results from different extensions to determine the one best suited to the target users' actions and perceptions. It should be noted that, at times, the KLM or one of its extensions may be unable to address all the human behaviour anticipated in a product. In this case, combining two or more models is possible but not recommended without thorough investigation.

Conclusion and Future Work
KLM is popularly used in the literature to evaluate system design early in the development phase to determine probable performance times for skilled error-free tasks. Over the years, several extensions have been created that modify the original KLM to consider revisions of the original operators, varied device setups, and varied domains. This paper presented a systematic review that summarises the existing KLM extensions developed in the literature. From an initial 2,444 studies, 68 unique publications were selected for the review. Information was extracted from the selected studies, which allowed the reviewers to obtain conclusions to identify common techniques, find research gaps, and construct guidelines.
In future work, we intend to extend this systematic review and plan for future research in various ways: (i) Perform a systematic review that addresses the research question "What publications have utilised the KLM or one of its extensions to evaluate the efficiency of their designs and how?" We intend to apply the information obtained from this review. (ii) Develop a methodology with a formal protocol for extending the KLM that ensures an exhaustively assessed model. (iii) Offer a guide for applying the KLM and its various extensions to guarantee correct application. (iv) Review the term "expert" in an attempt to provide a unanimous definition for skilled user behaviour in the KLM and its extensions.