The Reliability Analysis of Speaking Test in Computer-Assisted Language Learning (CALL) Environment

Speaking ability is regarded as among the essential aspects of language development. Oral examination appeared challenging to evaluate due to the presence of human evaluators. The speaking method depends on the test’s reliability, determined by the raters’ scores. The current study is aimed at evaluating the speaking test’s interrater reliability utilized to measure the speaking performance of Common First Year (CFY) students during remote learning. The data were obtained from 56 EFL learners using a scoring sheet and rubrics. Eight raters were responsible for rating the study. The speaking test’s reliability was estimated using quantitative data analysis. Correlation coefficients and the Bland-Altman test were employed to assess raters’ agreement. SPSS was utilized to analyze the data in this investigation. The study’s findings suggested that the speaking exam used in the CFY program during remote learning has shown some reliability on correlations and acceptable norms on the Bland-Altman test.


Introduction
Education has changed rapidly in recent years, with a noticeable increase in e-learning, which involves teaching distantly or remotely and via online platforms [1][2][3]. Following the closure of all educational facilities in Saudi Arabia due to the COVID-19 breakout, an unforeseen quick transition from the typical "traditional" learning method [4,5] to the newly e-learning supported, namely, online learning, has happened. Adedoyin and Soykan [6] argue that "without the pandemic's emergence, our education institutions would not have been so adept at online instruction." This suspension has also impacted the testing and evaluation system, partially testing language skills. Speaking is an essential skill in learning any language. It involves interaction and group discussion, which was significantly impacted by the outbreak of COVID-19. Institutions were bound to have an online evaluation during this era.
Language proficiency fosters collaboration and interaction among individuals from varied cultural origins in all spheres of life, education, and business in the globalized twenty-first century Kukulska-Hulme et al. [7]. Learning languages must consequently be a long-term commitment, implemented in a variety of ways to fulfill societal, professional, and educational objectives, as well as individual desires and needs. English is commonly recognized as the globe's lingua franca and the most widely spoken language [8][9][10]. Due to the importance and demand for the English language in the modern global era, English as a second language (ESL) students travel from all over the world to study the language [11,12]. Consequently, significant effort has been expended in developing effective methods for learning English. It will require a concerted, substantial, and remarkable effort on the part of both students and instructors [13]. The growth of communicative pedagogy has resulted in a greater emphasis on the fundamental goal of speaking proficiency in online learning. As a result, primarily speaking competence, testing speaking skills has been a prominent issue in the language development testing process [5,14,15]. Numerous limits apply, considering the nature of verbal skills. The essential issue in speaking testing ability is the necessity to detail the activities that construct a sample of the population of speaking activities while also demonstrating how the results of those tasks truly reflect the examinees' speaking capabilities [16][17][18]. Similarly, various factors affect our perception of how well someone can communicate vocally. Since the essence of verbal ability is still in its development, it is impossible to define it accurately. There is a contradiction that allows for assessing the many components of oral skill. Typically, speaking measures are based on pronunciation, vocabulary utilization, and correct use of grammar. Likewise, the speaking test will also assess relevancy and fluency [19]. Due to the variety of elements included in the speaking evaluation, its accurate judgment is not as straightforward as it is for the other skills [20]. Kang et al. [21] contend that there could be differences in evaluating oral skills because the test taker is required to utilize language in any way due to its interactive aspect. Additionally, because human examiners primarily conduct the speaking assessment, the speaking test's scoring is significantly biased. Kang and Kermad [22] highlight this point as the primary concern in speaking evaluation, as the subjective nature of the calculation of the scoring process may result in rater inconsistencies or shifts, affecting the test taker's scores and, conversely, rater reliability. Therefore, the grading criteria are vital for the speaking test [23,24]. Speaking assessment also has certain practical constraints, contributing to the inconsistency of the outcomes, particularly in online learning. This includes time, a large number of examinees at the same period, operational costs, the attitude of the raters, the rater's training, the duration of the test, the usage of rubrics, and the average test's length [25]. Regardless of these constraints, school systems, universities, colleges, and standardized English agencies now assess examinees' oral proficiency. The oral performance is evaluated through various tasks involving presentations, group conversations, and role-playing, which are expected to elicit evidence regarding the evaluators' ability to communicate effectively [26].
The assessment process is highly dependent on a variety of factors that can hinder a learner's speaking skills during remote testing in situations like COVID-19, where rater and students have to engage through an online platform. Hence, the requirement of a well-defined, well-researched, and well-documented description of the exam results' trustworthiness is derived from logical, empirical evidence [27]. The language assessment process may also be centered on the correctness of the evaluations of learners' replies that may be supported by the premises of measure [28][29][30]. The purpose of language assessment is not merely to provide rating scales for awarding certain marks or levels of language ability but to explain the types of evidence that can be offered to justify the precision of the proficiencies of the grades [31]. Therefore, a speaking test procedure should be backed up by evidence that the test is performing the intended purpose. This entails presenting data on remote steps in addition to various reliability measures. Nonetheless, the research reveals that only a small portion of the validity question is addressed. No one measure can resolve the language test's reliability, specifically the speaking competence exam [32].

Literature Review
The literature review is carried out under two distant variables of the study-i.e., nature of testing speaking and reliability.
2.1. Oral Testing. Oral ability testing as an element of English instruction is a necessary procedure, not just because it provides a valuable platform for data on the effectiveness of education [33,34]. Additionally, it could facilitate and expedite instruction, enhance learners' motivation to improve their language proficiency, and strengthen the evaluation process [35,36]. The assessment of speaking ability has been viewed as a prominent issue in the language testing system, as speaking ability plays an essential part in language development and learning and has assumed a vital role in language education with the onset and emphasis on communicative language teaching in remote or online learning background. Speaking ability is embedded in culture, and "situation-based activity" is a significant component of daily life scenarios [37]. An ESL or EFL assessment is commonly considered more difficult than assessing other abilities, skills, or correctness ( [38][39][40]).
Speaking tests cover various language learning areas, including vocab, proper grammatical usage, fluency, correctness, interaction, the social side of speaking, and task fulfillment [41,42]. Additionally, assessing speaking is complicated because of its dynamic character, spontaneity, and appropriateness [11,[43][44][45]. To accomplish this, instructors, learners, and assessors must have a firm grasp of the features and structure of oral language that set it apart from other modes of language assessment [46,47]. Ockey [48] asserts that Clark and Swinton established a theoretical framework for classifying three types of speech assessments: "direct, semidirect, and indirect exams." The direct and semidirect examinations need learners to present before assessors and discuss the assigned topic. At the same time, the indirect tests are part of the testing system's "procommunicative" period and do not need learners to engage in communicative skills [49][50][51].
The oral assessment is among the most often utilized test types for evaluating speaking ability and substantially impacts language assessment. It is conducted with a single test taker and one or two qualified assessors or raters who assess or record their speaking ability on the predefined scale. It starts with introducing the individual, a warm-up chat to establish rapport, and then predetermined test tasks such as narrating an experience, an event, role-plays, or reversal interview. The majority of language assessments are semistructured. The IELTS speaking section is a critical 2 Education Research International component of this speaking assessment, approved in over 100 countries worldwide [52]. The interview form of assessment enables the assessor or examiner to gain a holistic impression of the learners' speaking ability and compensate for the inadequacies of other elements of the language assessment process. Furthermore, it is pretty simple to train examiners and achieve good interrater reliability [53]. Another type of speaking assessment is the pair or group assessment. This evaluation method involves one or more assessors assessing the examinees' speaking performance in groups or couples. The paired test is used to evaluate large-scale speaking ability. Speaking evaluations emphasize the interaction between participants and test takers [54]. This enables a more flexible interaction among test takers and assessors and a broader type of discourse than formal interviews [55]. Both forms provide raters with handouts and speaking evaluation criteria. The speaking test is graded holistically or analytically, regardless of the type of communication.

Reliability.
Reliability is an essential factor of every test. The goal of reliability is to assure the precision with which examinees' knowledge and performance are validated. The extent to which a test tool produces steady and consistent results is called its reliability [56]. The term "reliability" is defined as "the consistency of assessment" [57]. Thus, reliability argues that the findings are the most accurate and complete representation of a test participant's competency. This statement asserts that grading should be congruent with the test's or rater's reliability. The reliability of a test is characterized by its capacity to reflect the correctness and consistency of an evaluation. Traditionally, during the testing protocol, two reliability components are considered: interrater and intrarater reliability. Jeyaraman et al. [58] asserted that interrater reliability refers to the precision of grades provided by evaluators.
In contrast, intrarater reliability refers to the consistency of a rater's rating on distant times. This emphasizes that interrater consistency is established by comparing the grades assigned by various examiners. In contrast, intrareliability is found by evaluating the scores given by the same assessors for the same respondents over time. This demonstrates that there is no one-size-fits-all method for determining the reliability of an exam or test. Rater reliability is a concern because it incorporates individual subjectivity, affecting the marks assigned to various learners [59].
When it comes to assessing language learning's productive skills, the function of raters is always critical when it comes to determining practical ability. The reliability of an oral examination is rugged and necessitates remote measurements. Due to the subjectivity of the speaking assessment, some raters may be more moderate than others, affecting the reliability [60]. This is due to the rater's cultural context or present mood. The familiarity with the accent of the examinee may also influence the rater to award higher grades in the pronunciation part of the test [61].
Similarly, when a rater is familiar with L1 communication, they are more tolerant of granting respondents better scores. This demonstrates that the speaking exam scores are influenced in various ways [62][63][64]. Additionally, the degree to which raters' judgments contradict one another depends on the assessment scale, rubrics, and marking standards employed in a particular oral test. Because of the comprehension of the grading system, this rating criterion could have an impact on the intrarater reliability of the results. As a result, raters' knowledge of the grading scale and awareness of the rubric are also important in determining reliability.
Several studies in language assessment have investigated numerous components of speaking evaluation. Kang et al. [21] state that the research outcomes contribute significantly to understanding the speaking assessment concept. Several researchers have examined the speaking test's reliability. Nicholson [65] indicated that the speaking assessment was exceedingly consistent, but the validity argument appeared to be erroneous. The Khan et al. [66] study discovered discrepancies in examiners' scores. Further investigation revealed that the differences in the evaluators' scores were primarily due to one of the evaluators awarding scores in the grammar and vocabulary section of the test. Further, it could be enhanced by training raters before the implementation of the test.
Iwashita and Vasquez [67] and Benyo and Kumar [68] also investigated a speaking competency test format to develop a scale for ESP grading. The study found that specific aspects of the test, including fluency and vocabulary, had a persistent effect on the total scores provided by the assessors. The findings of this study are expected to have a potential influence on the construction of scales. Likewise, Demirel and Baser [50] discovered that assessing the reliability of speaking skills is not an easy process because it is influenced by various factors, including the test's construct, task, and understanding of the learners' background. Numerous studies have been conducted to verify the IELTS speaking test's reliability [24,42,52,69]. According to research, most of the IELTS speaking tests are accurate and consistent. The IELTS speaking test is considered valid regarding the content covered, accessibility, and presentation.
On the other hand, the researchers concentrated on the introduction of two reviewers in the IELTS speaking test. The review of literature suggests that reliability analysis of the speaking test is negated by the scholars, and online assessment has recently emerged. Therefore, the purpose of this study is to measure the reliability of an oral examination that involves two raters evaluating a test taker concurrently by using the blackboard platform. The present study tries to answer the following research question: How reliable is a speaking test used for online assessment of Saudi EFL learners?

Methods
This attempt is aimed at evaluating the reliability of oral performance tests. Quantitative research design for data collection was utilized to answer the research question. The data for the measure of reliability were gathered using a speaking test devised by the exam committee. The test consists of six to eight tasks, each with a distinct set of questions.

Education Research International
Without knowledge of the tasks' contents, participants could choose themes for speaking at random. After selecting the task, learners have shared the task through the BB platform. They were allotted two minutes to read and think about the given task and were permitted to choose another task if they wanted to. Following some warm-up questions, participants were expected to describe the individual tasks, and the procedure was interactive.
The study included 56 CFY undergraduates who spent their first two semesters studying English language skills as a condition for admission to their majors. The participants were aged from 16 to 20 years. Each participant was a male student. The test was administered before the final exam of the first term. According to their entrance test, all participants had the same English proficiency. The participants were explained about the test procedure and given online training for the test, as for the first regular students had an online speaking exam. They received a BB-link to join a group; then, one of the evaluators had to split them into groups on BB. Finally, from the group, students were randomly invited to the main room of the BB for the speaking test.
Eight raters who are regular faculty members of the CFY faculty were engaged in the speaking test's scoring method. Since 2014, all evaluators have been conducting this type of test. Additionally, they underwent training courses for the speaking assessment. They are part of the regular staff of the CFY program and hold a master's degree in English and a CELTA teaching credential. The evaluators were between 34 and 56 years. The rating technique was conducted in pairs, with one participant and two raters in an online platform, and participants were assigned marks on the holistic approach.
The data collection instrument was a speaking test and the student's grades. The exam committee developed this test following the Cambridge University A-2 level speaking assessment criteria (English, 2011). The test consists of a variety of tasks selected from the course content. Each task takes between 8 and 16 minutes to complete. The overall score for the assessment was 15 points, with five points assigned to each of the three dimensions of the speaking test: task fulfillment, fluency and accuracy, and vocabulary use. Evaluators were supplied with each student's speaking criteria, rubrics, and rating form.

Data Analysis
Generally, Kuder Richardson and statistical correlation measurements are used to evaluate the test's reliability. Test/ retest split-half technique and parallel form are used to determine the test's reliability. Syahidah and Umasugi [70] assert that conventional methods of reliability calculation have little relevance to oral examination since they are developed for a fixed number of preplanned topics and questions. Practical estimation for the speaking test assessment can be obtained by comparing raters' results to those of other raters with special measures. The interrater reliability was used to evaluate the speaking test's reliability for the current test. The overall interrater reliability was 0.70 for the speaking test. According to Hiser et al. [71], rater reliability can be assessed using correlation, regression, and the Bland-Altman test. To this aim, two measures were utilized to determine reliability: Bland-Altman and correlation. SPSS 22 was applied to conduct the analyses for both tests.
The findings are reported in the following stages to estimate the spoken proficiency test's reliability. Participants were based on evaluating 15 speaking test scores from both raters who assessed them concurrently. The first stage evaluated the test's interrater reliability. Due to the human component of the test procedure, two different tests were used to determine the speaking test's reliability. Because this is an assessment of productive ability testing, the rater's decision to assign marks may impact the speaking testing process. Interrater reliability was determined by employing correlation coefficients derived in SPSS software on all evaluators' marks. The evaluators were divided into four pairs to calculate the correlation, t. Interrater reliability is summarized in Table 1. Table 1 presents the interrater reliability of the eight raters in four pairs. For each of the four couples, a correlation coefficient was generated. Eight raters were paired in four pairs for the data analysis. Correlation coefficients of the evaluators' ratings were 0.710, 0.600, 0.610, and 0.640 for four pairs. Correlation coefficients for the 1st pair were 0.710; the 2nd pair was 0.660, the 3rd pair was 0.610, and the 4th pair was 0.640. The first pair has an adequate level of reliability. However, the 2nd, 3rd, and 4th pairs have a fairly low level of reliability. Despite the 2nd and 3rd pairs' low reliability, the p values for all pairs were p = 0:01 which is significant.

Bland-Altman Test.
The Bland-Altman test is done to evaluate the degree of agreement among raters. The raters' ratings were combined in 3 groups to determine interrater reliability. Figure 1 depicts pair 1 and 2 agreement. Figure 1 depicts the consistency between raters 1 and 2. As illustrated in Figure 1, most points are located between the average value and zero, indicating that the raters are in agreement. When more than 50% of the points are close to zero, this implies that the raters are in agreement. Additionally, the average value of pair 1 and pair 2 is close to +1.96 SD and -1.96 SD, respectively. SD values for pair one and pair 2 are 1.26 and -1.03, respectively, which are within the acceptable norm of data to demonstrate agreement. The agreement between the raters' scores for pair three and pair four is depicted in Figure 2. Figure 2 illustrates the agreement between raters C and D. Further, the chart demonstrates that most dots are located near the average value and zero lines, indicating that the raters agree. Likewise, the mean values of pair 3 and pair 4 are close to +1.96 SD and -1.96 SD, respectively. SD values for pair three and pair 4 are 1.60 and -1.31, respectively, which are within the usual norm of data and demonstrate agreement. The rater agreement of pairs 5 and 6 is depicted in Figure 3. Figure 3 illustrates the agreement between raters 5 and 6. As shown from Figure 3, most of the dots are close to the average value and zero lines, indicating that the evaluators agree. If more than 50% of the scores are close to zero, this 4 Education Research International indicates that the raters are in agreement. Also, the mean values of pair 5 and pair 6 are close to +1.96 SD and -1.96 SD, respectively. SD estimation for pairs 5 and 6 are 1.67 and -1.31, respectively, which are substantially within the acceptable norm of data to demonstrate agreement.

Discussion and Recommendation
Our conclusions are based on the oral data. We used reliability guidelines, such as rubrics and two examiners, to minimize the impact of human involvement in the testing system. When evaluating language learners' productive abilities, the role of raters is always essential to determine practical possibility. Oral examinations are highly unreliable, requiring the use of virtual measurements. Because the speaking assessment is subjective, some evaluators may be more moderate than others, impacting the extent of reliability. The literature review indicates that speaking evaluation was examined from various perspectives, emphasizing broad subject areas: speaking capability structures, rater impacts, factors affecting spoken efficiency, test design, test score generalization, assessing scale assessment, and test utilized. The vital aspect which impacts evaluation has been overlooked. The present study sheds some light to add to the literature to offer some insight into reliability analysis. The study offers insight for language scholars by presenting a way to check the reliability of the speaking test. The speaking test's reliability was tested in two methods. The correlation coefficient suggested that the rater's interrater reliability is insufficient to satisfy the intended standard of test reliability. Nevertheless, the first pair's reliability was 0.710, which is deemed satisfactory for the online speaking test, but the reliability of the 2nd, 3rd, and 4th pairs is assessed 0.610, 0.600, and 0.640, respectively, which is not satisfactory. The discrepancy in the reliability estimation may be the result of the online assessment where an evaluator cannot see the confidence and facial expression of the participants. Another attribution for the low reliability can be the informal way of testing. Although the pairs' reliabilities were insufficient, the p values for all four pairs were less than p = 0:00, less than 0.05. This demonstrates the reliability of the speaking exam utilized at CFY. The gap in the interrater reliability findings could be because the correlation identifies how many identical scores were assigned to respondents, which is not achievable when scores are given in point and more significant than zero.
This concludes by using the Bland-Altman test, which determines the agreement between two raters. This could be a valuable way of gauging the reliability of the test, particularly in speaking skill research. Bland-Altman analysis revealed that all four pairs of evaluators have the interrater agreement. The data points are more equidistant from the zero lines. When more than 50% of the scores are close to zero, this indicates that the raters are in agreement. This was evident in each of the four pairs. Likewise, the Bland-Altman mean values were close to +1.96 and -1.96 in all three figures. Hence, it may be stated that the CFY speaking test is reliable and can present a good evaluation. Assessing the reliability of speaking ability is not an easy process, as it is influenced by various factors, including the test's struc-ture, task, and knowledge of the participants' background. The findings assert that instead of using the correlation coefficient test to determine the reliability. The Bland-Altman test is more suitable for oral examination. The correlation test measures the degree of identical scoring, and hence, in speaking evaluation, there is no one or zero scoring; this leads to the use of the Bland-Altman test in virtual and face-to-face testing.
The research findings are partly consistent with those of [66], who suggest that these analysis results are instrumental in predicting the test's reliability. The present attempt also observed some consistency with O'Mahony [57] findings, who investigated the reliability of an oral test. The study's results revealed that the spoken test was highly reliable; yet, the reliability in this study seemed to meet the established standard of reliability. This could result from the various concerns, including remote assessment and raters assigned point values, resulting in a lesser level of reliability.

Education Research International
The findings are congruent with Iwashita and Vasquez [67] investigation, which demonstrated inconsistencies in examiners' assessments. The differences in the raters' ratings were primarily caused by one of the raters awarding high marks for grammar and vocabulary usage. Furthermore, the findings are consistent with Iwashita and Vasquez [67], who analyzed various spoken proficiency tests to develop a rating scale for ESP. The results indicated that specific aspects of the test, including fluency and vocabulary, had a persistent effect on the total scores provided by the evaluators. Therefore, the findings corroborated the findings of previous studies [24,42,52,69] that established the IELTS speaking test's reliability. According to statistics presented in the studies, most IELTS speaking tests are accurate and reliable.
Online learning and evaluation are always challenging. Learners need the motivation to participate in the learning and testing procedure. The study could be expanded in a variety of ways. The number of raters can be increased, and pairs of participants can be swapped for grading purposes. Also, the rater training provision before the test administration can result in a different outcome. The reliability of the raters for speaking skills revealed some detrimental differences among the rate end. It would also be beneficial if the grading system is made more transparent to the evaluator, contributing to the test's reliability. Finally, in online assessment, the rater may ask the participants a role rehearsal; this will help evaluators gain an accurate picture of the speaking proficiency. Moreover, learners should also be given training in an online way to understand how the speaking test is carried out in remote learning.

Implications and Limitations
The study examined and reviewed the reliability analysis of the speaking test. This study concludes that using the Bland-Altman test can help teachers and scholars determine the test's reliability. As oral examination includes human interaction, it is not feasible to agree on 1 or 0 points. To this end, researchers, examiners, and test developers can use the Bland-Altman test to check the reliability of the speaking test, which determines the degree of agreement between two raters. This could be a valuable way of gauging the reliability of the test, particularly in speaking skill research. The study findings can also help the research scholar in oral or spoken skill development.
Although the subject matter of speaking frameworks has garnered considerable research interest in the field, as illustrated by the interpretation finding of this research, it appears that there is still a long way from attaining a detailed and perfectly alright comprehension of determining the reliability of speaking ability. The study had some limitations. First, the study was limited to a only campus and one level of the students; future studies are operative to include participants for distant institutions to present more generalized findings. Moreover, the study includes only male participants and the sample was small too. The inclusion of both genders may present different findings.

Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.