On Characterization of Norm-Referenced Achievement Grading Schemes toward Explainability and Selectability

Grading is the process of interpreting learning competence to inform learners and instructors of the current learning ability levels and necessary improvement. For norm-referenced grading, the instructors use a conventionally statistical method, z score. It is difficult for such a method to achieve explainable grade discrimination to resolve dispute between learners and instructors. To solve such difficulty, this paper proposes a simple and efficient algorithm for explainable norm-referenced grading. Moreover, the rise of artificial intelligence nowadays makes machine learning techniques attractive to the norm-referenced grading in general. This paper also investigates two popular clustering methods, K-means and partitioning around medoids. The experiment relied on the data sets of various score distributions and a metric, namely, Davies–Bouldin index. The comparative evaluation reveals that our algorithm overall outperforms the other three methods and is appropriate for all kinds of data sets in almost all cases. Our findings however lead to a practically useful guideline for the selection of appropriate grading methods including both clustering methods and z score.


Introduction
In both formal and informal education, grading is the process of interpreting learning competence to inform learners and instructors of current learning ability levels and necessary improvement.
ere are basically two types of nonbinary grading systems [1]: criterion-referenced grading and norm-referenced grading. e former normally calculates the percentage of a learning score and maps it to the predefined percent range of a specific grade. is grading system is suitable for an examination that covers all content topics of learning and thus requires long exam-taking as well as answer-checking times. In contrast, large classes and/or large courses widely use the norm-referenced grading system to meet exam-taking time constraints and to save examanswer-checking resources. Such a system compares the score of each individual to relative criteria defined based on all individuals' scores to determine a proper grade. e criteria are set by a conventionally statistical means either without or with conditions (e.g., a class's grade point average (GPA) must be kept below 3.25).
is paper focuses on the unconditionally norm-referenced grading. e type of problem that the paper targets is data clustering where its difficulty is that the reasons behind cluster boundaries must be explainable as the first priority. A concrete problem is norm-referenced grading while its difficulty is that how to make learners whose scores are contiguously ranked accept their different grades (i.e., their scores fell in different cluster boundaries) with no doubt. To our experiences, this classical problem has long made graders seriously reluctant to resolve dispute with learners. Let us consider the following example to comprehend such a situation: given a simplified series of ranked scores . . ., 84, 80, 78, . . ., performing the norm-referenced grading on such a score series by using a traditional method may result in grades . . ., A, B, B, . . ., respectively. e learner who scores 80 can make an objection to why he or she receives B rather than A. It is not only difficult for grader to explain the entire steps of the traditional method (which is complicated) but also difficult for the learner to understand. Our algorithm provides a simple and clear-cut justification based on the widest score gaps: "because 80 is closer to 78 than to 84 so 80 should be assigned the same performance level as 78 rather than 84." e rise of artificial intelligence nowadays makes machine learning techniques attractive to the norm-referenced grading. We therefore investigate an opportunity to exclusively adopt four methods from the realm of statistical and machine learning: our novel algorithm, a conventionally statistical method, and two unsupervised machine learning techniques, namely, K-means and Partitioning around medoids (PAM) (aka K-medoids). We selected the unsupervised learning techniques since the norm-referenced grading cannot have a training data set. In particular, we selected K-means and PAM as they are the only well-known clustering algorithms that allow us to specify the number of output clusters to represent the desired number of grades (as specified by an employed grading policy). erefore, both K-means and PAM are naturally applicable to the normreferenced grading. e grading results of each approach will be measured and compared based on the practical data sets of various distribution characteristics. e main contributions of this paper are a simple and efficient grading algorithm and a novel insight into the performance of statistical method, machine learning methods, and our algorithm in unconditionally norm-referenced grading. To the best of our knowledge, we also demonstrate for the first time the applicability of K-means and PAM clustering techniques for norm-referenced grading. e merit of this paper would help worldwide graders with the selection of the right grading method to meet their objectives. e rest of this paper is organized as follows. Section 2 explores previously existing research studies. Section 3 explains the z score grading method. Section 4 reviews machine learning techniques, which includes K-means and PAM, applicable to norm-referenced grading. Section 5 explains our proposed grading algorithm. Section 6 justifies a grading performance metric in terms of clustering quality. Section 7 experiments our algorithm, z score, K-means, and PAM methods based on normal and asymmetric distribution data sets. Section 8 discusses the main findings. Section 9 draws the conclusion.

Related Work
As for applying a machine learning clustering technique to learners' achievement, Arora and Badal [2] analyzed the competency of students by using K-means. e competency is attributed by 10-subject marks. e centroid of each cluster was mapped to one of the grade symbols A to G. e resulting grade of each cluster was the competency indicator of students belonging to such a cluster. Academic planners could use such an indicator to take appropriate action to remedy the students. Similarly, Borgavakar and Shrivastava [3] clustered GPAs and internal class assessments (e.g., class test marks, lab performance, assignment, quiz, and attendance) separately by using K-means. erefore, each student's competency was associated with several clusters, which were used to create a set of rules for classifying the student. Any weak students were identified before the final exam to reduce the ratio of fail students. Research by Parveen et al. [4] employed K-means to create 9 groups of GPAs: exceptional, excellent, superior, very good, above average, good, high pass, pass, and fail. Students whose GPAs belonged to the exceptional and the fail groups were called gifted and dunce, respectively. e gifted students were enhanced of their knowledge, whereas the dunce students were remedied through differentiated instruction. Research by Shankar et al. [5] clustered students from different countries based on their attributes: average grade, the number of participated events, the number of active days, and the number of attended chapters. An optimal k value of K-means was determined by means of the Silhouette index resulting in k � 3. Among the 3 clusters, the most compact cluster (i.e., a cluster with the least value of withincluster sum of square) was further analyzed for correlation between the average grade and the other attributes. Xi [6] utilized K-means to cluster students' test scores into 4 classes, excellent, good, moderate, and underachiever, to take the appropriate self-development and teaching strategy for treatment. Research by Iqbal et al. [7] explored several machine learning techniques for early grade prediction to allow instructors to improve students' competency in early stages. In such work, Restricted Boltzmann Machine was found to be most accurate for students' grade prediction. K-means was also used to cluster students based on technical course and nontechnical course performance.
Regarding an automated grading and scoring approach, Ramen and Joachims [8] proposed a peer grading method to enable student evaluation at scale by having students assess each other. Since students are not trained in grading, the method enlisted probabilistic models and ordinal peer feedback to solve a rank aggregation problem. Bai and Chen [9] proposed a method to automatically construct grade membership functions, lenient-type grades, strict-type grades, and normal-type grades, to perform fuzzy reasoning to infer students' scores.
is paper significantly extends our immature work [10] with a full-fledged algorithm, a newly practical data set, a newly experimented machine learning method, a set of new findings, and a novel guideline for method selection.

Conventionally Statistical Grading
A conventionally statistical grading method relies on z scores and t scores [1]. z score is a measure of how many standard deviations below or above the population mean a raw score is. z score (z) is technically defined in (1) as the signed fractional number of standard deviations (σ) by which the value of an observation or a data point x is above the mean value (μ) of what is being observed or measured.

Applied Computational Intelligence and Soft Computing
Observed values above the mean have positive z scores, otherwise, negative z scores. e t score converts individual scores into standard forms and is much like z score when the sample size is above 30. In psychometrics, t score (t) is a z score shifted and scaled to have a mean of 50 and a standard deviation of 10 as in (2).
e statistical grading method begins by converting raw scores to z scores. e z scores are further converted to t scores to simplify interpretation because t scores normally range from 0 to 100, unlike z scores that can be negative real numbers. e t scores are then sorted and a range between maximum and minimum t scores is divided by the desired number of grades to obtain an identical score interval. e interval is used to define the t score ranges of all grades. In this way, raw scores can be mapped to z scores, the z scores to t scores, the t scores to t score intervals, and the t score intervals to resulting grades, respectively.

Machine Learning-Based Grading
is section explains how to apply K-means and PAM clustering algorithms to the norm-referenced grading, which is natural to unsupervised learning rather than supervised one. K-means and PAM were selected since both allow specifying the number of clusters in advance to match the number of eligible grades known a priori. [11] is an unsupervised machine learning technique for partitioning n objects into k clusters. K-means begins by randomizing k centroids, one for each cluster. Assign every object to a cluster whose centroid is nearest to the object. Recalculate the means of all assigned objects within each cluster to serve as k new centroids aka barycenters of the clusters. Iterate the object assigned to the clusters and the centroid recalculation until no more object moves between clusters. In other words, the K-means algorithm aims at minimizing an objective function k j�1 n j i�1 |x i − c j |, where n j is the number of objects in cluster j, x i � <x i1 , x i2 , . . ., x im > is an object in cluster j whose centroid is c j , x i1 to x im are the features of x i , and |x i − c j | is Euclidean distance. Also, note that the initial centroid randomization can result in different final clusters.

K-Means. K-means
When applying the K-means algorithm to higher educational grading, k is set to the number of eligible grades. Graders must decide such a number in advance.

Partitioning around Medoids.
Unlike K-means representing each cluster with the mean value of objects within clusters, PAM [12] represents each cluster by one of the objects nearest to the cluster's center. PAM proceeds in two phases. In the first phase, build, select k objects nearest to the center of all other unselected objects. Such k objects called medoids are selected one by one. In the second phase, swap, assign all unselected objects to their nearest medoids to obtain k initial clusters. For each cluster, calculate average dissimilarity (i.e., average distance) between a medoid and the other objects. en, for such a cluster, search whether any object if it became a new medoid minimizes the average dissimilarity. If it does, select such an object as a new medoid. Once all clusters have been searched and if at least one medoid has changed, repeat the second phase; otherwise, PAM ends.
Similar to applying K-means, PAM requires that k be set to the number of eligible grade symbols beforehand.

Proposed Grading Algorithm
is section proposes a statistical algorithm for norm-referenced unconditional grading. e algorithm works step by step as defined in Algorithm 1.
e algorithm is explained as follows. In line 1, sort(S) initially ranks the scores of learners within a group from the best down to the worst. In line 2, countEligibleGrades(GS) counts the number of eligible grades. In line 3, calcu-lateAllScoreGaps(S) sequentially goes through the score ranked list to straightforwardly determine a gap between every two contiguous scores (i.e., a score difference). Line 4 sorts the gaps in a descending order. In line 5, selectWi-destGaps(SG, cnt-1) selects a set of maximum gaps that equal the number of eligible grades minus one. For instance, four eligible grades require four score ranges; thus, select-WidestGaps(SG, cnt-1) function returns the first three maximum gaps. In case that some gaps are identical, the gaps of scores that are closest to the middle of the score rank will be returned by the function. In line 6, defineScoreRanges-FromGaps(SG) creates a series of score ranges, each of which is associated with each eligible grade. For instance, the score range of grade B is 76 to 82 points. Finally, grades(S, R) in line 7 completely assigns proper grades to all scores based on the defined score ranges. In this way, our algorithm is simple while its performance will be proved in Section 7.
Remark that our algorithm gets only two input parameters, the learners' scores and the eligible grades while the local variables of the algorithm are used for temporary value assignment rather than as controlling parameters. Also, all of the called functions in our algorithm perform straightforward tasks as implied by their names without any tuning parameters. erefore, our algorithm keeps users away from the parameter tuning burden.

Grading Performance Measurement
In this paper, the performance of each grading method is represented with clustering quality. e quality of clustering results can be measured by using a well-known metric namely Davies-Bouldin index (DBI). We employed DBI instead of another related metric, Silhouette, because DBI is much computationally less complex; thus, it is highly readable by practical graders. Let us denote by δ j the mean intracluster distance of the n j points (each of which is expressed as x i ) belonging to cluster C j to their barycenter c j : Let us also denote a distance between barycenters c j′ and c j of clusters C j′ and C j by Δ jj′ � | c j′ − c j |. DBI is figured out by using (3) [13]. e lower DBI, the better quality of clustering results (i.e., low DBI clusters have low intracluster distances and high intercluster distances).
e underlying reason for using DBI as the grading performance metric in norm-referenced grading is intuitive as follows. Learners with much similar achievement should receive the same grade (i.e., equivalent to low intracluster distances), and different grades must be able to discriminate achievements between the groups of learners as much clearly as possible (i.e., equivalent to high intercluster distances). DBI value will be low (i.e., better grading performance result) if clusters are compact and far away from one another.

Evaluation
We evaluated our algorithm, z score method, K-means, and PAM in norm-referenced unconditional grading. Experimental configuration and data sets' characteristics are initially described.
en, grading results along with performance metrics are provided.

Experimental Configuration.
A grading policy that evaluated the scores into 5 eligible grades, A, B, C, D, and F, without any class GPA constraint was engaged. e grading policy was implemented in 4 ways by using our algorithm, z score, K-means, and PAM methods. e number of clusters was predefined to 5 (i.e., the 5 eligible grades) in K-means and PAM. Each method had its performance measured in DBI metric as if the grades represented distinct clusters. e six data sets of accumulative term scores were used to ensure fair comparison among the grading methods. We characterized the data sets through data distribution in order to verify their coverage of all possible distribution patterns (i.e., the representativeness of various case studies). In particular, the data distribution patterns that were employed included normal distribution (ND data set in Table 1) and positively and negatively skewed distributions (SD+ and SD− data sets in Tables 2 and 3). e algorithm's effectiveness was also double-checked by using two additional data sets, slightly positively and negatively skewed distributions (RD+ and RD− data sets in Tables 4 and 5). Last but not least, the other rare data set with an exclusively wide score gap (WD data set in Table 6) was also exploited. e scores relied on a scale of 0.0 to 100.0 points. A one-dimensional vector was used to represent each data set as shown in Tables 1-6 so that readers could dive deep into the scores to judge the effectiveness of each applied method. Every data set is also described in the term of statistic along with its distribution pattern.
e first data set, namely, ND, has a normal distribution. Table 1 shows the raw scores of ND. Mean and median are 63. Mode is unavailable as every score has the same frequency of 1. σ is 13.9.  Applied Computational Intelligence and Soft Computing To comprehend the characteristics of ND, Figure 1 projects its normal distribution. e horizontal axis represents z score. e curve was computed with (4) where x represents a score. e area under the curve represents a distribution value [1]. e second and the third data sets have positively and negatively skewed distributions namely SD+ and SD− , respectively. Positively skewed distribution is an asymmetric bell shape skewed to the left probably caused by overly difficult exam questions from the viewpoint of learners. Table 2 shows the raw scores of SD + set. Mode, median, mean, and σ are 52, 60.9, 53, and 14.236 respectively. Figure 2 depicts the normal distribution of SD + set. Its skewness is heavy and equals 1.006.
Negatively skewed distribution is an asymmetric bell shape skewed to the right probably caused by too easy exam questions from the viewpoint of learners. Table 3 shows the raw scores of SD− . Mode, median, mean, and σ equal 87, 82, 73.5, and 16.929, respectively. Figure 3 depicts the normal distribution of SD− . e skewness is as heavily as − 1.078. ese 3 data sets contain the same number of raw scores and were realistically synthesized to clarify the extreme behaviors of the four studied methods. e fourth data set RD− was collected from a group of real 61 anonymized learners taking the same undergrad course in the academic year 2019. Unlike SD+ and SD− that are heavily skewed, RD− (and RD+) represents imperfectly normal distributions (i.e., slightly skewed). RD− in Table 4 has the slightly negative skew of − 0.138 as shown in Figure 4. Mode, median, mean, and σ equal 66.7, 56.6, 57.9, and 12.136 respectively. e fifth data set, RD+, was the real term scores of the other group of 100 anonymized learners from another anonymized university. Opposite to RD− , RD + has the slightly positive skew of 0.155. e characteristics of RD + are shown in Table 5 and Figure 5. Mode, median, mean, and σ equal 82.5, 66.4, 65.7, and 9.662, respectively. e last data set, WD, consists of the broad range of scores with a relatively wide gap. Such a score pattern exists in the group of learners with a learning competency divide. As a result, some enclosed grade ought to be skipped. e characteristics of WD are shown in Table 5. A significant gap lies between the scores 79 and 30 as depicted in Figure 6. Mode, median, mean, and σ equal 87, 82, 62.3, and 31.975, respectively. WD has the moderately negative skew of − 0.450.

Grading Result.
We graded ND data set by using the proposed algorithm, z score, K-means, and PAM methods and reported their results, respectively, in angle brackets: < our-algorithm grade, z score grade, K-means grade, PAM grade >   Table 7 resulting in an Nx4 matrix where N rows equal a number of scores. Our algorithm delivered exactly the same results as those of K-means. Both methods' DBIs equaled 0.330. z score method yielded the equivalent DBI of 0.443. It might be questionable from student viewpoint why graders using z score gave learners who scored 78 and 79 the same grades A as that of 84 and 47 mark holder the same grade F as that of 42 marks. ese are simply because 78 and 79 fell in the same z score interval of A while 47 fell in the z score interval of F. PAM also yielded the DBI of 0.330 despite too many grades A.
We also graded SD+ data set with our algorithm, z score, K-means, and PAM methods as shown in Table 8. Our algorithm delivered the same results as K-means and PAM.
eir DBIs were 0.222. z score method gave the equivalent DBI of 0.575. ere were many grades F when using z score method.
Next, we graded SD− data set in Table 9. Our algorithm delivered the equivalent DBI of 0.299. e DBIs of z score, K-means, and PAM methods were equally 0.233.   In practice, there is no perfectly normal distribution with respect to learners' achievement. Now experimental results based on data sets having slightly skewed distributions are described. We graded RD data set with our algorithm, z score, K-means, and PAM methods as shown in Table 10. e gap columns show differences between every two consecutive scores (i.e., the results of calculateAllScor-eGaps() function in Algorithm (1) to be utilized by our algorithm where 4 widest gaps (indicated by the bold numbers) were used as grading steps.
All four methods produced different grading results. Particularly, our algorithm and K-means assigned A for the same group of learners whereas z score and K-means methods gave F to the same group of learners. Our algorithm had the DBI of 0.375 whereas K-means, PAM, and z score method gave the equivalent DBIs of 0.469, 0.474, and 0.492,  erefore, our algorithm delivered the best grading results for RD− -. Our algorithm accomplished the lowest DBI partly because grade D has only one member score, comparable to the smallest possible cluster, which DBI favors.
We graded RD + data set as shown in Table 11. With this large data set, the grading results of all methods are totally different. Our algorithm, z score method, K-means, and PAM methods yielded DBIs of 0.345, 0.529, 0.486, and 0.487, respectively, meaning that our algorithm defeated the others.
WD data set was graded as shown in Table 12. Our algorithm, z score method, K-means, and PAM yielded DBIs of 0.403, 0.452, 0.449, and 0.449, respectively. Although our algorithm outperformed the others in terms of DBI, recall that WD data set had the exceptional pattern of so significant gap that assigning 5 grades completely may not be plausible. As shown in Table 12, only z score method is capable of automatically skipping grades C and D.     Applied Computational Intelligence and Soft Computing Figure 7 comparatively projects all aforementioned DBIs with respect to each grading method and data set. ey can be analyzed as follows. Our algorithm has DBIs' μ � 0.329 and σ � 0.058. z score has DBIs' μ � 0.454 and σ � 0.109. K-means' DBIs have μ � 0.365 and σ � 0.109. PAM′ DBIs have μ � 0.366 and σ � 0.110. e overall performance of each method is revealed in Figure 8. As the lower DBI the better clustering quality, the heights of stacks show that our algorithm performs best due to the lowest overall DBI whereas K-means and PAM produce underneath performance results by 10.90% and 11.21% higher DBIs than ours, respectively. z score method performs worst, 38.03% greater DBI than ours. ese relative performance differences show the practical significance of our algorithm.

Result Analysis, Finding, and Discussion
We also conducted paired (Student's) t-test to evaluate whether the means of our algorithm's DBI are statistically significantly different from those of the other methods. Particularly, paired t-test was employed to compare DBI means produced by our algorithm with those of z score, K-means, and PAM methods for the 6 data sets. We used the standard significance level of 0.05 and the hypothesized mean difference of 0 (i.e., null hypothesis value indicating no DBI difference between methods) to figure out p value for one-tailed t-test. A smaller p value means that there is stronger evidence in favor of an alternative hypothesis (i.e., there is DBI difference between methods). Firstly, DBI difference between our algorithm and z score had the p value of 0.040, which was less than 0.050. erefore, our algorithm outperformed z score with statistical significance. Secondly, DBI difference between our algorithm and K-means had the p value of 0.144. Lastly, DBI difference between our algorithm and PAM had the p value of 0.141. erefore, our algorithm outperformed K-means and PAM without statistical significance. Note that, unlike the practical significance, the statistical significance only provides evidence that performance differences exist since it is a mathematical definition that does not know anything about our subject area.  Our algorithm and K-means lead to fairly similar grading results based on normal and heavily-positively skewed distributions. Furthermore, by examining Tables 7-12, PAM produced the most A and the least F by average. e behavior of our proposed algorithm can be discussed in terms of the definition of (3) as follows. e algorithm performed clustering effectively in almost all cases of data sets (i.e., ND, SD+, RD− , RD+, and WD) because Algorithm 1 always selects the maximum score gaps to draw cluster boundaries, that is, maximum Δ jj′ . Although the algorithm does not deal with the minimization of δ j , it usually has less impact on DBI than Δ jj′ since δ j takes part in the summation (thus requiring the minimization of the other term δ j ′ ), whereas Δ jj′ is the sole divider in (3). Nevertheless, in an exceptional case, merely maximizing Δ jj′ is not enough as substantiated by our algorithm that performed worst when clustering SD− data set.
Key findings based on the result analysis are provided as follows. In general, Figure 7 reveals that the absolute degree rather than the positive or negative polar of skewness has more impacts on the methods' grading performance: the greater the absolute skewness, the lower the grading performance.
is is because the greater absolute skewness implies more dispersed or dissimilar scores. Considering the nature of each method in conjunction with the above grading results leads to a guideline in Table 13 for appropriate method selection.
As we had not experimented our algorithm against data sets from other application domains, we did not claim the other applications of our algorithm besides that of the normreferenced grading. However, the potential applications of our algorithm might include resource-consumer clustering problems in real life where their practical requirements of cluster-boundary explainability are the first priority: why two contiguously ranked data points (i.e., consumer profiles) belong to different clusters (i.e., different resource allocation levels) needs to be straightforwardly acceptable by data point owners. Some concrete applications can include the nationwide selection of government loan applicants. Otherwise, serious arguments or even protests might occur between not only data-clustering processor and data owners but also discriminated data owners themselves. e main characteristic of our algorithm meets such requirements by providing a simple and clear-cut answer based on the widest gap  between cluster boundaries; the other algorithms require that data owners completely understand the complicated algorithms to get answers. Last but not least, to have an unbiased view, we point out the limitation of the proposed algorithm as follows. Although our algorithm can justify grade changes over evaluated scores through obvious score dissimilarity, the score ranges of the grades might be relatively different unlike z score. For instance, our algorithm might yield only a few learners receiving grade B and more receiving grade C. is can be negatively translated as unfair chances to receive both grades. Furthermore, unlike z score, our algorithm cannot skip any eligible grade if no one deserves such a grade (i.e., criterion based). e example lies in Table 12. However, this drawback holds only if some sense of criterion-referenced grading is introduced instead of pure norm-referenced grading.

Conclusions
is paper provides the comprehension of four unconditionally norm-referenced grading methods: our new algorithm, z score, K-means, and PAM. We conducted the experiments with multiple data sets of various distribution characteristics based on DBI performance metric. Overall, our algorithm outperforms the other methods. K-means method is ranked second followed by PAM. z score is the worst but appropriate for some case. In fact, our algorithm is so simple that it is implementable by using a spreadsheet tool. We plan to conduct more experiments with constraints and apply our algorithm to other domains as well.

Data Availability
e data used to support the findings of the study are included within the article.
Disclosure e preliminary version of this paper was published under the title "Norm-Referenced Achievement Grading: Methods and Comparison" in the Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2020.  (i) is method is suitable when the same grade is always supposed to be held by learners with closely similar abilities.
(ii) As indicated in Figure 7, K-means is also suitable for heavily skewed distribution like SD+ and SD− data sets.

PAM
PAM that produced the most A and the least F by average implies that the group GPA of learners tends to be high when grading with PAM.
PAM is also suitable for heavily skewed distribution like SD+ and SD− data sets as indicated in Figure 7.
Our algorithm (i) In contrast with K-means, our algorithm prioritizes intercluster dissimilarity, that is, gaps between scores at the borders of different groups.
(i) is method is of a good choice when different grades are supposed to distinguish learning ability divides.
(ii) Our algorithm is generally appropriate for all kinds of data distributions. e reason is that our algorithm's strategy is the determination of score gaps, which draw the clear-cut boundaries of clusters.
z score (i) z score method disregards the notion of cluster (dis) similarity by engaging the even ranges of the best and the worst scores within each learner group.
(i) is method should be used when all grades are supposed to encompass an equal score range. Let us consider Table 10. Grade C produced by our algorithm ranges from 42 to 63.5 points which is relatively wider than the score ranges of the other grades. is situation is avoided in z score's results. In other words, z score tries to equalize score ranges across all grades.
(ii) z score method is not good at dealing with normreferenced grading in general mainly because its operation is blind to inherent raw-score gaps.
(ii) Unlike the other methods, z score method is recommended for grading a score set that holds some wide divide (i.e., WD) because z score method allows skippable grades.