Rating Scale Measures in Multiple-Choice Exams : Pilot Studies in Pharmacology

Multiple-choice questions are widely used in clinical education. Usually, the students have to mark the one and only correct answer from a set of five alternatives. Here, in a voluntary exam, at the end of an obligatory pharmacology exam, we tested a format where more than one alternative could be correct (N � 544 students from three year groups). Moreover, the students were asked to rate each item.*e students were unaware how many correct answers were contained in the questions. Finally, a questionnaire had to be filled out about the difficulty of the new tests compared to the one out of five tests. In the obligatory final exam, all groups performed similarly. From the results, we conclude that the new rating scales were a better challenge and could be adapted to assess student knowledge and confidence in more depth than previous multiple-choice questions.


Introduction
Written examinations using multiple-choice questions (MCQ) are now widely used all over the world in medical education.
ese tests were developed in psychology for research purposes by E. L. orndike around 1900 [1].Frederick J. Kelly was, likely, the first to use such items as part of a large-scale assessment for educational purposes [1].Probably, the first massive use of MCQ occurred in the US armed forces during the First World War [2].MCQ offered the possibility to assess the recruits for combat use with limited involvement of human resources.Later in the 1950s [3], the national board of examiners in the USA, based on military experiences, used these tests to assess medical students and foreign medical doctors.
MCQ-based examinations have the advantages of being fast, cheap, objective, and of making it possible to test a broad sampling of the curriculum.
ey can be given in a very standardized form and can have high statistical validity.Ideally, tests should also be sensitive and specific.MCQ-based examinations are now often electronically marked, and the results can then be mathematically processed to give the examiners and the students a quantitative feedback on the quality, reliability, and validity of the tests.A concern in this test format was always how valid it really is.In other words, does it test the skill and knowledge used in clinical practice?Is it sensitive enough to differentiate which students are fit for clinical practice and which should be kept out of practice?Hence, the national board of medical examiners in the USA and also other testing institution as well as medical faculties have continuously altered, adapted, and tried to improve their multiple-choice formats.
ere is controversy whether MCQ only assess lower levels of knowledge like recall of isolated facts, encourage trivialization of knowledge, or lend to rote learning [4].One main concern, when using these MCQ-based examinations, is whether really knowledge is assessed or other patterns like structure of the personality, the experience with the test format ("test sophistication" [5]), or the decision to take an educated guess.Others support MCQ and argue that better constructed MCQ can measure and perhaps improve problem-solving skills in examinees [6,7].Moreover, it has been claimed (in an US introductory science course) that there is a gender bias in answering MCQ [8].Possibly, females are less likely to guess than males [9] or exhibit less "test wiseness" [10].In the NBME (national board of medical examiners) part I (basic medical sciences) in the USA, men (number of examinees: 7234) had significantly higher grades than women (number of examinees: 4090).e relative difference in scores was 5.9% in total.Of interest, here, the mean score of men on pharmacology was 496 (standard deviation: 113) and in women, it was 469 (standard deviation: 107); hence, they were significantly different.ese differences were much smaller if the results in science parts of the admission test (MCAT) were taken into account [11].However, women scored better in part II (clinical sciences) than men.
A-type questions were the first and are the "one correct item out of five items" format.While this simple form is commonly termed a multiple-choice question [12], we will consider this as single choice (SC) or single response (SR), due to only one single alternative being correct which needs to be marked.It is important to note that examinees are well aware of the fact that all alternatives but one are false and presumably make use of this information.
e first aim of our study was to examine the usefulness of a new type of question with the following properties: (1) more than one (k; 0 ≤ k ≤ 5) alternative can be correct requiring the examinee to give multiple responses (MR) and (2) the number of correct alternatives for a given question is unknown to the examinee.We used multiple correct alternatives per question because the SC format as the predominantly used question type in medical education and examination involves some methodological problems.e most serious of which is, that if examinees choose the correct alternative, it is inferred, they would have also known, that all other four are incorrect; a conclusion, which cannot hold critical contemplation.
e second aim of our study was to test different implementations of an answering key for this MR format (Figure 1), namely, the already known key, where true alternatives are checked and false alternatives are left blank (only that more than one checkmark might be needed; MC), a multiple true-false (MTF) format, where the examinee indicates for every alternative whether it is correct or incorrect, and two different rating keys, where every alternative is rated by the examinee on a scale with confidence ratings with either four (R4) or five (R5) categories ranging from "definitely false" to "definitely true."A progress report of our studies has been published in an abstract form [13]. e overall aim of the study was to test new test instruments that facilitate feedback for the examiners on deeper knowledge and concept on pharmacology formed in their students.

Materials and Methods
2.1.General.We conducted three studies in three consecutive years (2012-14) for overall design, Table 1.In each year, the general concept of the study followed the same pattern as follows: in the pharmacology course, which comprises lectures and seminars, we administered a midterm exam and a final exam.e final exam was administered at the final week of the course.Immediately following this final exam, the students were given additional questions in an MR format, which we changed.We then calculated the correlation between students' scores in the regular exams and different scoring methods for the additional questions.
Students were always seated individually, and educators were in the lecture hall where the test was given for the whole time.For the regular exams, identical versions were prepared in four groups (different random sequences of questions).Test versions were given out at random in the class, and examinees received only one version throughout the whole exam.e 30 questions in the midterm exam contained only items that should be known from seminars and lectures till that time.e final exam contained 30 questions pertaining to the whole course.Questions in the midterm and final exam were designed following commonly available guidelines for A-type questions [3], and we allotted 90 seconds for answering each question.
For grading, only the results in the regular midterm and final exams were considered.As each exam consisted of 30 equally weighted questions, we calculated simple sum scores (max.30 points each).Students passed the course when reaching at least 60% of the combined score of these two exams.
is meant 36 correct answers from the 30 questions given in the midterm exam and the 30 questions given in the final exam.
us, passing the course required taking part in both exams.e additional questions were given out randomly together with the regular exam sheets before the start of the exam.ese questions had been used in their A-type format on prior cohorts and were selected as having shown a difficulty from 0.34 to 0.96 and test discrimination (from 0.21 to 0.58), following the classical test theory (e.g., Kubinger and Gottschall [14]).For the MR format, the question stems were adjusted as slightly as possible to allow for multiple correct alternatives.Answering these questions was always voluntary.

Ethics Statement.
is study can be understood as a kind of voluntary survey.No health-related data were collected.erefore, an ethics statement was not necessary for  e students were informed about the project several weeks before the examination.Participants gave their consent by filling out the exam.ese additional voluntary exams were written directly after the obligatory exams but were not part of routine course procedure, and participation had no influence on passing the obligatory exams.

Statistical
Methods.Significances of differences between correlations were tested using Fisher's r-to-z transformation (Cohen and Cohen [15], p. 54).

Year 2012
2.2.1.Aim.As a first step, we wanted to assess how the MR format with a 5-point rating key (R5) performs versus the traditional SC format.Since all the questions were taken from the same pool used for regular exams, we expected that the performance in both formats should be about the same when each alternative is scored in the MR format.However, we expected substantially lower scores in the R5 format when the questions are scored as a whole because only one incorrectly answered alternative would render the whole question incorrect and cost all other correct answers for this question.

Test Instrument.
To test the usefulness of our new R5 format, we asked two groups of students to either mark ten questions in the traditional SC format or mark their ten modified counterparts in the R5 rating format following the regular final exam.An example of a typical additional question is shown in Figure 2. As described above, we used previously used A-type questions (single choice) with slightly adjusted question stems for the additional questions.Further, one question was negated, so that all four alternatives were correct.All other nine questions had one correct and four incorrect alternatives.

Test Administration.
Half of the students received the additional items in their original SC format and half of the students in the new adjusted MR rating format with a 5-point rating scale (R5).Rating scales are a classical instrument ("Likert scales") in education research [16].Students received no extra credit when answering the additional questions.Students did not know in advance what types of additional questions they were to expect.However, both groups were informed about the required way of giving a correct answer in an enclosed text.Few (less than 5%) students stayed in the examination room till the end of the allotted time.

Sample.
Examinees were, N � 245, first-year clinical students in medicine who took part in both the obligatory midterm and final exam in pharmacology.Of this number of students, only the data of N � 188 were analyzed as follows.
e rest of the students were excluded from the analysis either, because they did not return their sheet with the additional questions (1), they did not provide their student identity number on the sheets with the additional questions and thus could not be matched with their regular exam performance (14) or, in the R5 group, they left blank more than five alternatives (42).
ese 57 students are further summarized as the dropout group (DO). is resulted in rather unequal group sizes of 129 for the SC group and 59 for the R5 group albeit the random distribution of the same amount of exam sheets of both formats.

Statistics.
For the additional questions, different scoring methods were applied in the two groups.In the SC group, a similar sum score was calculated for the additional ten questions as in the regular exam.e correct choice of the one out of five answers received one point (max.ten points), and reaching at least 60% of the score was considered as hypothetical for passing these questions.
In the R5 group, each of the 50 alternatives was scored individually as follows: alternatives were scored as correct Which of the following antimycotic drugs would be best to treat systemic fungal infections?
Which of the following antimycotic drugs can be used to treat systemic fungal infections?Education Research International and received one point, if (a) a correct alternative was answered with "definitely true" or "probably true" and (b) an incorrect alternative was answered with "definitely false" or "probably false."In all other cases, the alternative was scored as incorrect.Sum scores were then calculated for either all alternatives (max.50 points) or for all questions (points for a question were allotted only if all alternatives for these questions were answered correctly; max.ten points as in the SC group).
Due to the increased guess probability of p � 2/5 when scoring individual alternatives, different percentages were applied for determining whether examinees have hypothetically passed these additional questions.When scoring alternatives, at least 70% of the score was required, and when scoring questions as a whole, 16.8% of the score was required to hypothetically pass these additional questions.
e ten additional items were not graded and used only for our research purposes.e students did not receive bonus points for good performance in this test.

Year 2013
2.3.1.Aim.As a second step, we wanted to assess how the three different keys for the MR format (Figure 1) perform versus each other.We expected that each key yields about the same performance, but the keys are being judged differently in difficulty and acceptance by the examinees.

Test Instrument.
For the additional questions, we varied the key for the same questions in three groups: multiple choice (MC), multiple true-false (MTF), and a rating format (R4).An example of a typical additional question is shown in Figure 3.As described above, we used previously used A-type questions.While transforming them to the MR format, the question stems were as slightly as possible adjusted to allow for multiple correct alternatives.Further, some alternatives were modified for each question, so that always two questions possessed one, two, three, four, or five correct alternatives.Moreover, only positive correct alternatives were given in the test instruments.

Test Administration.
One-third of the students received the additional items in the MC implementation, one-third in the MTF implementation, and the last third in the R4 implementation.Marking the additional questions was voluntary, and students received no extra credit when answering the additional questions.Students did not know in advance what types of additional question they were to expect.However, all groups were informed about the required way of giving a correct answer in an enclosed text.
Together with the exam questions, we distributed a questionnaire about the experiences the students had while answering the additional questions.Students were asked to answer these questions after completing the additional questions.Few (less than 5%) students stayed in the examination room till the end of the allotted time.

Evaluation Instrument.
Student's experiences and opinions about the new questions were collected immediately after the test using a questionnaire.e questionnaire contained ten questions about how easy to answer the additional questions had been, whether they are suitable for exams or online preparation tests, if students wished for a more frequent use of them, and if one of the other not selfexperienced keys might be better.

Sample.
Examinees were first-year clinical students in medicine of which N � 218 took part in the obligatory midterm exam and N � 211 took part in the final exam in pharmacology.Students who did not took part in both exams (no midterm: ten of the remainder; no final: two) were excluded, resulting in N � 209 examinees with data in both exams.
Of this number of students, only the data of N � 162 were analyzed as follows.
e rest of the students were excluded from the analysis either, because they did not return their sheets with the additional questions (5), they did not provide or provided an incorrect student identity number on the sheets with the additional questions and thus could not be matched with their regular exam performance (2) or they left blank more than one whole question in the MC group (5) or they left blank more than five alternatives in the MTF group (14) or in the R4 group (21).
ese 47 students are further summarized as the dropout group (DO).
is resulted in more or less equal group sizes of 58 for the MC group, 56 for the MTF group, and 48 for the R4 group.

Statistics.
In 2013, the 50 alternatives of the ten additional items were again scored individually.In the MC group, alternatives were scored as correct and received one point if (a) a correct alternative was checked and (b) an incorrect alternative was left blank.In the MTF group, alternatives were scored as correct and received one point if (a) a correct alternative was answered with "true" and (b) an Which of the following drugs can be used for the treatment of myasthenia gravis? 4 Education Research International incorrect alternative was answered with "false."In the R4 group, alternatives were scored as correct and received one point if (a) a correct alternative was answered with "definitely true" or "probably true" and (b) an incorrect alternative was answered with "definitely false" or "probably false."In all other cases, the alternative was scored as incorrect.Sum scores were then calculated over alternatives (max.50 points).Due to the increased guess probability when scoring alternatives of p � 0.5, at least 75% of the score was required to hypothetically pass the additional questions.e ten additional items were not graded and used only for our research purposes.e students did not receive bonus points for good performance in this test.

Year 2014
2.4.1.Aim.As a last step, we wanted to assess if motivation is the key factor to better performance in the additional questions in the MC format.To achieve this, this year, examinees received additional bonus credit for answering the questions in the MC format.We expected that this measure is capable of reducing dropout and/or eliciting better performance.

Test Instrument.
is year, only the R4 key was used for the additional questions which were exactly the same as in 2013, but students were informed about three and a half months before the exam about which type of question they had to expect, which scoring method was applied and that they could earn bonus points for good performance.is information was given through a separate instruction sheet enclosed in the announcement of the exam.

Test Administration.
All students received the additional items in their MR format with the R4 key.Marking the additional questions was voluntary; however, for the first time, this year, students knew well in advance what types of additional question they were to expect and received up to two bonus points for the regular exam when performing well in the additional questions.
We told students that in order to calculate the bonus points, the following scoring method would be applied: if (a) a correct alternative was answered with "definitely true," three points were awarded, was answered with "probably true," two points were awarded, and was answered with "probably false," one point was awarded, and (b) an incorrect alternative was answered with "definitely false," three points were awarded, was answered with "probably false," two points were awarded, and was answered with "probably true," one point was awarded (Table 2).In all other cases, the alternative was scored as incorrect and received no points.Sum scores were then calculated for all alternatives (max.150 points).One bonus point was awarded for reaching at least 60% of the points, and two bonus points were award for reaching at least 80% of the score.is information had already been given in the announcement of the exam and was repeated in an enclosed text.Few (less than 5%) students stayed in the examination room till the end of the allotted time.

Sample.
Examinees were first-year clinical students in medicine of which N � 203 took part in the obligatory midterm exam and N � 201 took part in the final exam in pharmacology.e two students who did not take part in both exams were excluded, resulting in N � 201 examinees with data in both exams.
Of this number of students, only the data of N � 194 were analyzed as follows.
e rest of the students were excluded from the analysis only, because they left blank more than five alternatives.ese seven students are further summarized as the dropout group (DO).

Statistics.
In 2014, only the R4 format was used for all the examinees and again all 50 alternatives were scored individually as before (max.50 points).Further, we calculated for each alternative the score from zero to three points as described above for awarding bonus points.We then also calculated sum scores over all alternatives (max.150 points).
is scoring method may be considered as a new key which we will report for the sake of completeness and term BP for "bonus points scoring" from here on.Again, at least 75% of the score was required to hypothetically pass the additional questions.
While 95.9% of all students passed the two combined regular exams, in the SC group 78.3% passed the additional questions and in the R5 group only 50.8% passed the additional questions when scoring alternatives or 74.6% passed when scoring the questions as a whole (Table 3 for further statistics on performance).We calculated the correlation between the score in the combined regular exams and the Education Research International different scorings in the additional questions.e R5 group with scoring alternatives yielded the highest correlation with r � 0.365 and p � 0.004, closely followed by the SC group with r � 0.359 and p < 0.001.However, both correlations did not differ significantly, p � 0.943, using Fisher's r-to-z transformation [17] and comparing them using formula 2.8.5 from [15].e correlation in the R5 group with scoring questions was as expected the lowest with r � 0.253 but still significant, p � 0.027.Scatterplots with a fitted regression line of the performance in the combined regular exams versus the performance in the additional questions are shown in Figure 4.
For all further calculations, it is important to note that the overall pass rate in the combined exam was only 69.4%, which is considerably lower than in the 2012 combined exam, where the pass rate was 95.9%.While the pass rate in the midterm exam was already lower than before (78.0%this year versus 93.3% in 2012), the pass rate in the final exam, after which the additional questions were answered, dropped even more (48.8%this year versus 86.9% in 2012). is might have biased all further results.While still 69.4% of all students passed the two combined regular exams, the pass rate for the additional questions was very low in all groups: 5.2% in the MC group, 10.7% in the MTF group, and 10.4% in the R4 group would have passed the additional questions (Table 4 for further statistics on performance).We calculated the correlation between the score in the combined regular exams and the different scores in the additional questions.
e MC group yielded the highest and only significant correlation with r � 0.343 and p � 0.004.e correlation in both of the other groups was about zero and not significant.Scatterplots with a fitted regression line of the performance in the combined regular exams versus the performance in the additional questions are shown in Figure 5. Further, we asked the examinees about their experiences with the additional questions.ere were two questions directly related to the questions just answered: whether the examinees could deal well with the format and key they had experienced and whether they could easily answer the questions.For both questions, the MC group agreed most followed by the MTF group and the R4 group (Figure 6(a)).Both effects were significant, F(2.158) � 9.11, p < 0.001 and F(2.156) � 5.56, p � 0.005, respectively.In another two questions we asked, whether the examinees deemed the format and key they had experienced were useful for either teaching or exams.Again, the MC group agreed most, again followed by the MTF and R4 group in this order (Figure 6(b)).Both effects were again significant, F(2.157) � 6.09, p � 0.003 and F(2.156) � 9.81, p < 0.001, respectively.Astonishingly, when asked to think of the two other keys they had not experienced themselves and what they think how easy these would be to answer, both the MC and MTF groups chose the R4 key over the other possibility while the R4 group chose the MC key over the MTF key (Figure 7).In the MC and R4 groups, this effect is significant, t (108) � 3.33, p � 0.001 and t(88) � 2.11, p � 0.037, respectively, but not in the MTF group, t(104) � 1.96, p � 0.053.
For all further calculations, it is important to note that while the overall pass rate in the combined exams once again increased compared to the year 2013 (84.6% this year versus 78.0% in 2013 and 93.3% in 2012), the pass rate in the final exam was only 44.8%, which is even lower than in 2013 (48.8%) and considerably lower than in 2012 (86.9%). is is likely to have influenced all further results, as the additional questions were answered after the final exam (Table 5 for a summary of all performances).Still 44.8% of all students passed the two combined regular exams, while the pass rate for the additional questions was low with only 24.2% in the normal R4 scoring.However, this still is an increase of about 14% compared to the year before, where students did not know in advance about the type of questions and the scoring methods applied.We again calculated the correlation between the score in the combined regular exams and the score in the additional questions, but once again, these were around zero and not significant.Scatterplots with a fitted regression line of the performance in the combined regular exams versus the performance in the additional questions are shown in Figure 8.

Discussion
ere is debate in the literature which kind of written question type is best for medical education.Historically, the first written question items were dichotomic, that is, "true/false." is has been criticized because the language of the question item may be ambiguous (unclear, subject to interpretation, or lacking specifics like age of the patient) or 6 Education Research International there may be dissent between graders whether an answer is completely wrong or completely false [3].erefore, questions that require one best answer may be superior [3]. is format offers the theoretical possibility to score answers in a continuum between the least correct and most correct.erefore, they are thought to mirror the clinical reality more closely.Some also recommend not to use negative A-type (i.e., one out of five [3] questions especially types like "each of the following is correct except") because options cannot be rank-ordered on a single continuum, and the examinee cannot determine either the "least" or the "best correct answer" [3].ey rather recommend the Pick "N" format, in which the examinees are instructed to select "N" responses [3].In addition, it has been argued that Pick-N format questions might be better than type-A questions because in clinical practice there is often more than one acceptable solution to a given problem [18,19].We did, on purpose, not make the examinees aware how many responses they were supposed to pick, in order to make guessing a less useful strategy.Others have studied different scoring scales or scoring algorithms for Pick-N-type MCQ [18].Firstly, dichotomous scoring: one point if all true and no wrong answers were chosen.Secondly, partial credit scoring: one point for 100% true answers, a half point for Education Research International 50% or more true answers, no point deduction for wrong answers, and zero points for less than 50% true answers.irdly, a fraction (1/m) of one point was given for each correct answer given.ey concluded that the second and third options had higher statistical reliability and thus recommended to award partial knowledge in these type (Pick-N)   [18]).Others came to similar conclusions by showing that the third scoring method also attained higher validity, but the first method exhibited higher reliability (medical students in the USA [19]).It might be also important that, in contrast to others, we did not detect gender differences in the difficulty of our tests, which might argue for the appropriate construction of our tests.Others noted compared eight different scoring algorithms in secondary school students: they reported a penalizing algorithm that subtracted the portion of marked distractors that was most sensitive for differences in performance between students [20].In progress test on medical students in the Netherlands, the authors compared number-right scoring system (i.e., incorrect answers in MCQ were not penalized by subtraction from the total score) versus formula scoring (penalizing wrong answers and forcing students to guess the answer).ey concluded that number-right scoring exhibited better psychometric properties than formula scoring [21].In other studies (in a progress test in radiology residents), number-right scores showed lower reliabilities than formula scoring [22].
Another study compared different item formats and calculated reliability coefficients for the study items in each format.ey found that 425 A-type questions are needed to obtain a reliability of about 0.90 but only 275 Pick-N questions with partial credit scoring for the same reliability [19].Hence, fewer questions are needed to obtain reliability, thus saving time for faculty.ese Pick-N-type questions were, however, also different from our format in that more than five answers were available [3,18].e number of options to be selected is explicitly indicated in Pick-N-type, because otherwise scoring problems can occur.Moreover, sometimes skilled clinicians fared worse with these questions than "test-wise" medical students.However, in the real world, the number of options to select is rarely specified [19].Hence, it is of merit that we decided not to indicate how many options to select.
An interesting study used a different approach to study confidence in medical students.e authors gave A-type MCQ  Education Research International in a pharmacology course for junior (third year) and senior students (fifth year) and offered a four-point scale for confidence (from I am very sure to I am guessing).ey noticed that higher level medical students were less confident in their answer but exhibited a higher correctness.ey concluded that towards graduation, medical students gain more knowledge and also more skepticism which may be a valid goal in medical education [23].
Even though all three implementations of our new type of questions require the same set of knowledge, a different set of "test wiseness" seems required to successfully answer the questions.While the groups did not differ in their scores in the graded exam, only the MC implementation yielded a significant correlation with these scores.Further, the students answering the R4 implementation scored highest, although rating this implementation was the most demanding.
ough in many institutions and countries, the central licensing test given to medical students relies on a clear-cut MC single-choice format, and faculties are often free to choose a format of their own liking for examinations in their courses.We suggest that MC questions with a random number of right answers increase the power of the MC to assess the knowledge of students and might also offer the chance to improve their problem-solving skills and in other words make it possible to reach higher levels in Bloom's taxonomy, indeed at least in the US biology teaching to undergraduates even higher Bloom levels of knowledge were measurable with MCQ [24].Others complained that at least in tests being given to first-and second-year medical students, only low competence levels were used and called for improving the MCQ by offering faculty programs on MCQ writing skills [25,26].
We noted that performance in questions relevant for the exam was much better than that in our additional voluntary tests given for research purposes which were without consequence for final grading.Probably, this is due to student motivation expecting better grading increases the motivation and hence the performance in the test considerably.
When offering graded response categories ranging from "certainly true" to "certainly false" it is apparently better to offer an even number of options.A neutral category ("I don't know") is probably not semantically well defined and thus might corrupt the answer selection by students.
Different question formats are confounded by differing guessing probabilities.To make results comparable with respect to overall performance between tests, appropriate correction formulas should be applied.is only works well if the assumptions on guessing probabilities are correct.In  Studies 2 and 3 demonstrated the importance of familiarity with the question formats.Obviously, our multipleresponse formats and in particular the confidence scales were initially unfamiliar and this fact posed difficulties to students.It seems to be therefore advisable to introduce any new question format before giving the new examination style (in other words: at the beginning of classes).us, students can get acquainted to the new format, can devise optimal answering strategies, and might get familiar with the logic behind the new scoring systems given to them.
In summary, true multiple-response questions are more difficult to answer than single-choice questions.Multipleresponse questions offer the examiner more flexibility to test the knowledge of the student, and in some topics, this format makes it easier to construct questions with clinical relevance.Rating scales (I am confident and I am not confident) offer the possibility for the examiner to question the curriculum, detect false concept, and detect which topics have not been covered in the course work.Hence, rating scale questions might be a useful tool in formative testing to improve teaching curricula.

Figure 2 :
Figure 2: Examples of SC and R5 question types.Sample questions with their five alternatives and the corresponding key used in 2012.(a) Left panel: the question as it was used in prior exams in the SC format requiring to check the one and only true alternative and to leave false alternatives blank.(b) Right panel: the same question with slightly modified stem to allow for multiple true alternatives.Students were asked to rate each alternative on a 5-point scale (R5).

Figure 3 :
Figure 3: Examples of MC, MTF, and R4 question types.A sample question with its five alternatives and the three implementations of the key used in 2013 (all) and 2014 (only panel).(c): Students were asked to check true alternatives and leave false alternatives blank (MC, panel (a)), to indicate for each alternative whether it is true or false (MTF, panel (b)), or to rate each alternative on a 4-point scale (R4, panel (c)).

Figure 4 :
Figure 4: Performance of the year 2012 students.Performance of the 2012 SC group (panel (a)) and the R5 group when scoring alternatives (panel (b)) and whole questions (panel (c)) in the combined regular exams versus their performance in the additional questions.Each dot represents the scores of a single examinee with dots in the upper right rectangle being the ones that passed both exams.Dotted lines indicate scores required for passing.e straight line represents the linear fit to the data.

Figure 5 :
Figure 5: Performance of the year 2013 students.Performance of the 2013 MC group (panel (a)), MTF group (panel (b)), and R4 group (panel (c)) in the combined regular exams versus their performance in the additional questions.Each dot represents the scores of a single examinee with dots in the upper right rectangle being the ones that passed both exams.Dotted lines indicate scores required for passing.e straight line represents the linear fit to the data.

Figure 6 :
Figure 6: Year 2013 evaluation.Mean agreement in the three groups to the statements shown at the bottom of the panels.

Figure 7 :
Figure 7: Year 2013 subjective evaluation.Examinees were asked to think about the other two not self-experienced keys and then to indicate how easy it would be for them to answer the same questions with such a key.

Figure 8 :
Figure 8: Performance of the year 2014 students.Performance of the 2014 R4 group with alternative scoring (panel (a)) and bonus point scoring (panel (b)) in the combined regular exams versus their performance in the additional questions.Each dot represents the scores of a single examinee with dots in the upper right rectangle being the ones that passed both exams.Dotted lines indicate scores required for passing.e straight line represents the linear fit to the data.

Table 1 :
Overview of study design, year, mode of questions, and participation.

Table 2 :
Overview of scoring an alternative in the 2014 additional questions.e same table (in German) was made available to students three and a half months prior to the exam.For one alternative, you receive the points shown here, if. . . the alternative is correct and... the alternative is not correct and...

Table 3 :
Summary of performance in the 2012 regular exams and additional questions.e values given (except N and the correlations) are the respective percentage of the maximum possible value.N Min Max Mean SD Pass rate r with combined exam r with final exam Significance with p < 0.05 with respective N in each row; * * significance with p < 0.001 with respective N in each row.

Table 4 :
Summary of performance in the 2013 regular exams and additional questions.e values given (except N and the correlations) are the respective percentage of the maximum possible value.

TABLE 5 :
10mmary of performance in the 2014 regular exams and additional questions.e values given (except N and the correlations) are the respective percentage of the maximum possible value.Significance with p < 0.001 with respective N in each row.10EducationResearch International our study 1, we saw both cases: guessing correction worked well in the guessing scoring case but failed with the single scoring case.