Multiple-choice questions are widely used in clinical education. Usually, the students have to mark the one and only correct answer from a set of five alternatives. Here, in a voluntary exam, at the end of an obligatory pharmacology exam, we tested a format where more than one alternative could be correct (
Written examinations using multiple-choice questions (MCQ) are now widely used all over the world in medical education. These tests were developed in psychology for research purposes by E. L. Thorndike around 1900 [
MCQ-based examinations have the advantages of being fast, cheap, objective, and of making it possible to test a broad sampling of the curriculum. They can be given in a very standardized form and can have high statistical validity. Ideally, tests should also be sensitive and specific. MCQ-based examinations are now often electronically marked, and the results can then be mathematically processed to give the examiners and the students a quantitative feedback on the quality, reliability, and validity of the tests. A concern in this test format was always how valid it really is. In other words, does it test the skill and knowledge used in clinical practice? Is it sensitive enough to differentiate which students are fit for clinical practice and which should be kept out of practice? Hence, the national board of medical examiners in the USA and also other testing institution as well as medical faculties have continuously altered, adapted, and tried to improve their multiple-choice formats.
There is controversy whether MCQ only assess lower levels of knowledge like recall of isolated facts, encourage trivialization of knowledge, or lend to rote learning [
A-type questions were the first and are the “one correct item out of five items” format. While this simple form is commonly termed a multiple-choice question [
The first aim of our study was to examine the usefulness of a new type of question with the following properties: (1) more than one (
The second aim of our study was to test different implementations of an answering key for this MR format (Figure
Systematization of question types.
We conducted three studies in three consecutive years (2012–14) for overall design, Table
Overview of study design, year, mode of questions, and participation.
Year | General questions | R5 | MC | MTF | R4 | Bonus |
---|---|---|---|---|---|---|
2012 | 60 |
|
− | − | − | − |
2013 | 60 | − |
|
|
|
− |
2014 | 60 | − | − | − |
|
+ |
Year: year where examination was given; bonus: plus and minus indicate whether or not bonus points were given to increase student participation/motivation;
Students were always seated individually, and educators were in the lecture hall where the test was given for the whole time. For the regular exams, identical versions were prepared in four groups (different random sequences of questions). Test versions were given out at random in the class, and examinees received only one version throughout the whole exam. The 30 questions in the midterm exam contained only items that should be known from seminars and lectures till that time. The final exam contained 30 questions pertaining to the whole course. Questions in the midterm and final exam were designed following commonly available guidelines for A-type questions [
For grading, only the results in the regular midterm and final exams were considered. As each exam consisted of 30 equally weighted questions, we calculated simple sum scores (max. 30 points each). Students passed the course when reaching at least 60% of the combined score of these two exams. This meant 36 correct answers from the 30 questions given in the midterm exam and the 30 questions given in the final exam. Thus, passing the course required taking part in both exams.
The additional questions were given out randomly together with the regular exam sheets before the start of the exam. These questions had been used in their A-type format on prior cohorts and were selected as having shown a difficulty from 0.34 to 0.96 and test discrimination (from 0.21 to 0.58), following the classical test theory (e.g., Kubinger and Gottschall [
This study can be understood as a kind of voluntary survey. No health-related data were collected. Therefore, an ethics statement was not necessary for this kind of voluntary survey. Nevertheless, the study complies with any applicable personal data protection regulations. The students were informed about the project several weeks before the examination. Participants gave their consent by filling out the exam. These additional voluntary exams were written directly after the obligatory exams but were not part of routine course procedure, and participation had no influence on passing the obligatory exams.
Significances of differences between correlations were tested using Fisher’s
As a first step, we wanted to assess how the MR format with a 5-point rating key (R5) performs versus the traditional SC format. Since all the questions were taken from the same pool used for regular exams, we expected that the performance in both formats should be about the same when each alternative is scored in the MR format. However, we expected substantially lower scores in the R5 format when the questions are scored as a whole because only one incorrectly answered alternative would render the whole question incorrect and cost all other correct answers for this question.
To test the usefulness of our new R5 format, we asked two groups of students to either mark ten questions in the traditional SC format or mark their ten modified counterparts in the R5 rating format following the regular final exam. An example of a typical additional question is shown in Figure
Examples of SC and R5 question types. Sample questions with their five alternatives and the corresponding key used in 2012. (a) Left panel: the question as it was used in prior exams in the SC format requiring to check the one and only true alternative and to leave false alternatives blank. (b) Right panel: the same question with slightly modified stem to allow for multiple true alternatives. Students were asked to rate each alternative on a 5-point scale (R5).
Half of the students received the additional items in their original SC format and half of the students in the new adjusted MR rating format with a 5-point rating scale (R5). Rating scales are a classical instrument (“Likert scales”) in education research [
Examinees were,
For the additional questions, different scoring methods were applied in the two groups. In the SC group, a similar sum score was calculated for the additional ten questions as in the regular exam. The correct choice of the one out of five answers received one point (max. ten points), and reaching at least 60% of the score was considered as hypothetical for passing these questions.
In the R5 group, each of the 50 alternatives was scored individually as follows: alternatives were scored as correct and received one point, if (a) a correct alternative was answered with “definitely true” or “probably true” and (b) an incorrect alternative was answered with “definitely false” or “probably false.” In all other cases, the alternative was scored as incorrect. Sum scores were then calculated for either all alternatives (max. 50 points) or for all questions (points for a question were allotted only if all alternatives for these questions were answered correctly; max. ten points as in the SC group).
Due to the increased guess probability of
The ten additional items were not graded and used only for our research purposes. The students did not receive bonus points for good performance in this test.
As a second step, we wanted to assess how the three different keys for the MR format (Figure
For the additional questions, we varied the key for the same questions in three groups: multiple choice (MC), multiple true-false (MTF), and a rating format (R4). An example of a typical additional question is shown in Figure
Examples of MC, MTF, and R4 question types. A sample question with its five alternatives and the three implementations of the key used in 2013 (all) and 2014 (only panel). (c): Students were asked to check true alternatives and leave false alternatives blank (MC, panel (a)), to indicate for each alternative whether it is true or false (MTF, panel (b)), or to rate each alternative on a 4-point scale (R4, panel (c)).
One-third of the students received the additional items in the MC implementation, one-third in the MTF implementation, and the last third in the R4 implementation. Marking the additional questions was voluntary, and students received no extra credit when answering the additional questions. Students did not know in advance what types of additional question they were to expect. However, all groups were informed about the required way of giving a correct answer in an enclosed text.
Together with the exam questions, we distributed a questionnaire about the experiences the students had while answering the additional questions. Students were asked to answer these questions after completing the additional questions. Few (less than 5%) students stayed in the examination room till the end of the allotted time.
Student’s experiences and opinions about the new questions were collected immediately after the test using a questionnaire. The questionnaire contained ten questions about how easy to answer the additional questions had been, whether they are suitable for exams or online preparation tests, if students wished for a more frequent use of them, and if one of the other not self-experienced keys might be better.
Examinees were first-year clinical students in medicine of which
Of this number of students, only the data of
In 2013, the 50 alternatives of the ten additional items were again scored individually. In the MC group, alternatives were scored as correct and received one point if (a) a correct alternative was checked and (b) an incorrect alternative was left blank. In the MTF group, alternatives were scored as correct and received one point if (a) a correct alternative was answered with “true” and (b) an incorrect alternative was answered with “false.” In the R4 group, alternatives were scored as correct and received one point if (a) a correct alternative was answered with “definitely true” or “probably true” and (b) an incorrect alternative was answered with “definitely false” or “probably false.” In all other cases, the alternative was scored as incorrect. Sum scores were then calculated over alternatives (max. 50 points). Due to the increased guess probability when scoring alternatives of
The ten additional items were not graded and used only for our research purposes. The students did not receive bonus points for good performance in this test.
As a last step, we wanted to assess if motivation is the key factor to better performance in the additional questions in the MC format. To achieve this, this year, examinees received additional bonus credit for answering the questions in the MC format. We expected that this measure is capable of reducing dropout and/or eliciting better performance.
This year, only the R4 key was used for the additional questions which were exactly the same as in 2013, but students were informed about three and a half months before the exam about which type of question they had to expect, which scoring method was applied and that they could earn bonus points for good performance. This information was given through a separate instruction sheet enclosed in the announcement of the exam.
All students received the additional items in their MR format with the R4 key. Marking the additional questions was voluntary; however, for the first time, this year, students knew well in advance what types of additional question they were to expect and received up to two bonus points for the regular exam when performing well in the additional questions.
We told students that in order to calculate the bonus points, the following scoring method would be applied: if (a) a correct alternative was answered with “definitely true,” three points were awarded, was answered with “probably true,” two points were awarded, and was answered with “probably false,” one point was awarded, and (b) an incorrect alternative was answered with “definitely false,” three points were awarded, was answered with “probably false,” two points were awarded, and was answered with “probably true,” one point was awarded (Table
Overview of scoring an alternative in the 2014 additional questions. The same table (in German) was made available to students three and a half months prior to the exam.
For one alternative, you receive the points shown here, if… | ||
---|---|---|
the alternative is correct and... | the alternative is not correct and... | |
you mark |
3 points | 0 points |
you mark |
2 points | 1 point |
you mark |
1 point | 2 points |
you mark |
0 points | 3 points |
Examinees were first-year clinical students in medicine of which
Of this number of students, only the data of
In 2014, only the R4 format was used for all the examinees and again all 50 alternatives were scored individually as before (max. 50 points). Further, we calculated for each alternative the score from zero to three points as described above for awarding bonus points. We then also calculated sum scores over all alternatives (max. 150 points). This scoring method may be considered as a new key which we will report for the sake of completeness and term BP for “bonus points scoring” from here on. Again, at least 75% of the score was required to hypothetically pass the additional questions.
There were no differences between the three groups (SC, R5, and DO) regarding age,
While 95.9% of all students passed the two combined regular exams, in the SC group 78.3% passed the additional questions and in the R5 group only 50.8% passed the additional questions when scoring alternatives or 74.6% passed when scoring the questions as a whole (Table
Summary of performance in the 2012 regular exams and additional questions. The values given (except
|
Min | Max | Mean | SD | Pass rate |
|
| ||
---|---|---|---|---|---|---|---|---|---|
Midterm exam | 245 | 50.0 | 93.3 | 78.9 | 8.5 | 97.6 | 0.774 |
0.402 |
|
Final exam | 245 | 16.7 | 96.7 | 70.5 | 11.9 | 86.9 | 0.891 |
1 | |
Combined exam | 245 | 35.0 | 91.7 | 74.7 | 8.6 | 95.9 | 1 | 0.891 |
|
Additional questions | SC group | 129 | 0 | 100 | 68.3 | 21.2 | 78.3 | 0.359 |
0.308 |
R5 group (alternatives) | 59 | 0 | 98.0 | 64.0 | 23.0 | 50.8 | 0.365 |
0.417 |
|
R5 group (questions) | 59 | 0 | 90.0 | 34.1 | 23.5 | 74.6 | 0.253 |
0.335 |
Performance of the year 2012 students. Performance of the 2012 SC group (panel (a)) and the R5 group when scoring alternatives (panel (b)) and whole questions (panel (c)) in the combined regular exams versus their performance in the additional questions. Each dot represents the scores of a single examinee with dots in the upper right rectangle being the ones that passed both exams. Dotted lines indicate scores required for passing. The straight line represents the linear fit to the data.
There were no differences between the four groups (MC, MTF, R4, and DO) regarding age,
For all further calculations, it is important to note that the overall pass rate in the combined exam was only 69.4%, which is considerably lower than in the 2012 combined exam, where the pass rate was 95.9%. While the pass rate in the midterm exam was already lower than before (78.0% this year versus 93.3% in 2012), the pass rate in the final exam, after which the additional questions were answered, dropped even more (48.8% this year versus 86.9% in 2012). This might have biased all further results. While still 69.4% of all students passed the two combined regular exams, the pass rate for the additional questions was very low in all groups: 5.2% in the MC group, 10.7% in the MTF group, and 10.4% in the R4 group would have passed the additional questions (Table
Summary of performance in the 2013 regular exams and additional questions. The values given (except
|
Min | Max | Mean | SD | Pass rate |
|
| ||
---|---|---|---|---|---|---|---|---|---|
Midterm exam | 209 | 33.3 | 96.7 | 68.6 | 12.6 | 78.0 | 0.837 |
0.404 |
|
Final exam | 209 | 16.7 | 83.3 | 57.2 | 12.6 | 48.8 | 0.839 |
1 | |
Combined exam | 209 | 28.3 | 83.3 | 62.9 | 10.6 | 69.4 | 1 | 0.839 |
|
Additional questions | MC group | 58 | 44.0 | 78.0 | 58.5 | 8.1 | 5.2 | 0.343 |
0.316 |
MTF group | 56 | 38.0 | 88.0 | 61.6 | 10.1 | 10.7 | −0.061 | −0.054 | |
R4 group | 48 | 46.0 | 80.0 | 63.7 | 8.0 | 10.4 | −0.028 | 0.138 |
Performance of the year 2013 students. Performance of the 2013 MC group (panel (a)), MTF group (panel (b)), and R4 group (panel (c)) in the combined regular exams versus their performance in the additional questions. Each dot represents the scores of a single examinee with dots in the upper right rectangle being the ones that passed both exams. Dotted lines indicate scores required for passing. The straight line represents the linear fit to the data.
Year 2013 evaluation. Mean agreement in the three groups to the statements shown at the bottom of the panels.
Year 2013 subjective evaluation. Examinees were asked to think about the other two not self-experienced keys and then to indicate how easy it would be for them to answer the same questions with such a key.
There were no differences between the R4 and DO groups regarding age,
For all further calculations, it is important to note that while the overall pass rate in the combined exams once again increased compared to the year 2013 (84.6% this year versus 78.0% in 2013 and 93.3% in 2012), the pass rate in the final exam was only 44.8%, which is even lower than in 2013 (48.8%) and considerably lower than in 2012 (86.9%). This is likely to have influenced all further results, as the additional questions were answered after the final exam (Table
Summary of performance in the 2014 regular exams and additional questions. The values given (except
|
Min | Max | Mean | SD | Pass rate |
|
| ||
---|---|---|---|---|---|---|---|---|---|
Midterm exam | 201 | 40.0 | 96.7 | 84.2 | 13.6 | 93.0 | 0.848 |
0.301 |
|
Final exam | 201 | 23.3 | 90.0 | 56.1 | 11.1 | 44.8 | 0.761 |
1 | |
Combined exam | 201 | 36.7 | 91.7 | 70.2 | 10.0 | 84.6 | 1 | 0.761 |
|
Additional questions | R4 scoring | 194 | 44.0 | 88.0 | 68.4 | 8.9 | 24.2 | 0.031 | 0.087 |
BP scoring | 194 | 47.3 | 84.0 | 66.6 | 7.8 | 16.5 | 0.023 | 0.053 |
Performance of the year 2014 students. Performance of the 2014 R4 group with alternative scoring (panel (a)) and bonus point scoring (panel (b)) in the combined regular exams versus their performance in the additional questions. Each dot represents the scores of a single examinee with dots in the upper right rectangle being the ones that passed both exams. Dotted lines indicate scores required for passing. The straight line represents the linear fit to the data.
There is debate in the literature which kind of written question type is best for medical education. Historically, the first written question items were dichotomic, that is, “true/false.” This has been criticized because the language of the question item may be ambiguous (unclear, subject to interpretation, or lacking specifics like age of the patient) or there may be dissent between graders whether an answer is completely wrong or completely false [
Another study compared different item formats and calculated reliability coefficients for the study items in each format. They found that 425 A-type questions are needed to obtain a reliability of about 0.90 but only 275 Pick-N questions with partial credit scoring for the same reliability [
An interesting study used a different approach to study confidence in medical students. The authors gave A-type MCQ in a pharmacology course for junior (third year) and senior students (fifth year) and offered a four-point scale for confidence (from I am very sure to I am guessing). They noticed that higher level medical students were less confident in their answer but exhibited a higher correctness. They concluded that towards graduation, medical students gain more knowledge and also more skepticism which may be a valid goal in medical education [
Even though all three implementations of our new type of questions require the same set of knowledge, a different set of “test wiseness” seems required to successfully answer the questions. While the groups did not differ in their scores in the graded exam, only the MC implementation yielded a significant correlation with these scores. Further, the students answering the R4 implementation scored highest, although rating this implementation was the most demanding.
Though in many institutions and countries, the central licensing test given to medical students relies on a clear-cut MC single-choice format, and faculties are often free to choose a format of their own liking for examinations in their courses. We suggest that MC questions with a random number of right answers increase the power of the MC to assess the knowledge of students and might also offer the chance to improve their problem-solving skills and in other words make it possible to reach higher levels in Bloom’s taxonomy, indeed at least in the US biology teaching to undergraduates even higher Bloom levels of knowledge were measurable with MCQ [
We noted that performance in questions relevant for the exam was much better than that in our additional voluntary tests given for research purposes which were without consequence for final grading. Probably, this is due to student motivation expecting better grading increases the motivation and hence the performance in the test considerably.
When offering graded response categories ranging from “certainly true” to “certainly false” it is apparently better to offer an even number of options. A neutral category (“I don’t know”) is probably not semantically well defined and thus might corrupt the answer selection by students.
Different question formats are confounded by differing guessing probabilities. To make results comparable with respect to overall performance between tests, appropriate correction formulas should be applied. This only works well if the assumptions on guessing probabilities are correct. In our study 1, we saw both cases: guessing correction worked well in the guessing scoring case but failed with the single scoring case.
Studies 2 and 3 demonstrated the importance of familiarity with the question formats. Obviously, our multiple-response formats and in particular the confidence scales were initially unfamiliar and this fact posed difficulties to students. It seems to be therefore advisable to introduce any new question format before giving the new examination style (in other words: at the beginning of classes). Thus, students can get acquainted to the new format, can devise optimal answering strategies, and might get familiar with the logic behind the new scoring systems given to them.
In summary, true multiple-response questions are more difficult to answer than single-choice questions. Multiple-response questions offer the examiner more flexibility to test the knowledge of the student, and in some topics, this format makes it easier to construct questions with clinical relevance. Rating scales (I am confident and I am not confident) offer the possibility for the examiner to question the curriculum, detect false concept, and detect which topics have not been covered in the course work. Hence, rating scale questions might be a useful tool in formative testing to improve teaching curricula.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.
The authors want to acknowledge the eLearning development group of the Faculty of Medicine “HaMeeL–Hallesches Medizinisches eLearning” and the center for multimedia-enhanced teaching and learning (@LLZ) of the Martin Luther University Halle-Wittenberg. The authors also acknowledge the financial support of the Open Access Publication Fund of the Martin Luther University Halle-Wittenberg.