Effects of COVID-19 Pandemic on Progress Test Performance in German-Speaking Countries

Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Progress Test Medizin, Charitéplatz 1, Berlin 10117, Germany Charité—Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Progress Test Medizin and BIH QUEST Center for Transforming Biomedical Research, Translational Research Unit, Charitéplatz 1, Berlin 10117, Germany


Introduction
e COVID-19 pandemic has impacted almost every area of daily life worldwide; students have also been a ected by changes in their studies due to lockdown measures [1]. Undergraduate curriculums in medical schools usually include an extensive practical component implying regular contact with patients; students would therefore be put at risk for potential infection if practical lessons were to be held as they were planned before the pandemic [2]. e impact of all these circumstances on the academic performance of medical students has been the subject of research, with di ering outcomes: there are studies reporting that student performance worsened [3,4]; stagnated [5] or just changed for speci c subjects [6] during the pandemic.
However, there might still be a knowledge gap to ll due to the methodological limitations in the literature published so far, among which we can mention small sample sizes, limited research scopes (i.e., only speci c subjects or semesters were considered), or incomplete information about hypothetical di erences in the di culty levels of the exams being compared.
In order to help ll this knowledge gap, we set ourselves to investigate short-term e ects on knowledge gain using data from recent issues of the Progress Test Medicine (PT). e PT is a formative test including 200 multiple choice questions at the graduate level, which provides feedback to students on knowledge and knowledge gain during their course of study [7]. It is usually administered around the beginning of each semester, and in the summer of 2020, it was provided to 11,101 students from 15 German and Austrian faculties.
In addition to the large amount of data available and the possibility of observing the development in every semester and subject, another significant strength of the PT lies in the fact that it assesses current levels of knowledge without giving students the chance to prepare for it [7].
In summary, we intend to address the following two main questions throughout this study using data from PT: (i) Is there a substantial change in knowledge between tests that took place prior to the pandemic ("prepandemic") and those conducted after the pandemic began? (ii) Are there differences at the specialty level? A relevant question here is whether the observed changes were similar for all fields of study or if there were remarkable differences in performance depending on the medical discipline considered.

Setting.
e PT is a low-stakes test on medical knowledge assembled every semester by Charité-Universitätsmedizin Berlin. PT exam regulations may differ between participating faculties: for example, participating in the test is mandatory for all students in 12 faculties and voluntary in the remaining two, and the number of compulsory participations demanded from students also varies by faculty. Additionally, three different test modalities are implemented depending on the faculty where the test is carried out: the traditional paper-pencil exam (which, however, has been completely abandoned because of the pandemic), the consortium's own platform (ePT), where students are required to state how confident they are that their answer is indeed true (ePT) [8] and also other learning platforms (e.g., ILIAS [9]) where this confidence statement is not included. e "winter semester PT" takes place usually from October to December, while the "summer semester PT" is conducted from April to June. Since October 2019, our PTs share a considerable amount of questions with the PT that took place five semesters before.
is leads to a natural pairing of tests administered five semesters (two and a half years) apart from each other. We conducted our analysis based on the shared questions within each of the pairs. ree consecutive PT pairs were included in our study; the first one comprises PT number 36 and PT number 41 (which in the following will be called "PT36" and "PT41"), which share 122 questions. As both tests took place before the pandemic, starting in April 2017 and October 2019, respectively, we included this dataset as a control. PT37 and PT42 share 155 questions; PT37 started before the pandemic in October 2017 and PT42 in April 2020, during the first lockdown in Germany and Austria. Because the pandemic began to spread across Europe just a few weeks before the summer semester of 2020 was scheduled to start, new entrants in medical schools were suddenly confronted with a rather uncertain academic situation, having to adapt themselves to a completely virtual study environment, which had never been implemented on a comparable scale up to that time.
PT38 started in April 2018 and PT43 in November 2020, sharing 134 questions. By November 2020, online lectures had already become the norm, while practical lessons had been reduced or cancelled in line with mandatory social distancing regulations. Examination periods were postponed and prolonged, and sometimes more lenient rating procedures were applied, counting failed exams as "free shot." Since both tests belonging to the same pair share most of their questions, students who took both tests in a pair were shown the shared questions twice, while the rest were presented these questions only once. To quantify the effect this might have on test results, we performed a t-test comparing both groups ("seen twice" vs. "seen once"). We estimated the effect size using Cohen's d; significance level α was set top � 0.01. Students in the "seen twice" group did not outperform those in the "seen once" group (t-statistic -0.32, p-value 0.75 and effect size (Cohen's d) -0.01 for pair 36-41; 1.96, 0.05, 0.07 for pair 37-42, and 2.43, 0.02, 0.09 for pair 38-43, respectively).

Participants.
A total of 9 faculties were included in this study; in addition to consenting to the use of their data, they had to meet two further requirements: (i) Faculty-specific PT exam regulations must not have changed between the summer semester of 2017 and the winter semester of 2020 (ii) e faculty must have administered the test every academic semester since the summer semester of 2017 We used a pseudonymized dataset with the shared questions of PT36 and PT41, PT37 and PT42, and PT38 and PT43. ese datasets contained the answers of participants to each question as well as the semester to which they belong, the pseudonymized faculty where they study, and whether their participation in the test is considered "serious" or not; participants classified as nonserious are excluded from the calculation of comparison groups since the validity of results would otherwise be jeopardized [10].
Nonserious participation is presumed when one or more of the following happens: (i) e amount of time devoted to completing the test is too short [11] (less than 20 minutes) (ii) Every single question of the test was either answered with "don´t know" or not answered at all [12] (iii) None of the 120 last questions of the test were responded to (suggesting that the test was left incomplete with more than half of it yet to be read) 2 Education Research International (iv) e self-monitoring accuracy rate in relation to testing answers is lower than 33% upon 20 or more questions, hinting that most answers were guessed or randomly chosen In addition to disengaged test taking, exam misconduct (e.g., "cheating") must also be addressed as a construct-irrelevant factor. Two particular design elements of the PT are key to preventing exam misconduct: firstly, the test is purely formative and not linked to any specific course content. Besides, test scores have no effect whatsoever on the final grades of students [13,14]. Secondly, the timeframe for the test is tight (180 minutes for 200 questions) with no possibility of interruption [15].

Overall Test Performance.
For each PT pair, we fitted a linear mixed-effect model with random intercept and slope and with the relative PT score (correctly answered questions in percent) as the outcome variable. Four fixed effects predictors were set: test number, semester of study, the interaction between test number and semester of study, and test modality (digital vs. ePT vs. PT with paperpencil).
e medical school where each test was administered (random intercept) and the interaction between faculty and semester of study (random slope) were chosen as random effect predictors. is choice of predictors is based on the assumption (corroborated by longterm data from Progress Test) that curricular differences might lead to dissimilar semestral variations in average scores among the participating medical schools. Coefficients were fit with the restricted maximum likelihood approach.
We used R for Windows, version 3.6.1 (R Core Team, Vienna, Austria), and the lmer function from the R-package lme4 [16] for fitting the linear mixed models; additionally, the semipartial R 2 was calculated for each fixed effect by using the "nsj" method from the r2glmm package [17].

Performance Development per Subject.
Since the PT also publishes results broken down by medical discipline or subject, all questions included in the test are routinely classified according to a list of 27 predetermined subjects; we used this classification to compute the absolute variations in the percentage share of correct answers for every subjectspecific question subset. ese question subsets are the same for both tests in each PT pair; therefore, a direct comparison between them is methodologically sound.

Participants.
e final datasets consisted of 13,372 tests for pair PT36-PT41, 13,121 for pair PT37-PT42, and 13,822 for pair PT38-PT43. Only serious test takers from nine faculties were kept (see Figure 1 for the flow chart of data selection).

Overall Test Performance.
e most recent test of each PT pair showed a higher mean score compared to the previous test in the same pair, with this difference becoming more pronounced over time (PT pair 36-41 : 2.53 (95% CI: 1.31-3.75), PT pair 37-42 : 3.72 (2.57-4.88), and PT pair 38-43 : 5.66 (4.63-6.69) (see Figure 2 and Supplementary Material Tables 1-3); this could signify a sustained knowledge increase throughout the whole period examined. e variable "semester" was found to be the most influential fixed effect regarding student performance (>4.3 difference in mean between tests for every PT pair), implying that the mean score for each semester increases on average at least 4.3 points with respect to the previous semester regardless of other factors. is result is in line with expectations and reflects the usual knowledge increase of participants as they advance towards the completion of their degrees.
Regarding the interaction between test and semester, the results for each PT pair do not show a uniform picture. e values obtained for PT pairs 36-41 and 38-43 are −0.32 and −0.3; these negative figures imply that the growth of mean scores is stronger in earlier semesters and dwindles somewhat for more advanced students. is is not true in the case of PT pair 37-42, where the corresponding value (0.04) indicates that mean scores increase evenly throughout all semesters.
According to the intraclass correlation of all three models, university-related random effects do not generally add much variance to the obtained scores (PT pair 36-41 : 0.14, PT pair 37-42 : 0.06, and PT pair 38-42 : 0.04), which means that test results from the same university show comparatively low levels of within-cluster correlation. For all three models, the conditional R 2 lie around 0.48 and 0.56, respectively (i.e., the models explain 48% to 56% of variances in test scores). is is an expected outcome since variations in individual performance between participants belonging to the same university and semester are not covered by any model parameter; here, we have preferred to explore the evolution of test results for whole student cohorts instead of focusing on performance imbalances between students of the same group. From a methodological point of view, it is also worth mentioning that we have intended to model a numerical variable using a mixture of numerical and categorical variables, which may also have a negative effect on the value of R 2 (for the complete model results, see supplementary material Tables 1-3; the distribution of correctly answered questions of each PT pair per semester can be seen in Figure 3).

Development of Performance per Subject.
As can be noted in Figure 4, subjects such as epidemiology, anesthesiology, or gynecology stand out markedly above the rest in terms of performance, while others (e.g., Urology, Dermatology,  (Table 4). e medical discipline with the most noteworthy evolution is epidemiology (epi), whose share of correct answers in PT43 increased by 22.56 percentage points with respect to

Discussion
COVID-19 pandemic lockdowns triggered sweeping changes in virtually all areas of society. Medical education was no exception to this rule: most faculties switched to online teaching, either reducing practical lectures and patient contact or even cancelling them altogether. ese changes took place against a backdrop of fear and concern about various aspects of medical teaching and learning [18]. We used PT data to investigate the impact of these events on knowledge gain.
According to our analysis, both tests conducted during the pandemic (PT42 in April 2020 and PT43 in November 2020) show a relevant increase in mean scores of 3.72 and 5.66, respectively, when compared to previous tests belonging to the same pair (PT37 and PT38, respectively). With a mean score of 2.53, this effect is not so strong in the case of the PTs that took place before the pandemic (PT41 and PT36). is is mirrored by the net changes per subject; while the prepandemic pair shows an average linear increase of 1.40%, this effect is much stronger for pairs PT37-PT42 (3.05%) and PT38-PT43 (4.21%). ere are a few medical disciplines that even emerge as winners from the current situation; in fact, the outstanding performance improvement in epidemiology-related questions might be understood as a side effect of the pandemic.
ere are a wide variety of circumstances that might influence academic performance on both individual and collective levels. However, the PT examination framework remained almost unchanged between PT36 and PT43 save for the implementation of technologically advanced test modalities at some of the participating medical schools. ese test modalities were thus included in our models as fixed effects to quantify and delimit their influence on test scores. One effect we could not directly account for was COVID-19, which induced major study environment changes in the participating faculties in Germany and Austria over the period between PT41 and PT43. We, therefore, link the majority of the unaccounted differences in performance results to these changes. e results reported in the literature are not unanimous; some findings describe negative trends [3] or impacts [4,19] of the pandemic on the academic performance of medical students. On the other hand, there are findings where medical students have stressed the benefits of online lectures [6] and also studies concluding that their cognitive performance remained the same [5] or improved [20].
Our results suggest a sustained performance improvement widely spread across comparison groups and participating faculties; this improvement is also noticeable at the subject level, although its distribution among medical disciplines is somewhat uneven. In this context, one must keep in mind that the PT is a formative test that assesses the end objectives of the curriculum in contrast to summative examinations of specific course content [21].
It must be mentioned that first-semester students also performed better in the tests conducted during the pandemic; these students wrote their PT only a few weeks into their studies. is outcome may have been partly driven by new admission criteria introduced by some of the participating medical schools in 2020 [22]; nevertheless, we recommend further research at this point.  In summary, there is good evidence that the shift to distance learning prompted by COVID-19 resulted in an increased knowledge gain. However, we must remark that our study was limited to the domain of theoretical knowledge; further research on how the pandemic conditions may have also affected the acquisition of practical skills would be much needed in order to build a complete view of the broader topic.

Limitations
4.1.1. Scope of the Study. One must keep in mind that the percentages between the different test pairs are not comparable without conditions as the difficulties of the chosen questions may differ.
On another note, this study only covers possible shortterm effects of the COVID-19 pandemic on medical students; an investigation of medium-term or even long-term effects would require more prolonged monitoring of results.

Regression.
We treated overall score changes between semesters as if they were linear, but there might be certain semesters where the extent of the knowledge increase differs from the average. Variances in the model were heterogeneous, which might lead to underestimated standard errors.

Conclusion
e shift to distance learning prompted by COVID-19 resulted in an increased knowledge gain compared to Progress Tests administered before the pandemic. ese findings could also be relevant in the future since they are descriptive (at least to some extent) of how medical schools in Germany and Austria used digitalization and online learning as tools to cope with the impact of an unforeseen critical event with major consequences. As such, these adjustments and their effects should not be overlooked since they could serve as a "dress rehearsal" [23] for future challenges on a global level. It is important to keep in mind that we are not able to give forecasts regarding effects on the practical skills or mental state of students.
One way or another, the current worldwide push for digital education makes it appropriate to build a corpus of evidence on its effects on student experience, even if now we are only able to discuss short-term developments. Data Availability e datasets generated during and/or analyzed during the current study are not publicly available for data security reasons but are available from the corresponding author on reasonable request and after approval of the Progress Test cooperation partners and an extended ethical approval.

Ethical Approval
Ethical approval was granted by the Ethics Committee at the Charité (EA4/242/20). All methods were performed according to relevant guidelines and regulations. Regarding the usage of data about student performance in Progress Tests, the authors also refer to the local university law (BerlHG; §6) and the local examination regulations. Disclosure e preprint and its history can be accessed at https://www. researchsquare.com/article/rs-786483/v2.

Conflicts of Interest
e authors declare that they have no conflicts of interest.

Authors' Contributions
MM and VS outlined the concept and design of the article. IRA and MS developed the study design and prepared and analyzed the data. All authors discussed and evaluated the results. VS and JS drafted the introduction section. MM, IRA, and MS drafted the methods and results. VS and MM drafted the discussion and conclusion section. All authors contributed to the manuscripts' revision. IRA performed the manuscripts' language editing. All authors read and approved the final manuscript.