Medical simulation has become established as a safe and effective tool for training and assessing performance and competency in individuals and teams responsible for patient care [
A fundamental requirement common to the interpretation of simulation outcomes is that the assessment instruments selected be validated. A newly conceived instrument, such as one to assess clinical performance, may have latent flaws that must be identified and corrected before implementation [
We describe the development of methodology for assessing subject performance using a screen-based simulator in the setting of a simulated operating room. We designed our screen-based interface to allow subjects to log their observations, proposed interventions, and courses of action during simulated intraoperative emergencies. An important benefit of this interface is that expert raters are able to later review the responses logged by subjects while remaining completely blinded to subject identity, gender, and experimental condition.
Screen-based simulation has had an important role in the field of anesthesiology [
The institutional review boards at the University of Miami and Jackson Health system reviewed and approved this study. Twenty first-year clinical anesthesia (CA-1) residents participated in this study after informed consent.
A custom graphical user interface (GUI) was developed in the MATLAB (MathWorks, Natick, MA) environment. The GUI frontend was designed to combine the displays of the GE monitor and Datex-Ohmeda ventilator into a single display (Figure
Graphical user interface (GUI) used in simulation experiments. Parameters were updated at one-second intervals based on values read from an XLS file. The GUI featured a responsive pulse oximeter auditory display and IEC alarms that would annunciate when parameter alarm thresholds were transcended. Subjects entered answers to distractor questions and responses to state changes in the text entry box.
Simulation scripts were conceived and written in XLS format (Figure
Screenshot of a portion of the “symptomatic bradycardia” XLS script. Note cell AA752 which shows the first time the heart rate drops below 60 bpm and surpasses an alarm threshold. Cell AE752 programmatically changes to a value of 1 which instructs the GUI to annunciate the appropriate IEC alarm, in this case the medium priority cardiac alarm “cardmed.”
Plot showing the changes to relevant parameters in the “hypovolemia” scenario. Near the beginning, heart rate gradually increases over 5 minutes but does not surpass the alarm threshold. Later in the scenario, a low blood pressure is measured and the appropriate alarm sound enunciated. All parameters normalize and revert back to baseline levels before the end of the scenario.
A set was created of 100 questions relating to the practice of anesthesiology. The questions were menial and tedious, usually requiring simple calculations to be performed in order to arrive at the answer. For example, “calculate the BMI for a 28 yo female who is 5 foot 9 inches and 225 pounds,” “calculate the paO2/FiO2 ratio when paO2 = 107 mm Hg and FiO2 = 80%,” and “during general anesthesia, a mixture of 60% N2O and 40% O2 is being administered to a patient. Assuming the flow rate of O2 is 2 liter/min, what is the flow rate of N2O?” Some questions required reference to a pharmacopeia, for example, “what is the renal dosing for tiagabine? You can use computer/phone (e.g., Epocrates
In order to allow expert raters to review subject responses during emergencies in a blinded fashion, stem plots were generated displaying subject responses into three categories of information (Figure
Stem plot showing the responses entered by a subject into the GUI during the “hypovolemia” scenario. Note that the
The Ottawa Crisis Resource Management Global Rating Scale and Simulation Session Crisis Management Skills Checklist [
The Crisis Management Checklist consists of three subscales for the assessment of the ability to detect state changes, to be situationally aware, and to initiate proper therapy or interventions. Each of these categories has individual items related to timeliness, completeness, appropriateness, and prioritization. Raters can score each item trichotomously as “yes” (2 points), “marginal” (1 point), or “no” (0 points). Some items such as “missed detection” are scored with negative points.
Using the Global Rating Scale and Crisis Management Checklist assessment instruments, subject performance was evaluated by two raters with clinical expertise in anesthesiology who were blinded to the experimental condition and subject identity. The reviewers were asked to examine the stem plots of the subject responses logged during simulation experiments (see Figure
All statistical analyses were performed using SPSS software suite (IBM®). Interrater reliability was assessed by calculating the intraclass correlation coefficient (ICC) [
The two expert raters assessed subject performance using the Global Rating Scale and Crisis Management Checklist. Each subject was rated 6 times per rater, once for each scenario, and the total number of ratings from each rater on 20 subjects was 120. Tables
Internal consistency of Global Rating Scale considering all emergency scenarios.
Item | Corrected item-total correlation |
Cronbach’s |
---|---|---|
Overall performance | 0.909 | 0.726 |
State change detection | 0.663 | 0.800 |
Situational awareness | 0.828 | 0.747 |
Therapy/resource utilization | 0.794 | 0.760 |
Subject perception of crisis resolution | 0.117 | 0.930 |
Internal consistency of Crisis Management Checklist considering all emergency scenarios.
Item | Corrected item-total correlation |
Cronbach’s |
---|---|---|
State change detection | ||
Timely/prompt detection | 0.457 | 0.885 |
Complete detection | 0.545 | 0.879 |
Missed detection | 0.533 | 0.881 |
Situational awareness | ||
Complete/correct differential | 0.800 | 0.862 |
Prioritized differential list | 0.774 | 0.864 |
Reassesses situation | 0.759 | 0.867 |
One or more incorrect diagnoses | 0.183 | 0.894 |
Therapy/resource utilization | ||
Timely therapy | 0.696 | 0.870 |
Prioritized actions | 0.771 | 0.864 |
Appropriate therapy/action | 0.772 | 0.864 |
One or more inappropriate actions | 0.177 | 0.893 |
Tables
Interrater agreement for the Global Rating Scale.
Item | Circuit disconnect | Bradycardia | Endobronchial intubation | Hypovolemia | Pulmonary embolism | Light anesthesia | All scenarios |
---|---|---|---|---|---|---|---|
Overall performance | 0.756 | 0.900 | 0.772 | 0.789 | 0.908 | 0.844 | 0.804 |
State change detection | 0.796 | 0.907 | 0.883 | 0.888 | 0.828 | 0.537 |
0.819 |
Situational awareness | 0.889 | 0.848 | 0.872 | 0.673 | 0.964 | 0.835 | 0.866 |
Therapy/resource utilization | 0.756 | 0.798 | 0.739 | 0.825 | 0.805 | 0.867 | 0.787 |
Subject perception of crisis resolution | 0.683 | 0.633 | 0.740 | 0.529 |
0.414 |
0.538 |
0.624 |
Total |
0.798/0.852 | 0.899/0.906 | 0.812/0.860 | 0.849/0.838 | 0.892/0.929 | 0.760/0.826 | 0.825/0.856 |
Interrater agreement assessed by calculating the intraclass correlation coefficient (ICC) using a two-way mixed effects for consistency between two expert rater responses.
Interrater agreement for the Crisis Management Checklist.
Item | Circuit disconnect | Bradycardia | Endobronchial intubation | Hypovolemia | Pulmonary embolism | Light anesthesia | All scenarios |
---|---|---|---|---|---|---|---|
State change detection |
0.699/0.732 | 0.521 |
0.807/0.839 | 0.856/0.817 | 0.495 |
0.121 |
0.639/0.674 |
Timely/prompt detection | 0.730 | 0.463 |
0.791 | 0.733 | 0.333 |
0.301 |
0.593 |
Complete detection | 0.640 | 0.838 | 0.729 | 0.618 | 0.506 |
0.649 | 0.655 |
Missed detection | 0.248 |
† | 0.518 |
0.000 |
0.487 |
(−)0.366 |
0.088 |
Situational awareness |
0.835/0.838 | 0.946/0.952 | 0.834/0.856 | 0.773/0.773 | 0.907/0.937 | 0.798/0.803 | 0.844/0.852 |
Complete/correct differential | 0.753 | 0.885 | 0.820 | 0.724 | 0.889 | 0.790 | 0.821 |
Prioritized differential list | 0.710 | 0.913 | 0.790 | 0.825 | 0.857 | 0.794 | 0.807 |
Reassesses situation | 0.733 | 0.387 |
0.739 | 0.533 |
0.647 | 0.708 | 0.620 |
One or more incorrect diagnoses | 0.910 | 0.654 | 0.158 |
† | 0.627 | 0.557 | 0.565 |
Therapy/resource utilization |
0.917/0.917 | 0.886/0.886 | 0.705/0.711 | 0.888/0.888 | 0.763/0.784 | 0.934/0.934 | 0.842/0.852 |
Timely therapy | 0.945 | 0.647 | 0.594 | 0.795 | 0.798 | 0.871 | 0.793 |
Prioritized actions | 0.857 | 0.681 | 0.570 | 0.914 | 0.733 | 0.840 | 0.780 |
Appropriate therapy/action | 0.770 | 0.432 |
0.609 | 0.681 | 0.515 |
0.851 | 0.658 |
One or more inappropriate actions | † | † | 0.000 |
† | 0.487 |
† | 0.314 |
Total |
0.903/0.898 | 0.954/0.965 | 0.871/0.860 | 0.917/0.931 | 0.870/0.908 | 0.850/0.881 | 0.878/0.890 |
Interrater agreement assessed by calculating the intraclass correlation coefficient (ICC) using a two-way mixed effects for consistency between two expert rater responses.
Overall subject performance assessment scores from the Global Rating Scale (a) and Crisis Management Checklist (b). Individual rater and average ratings are shown. The bars depict standard deviations.
Interrater agreement when considering all scenarios was good for each subscale in the Crisis Management Checklist (Table
Correlation between the Global Rating Scale and Crisis Management Checklist total scores (averaged across all six emergency scenarios) was high (Spearman rank correlation = 0.948,
Subject performance assessment scores from the Global Rating Scale (a) and Crisis Management Checklist (b) based on emergency scenario. Individual rater and average ratings are shown. The bars depict standard deviations.
Estimation of effect sizes.
Mean 1 | Mean 2 | Difference | % difference |
Cohen’s | |
---|---|---|---|---|---|
GRS | |||||
Median |
15.1 (2.8) | 20.2 (1.9) | 5.1 | 25.2 | 1.5 |
Quartile |
17.2 | 18.6 | 1.4 | 7.4 | 0.4 |
Scenario |
15.8 | 19.8 | 4.0 | 25.2 | 0.6 |
CMC | |||||
Median |
10.0 (2.2) | 13.0 (0.9) | 3.0 | 23.3 | 1.3 |
Quartile |
11.6 | 12.3 | 0.6 | 5.0 | 0.3 |
Scenario |
10.2 | 12.5 | 2.4 | 23.1 | 0.6 |
Simulation-based experiments offer a viable controlled strategy to test hypotheses and interventions before implementation in actual clinical settings. We are interested in developing and testing techniques for characterizing the impact of intraoperative factors on anesthesiologist performance and patient safety. We have developed a novel screen-based interface and adapted previously validated Global Rating Scale and Crisis Management Checklist instruments for assessing performance in our simulator. Based on the results presented here, the feasibility of this methodology as a tool for allowing blinded assessment of subjects by expert raters has been demonstrated. Additionally, the first step in validating our performance assessment instruments has been accomplished.
One of the fundamental features of our screen-based interface is that expert raters assess performance based on subject responses and actions logged through the interface, assuring that raters are blinded to subject identity and experimental condition. Automated timestamping of the logged responses facilitated the assembly of the stem plot timelines (see Figure
Internal consistency was good for both the Global Rating Scale and Crisis Management Checklist instruments, as assessed with Cronbach’s
The high correlation (0.948) between the Global Rating Scale and Crisis Management Checklist ratings suggests good convergent validity for the two instruments. However, it has been pointed out that caution in interpreting this result is warranted because the same expert raters assessed subject performance with both instruments, and, as a result, scores for each instrument cannot be assumed to be truly independent of the other [
With revised assessment instruments, the next experiments will be guided in part by the effect sizes and variability of responses observed here. Though not optimal, we chose to roughly estimate (possible) effect sizes by comparing subject scores between two groups that straddle the median total score (averaging across all emergency scenarios) (Table
Screen-based simulation is considered to be less realistic than mannequin-based simulation; however, the utilization of our interface, within the context of a fully functional replica of an OR which included a METI mannequin, likely helps mitigates this penalty. Mannequin-based simulators may be better at assessing behavioral outcomes dealing with leadership, group dynamics, and communication skills than screen-based simulators, but the outcomes for this pilot study dealt with individual performance in management of intraoperative emergencies, and evidence exists where screen-based simulation can be effective at allowing assessment of performance in crisis management training [
In addition to limitations discussed above, a fundamental limitation of the current study stems from the fact that, at this pilot stage, there is no way to ascertain that the construct measured by the Global Rating Scale and Crisis Management Checklist actually equated to subject performance. Relative to simulation for education, there are numerous challenges inherent in using simulation as a tool for assessment [
We have previously shown that intraoperative noise increases anesthesia resident perception of fatigue and task load in an OR simulator that approximates environmental conditions in our clinical ORs [
We demonstrate the feasibility of a screen-based simulation experiment for blinded assessment of resident performance while managing intraoperative emergencies. Our modified global assessment and checklist instruments show good internal consistency, interrater reliability, and convergent validity. The next phase of experiments will be to determine discriminant ability of our setup in residents at different levels of training.
See Figures
Global Rating Scale used by expert raters to assess performance of subjects in simulations.
The Crisis Management Checklist used by expert raters to assess performance of subjects in simulations.
Work should be attributed to Department of Anesthesiology, University of Miami.
The authors declare that there is no conflict of interests regarding the publication of this paper and regarding the funding that they received.
The Anesthesia Patient Safety Foundation is acknowledged for funding this research. The University of Miami—Jackson Memorial Hospital Center for Patient Safety is acknowledged for providing use of its operating room simulator.