Assessing Neonatal Pain with NIPS and COMFORT-B: Evaluation of NICU's Staff Competences

Background Pain is considered “the 5th vital sign” that should be regularly assessed in the neonatal intensive care setting. Although over 40 pain assessment tools have been developed for neonates, their implementation in everyday practice is challenging. Epidemiological studies demonstrate that pain is still underassessed and undertreated in European NICUs. Purpose To evaluate the interrater and intrarater reliability of the NIPS and COMFORT-B scales among the tertiary NICU's staff members 4 years after their implementation in local pain guidelines with no prior dedicated training. Methods Physicians and nurses were invited to evaluate 5 video recordings of infants hospitalized in the intensive care settings, using the NIPS and COMFORT-B scales. The assessment took part twice at a 3-month interval. Interrater reliability was calculated for both scales using Kendall's W coefficient of concordance and Krippendorff's alpha coefficient. Cohen's kappa was used to assess intrarater reliability. Results 17 physicians and 19 nurses took part in the study. Interrater agreement for the COMFORT-B scale was above 0.8 for Kendall's W coefficient (p < .01) and above 0.667 for Krippendorff's alpha coefficient. Kendall's W coefficient for the NIPS scores ranged between 0.7 and 0.8 (p < .01). Krippendorff's alpha was above 0.667. Intrarater agreement for both the COMFORT-B and NIPS scales was 0.693 and 0.724, respectively. Conclusions Overall, the agreement between our staff members was moderately good for both scales. This is not enough to avoid inadequate pain assessment. More training is needed to improve NICU's staff competences in using pain scales.


Introduction
Over 30 years ago, a study published by Anand et al. [1] demonstrated that inadequate analgesia during surgery in preterm babies resulted in more pronounced metabolic stress response and unstable clinical course in the postoperative period. Hence, the myth that the immature central nervous system precludes neonates from experiencing pain was rejected. Since then, neonatal pain research has made a considerable progress in understanding the developmental aspects of postnatal nociception [2]. In the 1990s, distinct behavioural and physiological responses to painful stimuli were characterized by Craig et al. [3], which led to the development of numerous neonatal pain assessment tools. To date, over 40 scales to assess pain and/or sedation in neonates have been created, yet there is still no gold standard instrument [4]. Clinical guidelines on neonatal pain prevention and management [5][6][7]  e assessment method should be adapted to the type of pain a neonate is experiencing, namely, acute, prolonged, or postoperative pain. Pain should be evaluated and documented every 4 to 6 hours and after each potentially painful procedure [7].
It has been demonstrated that implementing guidelines in everyday practice is challenging. In the EUROPAIN (EUROpean Pain Audit In Neonates) prospective observational study performed in 243 NICUs from 18 European countries [8], 31.8% of enrolled neonates received an assessment of continuous pain at least once during their NICU stay. Daily pain assessments occurred in only 10.4% of patients. It is notable that practices varied among countries with the common occurrence of pain assessment in French (100%), Dutch (80%), and Belgian (75%) NICUs. As for Polish NICUs, 2 (25%) out of 8 hospitals participating in the study reported performing continuous pain assessment. It was demonstrated that the presence of local NICU pain guidelines and nurses that specialized in pain management increased the odds for pain assessment.
In the Children's Memorial Health Institute in Warsaw, whose NICU also participated in the EUROPAIN study, there is a local guideline document regarding pain management. Medical charts are occasionally audited by the pain management services to verify that pain assessment is performed. eir staff also provides support in pharmacotherapy, if needed. In our standard of care, neonatal pain assessment is performed with the use of the Neonatal Infant Pain Scale (NIPS) in nonventilated patients and the COMFORT Behaviour (COMFORT-B) scale in ventilated patients. Nurses are provided with cards describing each scale at their workstations.
Unlike many reports on the implementation of pain assessment in hospital settings [9,10], pain scales were introduced in our department without prior extensive training or calculation of interrater reliability. To our knowledge, their Polish translations did not undergo crosscultural adaptation and validation. Yet, since their introduction in 2017, they have been meticulously documented in medical records. To improve our pain awareness and pain measurement, we conducted a study with the aim to evaluate the agreement between observers using both scales.

Materials and Methods
is was a prospective study conducted from January to April 2021 in the level 3 NICU of the Children's Memorial Health Institute in Warsaw. e study was approved by the local Institutional Review Board (study ID number: 21/KBE/2018).

Population and Design.
At the time, 77 nurses and 28 doctors were employed at the NICU. ey were informed about the aim of the study and their role in it during a staff meeting. ose who did not attend the meeting were personally approached by the authors. Participation in the study was voluntary. e study procedure involved the evaluation of 5 video recordings of infants hospitalized in the intensive care settings, using the NIPS and COMFORT-B scales. Participants had 2 minutes to assess each video. e assessment took part twice at a 3-month interval. At each occasion, the assessment took place after the morning staff meeting in our department's conference room. e approximation of the minimum required sample size was based on Krippendorff's estimations [11]. We assumed that, for the NIPS, each of its 8 values (from 0 to 7) is equally likely to occur. In order to achieve the smallest acceptable reliability value of alpha 0.667 at the 0.05 level of statistical significance, the minimum reliability sample size was 71 units. It means that with a fixed number of 5 videos to evaluate, we had to enroll in the study at least 13 raters. Additionally, we decided to increase the minimum number of observations made in this study to at least 100 based on the recommendations of the COSMIN Checklist [12], which rates a sample of over 100 as "excellent". It means that at least 20 raters had to be enrolled in the study. is sample should be sufficient to detect a value of Kendall W coefficient of 0.8 (ρs1) with 80% power at the 0.05 level of significance, assuming that the null value (ρs0) equals 0.6 [13].

Video Recordings.
A convenience sample of 5 videos was selected to be evaluated by the study participants. Our aim was to ensure that participation in the study would not collide with staff's everyday duties, hence the small number of videos to assess. 4 of the videos were retrieved from the COMFORT Behaviour Scale instructional website (https:// comfortassessment.nl/) [14].
e website provides video guides on how to evaluate each of the scale's items, as well as training videos for the full assessment. e videos selected for the study included: video 1: COMFORT score of 19/20; video 2: extreme scores (5) for "Calmness," "Alertness," "Respiratory response," and "Physical movement"; video 3: score 5 for "Crying"; video 4: score 3 for "Respiratory response". e 5 th video was recorded in our department presenting a full-term neonate undergoing a venepuncture procedure, which is classified as moderately painful [15]. Written parental consent was obtained before the recording, and the venepuncture was clinically necessary.

Pain Assessment.
e Neonatal Infant Pain Scale (NIPS) is a tool developed in the early 1990s [16] aimed to assess six behavioural reactions to painful procedures in preterm and full-term newborns. e scale was demonstrated to have high interrater reliability and internal consistency. It was validated for construct and concurrent validity. Its recommended use is for acute and postoperative pain, although its psychometric studies were mainly validated for acute pain [5]. It contains six items defined in Table 1. In order to provide the total NIPS score, participants in the study had to evaluate all of the items. e COMFORT scale was developed to assess the levels of distress in PICU patients, as well as postoperative pain in children under 3 years of age. It consists of six behavioural items and two physiologic items: heart rate and mean arterial pressure. As physiological variables were demonstrated to have a weak correlation with pain behaviour, their exclusion from the scale led to creating the COMFORT-B scale containing only behavioural items. e scale is illustrated in Table 2. It is possible to omit one of the scale's items in the pain assessment. e total score is then computed by multiplying the total score for the other items by 6/5 [14]. e scale was validated for concurrent validity, internal consistency, and interrater reliability [17,19,20]. e data were analysed using IBM SPSS Statistics v. 27. Descriptive statistics were used to calculate median scores and interquartile ranges. Kendall's W and Krippendorff's alpha coefficients were calculated to evaluate interrater reliability (IRR) for COMFORT-B and NIPS total scores, as well as for items of each scale. Both coefficients are suitable for ordinal ratings with more than 2 raters [21]. For interpretation of coefficients, we assumed the labels suggested by Landis and Koch for the use of kappa: values between 0 and 0.20 indicate a slight IRR; values between 0.21 and 0.40 indicate a fair IRR; values between 0.41 and 0.60 indicate a moderate IRR; values between 0.61 and 0.80 indicate a substantial IRR; and values between 0.81 and 1.00 indicate an almost perfect IRR [22]. Additionally, for Krippendorff's alpha, it is accepted that its lowest conceivable limit is 0.667 [11]. Intrarater reliability was assessed using Cohen's kappa coefficient. Where applicable, tests were performed at 0.05 significance level. Missing values were omitted from the analyses.

Results
36 members of our NICU staff took part in our study. e group included 5 doctors and 9 nurses with less than 5 years' experience in a neonatal intensive care unit. e remaining 12 doctors and 10 nurses had more than 5 years' of NICU experience.
We obtained 180 and 170 total NIPS scores at the 1 st and 2 nd measurements, respectively. Total COMFORT-B scores amounted to 175 at both measurements. Total scores for all assessments are displayed as box and whisker plots (Figures 1 and 2). e percentage of observers who assessed 4 videos exactly as in reference from the COMFORT training website is illustrated in Table 3. As for the 5 th video that showed a procedure considered to be moderately painful, the total scores displayed in Figures 1 and 2 are within a range of severe pain.

Interrater Reliability: Total Scores.
Interobserver agreement for the COMFORT-B and NIPS scales is presented in Tables 4 and 5, respectively. Kendall's W coefficients values (from 0.736 to 0.906) indicate substantial to almost perfect agreement between observers. Krippendorff's alpha coefficients are above the smallest acceptable value of 0.667, but below 0.8, which implies moderate interrater reliability. All reliability coefficients achieved higher values for the COMFORT-B scale and for the 2 nd measurement in both scales. Tables 6 and 7. Overall, the values of reliability coefficients seem to be more consistent for the NIPS scores. e items that did not reach the minimum desired level of interrater reliability include "Breathing pattern" (both coefficients) and "Legs"(alpha). e observers showed almost perfect agreement (Kendall's W and Krippendorff's alpha >0.8) while assessing the following items of the NIPS: "Facial expression," "Cry," and "State of Arousal".

Interrater Reliability: Scales' Items. Interobserver agreement for the COMFORT-B and NIPS scales's items is presented in
As for the COMFORT-B scale, Kendall's W coefficients were shown to be above the substantial agreement threshold for all items, but they rarely reached a value greater than 0.8. However, Krippendorff's alpha coefficients were below the acceptable agreement level for the following items of the COMFORT-B scale: "Alertness," "Respiratory response," "Crying," "Muscle tone," and "Facial tension".

Discussion
In this study, we evaluated agreement between a sample of our staff members in pain assessment using the NIPS and COMFORT-B scales 4 years after their introduction in our department. Contrary to other studies [9,10,14,23,24], we did not undergo intensive training before implementing these tools into our everyday practice. We only received cards describing each scale that are available at nurses' workstations. Our lack of training could explain why such a low percentage of our group assessed the videos in accordance with the COMFORT training website. Nevertheless, in e results of the scales' items analysis are more conflicting. If we take into consideration only Krippendorff's alpha coefficients, we failed to demonstrate interrater agreement in 5 out of 7 items of the COMFORT scale and in 2 out of 5 items of the NIPS. It can be speculated that these results are due to our lack of training and also technical difficulties related to applying some of these items to a video recording (e.g., evaluation of muscle tone or respiratory response). It is worth noting that results were more consistent for the NIPS scores where we collected the same number of observations for all items compared to the COMFORT-B where items differed in the number of observations. It is likely that our sample size in the COM-FORT-B scale's item analysis was inadequate for the estimation of Krippendorff's alpha coefficients [25].
We used two different reliability coefficients that are suitable for ordinal ratings with more than 2 raters. ey are based on different mathematical assumptions, which leads to providing different numerical values for the same datasets [26]. Krippendorff's alpha is considered to be a conservative measure of reliability favouring more even distribution inferred as the pattern by which cases fall into categories [26]. Kendall's W coefficient measures the associations between ratings with no assumptions regarding the nature of the probability distribution [27]. It is worth noting that all of Kendall's W statistics reached a significance of p < .01. Moreover, for the interpretation of Kendall's W coefficient, we employed Landis and Koch benchmarks [22] that were originally designed for Cohen's kappa and are the most widely used in research. However, it is not certain whether Table 2: Scoring and interpretation for the COMFORT-B scale. When performing the assessment, the infant is observed for 2 minutes. e healthcare professional must be in a position that permits a full view of the infant's face and body [17,18]. Alertness 1 Deeply asleep (eyes closed, no response to changes in the environment) 2 Lightly asleep (eyes mostly closed, occasional responses) 3 Drowsy (child closes his/her eyes frequently, less responsive to the environment) 4 Awake and alert (child responsive to the environment) 5 Awake and hyperalert (exaggerated responses to environmental stimuli) Calmness/agitation 1 Calm (child appears serene and tranquil) 2 Slightly anxious (child shows slight anxiety) 3 Anxious (child appears agitated but remains in control) 4 Very anxious (child appears very agitated, just able to control) 5 Panicky (severe distress with loss of control) Respiratory response (only in mechanically ventilated children) Reduced muscle tone; less resistance than normal 3 Normal muscle tone 4 Increased muscle tone and flexion of fingers and toes 5 Extreme muscle rigidity and flexion of fingers and toes Facial tension 1 Facial muscles totally relaxed 2 Normal facial tone 3 Tension evident in some facial muscles (not sustained 4 Tension evident throughout facial muscles (sustained) 5 Facial muscles contorted and grimacing COMFORT-B score interpretation Sedation levels: <10 oversedation, >23 undersedation [17] Pain >17 along with the numeric rating scale (NRS) > 4 indicate pain [18] NRS can be substituted for any validated pain tool 4 Pain Research and Management they should be applied with regard to coefficients based on different assumptions than kappa [28]. In the studies related to pain assessment scales in neonates Cohen's kappa, linearly weighted Cohen's kappa and intraclass coefficient were the most widely used [4]. We are convinced that the reliability measures we chose to apply in our study are suitable for the dataset we had to analyse [28]. However, we are aware that selecting them instead of kappa statistics precludes from comparisons of our results with other studies involving pain assessment in neonates [12]. To our knowledge, there has been only one study comparing the NIPS and COMFORT-B scales [29]. It demonstrated that while evaluating painful procedures, the NIPS has a significantly higher coefficient of variation (CV, 188% ± 99%) compared to the COMFORT scale (33% ± 8%). We did not identify any studies comparing the interrater reliability of both scales. However, they were used together as endpoints in several randomised controlled trials [30][31][32][33].
e main limitation of our study is that the relative representation of nurses in our group is much smaller compared to doctors. Only 19 of 77 employed at that time nurses took part in our study, whereas the group of doctors included 17 of 28 employed physicians. In our department, pain assessment is part of nurses' responsibilities. In case of elevated pain scores, the adjustments of pharmacological treatment are discussed with doctors. erefore, it is  Pain Research and Management essential for physicians to be familiar with the pain scales used in NICU. In our study, Kendall's W coefficients indicated almost perfect interrater agreement among doctors and substantial agreement among nurses. Given the importance of pain assessment, it should be our aim to achieve agreement above 0.8 between nurses. e results of our study imply there is a need for more training in using pain assessment tools. e strength of our study is that it shows the real-life experience of a tertiary NICU, where the strain of everyday duties and work overload leads at times to omission of training in matters that seem to be intuitive and less vital    than life-saving procedures. ere is growing evidence that early life exposure to painful stimuli leads to long-term consequences such as altered pain sensitivity [34][35][36][37], impaired cognitive, behavioural, and motor development [38], and structural changes in the central nervous system detected in MRI studies [39][40][41]. As much as there is no doubt that pain prevention and management are crucial in neonatal care, the introduction of pain assessment tools in everyday practice is a challenge. Newborns hospitalized in NICUs are affected by different types of pain, namely, acute, postoperative, and prolonged pain. Most of the available pain scales were validated for acute pain, while tools for the evaluation of prolonged pain are scarce. Moreover, it is known that the severity of illness may affect the pain expression in neonates. Given that most behavioural pain scales are based on pain expression indices, it has not been established yet whether the cutoff values used for pain assessment should be different for more severely ill patients [42]. It is evident that the "one-size-fits-all" approach to pain assessment in neonates is unsatisfactory. Staff members should be trained to recognise different types of pain in a given clinical context and apply assessment tools accordingly. However, some scales require the evaluation of so many parameters that it makes it difficult for a single caregiver to measure them accurately. In other cases, the intensive care setting involving tubes and tapes covering patients' faces precludes from appropriate assessment of facial expressions. Furthermore, the main goal of pain assessment is to intervene with painalleviating treatment when needed. A study conducted in New York showed that pain scores documented in medical charts did not influence analgesic medication practices [43]. Some argue that there is no evidence that using standardized pain assessment tools improves patient outcomes [44]. us, efforts should be more focused on pain detection in everyday practice, while validated tools should be reserved for research purposes [45]. It is also advisable to engage parents in pain assessment, as they might be more motivated to detect pain than healthcare workers [46,47]. Until better pain assessment tools are available, it is in the best interest of NICU's patients that healthcare providers focus on pain detection combined with improvement of their competence in using validated pain scales. e latter may be achieved by regular training with evaluation of interrater reliability among staff members. We believe this study to be a starting point for us to improve our pain assessment with the use of both scales.

Conclusions
Results of our study demonstrate that implementing pain scales without prior training may lead to a moderately good interrater agreement among staff members. Reliability values estimated here are not high enough to avoid inadequate pain assessment. erefore, the development of a dedicated training programme is essential to improve our daily practice. Education should be focused on items of both scales that we identified to yield the most inconsistent scores among our staff members.

Data Availability
e data used to support the findings of this study are available from the corresponding author (e.sarkaria@ ipczd.pl) upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this study.