A Pilot Study of Detecting Individual Sleep Apnea Events Using Noncontact Radar Technology, Pulse Oximetry, and Machine Learning

The gold standard for assessing sleep apnea, polysomnography, is resource intensive and inconvenient. Thus, several simpler alternatives have been proposed. However, validations of these alternatives have focused primarily on estimating the apneahypopnea index (apnea events per hour of sleep), which means information, clearly important from a physiological point of view such as apnea type, apnea duration, and temporal distribution of events, is lost. The purpose of the present study was to investigate if this information could also be provided with the combination of radar technology and pulse oximetry by classifying sleep apnea events on a second-by-second basis. Fourteen patients referred to home sleep apnea testing by their medical doctor were enrolled in the study (6 controls and 8 patients with sleep apnea; 4 mild, 2 moderate, and 2 severe) and monitored by Somnofy (radar-based sleep monitor) in parallel with respiratory polygraphy. A neural network was trained on data from Somnofy and pulse oximetry against the polygraphy scorings using leave-one-subject-out cross-validation. Cohen’s kappa for second-by-second classifications of no event/event was 0.81, or almost perfect agreement. For classifying no event/hypopnea/apnea and no event/hypopnea/obstructive apnea/central apnea/mixed apnea, Cohen’s kappa was 0.43 (moderate agreement) and 0.36 (fair agreement), respectively. The Bland-Altman 95% limits of agreement for the respiratory event index (apnea events per hour of recording) were -8.25 and 7.47, and all participants were correctly classified in terms of sleep apnea severity. Furthermore, the results showed that the combination of radar and pulse oximetry could be more accurate than the two technologies separately. Overall, the results indicate that radar technology and pulse oximetry could reliably provide information on a second-by-second basis for no event/event which could be valuable for management of sleep apnea. To be clinically useful, a larger study is necessary to validate the algorithm on a general population.


Introduction
Sleep apnea is characterized by repetitive reduction or cessation of airflow during sleep resulting in microarousals and is associated with increased risk of daytime sleepiness, coronary artery disease, stroke, and early death [1]. Despite being a serious disease, sleep apnea is underrecognized and underdiagnosed [2]. The gold standard for diagnosing sleep apnea is inlaboratory polysomnography (PSG) [3]. PSG uses a comprehensive set of sensors to measure brain, muscular, respira-tory, and cardiovascular activity, and collected data is manually analyzed by a sleep specialist. While PSG is accurate, it is also resource intensive and can be inconvenient for the patient, who must sleep with several sensors attached to their body. Overnight respiratory polygraphy (RP), which in contrast to PSG does not measure brain activity, is often used as a simpler alternative when diagnosing sleep apnea. However, RP is still resource intensive and inconvenient for the patient. To reduce the amount of manual work required, PSG and RP software have been enhanced with algorithms for automatically scoring apnea events at the cost of slightly reduced precision [4].
More convenient alternatives for assessing sleep apnea have been investigated [5,6], such as using only pulse oximetry [7,8]. Furthermore, recent papers have shown that radar technology could accurately assess sleep apnea without any sensors attached to the patient [9][10][11]. However, the combination of pulse oximetry and radar technology has not been investigated. Though, the combination of pulse oximetry and respiratory inductance plethysmography (RIP), which also measure respiratory effort, has been studied [12]. Moreover, previous research has focused primarily on measuring the apnea-hypopnea index (AHI = overall number of hypopneas and apneas per hour of sleep). The AHI is today used as the most important metric for categorizing sleep apnea severity. However, this practice is debatable as the AHI does not take individual apnea type (hypopnea, obstructive apnea, central apnea, or mixed apnea), degree of desaturation, apnea duration, temporal distribution of events, or sleep disruption due to respiratory effort related arousals into account, which are clearly important from a physiological point of view [13,14]. An ideal tool for assessing sleep apnea would be less resource intensive and more convenient than PSG/RP, while still providing the same information.
The aim of the present study was to analyze how accurately the sleep assistant Somnofy can classify individual sleep apnea events using a combination of radar technology, pulse oximetry, and machine learning. For this purpose, sleep apnea events from Somnofy were compared to scorings from a RP-based home sleep apnea test (HSAT) on a second-bysecond basis. Though, the agreement between Somnofy and HSAT was also analyzed for the respiratory event index (REI = number of apnea and hypopnea per hour recording).

Methods
2.1. Participants. Fourteen patients (9 males, 5 females) referred to HSAT by their medical doctor were enrolled in this study. The average age was 50.1 years, and average body mass index (BMI) was 30.3. The inclusion criteria were age between 18 and 70, a medical history that indicated possible sleep apnea, and no history of upper airway surgery or use of nasal decongestants or anti-inflammatory medication the last three months prior to the study. No participants were excluded from the study. The study was approved by the Norwegian Ethical Committee (REK, id number 10445). Written informed consent was obtained from all participants. All methods were performed in accordance with relevant guidelines and regulations.
2.2. Procedure. All participants underwent HSAT for one night while also being monitored by Somnofy. The participants slept in a preorganized bedroom at a university hospital hotel where one Somnofy unit was placed in a nightstand position (by the head) and one in a wall position (above the head), both aiming at the participant's chests from approximately 1 meter distance. The setup is visualized in Figure 1. Since both Somnofy units recorded properly for all the nights, one unit was randomly picked per patient to reduce bias to sensor location in the analysis (9 nightstand, 5 wall).

Home Sleep Apnea
Test. Nox T3 (Nox Medical, Iceland), a type 3 HSAT monitor, was used in this study [15]. Nox T3 measure respiration using a nasal cannula, a thermistor, thoracic and abdominal respiratory inductance plethysmography, and a pulse oximeter. Sleep apneas were manually scored by a trained specialist in accordance with The AASM Manual for scoring of Sleep and Associated Events: Rules, Terminology, and Technical Specifications [3] in the Noxturnal software (version 5.1.3.20388, Nox Medical, Iceland). Hypopneas were scored based on the recommended rules of ≥30% reduction in flow for ≥10 seconds and ≥3% oxygen desaturation from preevent baseline. Apneas, on the other hand, were scored if there was ≥90% reduction in flow from preevent baseline for ≥10 seconds. Apneas were classified as obstructive if there was inspiratory effort throughout the event, as central apnea if there was no inspiratory effort, and as mixed apneas if there was no inspiratory effort at the beginning of the event followed by inspiratory effort at the final part of the event.
To simplify comparison of events with Somnofy, the scorings from Nox T3 were transformed into second-bysecond classifications where each second could take the   [12]. The amplitude, the respiratory rate, and the SpO 2 were resampled to a 1 Hz resolution and fed into a long short-term memory neural network (dense−>LSTM−>dense). The network was trained against the manual HSAT scored events with supervised learning. This way, the algorithm could learn the relationship between the radar-measured respiratory effort, the SpO 2, and the different apnea types. Leave-onesubject-out crossvalidation was utilized to both train and validate the algorithm on the same dataset. In other words, sleep apnea scorings from Somnofy for one patient were based on an algorithm which had been trained only on the other patients.
Somnofy is harmless to human beings and certified according to the Federal Communication Commission (FCC) and "Conformité Européene" (CE). The frequency of the radar pulses enables them to travel through softer materials like bed sheets while being reflected at denser objects like the human body. The sensor will measure respiration on only the closest person if there are multiple persons in the room. For more details on Somnofy, the reader is referred to the validation study of Somnofy for sleep stage classification in healthy adults [16]. Somnofy is currently not an FDAapproved medical device.

2.5.
Statistics. In order to analyze the performance of Somnofy for classifying individual sleep apnea events (providing information on event type, event duration, and temporal distribution of events), the agreement between Somnofy and HSAT was investigated for second-by-second classifications for each participant using Cohen's kappa. Cohen's kappa values were interpreted in the following way: values higher than .80 were considered as almost perfect agreement, .80 to .61 as substantial agreement, .60 to .41 as moderate agreement, .40 to .21 as fair agreement, .20 to.11 as slight agreement, and values less than.10 as no agreement [17,18].
The agreement on the REI between Somnofy and manual HSAT was analyzed using Bland-Altman analysis [19]. Bland-Altman plots were also generated for the hypopnea index (HI = number of hypopneas per hour recording), the obstructive apnea index (OAI = number of obstructive apneas per hour recording), the central apnea index (CAI = number of central apneas per hour recording), and the mixed apnea index (MAI = number of mixed apneas per hour recording). A confusion matrix was generated to compare the agreement on sleep apnea severity (control: REI < 5, mild: 5 ≤ REI < 15, moderate: 15 ≤ REI < 30, and severe: REI ≥ 30).
Calculations were performed in Python (v. 3.8.6), and Cohen's kappa was calculated with the scikit-learn package (v 0.23.2). Table 1 shows demographic data and sleep parameters for the study participants. In total, the dataset consisted of 1 584 (34 010 seconds) apnea events for which 628 (14 383 seconds), 823 (16 370 seconds), 76 (1 452 seconds), and 56 (1 805 seconds) were manually scored by HSAT as hypopnea, obstructive apnea, central apnea, and mixed apnea, respectively. Figure 2 displays the development of the respiratory waveform from Somnofy during one example of each event type. In these examples, the amplitude is clearly reduced during apnea events.

Second-by-Second Analysis.
Cohen's kappa for the binary classification no event/event between Somnofy with pulse oximetry (Somnofy+SpO 2 ) and manually scored HSAT on a second-by-second basis for the whole dataset was 0.81, or almost perfect agreement. For classifying the three classes, no event/hypopnea/apnea Somnofy+SpO 2 showed moderate agreement (Cohen's kappa = 0.43), while for classifying all five classes no event/hypopnea/obstructive apnea/central apnea/mixed apnea, the agreement was fair (Cohen's kappa = 0.36).

Journal of Sensors
The temporal distribution of events is further visualized in Figure 3, which shows classifications by both Somnofy +SpO 2 and HSAT for two random nights in each severity group.

Analysis of Sleep Apnea Indexes and Severity.
Bland-Altman plots for the different sleep apnea indexes, calculated from individual nights, are provided in Figure 4. Somnofy +SpO 2 tended to overestimate REI for low REI values and underestimate REI for high REI values. This trend was driven mostly by OAI. Furthermore, Somnofy+SpO 2 underestimated both CAI and MAI. Consequently, the differences with HSAT were not normally distributed, and the Bland-Altman limits of agreement and bias are not statistically valid for these plots.
The differences in sleep apnea severity between Somnofy +SpO 2 and manual scored HSAT are shown in the classification matrix in Table 2. HSAT and Somnofy agreed on the severity of all participants. Table 3 shows the results for the same algorithm but trained with only radar data or only pulse oximetry. Both technologies alone showed promising results on a second-by-second basis and for overall indexes but were inferior to the results for Somnofy+SpO 2 .

Discussion
The results in the present paper indicate that the combination of radar technology and pulse oximetry can classify sleep apnea more accurately than the two technologies separately.
Furthermore, the no event/event classification on a secondby-second basis showed almost perfect agreement with HSAT providing information on temporal distribution of events and event duration. Though, Cohen's kappa was lower than between manual PSG scorers on epoch ( [20]. REI per night for only radar technology and only pulse oximetry was comparable but less accurate than in previous research [7][8][9][10]. Nikkonen et al. [8] also utilized a neural network on pulse oximetry, but their training dataset contained 1 692 nights compared to our 14. It is likely that the accuracy of the algorithm in the present study would improve with a larger dataset. Nevertheless, the results for Somnofy with pulse oximetry were as accurate as previous research [7][8][9][10], and the agreement was higher than shown between RP and PSG [15,21]. Furthermore, HSAT and Somnofy with pulse oximetry agreed on the sleep apnea severity of all patients. The algorithm showed only moderate agreement for distinguishing no event/hypopnea/apnea and only fair agreement for classifying all the different apnea types.     Journal of Sensors Thus, it could not reliably detect the event type. As shown in Figure 2(a) it is not always straightforward to distinguish the apnea types. Here, the radar-measured respiratory effort behaves quite similarly for obstructive apnea and hypopnea, but also the HSAT scorings are not necessary always correct [15,21], or the sensor data could be noisy. This motivates a machine learning approach. It is likely that the machine learning algorithm would improve with more data, especially for central and mixed apnea where the dataset contained few events. However, this warrants further investigations. Recently, radar technology has been shown to accurately classify sleep in healthy adults [16] and to possibly detect body position during sleep (supine, prone, side) [22,23]. If radar technology also could reliably classify sleep in persons with sleep apnea, the proposed solution could calculate AHI, possibly detect RERAs, and investigate sleep disruption. With classifications on a second-by-second basis, sleep apneas could also be analyzed across the different sleep stages (light, deep, and REM sleep) and body positions. This would probably not be possible using only pulse oximetry. Using only radar technology, on the other hand, would not provide exact information on desaturation levels, unless this could be accurately estimated from the characteristics of the events.
Other sensors could be used to measure respiration instead of radars. Wearables, such as RIP [12], nasal cannula, and thermistor, are generally less convenient for the patient who has to sleep with sensors attached to his/her body. This could disrupt sleep and make it more difficult to assess patient groups that do not accept wearing wires, or the data quality could be affected if the sensor is attached suboptimal or disrupted from movements during sleep. Nasal cannula and thermistor cannot measure inspiratory effort, and nasal cannula cannot measure mouth breathing. Though, there are also other nonwearable alternatives like using sound [24] (subject to audio noise from surroundings), visionbased solutions (affected by bed sheets), infrared solutions [25], or under-the-mattress solutions [26,27]. To the authors' knowledge, neither of these alternatives has been shown to reliably classify individual sleep apnea type, apnea duration, or temporal distribution of events.
The ability to classify sleep apnea second by second as well as using exact measurements of oxygen desaturation levels may contribute to a more detailed and profound understanding of sleep apnea. Individual sleep apnea event severity, event duration, oxygen desaturation, temporal distribution of events, and sleep disruption are all clearly important from a physiological point of view [13,14]. Sleep apnea has a multifactorial pathogenesis [28] which has led to a multitude of options in both diagnostic and therapeutic measures. More detailed data could thus enable a more patient-specific tailored sleep apnea management. Furthermore, a more thorough understanding of the underlying pathophysiology will likely be instrumental in understanding comorbidities [29,30]. In the present study, we have shown that this information might not be limited to only PSG/RP.
As the proposed solution does not require manual scoring and the equipment does not need expertise to install (set radar on nightstand and put oximeter on finger), it should be more scalable and cost efficient than PSG/RP. If this solution could also reliably measure and diagnose sleep apnea, more people could receive sleep apnea assessments, diagnosis could be based on several nights of measurements to counteract the night-to-night variability in severity of sleep apnea [31], assessments could be performed in the patient's own bed, patients could be continuously monitored during treatment, and more data could be gathered for research purposes. Optimally, such an alternative should be as accurate and detailed as possible.

Limitations and Future
Work. The dataset in the present study includes 1 584 events which should be more than enough to validate the algorithm. However, the patient population is relatively small, and a larger study is needed to assess the clinical usefulness across age, sex, AHI, respiratory disturbance index (RDI), BMI, and on people with selected comorbidities. Accuracy should also be analyzed across sensor location and sleeping position. In contrast to the pilot study, a larger study should use PSG. HSAT does not measure sleep and is therefore unable to detect RERAs and AHI.

Conclusion
The present study indicates that radar technology and pulse oximetry could assess sleep apnea more accurately than the two technologies separately. Furthermore, the results show that classifications of no event/event could be performed reliably on a second-by-second basis, providing information on apnea duration and temporal distribution of events. This information is clearly important form a physiological point of view but has not been validated for radar technology or pulse oximetry as the focus has been primarily on the AHI. AHI is the most important clinical parameter today, but do not give the complete picture of the disease. To increase the understanding and improve the management of sleep apnea more information is needed. PSG/RP provides this information but is not scalable due to high cost and inconvenience. A scalable solution could collect data from a larger population and measure patients for longer periods of time. A larger κ = Cohen's kappa for the whole dataset on seconds resolution, 5 class = no event/hypopnea/obstructive apnea/central apnea/mixed apnea, 3 class = no event/hypopnea/apnea, 2 class = no event/event, and REI 95% LoA = Bland-Altman 95% limits of agreement for REI.