MCFN: A Multichannel Fusion Network for Sleep Apnea Syndrome Detection

Sleep apnea syndrome (SAS) is the most common sleep disorder which affects human life and health. Many researchers use deep learning methods to automatically learn the features of physiological signals. However, these methods ignore the different effects of multichannel features from various physiological signals. To solve this problem, we propose a multichannel fusion network (MCFN), which learns the multilevel features through a convolution neural network on different respiratory signals and then reconstructs the relationship between feature channels with an attention mechanism. MCFN effectively fuses the multichannel features to improve the SAS detection performance. We conducted experiments on the Multi-Ethnic Study of Atherosclerosis (MESA) dataset, consisting of 2056 subjects. The experiment results show that our proposed network achieves an overall accuracy of 87.3%, which is better than other SAS detection methods and can better assist sleep experts in diagnosing sleep disorders.


Introduction
Sleep apnea syndrome (SAS) is a common sleep-breathing disorder characterized by repetitive events of complete or partial cessation of breathing during sleep [1]. SAS often occurs in men and women aged 30 to 60 years or older [2]. Te main symptoms of SAS are daytime sleepiness, tiredness, inattention, and so on. Most SAS patients are undiagnosed and untreated which may lead to health problems such as heart and brain diseases [3][4][5][6].
SAS includes two important sleep events: obstructive sleep apnea (OSA) and hypopnea. According to an American Academy of Sleep Medicine (AASM) manual [7], OSA is scored when there is a 90% or more reduction in the prevent baseline of the airfow amplitude. However, there is a continued respiratory efort in the thoracic and abdominal belts. Hypopnea is scored when there is a 30% or more reduction in the preevent baseline of the airfow and 3% or more signifcant oxygen desaturation from the preevent baseline. Every OSA and hypopnea event lasts longer than 10 s. Normal sleep is scored when there is no OSA and hypopnea event or their duration time is less than 10 s.
Diagnosing SAS traditionally uses polysomnography (PSG), which is the gold standard. PSG can measure several signals, such as respiratory, electrocardiography (ECG), blood oxygen saturation, electroencephalography (EEG), and body movement signals. However, it is expensive and inconvenient because the patients need to attach a variety of sensors to their bodies. Moreover, it is time-consuming due to the manual analysis of signals. Terefore, it is necessary to propose alternative methods to automatic SAS detection using fewer physiological signals.
Various physiological signals have been used to detect sleep events [8][9][10]. Among these signals, respiratory signals can directly refect the breathing situation during sleep [11]. Te respiratory signal can be measured directly from the airfow sensor and thoracic and abdominal belts. Some methods have been used for SAS detection, such as threshold, support vector machine (SVM), logical regression (LR), and k nearest-neighbor (k-NN) [12][13][14][15][16]. Tese methods extracted the time domain, frequency domain, and other nonlinear features from physiological signals. However, manual feature extraction is difcult to perform in noisy signals and requires domain knowledge.
Deep learning networks are alternatives as they can learn informative features without prior domain knowledge. Many researchers use long-and short-term memory (LSTM) and convolutional neural networks (CNNs) to classify physiological signals [17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32]. In particular, CNN is a popular class of deep learning networks that can automatically learn and fnd features from physiological signals. Haidar et al. [22] have demonstrated the efcacy of CNN models in classifying apnea or hypopnea events using airfow respiratory signals, with an accuracy of 77.6%. When a wavelet spectrogram of airfow respiratory signals input the network, the accuracy was 79.8%. If we use abdominal and thoracic respiratory signals simultaneously, the performance can reach 83.5% [23]. Urtnasan et al. [24] proposed a method for automated OSA detection from a single-lead ECG using CNN. Choi et al. [25] used CNN and a single-channel nasal pressure signal to detect the realtime apnea-hypopnea event. Nasal pressure signals were adaptively normalized and segmented by sliding a 10 s window at 1 s intervals. Many researchers use the LSTM model for SAS detection to learn the temporal features of sleep events. Van Steenkiste et al. [26] used LSTM to detect sleep apnea from raw respiratory signals, obtaining 77.2% accuracy. Elmoaqet et al. [27] used LSTM and bidirectional long-short-term memory (Bi-LSTM) to detect three sleep events and got an average accuracy of 83.6%. Yu et al. [32] proposed a method of sleep staging based on EEG signals combined with sleep apnea-hypopnea syndrome classifcation, which signifcantly reduced the rate of false positives that appear in the waking period. Te data preprocessed by the sliding window were manipulated by LSTM and CNN to identify distinct various sleep events. Although these networks can automatically extract and learn deep-level features from physiological signals, there are still some shortcomings. First, they only focus on extracting deep features, ignoring the efect of shallow features, which can provide rich information for sleep events. Our initial conference paper solved this problem using a multilevel feature fusion network in [33]. Second, these networks did not consider the impact of channel features obtained by diferent respiratory signals. Some channel features can clearly distinguish sleep events, while others have little efect on SAS detection. We propose a multichannel fusion network (MCFN) to address this problem. MCFN efectively utilizes the shallow features of respiratory signals and fuses the multichannel features by an attention mechanism. We design a multichannel fusion block to calibrate the feature channel of various respiratory signals adaptively. Since the signifcance of each respiratory signal feature channel is diferent, this block can automatically obtain the importance of each feature channel, selectively enhance the useful channel feature, and restrain the useless ones. We evaluate our proposed network on a publicly available dataset with 2056 subjects. Te MCFN can achieve an overall accuracy of 87.3%.

Material and Methods
MCFN can efectively fuse the features of diferent levels and channels. Tis network mainly includes signal preprocessing, multilevel feature concatenation, and multichannel attention fusion. We show the framework in Figure 1. First, we segment the various respiratory signals into a series of the 30 s length of epochs. Te preprocessing block standardizes the respiratory signals, and each epoch is labeled as an event of OSA, hypopnea, and normal sleep according to the AASM guidelines. Second, the multilevel feature concatenation block obtains abundant features from shallow and deep layers through skip connections. Shallow features also contain some valuable identifcation information. Tird, the multichannel fusion block uses an attention mechanism to learn diferent weights. Te channel features that signifcantly afect SAS detection can obtain larger weights; otherwise, they get smaller weights. Finally, the feature vectors are input into two convolution layers and the max-pooling layer. Te sleep classifcation is performed in the fully connected layer by sigmoid activation functions. In the following subsections, we detail the main block of this network.

Dataset.
We conducted our experiments on a large dataset called the Multi-Ethnic Study of Atherosclerosis (MESA) [28,29]. Tis dataset is retrieved from the National Sleep Research Resource (NSRR). NSRR is a new National Heart, Lung, and Blood Institute resource designed to provide extensive data resources to the sleep research community. MESA contains PSG recordings of 2056 subjects. Te subjects, aged 45 to 84, come from diferent ethnic groups, including black, white, Hispanic, and Chinese men and women. Each PSG recording included various physiological signals such as EEG, respiration signals, and ECG. Our network only used three types of respiratory signals extracted from nasal thermal sensors and conductive belts around the thorax and abdomen. Te sampling frequency of these signals is 32 Hz. Sleep experts labeled the start time and duration time of OSA and hypopnea events.

Data
Preprocessing. In our network, three types of respiratory signals need to be preprocessed. First, we delete some subjects from the dataset which only contain normal sleep events. Second, due to diferent detection environments and equipment, the amplitude of each respiratory signal is very diferent. Terefore, the respiratory signal is individually standardized by subtracting the mean and dividing it by the standard deviation. Finally, according to the time of each sleep event, each 30 s epoch was labeled as OSA, hypopnea, or normal sleep event. If the epoch contains only obstructive sleep apnea or hypopnea lasting more than 10 seconds, it is labeled OSA or hypopnea. We excluded the epoch with obstructive sleep apnea and hypopnea events lasting more than 10 seconds. If an epoch contains obstructive sleep apnea or hypopnea events lasting less than 10 seconds, it is labeled as normal sleep.
We also need to consider the balance classifcation of sleep events in preprocessing blocks. Typically, sleep events such as normal sleep are more than OSA or hypopnea. When learning a detection network with imbalanced classes, the result detects the most frequent sleep events. One way to address this issue is to employ balanced sampling. We randomly select the same number from the majority sleep event as the minority sleep event and then feed the network with batches of data that contain as many epochs from each sleep event.

Multilevel Feature Concatenation Block.
A simple CNN architecture has been used for SAS detection [23,33]. It was composed of convolution, pooling, and classifcation layers. Te convolution layer extracts a feature map by applying a flter to the input respiratory signal. Te pooling layer makes the feature more distinct and reduces the amount of data. Te convolution layer can flter out some high-frequency information and make the signal smoother. In Figure 2, the partial feature map of the airfow respiratory signal after four convolution layers is shown. We fnd that with the increase of convolution layers, the receptive feld of features becomes larger, and more high-frequency information is fltered. Although some networks use deep-level features to detect SAS, some high-frequency features are lost. Multilevel feature concatenation is realized through fve skip connections to keep more high-frequency features in the network.
Te multilevel feature concatenation block includes four convolution layers, two pooling layers, fve skip connections, and one concatenation. We detail the parameters of diferent layers, which are summarized in Table 1. Each convolution layer has 32 flters with a rectifed linear unit activation function, and each max-pooling layer has a pool size of (1, 2) with two strides. Te convolutional kernel size is (1, 3) with three strides or (1, 2) with two strides. Following each convolution and pooling layer, the features of this level are obtained by average pooling to down-sampling. Ten, these features are concatenated to generate multilevel feature maps. Tese features include shallow and deep features and provide more basic information. Tey can improve detection performance.

Multichannel Attention Fusion
Block. Diferent respiratory signals such as airfow, thoracic, and abdominal have additional predictive power for SAS detection [27]. We fuse the multichannel features with an attention mechanism to fully use multilevel features from three types of respiratory signals. Tis block adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. It can learn to emphasize informative features and restrain less useful ones selectively.
As shown in Figure 1, we obtain the C × W × H features through the multilevel feature concatenation block, where C is the number of channels, and each channel contains W × H features. Each respiratory signal has 192 channels, and each channel includes 1 × 7 features. Te features of each respiratory signal are concatenated to obtain 576 channel features, which are the input of the multichannel attention fusion block. We recalibrate the multichannel features as follows.
First, the F sq ( ) operation compresses the features along the spatial dimension, turning each two-dimensional feature channel into an actual number. Te global average pooling Journal of Healthcare Engineering completes this operation to make the actual number have a global receptive feld. Te output dimension is the same as the number of input channels. F sq (μ c ) is calculated as follows: where μ c represents the feature map of the c-th channel feature map and i and j represent the row and column of the feature map, respectively. Second, the F ex ( ) operation is similar to the gate mechanism in the recurrent neural network (RNN). Tis operation can learn a nonlinear interaction between channels, and it can learn a nonmutually exclusive relationship. Te operation is completed by two fully connected layers (FC). F ex (z, W) is calculated as follows: where δ refers to the ReLU function and the parameter W 1 multiplied by Z is the frst FC layer. To limit model complexity and aid generalization, dimensions are reduced according to c/16 × c. A dimensionality-increasing layer returns to the channel dimension of the transformation output.
Finally, the F re ( ) operation regards the output weight of the excitation as the importance of each feature channel. Ten, the original feature is recalibrated on the channel dimension by weighting the previous feature by channel. F re (μ c , s c ) is calculated.where s c indicates the importance of the feature channel, and μ c represents the feature map of channel C.
After recalibration, there are 576 channel feature maps. Te size of each feature map is 1 × 7. After two convolutions and one pooling operation, the convolution kernel sizes are (1,3) and (1,2), and the strides are 3 and 2, respectively. Te max-pooling size is (1,2), and the stride is 2. Finally, the fatten operation obtains the 576 features. Ten, two fully connected layers and the sigmoid function output the probability of each sleep event. According to the probability value, this block outputs the sleep events.
where TP, FP, TN, and FN represent the number of true positive, false positive, true negative, and false negative epochs. Te proportion of the correctly identifed epochs is measured by sensitivity. Specifcity refects the detection efect of negative samples. Te confusion matrix also is used. Each row of the confusion matrix represents the epoch in actual labels, while each column represents the epoch in the predicted labels. We also standardized the confusion matrix by rows to obtain diferent probabilities. We use colors with diferent shades to represent the probability. Te darker the color, the greater the probability, vise versa.

Experimental Results
Tis section presents the experimental setup and several experimental results designed to demonstrate the role of each block. First, we showed the classifcation results of the MCFN model, which proves that the model has better performance. Second, we confrmed the efect of diferent respiratory signals on diferent sleep event detections. Tey complement each other in the SAS detection. Tird, we demonstrated the advantages of the multilevel feature concatenation block. Finally, we confrmed that the attention mechanism efectively fuses the multichannel features to improve performance.

Experimental Setup.
Te proposed network was trained and tested on the MESA dataset. After preprocessing, we selected 1801 subjects from 2506 subjects. Tey included the 54517 OSA events, 209910 hypopnea events, and 2019760 normal sleep events. Te training and test set consisted of a balanced number for each sleep event to prevent the model from overftting to the majority number of the class. We randomly selected 54517 sleep events from each sleep classifcation and mitigated the class imbalance issue. Te experiment chose 80% of the sleep events as the training set and 20% as the testing set.
Te training and testing are conducted based on the TensorFlow framework of Python 3.6. Te experiments used the graphics card NVIDIA GTX 2080Ti GPU. Te proposed network adopted the Adam optimization method and crossentropy as the loss function. Te initial learning rate is 1e − 3, and the learning rate is 1e − 4 after 40 iterations. Te size of the mini-batch is 400 sleep events. Te network had training of 100 epochs.

SAS Detection Performance of MCFN.
Te MCFN model detects sleep events using three respiratory signals of the chest, abdomen, and nasal airfow on the MESA dataset. Te average accuracy is 87.3%, and the average F1 score is 87.3%. Table 2 presents the detection performance of the model. We found that the performance indexes of OSA sleep event detection are the highest, recall can reach 93.7%, the F1 score is 93.5%, and precision is 93.3%, indicating that the MCFN model can achieve good performance in detecting OSA events. Tere is a contradiction between the precision and recall of normal sleep and hypopnea events, which the F1 score can measure. Te F1 scores of the two events are very similar, with a diference of only 0.8%, indicating that the performance of the MCFN model in detecting these two events is the same. From the confusion matrix, we found that there are some misclassifcations between normal sleep and hypopnea events, mainly because sometimes the waveforms of the two events are very similar, but there are diferences in amplitude. Te MCFN model can achieve good performance in detecting OSA events. Te main reason is that the waveform of the respiratory signal of OSA events is very diferent from that of other events.

Te Efects of Tree Respiratory Signals.
We used sensitivity and specifcity to measure the efect of diferent respiratory signals on various sleep events. Te sensitivity measures the proportion of correctly identifed positives, such as the percentage of OSA events correctly identifed as having the event. Te specifcity measures the proportion of correctly identifed negatives, such as the percentage of not OSA correctly identifed as not having the event.
We show the sensitivity of airfow (Flow), thoracic respiratory signal (Tor.), and abdominal respiratory signal (Abdo.) in Figure 3. Te sensitivity of abdominal respiration signal in detecting OSA and hypopnea sleep events is 81.39% and 73.05%, respectively. Te sensitivity of the airfow respiration signals in detecting normal sleep events is 72.9%, which was higher than the other respiratory signals.
We show the specifcity in Figure 4. Te specifcity of the airfow respiratory signal in detecting OSA was 93.72%, and the specifcity of detecting hypopnea sleep events was 87.64%, which was 4.31% higher than that of the abdominal respiratory signal. Te specifcity of abdominal respiratory signals in detecting normal sleep events was 47.2%. Tese experimental results show that diferent respiratory signals play diferent roles in detecting various sleep events, so we can use three respiratory signals simultaneously for SAS detection.
To comprehensively evaluate the role of three respiratory signals in detecting SAS, we show the accuracy in Figure 5. We input single, two, and three respiratory signals into the Journal of Healthcare Engineering MCFN model, respectively. It can fnd that the SAS detection performance of single respiratory signals is the lowest. Te accuracy of nasal airfow, abdominal, and thoracic respiratory signals was 76.5%, 76.3%, and 74.0%, respectively. When we combine the respiratory signals in pairs, the accuracy improves to varying degrees compared with that of single respiratory signals, such as the accuracy of combined fow and thoracic respiratory signals which can reach 83.1%, which is 6.6% higher than that of fow. Te detection accuracy is the highest when the three respiratory signals are combined, reaching 87.3%. Te detection accuracy improved by 9.1%.
We fnd that the detection performance of the combined respiratory signals is better than that of single respiratory signals. Te three kinds of respiratory signals play diferent roles in detecting sleep events. Te combination of multiple respiratory signals can complement each other and improve the SAS detection performance.

Multilevel Features Concatenation Block Improves
Performance. In this experiment, we investigate the infuence of the multilevel feature concatenation block on classifcation performances. First, to concatenate the features of diferent levels, it is necessary to down-sample the shallow features to get the same dimension. Tere are two methods for down-sampling: average pooling and max pooling. Trough the experiment, we fnd that the two methods have little efect on the detection performance. We choose one way randomly, and here we choose average pooling to reduce the dimension. Ten, by inputting diferent respiratory signals into the model with only deep-level features or multilevel features, the overall accuracy obtained is shown in Figure 6.
We fnd that whether it is single respiratory signals or combined respiratory signals, the detection accuracy using multilevel features is higher than that using only deep features. For airfow respiratory signals, the accuracy is only improved by 0.2%, indicating that the other level's features provide less identifcation information. For thoracic respiratory signals, the accuracy with only deep features was 71.1%, and the accuracy with multilevel features was 74.0%. It increased by 2.9%, indicating low-level features of thoracic respiratory signals which can provide rich identifcation information and improve the detection performance. For the combined respiratory signals, the accuracy can get improvement.
Tis result shows that the multilevel features of various respiratory signals have diferent efects on SAS detection. Te complete learning features of thoracic and abdominal respiratory signals can improve detection accuracy. In contrast, the multilevel features of airfow respiratory signals have little impact on performance.      Journal of Healthcare Engineering

Multichannel Features Fusion Block Improves
Performance. In this experiment, we investigated the infuence of the relationship between diferent channel features on classifcation performances. We take two types of airfow and abdominal signals or three kinds of respiratory signals as an example. Whether or not multichannel feature fusion is used, Figure 7 shows the SAS detection confusion matrix.
Comparing the confusion matrices (a) and (b), we fnd that the correct classifcation probability of hypopnea events increased from 0.78 to 0.83, increased by 0.05. Te correct classifcation probability of OSA events rose from 0.92 to 0.94, increasing by 0.02. Te experimental results show that the respiratory signal combined with abdominal and airfow can extract rich features. After attention fusion, it can strengthen useful features and suppress useless features to improve performance. Te correct classifcation probability of normal sleep events did not increase. Still, it decreased by 0.02, indicating that the extracted features by the two combined signals are very similar.
Comparing the confusion matrices (a) and (c), we fnd that the correct probability of each event classifcation in (c) is greater than or equal to that in (a). Te experimental results confrm that the classifcation performance of three respiratory signals is better than that of two signals, which further verifes that various respiratory signals can provide richer information.
Comparing the confusion matrices (c) and (d), we fnd that the correct probability of event classifcation in (d) is greater than that in (c). Te correct classifcation probability of hypopnea events increased from 0.82 to 0.86, an increase of 0.04. Te correct probability of OSA event and normal event classifcation has increased by 0.01. Te experimental results confrm that the attention mechanism improves the detection performance by fusing the multichannel features of the three respiratory signals.
Te abovementioned experimental results confrm that the multichannel attention fusion block can improve the correct classifcation probability of hypopnea events and OSA events. Te efect on normal sleep events is not very signifcant, mainly because the waveform of such events is relatively stable.

Learned Weight for Each Channel Feature.
Te attention mechanism can learn diferent weights for the channel features. Te experiment results verify that the channel features of each respiratory signal have diferent efects on SAS detection. Figure 8 shows the multichannel feature weights of three respiratory signals. For the frst channel of each respiratory signal, the channel weight of airfow respiratory is 0.18, the channel weight of thoracic respiratory is 0.50, and the channel weight of abdominal respiratory is 0.16. For the 64th channel of each respiratory signal, the channel weight of airfow respiratory is 0.50, the channel weight of thoracic respiratory is 0.50, and the channel weight of abdominal respiratory is 0.99. After multilevel feature concatenation of each respiratory signal, the model can obtain 192 channel features. Te multichannel feature fusion block obtains 576 channel features. Te attention mechanism learns the weight of each feature channel through training.
From Figure 8, we can fnd that the weights of each respiratory signal feature channel are diferent. For example, the weights of fow respiratory signals channel features are close to 1, and some are close to 0. Tese weights indicate that varying levels of features have diferent efects on sleep event detection. In addition, the weights of the feature channels 0, 32, 64, 96, 128, and 160 are marked with special graphics. Te importance of channel features at the same level is also diferent. Figure 9 shows the weight distribution of diferent respiratory signal channel features. When the weight is less than 0.25, the weight distribution of the three respiratory signals is very similar, indicating that the number of weak action feature channels is approximately equal. When the weight is in the range of 0.25∼0.75, the number of feature diagrams of airfow respiratory signal is signifcant, indicating that the role of airfow respiratory signal is moderately important. When the weight is more powerful than 0.75, the number of the abdominal respiratory signals feature diagrams is large. Tis result indicates that these features contribute the most to SAS detection and contain the most identifying information. In addition, the Kolmogorov-Smirnov (KS) test further determines whether the channel weights of the two respiratory signals obey the same distribution. Since the P values are less than 0.05, they belong to diferent distributions. Terefore, each respiratory signal learning channel feature has diferent efects on SAS detection, which shows that the fusion of multiple respiratory signals is essential.     [26] used LSTM to detect normal sleep and sleep apnea on large data sets, and the accuracy was 77.2%. Although the temporal correlation of sleep events was considered, they ignored the relationship between diferent channel features. Elmoaqet et al. [27] developed the LSTM and Bi-LSTM framework to detect apnea events. Tey evaluated the framework over three respiration signals: airfow, nasal pressure (NPRE), and abdominal respiratory inductance plethysmography. Tey used PSG recording of 17 patients with obstructive, central, and mixed apnea events. Te average accuracy was 83.6%.

. Discussion
Barroso et al. [31] conducted the 13 bispectral features from airfow. Te oxygen desaturation index ≥3% (ODI3) was also obtained to evaluate its complementarity to the bispectral analysis. Tey used the fast correlation-based flter (FCBF) and a multilayer perceptron (MLP) to select the feature and recognize the pattern. Te model reached 82.5% accuracy for the typical cut-ofs of fve events per hour. Yu et al. [32] proposed the SAS detection and classifcation method, which uses C4/A1 single-channel EEG signal, oronasal fow signal, and abdominal displacement signal. Tey utilized LSTM-CNN to identify four distinct types: normal sleep, hypopnea events, OSA, and CSA + MSA. Te overall classifcation accuracy achieves 83.94%.
It is challenging to compare as they do not all use the same database and the number of the same sleep classifcation. To make a comparison on the same dataset, we have implemented the research of Haidar et al., who have carried out a lot of analysis on the MESA dataset. In the beginning, in [22], they got 77.6% accuracy with CNN by inputting airfow respiratory signal. Later, in [23], they obtained 83.5% accuracy by inputting three types of respiratory signals. All the previously mentioned research studies are summarized in Table 3. Considering the efect of shallow features on sleep classifcation and the relationship between diferent channel features in detecting sleep events, our experiment improved the accuracy by 3.9%. Our network could not only detect many types of sleep events but also improve accuracy.

Conclusion
We propose an MCFN model to detect OSA, hypopnea, and normal sleep. Te model uses the multilevel feature concatenation block which can extract more rich information and give full play to the role of shallow features. Ten, the  model utilizes an attention mechanism to efectively fuse the diferent level features of airfow, abdominal, and thoracic respiratory signals. Te fusion block makes each channel feature of three respiratory signals have diferent weights, enhances the useful channel feature, and suppresses the useless channel feature. Te experiments verifed that multiple respiratory signals, multilevel features, multichannel fusion, and channel features afect SAS detection. MCFN model improves SAS detection performance by using the complementarity of various signals and the completeness of features. Te detection accuracy is 87.3% on the MESA dataset, which is better than the other methods. In future research, we will try to study the efect of sleep apnea on sleep staging.

Data Availability
Te MESA sleep dataset was supported by the National Heart, Lung, and Blood Institute (NHLBI) at the National Institutes of Health. It is available through NHLBI National Sleep Research Resource at https://www.sleepdata.org/ datasets/mesa.

Conflicts of Interest
Te authors declare that they have no conficts of interest.