Sleep Stage Classification Using Unsupervised Feature Learning

Most attempts at training computers for the difficult and time-consuming task of sleep stage classification involve a feature extraction step. Due to the complexity of multimodal sleep data, the size of the feature space can grow to the extent that it is also necessary to include a feature selection step. In this paper, we propose the use of an unsupervised feature learning architecture called deep belief nets (DBNs) and show how to apply it to sleep data in order to eliminate the use of handmade features. Using a postprocessing step of hidden Markov model (HMM) to accurately capture sleep stage switching, we compare our results to a feature-based approach. A study of anomaly detection with the application to home environment data collection is also presented. The results using raw data with a deep architecture, such as the DBN, were comparable to a feature-based approach when validated on clinical datasets.


Introduction
One of the main challenges in sleep stage classification is to isolate features in multivariate time-series data which can be used to correctly identify and thereby automate the annotation process to generate sleep hypnograms.In the current absence of a set of universally applicable features, typically a two-stage process is required before training a sleep stage algorithm, namely, feature extraction and feature selection [1][2][3][4][5][6][7][8][9].In other domains which share similar challenges, an alternative to using hand-tailored feature representations derived from expert knowledge is to apply unsupervised feature learning techniques, where the feature representations are learned from unlabeled data.This not only enables the discovery of new useful feature representations that a human expert might not be aware of, which in turn could lead to a better understanding of the sleep process and present a way of exploiting massive amounts of unlabeled data.
Unsupervised feature learning and in particular deep learning [10][11][12][13][14][15] propose ways for training the weight matrices in each layer in an unsupervised fashion as a preprocessing step before training the whole network.This has proven to give good results in other areas such as vision tasks [10], object recognition [16], motion capture data [17], speech recognition [18], and bacteria identification [19].
This work presents a new approach to the automatic sleep staging problem.The main focus is to learn meaningful feature representations from unlabeled sleep data.A dataset of 25 subjects consisting of electroencephalography (EEG) of brain activity, electrooculography (EOG) of eye movements, and electromyography (EMG) of skeletal muscle activity is segmented and used to train a deep belief network (DBN), using no prior knowledge.Validation of the learned representations is done by integrating a hidden Markov model (HMM) and compare classification accuracy with a feature-based approach that uses prior knowledge.The inclusion of an HMM serves the purpose of improving upon capturing a more realistic sleep stage switching, for example, hinders excessive or unlikely sleep stage transitions.It is in this manner that the knowledge from the human experts is infused into the system.Even though the classifier is trained using labeled data, the feature representations are learned from unlabeled data.The architecture of the DBN follows previous work with unsupervised feature learning for electroencephalography (EEG) event detection [20].
A secondary contribution of the proposed method leverages the information from the DBN in order to perform anomaly detection.Particularly, in light of an increasing trend to streamline sleep diagnosis and reduce the burden on health care centers by using at home sleep monitoring technologies, anomaly detection is important in order to rapidly assess the quality of the polysomnograph data and determine if the patient requires another additional night's collection at home.In this paper, we illustrate how the DBN once trained on datasets for sleep stage classification in the lab can still be applied to data which has been collected at home to find particular anomalies such as a loose electrode.
Finally, inconsistencies between sleep labs (equipment, electrode placement), experimental setups (number of signals and categories, subject variations), and interscorer variability (80% conformance for healthy patients and even less for patients with sleep disorder [9]) make it challenging to compare sleep stage classification accuracy to previous works.Results in [2] report a best result accuracy of around 61% for classification of 5 stages from a single EEG channel using GOHMM and AR coefficients as features.Works by [8] achieved 83.7% accuracy using conditional random fields with six power spectra density features for one EEG signal on four human subjects during a 24-hour recording session and considering six stages.Works by [7] achieved 85.6% accuracy on artifact-free, two expert agreement sleep data from 47 mostly healthy subjects using 33 features with SFS feature selection and four separately trained neural networks as classifiers.
The goal of this work is not to replicate the R&K system or improve current state-of-the-art sleep stage classification but rather to explore the advantages of deep learning and the feasibility of using unsupervised feature learning applied to sleep data.Therefore, the main method of evaluation is a comparison with a feature-based shallow model.Matlab code used in this paper is available at http://aass.oru.se/∼mlt.

Deep Belief Networks
DBN is a probabilistic generative model with deep architecture that searches the parameter space by unsupervised greedy layerwise training.Each layer consists of a restricted Boltzmann machine (RBM) with visible units, v, and hidden units, h.There are no visible-visible connections and no hidden-hidden connections.The visible and hidden units have a bias vector, c and b, respectively.The visible and hidden units are connected by a weight matrix, W, see Figure 1(a).A DBN is formed by stacking a user-defined number of RBMs on top of each other where the output from a lower-level RBM is the input to a higher-level RBM, see Figure 1(b).The main difference between a DBN and a multilayer perceptron is the inclusion of a bias vector for the visible units, which is used to reconstruct the input signal, which plays an important role in the way DBNs are trained.
A reconstruction of the input can be obtained from the unsupervised pretrained DBN by encoding the input to the top RBM and then decoding the state of the top RBM back to the lowest level.For a Bernoulli (visible)-Bernoulli (hidden) RBM, the probability that hidden unit h j is activated given Decoding Encoding visible vector, v, and the probability that visible unit v i is activated given hidden vector, h, are given by (1) The energy function and the joint distribution for a given visible and hidden vector are ( The parameters W, b, and v are trained to minimize the reconstruction error.An approximation of the gradient of the log likelihood of v using contrastive divergence [21] gives the learning rule for RBM: where • is the average value over all training samples.In this work, training is performed in three steps: (1) unsupervised pretraining of each layer, (2) unsupervised fine-tuning of all layers with backpropagation, and (3) supervised finetuning of all layers with backpropagation.

Automatic Sleep Stager.
The five sleep stages that are at focus are awake, stage 1 (S1), stage 2 (S2), slow wave sleep (SWS), and rapid eye-movement sleep (REM).These stages come from a unified method for classifying an 8 h sleep recording introduced by Rechtschaffen and Kales (R&K) [22].A graph that shows these five stages over an entire night is called a hypnogram, and each epoch according to the R&K system is either 20 s or 30 s.While the R&K system brings consensus on terminology, among other advantages [23], it has been criticized for a number of issues [24].Even though the goal in this work is not to replicate the R&K system, its terminology will be used for evaluation of our architecture.Each channel of the data is divided into segments of 1 second with zero overlap, which is a much higher temporal resolution than the one practiced by the R&K system.We compare the performance of three experimental setups as shown in Figure 2.

Feat-GOHMM. A Gaussian observation hidden
Markov model (GOHMM) is used on 28 handmade features; see the appendix for a description of the features used.Feature selection is done by sequential backward selection (SBS), which starts with the full set of features and greedily removes a feature after each iteration step.A principal component analysis (PCA) with five principal components is used after feature selection, followed by a Gaussian mixture model (GMM) with five components.The purpose of the PCA is to reduce dimensionality, and the choice of five components was made since it captured most of the variance in the data, while still being tractable for the GMM step.Initial mean and covariance values for each GMM component are set to the mean and covariance of annotated data for each sleep stage.Finally, the output from the GMM is used as input to a hidden Markov model (HMM) [25].

Feat-DBN.
A 2-layer DBN with 200 hidden units in both layers and a softmax classifier attached on top is used on 28 handmade features.Both layers are pretrained for 300 epochs, and the top layer is fine-tuned for 50 epochs.Initial biases of hidden units are set empirically to −4 to encouraged sparsity [26], which prevents learning trivial or uninteresting feature representations.Scaling to values between 0 and 1 is done by subtracting the mean, divided by the standard deviation, and finally adding 0.5.The first method, feat-GOHMM, is a shallow method that uses prior knowledge.The second method, feat-DBN, is a deep architecture that also uses prior knowledge.And, lastly, the third method, raw-DBN, is a deep architecture that does not use any prior knowledge.See text for more details.

Raw-DBN.
A DBN with the same parameters as feat-DBN is used on preprocessed raw data.Scaling is done by saturating the signal at a saturation constant, sat channel , then divide by 2 * sat channel , and finally adding 0.5.The saturation constant was set to sat EEG = sat EOG = ±60 μV and sat EMG = ±40 μV.Input consisted of the concatenation of EEG, EOG1, EOG2, and EMG.With window width, w, the visible layer becomes the noise in signal EEG2.Previous approaches to artifact rejection in EEG analysis range from simple thresholding on abnormal amplitude and/or frequency to more complex strategies in order to detect individual artefacts [27,28].(5 nights) was collected at a healthy patient's home using a Embla Titanium PSG.A total of 8 electrodes were used: EEG C3, EEG C4, EOG left, EOG right, 2 electrodes for the EMG channel, reference electrode, and ground electrode.Data was collected with a sampling rate of 256 Hz, which was downsampled to match the sampling rate of the training data.The signals are preprocessed using the same method as the benchmark dataset.

Automatic Sleep Stager.
A full leave-one-out crossvalidation of the 25 acquisitions is performed for the three experimental setups.The classification accuracy and confusion matrices for each setup and sleep stage are presented in Tables 1, 2, 3, and 4. Here, the performance of using a DBN based approach, either with features or using the raw data, is comparable to the feat-GOHMM.While the best accuracy was achieved with feat-DBN, followed by raw-DBN and lastly, feat-GOHMM, it is important to examine the performances individually.Figure 4 shows classification accuracy for each subject.The raw-DBN setup gives best, or second best, performance in the majority of the sets, with the exception of subjects 9 and 22.An examination of the performance when comparing the F 1 -score for individual sleep stages indicates that S1 is the most difficult stage to classify and awake and slow wave sleep is the easiest.
For the raw-DBN, it is also possible to analyze the learned features.In Figure 6, the learned features for the first layer are given.Here, it can clearly be seen that both low and high frequency features for the EEG and high and low amplitude features for the EMG are included, which to some degree correspond to the features which are typically selected in handmade feature selection methods.Some conclusions from analyzing the selected features from the SBS algorithm used in feat-GOHMM can be made.Fractal exponent for EEG and entropy for EOG were selected for all 25 subjects and thus proven to be valuable features.Correlation between both EOG signals was also among the top selected features, as well as delta, theta, and alpha frequencies for EEG.Frequency features for EOG and EMG were excluded early, which is in accordance to the fact that these signals do not exhibit valuable information in the frequency domain [30].The kurtosis feature was selected more frequently when it was applied to EMG and less frequently when it was applied to EEG or EOG.Features of spectral mean for all signals, median for EMG, and standard deviation for EOG were not frequently selected.See Figure 5 for errors bars for each feature at each sleep stage.
It is worth noting that variations in the number of layers and hidden units were attempted, and it was found   that an increase did not significantly improve classification accuracy.Rather, an increase in either the number of layers or hidden units often resulted in a significant increase in simulation time, and therefore to maintain a reasonable training time, the layers and hidden units were kept to a minimum.With the configuration of the three experimental setups described above and simulations performed on a Windows 7, 64-bit machine with quad-core Intel i5 3.1 GHz CPU with use of a nVIDIA GeForce GTX 470 GPU using GPUmat, simulation time for feat-GOHMM, feat-DBN, and raw-DBN were approximately 10 minutes, 1 hour, and 3 hours per dataset, respectively.2 where an electrode falls off after around 380 minutes can be seen in Figure 7.

Anomaly Detection on Home Sleep
Interestingly, attempts on using the feat-GOHMM for sleep stage classification on the home sleep dataset resulted in faulty data to be misclassified as awake.This could be explained by the fact that faulty data mostly resembles signals in awake state.

Discussion
In this work, we have shown that an automatic sleep stager can be applied to multimodal sleep data without using any handmade features.We also compared the reconstructed signal from a trained DBN on data collected in a home environment and saw that the RMSE was large where an obvious error had occurred.
Regarding the DBN parameter selection, it was noticed that setting initial biases for the hidden units to −4 was an important parameter for achieving good accuracy.A better way of encourage sparsity is to include a sparsity penalty term in the cost objective function [31] instead of making a crude estimation of initial biases for the hidden units.For the raw-DBN setup, it was also crucial to train each layer with a large number of epochs and in particular the fine tuning step.
We also noticed a lower performance if sleep stages were not set to equal sizes in the training set.There was also a high variation in the accuracy between patients, even if they came from the same dataset.Since the DBN will find a generalization that best fits all training examples, a testing set that deviates from the average training set might give poor results.Since data might differs greatly between patients, a single DBN trained on general sleep data is not specialized enough.The need for a more dynamic system, especially one including the transition and emission matrices for the HMM, is made clear when comparing the hypnograms of a healthy patient and a patient with sleep disordered breathing.Further, although the HMM provides a simple solution that captures temporal properties of sleep data, it makes two critical assumptions [13].The first one is that the next hidden state can be approximated by a state depending only on the previous state, and the second one is that observations at different time steps are conditionally independent given a state sequence.Replacing HMM with conditional random fields (CRFs) could improve accuracy but is still a simplistic temporal model that does not exploit the power of DBNs [32].
While a clear advantage of using DBN is the natural way in which it deals with anomalous data, there are some limitations to the DBN.One limitation is that correlations between signals in the input data are not well captured.This gives a feature-based approach an advantage where, for example, the correlation between both EOG channels can easily be represented with a feature.This could be solved by either representing the correlation in the input or extending the DBN to handle such correlations, such as a cRBM [33].Regarding the implemented feat-GOHMM, we have tried our best to get as high accuracy with the setup as possible.It is almost certain that another set of features, different feature selection algorithm, and/or another classifier could outperform our feat-GOHMM.However, we hope that this work illustrates the advantages of unsupervised feature learning, which not only removes the need for domain specific expert knowledge, but inherently provides tools for anomaly detection and noise redundancy.
It has been suggested for multimodal signals to train a separate DBN for each signal first and then train a top DBN with concatenated data [34].This not only could improve classification accuracy, but also provide the ability to single out which signal contains the anomalous signal.Further, this work has explored clinical data sets in close cooperation with physicians, and future work will concentrate on the application for at home monitoring as sleep data is an area where unsupervised feature learning is a highly promising method for sleep stage classification as data is abundant and labels are costly to obtain.where N is the number of samples in signal y, and n k is the number of samples from y that belongs to the kth bin from a histogram of y.
The kurtosis for a signal y is calculated as where μ and σ are the mean and standard deviation, respectively, for signal y.
The spectral mean for signal y is calculated as where F is the sum of the lengths of the 5 frequency bands.
Fractal exponent [35,36] for the EEG is calculated as the negative slope of the linear fit of spectral density in the double logarithmic graph.
Normalization is performed for some features according to [37] and [30].The absolute median for EMG is normalized by dividing with the absolute median for the whole EMG signal.

Figure 2 :
Figure2: Overview of three setups for an automatic sleep stager used in this work.The first method, feat-GOHMM, is a shallow method that uses prior knowledge.The second method, feat-DBN, is a deep architecture that also uses prior knowledge.And, lastly, the third method, raw-DBN, is a deep architecture that does not use any prior knowledge.See text for more details.

Figure 4 :
Figure 4: Classification accuracy for 25 testing sets for three setups.
Data.A total of five acquisitions were recorded at a patient's home during sleep and manually labeled into faulty or nonfaulty signals.A DBN with the raw-DBN setup was trained using the benchmark dataset.The root mean square error (RMSE) between the home sleep data and the reconstructed signal from the trained DBN for the five night runs and a close-up for night Advances in Artificial Neural Systems

Figure 5 :
Figure 5: Error bar of the 28 features.Gray number in background represents how many times that feature was part of best subset from SBS algorithm (maximum is 25).

Figure 6 :
Figure 6: Learned features of layer 1 for (a) EEG, (b) EOG1, (c) EOG2, and (d) EMG.It can be observed that the learned features are of various amplitudes and frequencies and some resemble known sleep events such as a K-complex or blink artifacts.Only the first 100 of the 200 features are shown here.

Figure 7 :y 1 =
Figure 7: RMSE for five night runs recorded at home (bottom).Color-code of RMSE for night run 2 where the redder areas more anomalous areas of the signal.EOG2 falls off at around 380 minutes (top).
Figure 3shows data that has been collected at a healthy patient's home during sleep.All signals, except EEG2, are nonfaulty prior to a movement artifact at t = 7 s.This movement affected the reference electrode or the ground electrode, resulting in disturbances in all signals for the rest of the night, thereby rendering the signals unusable by a clinician.A poorly attached electrode was the cause for PSG data collected in a home environment.A movement occurs at t = 7 s resulting in one of the electrodes to be misplaced affecting EOG1 and both EEG channels.EOG2 is not properly attached resulting in a faulty signal for the entire night.
Two datasets are used in this work.The first consists of 25 acquisitions and is used to train and test the automatic sleep stager.The second consists of 5 acquisitions and is used to validate anomaly detection on sleep data collected at home.Hz after being prefiltered with a band-pass filter of 0.3 to 32 Hz for EEG and EOG, and 10 to 32 Hz for EMG.Each epoch before and after a sleep stage switch is removed from the training set to avoid possible subsections of mislabeled data within one epoch.This resulted in 20.7% of total training samples to be removed.A 25 leave-one-out cross-validation is performed.Training samples are randomly picked from 24 acquisitions in order to compensate for any class imbalance.A total of approximately 250000 training samples and 50000 training validation samples are used for each validation.

Table 1 :
Classification accuracy and F 1 -score for the three experimental setups.