A Multilevel Temporal Context Network for Sleep Stage Classification

Sleep stage classification is essential in diagnosing and treating sleep disorders. Many deep learning models have been proposed to classify sleep stages by automatic learning features and temporal context information. These temporal context features come from the intra-epoch temporal features, which represent the overall morphology of an epoch, and temporal features of adjacent epochs and long epochs, which represent the influence between epochs. However, most existing methods do not fully use the complementarity of the three-level temporal features, resulting in incomplete extracted temporal features. To solve this problem, we propose a multilevel temporal context network (MLTCN) to learn the temporal features from intra-epoch, adjacent epochs, and long epochs, which utilizes the complete temporal features to improve classification accuracy. We evaluate the performance of the proposed model on the Sleep-EDF datasets published in 2013 and 2018. The experimental results show that our MLTCN can achieve an overall accuracy of 84.2% and a kappa coefficient of 0.78 on the Sleep-EDF-2013 dataset. On the larger Sleep-EDF-2018 dataset, the overall accuracy is 81.0%, and a kappa coefficient is 0.74. Our model can better assist sleep experts in diagnosing sleep disorders.


Introduction
Sleep disorder is a common sleep disease, mainly including drowsiness, insomnia, and sleep apnea. According to the World Health Organization, the global sleep disorder rate is 27%. In 2016, the sleep survey results of the China Sleep Research Association showed that the insomnia rate of Chinese adults was as high as 38.2%, and more than 300 million Chinese people had sleep disorders. Sleep disorders can increase the risk of heart disease, hypertension, Alzheimer's disease, depression, anxiety disorders, and other diseases, which seriously affect human health and quality of life [1].
Sleep stage classification is the basic research for the diagnosis of sleep disorders. Sleep specialists classify the sleep stages via polysomnography (PSG), the gold standard of sleep scoring. PSG collects physiological signals recorded from various sensors counting electroencephalography (EEG), electrooculography (EOG), electromyography (EMG), pulse oximetry, and respiration. ese signals are divided into 30-second epochs, and sleep specialists manually label each epoch according to some standard criteria, such as the American Academy of Sleep Medicine (AASM) rules [2] or Rechtschaffen and Kales rules [3]. According to the AASM rules, each epoch is classified into one of the five stages: Wake, REM, N 1 , N 2 , and N 3 . Manual sleep stage classification is time-consuming, tedious, and exhaustive. us, automatic sleep stage classification methods have developed rapidly recently. Many researchers analyze the changes in various physiological signals to classify sleep stages and perform better. However, the multichannel physiological signal acquisition process increases the sleep monitoring cost and affects the subjects' sleep. Many researches used single-channel signals to classify sleep stages.
EEG can well reflect the brain wave activity during sleep, which has recently become attractive for sleep stage classification.
At present, many researchers use deep learning methods to classify sleep stages by automatic learning features combined with temporal context information. ese temporal features can be learned from three levels: intra-epoch, adjacent epochs, and long epochs. As shown in Figure 1, Level_0 represents the temporal features within an epoch. Level_1 represents the temporal features of adjacent epochs, including left and right neighbor epochs. Level_2 describes the temporal features of long epochs whose length is greater than 3. Each level of temporal features provides information from a different granularity. ese temporal features are complementary to the sleep stage classification. For example, Level_0 represents the overall morphology of an epoch. It is worth noting that sleep experts often determine the sleep stage according to the morphology of EEG signals. However, the overall morphology of EEG signals in some stages is similar and difficult to distinguish, such as Wake and REM. Level_1 utilizes the temporal relationship of adjacent epochs and improves the distinguishability of Wake and REM. Level_1 only considers the context information of shortterm epochs, which includes left and right neighbor epochs. Sleep stage transitions follow certain transition rules and are not stochastic processing. Level_2 is used to learn sleep transition rules from long epochs. It further complements temporal context information by fine-tuning the abnormal sleep epoch in the long epochs.
During the sleep stage classification, the functions of Level_0, Level_1, and Level_2 are different, but the existing studies only use one or two kinds of temporal features. ey do not utilize the complementarity of the different-level temporal features. To solve this problem, we propose a multilevel temporal context network for sleep stage classification, which learns temporal features through three temporal context learning blocks to improve the classification performance. e main contributions of this paper are as follows: (i) We propose a multilevel temporal context network (MLTCN), which learns the temporal features from three levels: intra-epoch, adjacent epochs, and long epochs. MLTCN fully utilizes the complementarity of multilevel temporal features to improve classification performance. (ii) We deploy intra-epoch temporal context learning block to efficiently capture the morphology features from the raw signals and time-frequency images. Moreover, the dilated causal convolution is used to learn the temporal features of an epoch from the raw signals. (iii) We apply weighted fusion classification and prediction to capture the temporal features of adjacent epochs. Different weights are given according to the different functions of classification and prediction so that the model can better reflect the influence between adjacent epochs. (iv) To further supplement the temporal features of long epochs, we fine-tune the classification results according to the transition probability between epochs.

Related Work
According to feature acquisition methods, automatic sleep stage classification can be divided into handcrafted feature extraction and automatic feature learning. Handcrafted feature extraction methods need to extract the time-domain, frequency-domain, and nonlinear features according to prior knowledge. en, feature selection is carried out to remove redundant features, and support vector machine (SVM), k-nearest neighbor (k-NN), and random forest (RF) are used to classify [4][5][6]. Although these traditional machine learning methods have achieved a reasonable performance, they need the prior knowledge of sleep experts. e classification performance depends on the extracted features and the selected classifier. To solve the problem of complexity in the feature extraction, researchers have applied the deep learning model for sleep stage classification. e deep learning model can automatically learn features without prior knowledge. Several studies have designed convolutional neural network (CNN) for learning features from raw EEG signals [7][8][9][10][11][12][13] and time-frequency images [14,15]. Tsinalis et al. [7] used the raw EEG signals to learn features and the relationship between features by two-layer convolutions and pooling. Sokolovsky et al. [12] designed the deep CNN and proved that the classification performance depends on the network depth rather than the number of channels. Phan et al. [14] transformed the raw EEG signals into two-dimensional time-frequency images using short-time Fourier transformation and learned features through multiple small-scale convolution kernels.  [17]. Khalili and Mohammadzadeh Asl [9] used CNN to learn time-domain and frequency-domain features, and then used the temporal convolution model and conditional random field to learn the temporal context information between these features. Eldele et al. [18] used multiresolution convolutional neural network and adaptive feature recalibration to extract features and then captured the temporal dependencies among the extracted features by using a multihead attention mechanism. Zhu et al. [13] extracted features by convolution on different windows, embedded position information, calibrated features by attention module, and learned the temporal context information within an epoch.
ere are not only temporal features within an epoch, but also certain temporal dependence between sleep stages [19,20]. For example, the number of adjacent epochs with the same label accounts for 89.7% of the total, and there is also a state transition probability between different stages on the Sleep-EDF-2013 dataset. Many studies use many-to-one or one-to-many to learn the temporal features of adjacent epochs [7,8,11,21,22,23]. Sors et al. [8] took five successive epochs as input and used 12 convolutional layers and two fully connected layers to learn the features. Seo et al. [22] took 4 or 10 consecutive epochs as input, used ResNet-50 to learn the features, and input these features into bidirectional long short-term memory (Bi-LSTM) to learn the temporal features. Vilamala et al. [23] used the time-frequency images of five consecutive epochs to learn the temporal features between epochs. ese many-to-one modes need to input some epochs, the computational complexity of the model is high, and it is easy to produce model ambiguity. Phan et al. [14] used a one-to-many model to learn temporal context features between adjacent epochs. A single time-frequency image was used as input and obtained the probability of classification and prediction. e classification results were obtained by fusing these probabilities. ese studies have achieved good results in learning the temporal features of adjacent epochs, but for long epochs, the performance will decline due to the increase in the number of input epochs.
To learn the temporal features of long epochs, many studies used the RNN, which can store all past information of time series in the hidden units. Michielli et al. [24] proposed the cascaded long short-term memory (LSTM) to classify sleep stages. e first network performed multiclass classification by merging into a single class the stage N 1 and REM, while the second one performed the binary classification. Phan et al. [17] adopted the dual RNN to learn the temporal features within an epoch and long epochs. First, they used the Bi-RNN with attention to learn the temporal features within an epoch and then used the Bi-RNN to obtain the temporal features of long epochs. e combination of CNN and RNN can also be used for sleep stage classification [25][26][27][28]. Supratak et al. [25] utilized the convolution kernels of different sizes to learn time-domain and frequency-domain features and then adopted RNN to learn the transition rules between sleep stages. Mousavi et al. [26] adopted the dual CNN to learn the intraepoch features and used the encoding and decoding RNN with attention to learn the most relevant part of these features. Although these sequence-to-sequence models can learn the long-term temporal features fully by using RNN, the model needs more epochs as input, which makes the model more complicated and the training time longer. To reduce the number of input epochs and shorten the training time, a simple postprocessing method can be considered.
Most deep learning models only consider one or two kinds of temporal features of intra-epoch, adjacent epochs, and long epochs. ey do not fully use the complementarity of different-level temporal context information. Besides, many models learn temporal features with multiple epochs as input or RNN, which are high computational overhead and difficult to train. To improve the accuracy, we propose the MLTCN to learn the temporal context information from three levels. e overall framework of MLTCN is shown in Figure 2. Firstly, a 30 s EEG signal is input into the preprocessing block, and this block outputs standardized EEG signals and time-frequency images. Secondly, we utilize a temporal convolutional network (TCN) to learn the intra-epoch temporal features from the raw EEG signal and adopt the 1max pooling CNN to learn the frequency-domain features from the time-frequency image. irdly, three outputs are generated results of the adjacent epochs and the classification result of the current epoch. According to the accuracy of prediction and classification, the output of each task is given different weights, and the classification result is obtained by weighted fusion. Finally, the fused classification results are used as the observation sequence of the hidden Markov model (HMM), and the most likely hidden state sequence is obtained by the Viterbi algorithm. e classification results are fine-tuned using the hidden state sequences to learn the temporal features of the long epochs. In the following subsection, we will introduce each block in detail.

Preprocessing Block.
To extract the temporal and frequency-domain features within an epoch, the preprocessing block needs to output standardized EEG signals and timefrequency images. e standardized EEG signals are used to learn the temporal features within the epoch, and the timefrequency images are used to learn the frequency-domain features. Various forms of signals can improve sleep classification performance. e standardized EEG signals are obtained by subtracting the mean value and dividing the standard deviation. We calculated the standardized EEG signals as follows: where S i represents the i-th epoch EEG signal and μ represents the mean value of EEG signals. σ represents the standard deviation. EEG time-frequency images are obtained by short-time Fourier transform (STFT). Hamming window and 256-point fast Fourier transform (FFT) are used for transformation. e window size is 2 s and 50% overlap. Logarithmic operation is carried out on the time-frequency images to generate log-power spectrum. e size of the log-power spectrum is 29 × 129. Frequency-domain filter banks are used to smooth frequency and reduce the dimension [28]. e new size of spectrum is 29 × 20.

Intra-Epoch Temporal Context Learning Block.
To learn the temporal features within an epoch, we design the intraepoch temporal context learning block. is block includes two sub-modules: TCN and 1-max pooling CNN. e TCN is used to learn the temporal features instead of LSTM to shorten the training time. e 1-max pooling CNN [14] is used to extract frequency-domain features from time-frequency images. e 1-max pooling CNN consists of three layers: one convolutional layer, one pooling layer, and one multitask softmax layer. Its convolutional layer simultaneously accommodates convolutional kernels with varying sizes. We use 400 filters, and the size of kernel is (20,3), (20,5), and (20, 7). e pooling layer adopts a 1-max pooling strategy to retain the most prominent feature. e multitask softmax layer is adapted to fuse prediction and classification. e module is described as follows: where x G n represents the time-frequency image of the n-th epoch as the input. P G represents the outputs probability: , and P(y G n |x G n ) which represent the prediction probability of the forward epoch, the prediction probability of the backward epoch, and the classification probability of the current epoch, respectively.
For the temporal features within an epoch, we utilize a modulus described as follows: where the n-th epoch x R n �(x n1 , x n2 , . . ., x n3000 ) represents the input and the corresponding output is P R , which includes the probability of predictions and classification. y n �{Wake, REM, N 1 , N 2 , N 3 } indicating five sleep stages. TCN has been introduced by recent research [29]. e structure of TCN is shown in Figure 3. It is composed of 5 temporal block layers. Each temporal block layer includes two dilated causal convolutions, two dropouts, and a residual connection. e dropouts are used to prevent overfitting. e residual connection solves the problem of network degradation. Compared with traditional convolution, dilated causal convolution can extract global temporal features with less layers using dilation factor. To ensure that the output depends on all input data, we need to consider several parameters and the relationship between them. Firstly, we need to select a constant b as the dilation factor and use it to calculate the expansion distance of the i-th layer as d, where d � b i . Secondly, the receptive field width w is calculated as follows: where k is the size of the convolution kernel, n is the number of convolution layers, and b is the dilation factor. To make the receptive field with no holes, the size of the convolution kernel should be at least as large as the dilation factor, that is, k ≥ b. Moreover, zero fillings are required for convolution operation in each layer, which can ensure the equal length of input and output sequences. In extracting the temporal features of each sleep stage, to cover 3000 data points within an epoch, the receptive field w ≥ 3000 must be met.

Adjacent Epoch Temporal Context Learning Block.
In the proposed one-to-many setting, the network should be penalized for both misclassification and misprediction on a training epoch. e network model input an epoch, which is represented as x n . e truth one-hot encoding vectors are (y n−1, y n, y n+1 ). e corresponding classification labels are (y n−1 , y n, y n+1 ). e loss is computed as the sum of the crossentropy errors on the individual subtasks: Here, θ denotes the network parameters. e network is trained to minimize the multitask crossentropy loss over N training samples: where θ denotes the hyper-parameter that trades off the error terms and the L 2 -norm regularization term.
To improve the classification accuracy, we adopt the oneto-many model with weighted fusion to learn the adjacent epoch temporal features. is process is shown in Figure 4. Both TCN and 1-max pooling CNN output two prediction probabilities and one classification probability. ese two groups of probabilities use the same fusion operation. Here, we only express the fusion of one group, and the other group is similar. e n-th epoch EEG signal x n inputs intra-epoch temporal context learning block, which outputs forward prediction probability P(y n−1 |x n ), backward prediction probability P(y n+1 |x n ), and classification probability P(y n |x n ), respectively, as shown by the purple line. To determine the sleep stage of x n , the prediction probability P(y n |x n−1 ), P(y n |x n+1 ) and classification probability P(y n |x n ) need to be considered at the same time. We utilize Temporal Block P(y n-1 |xn) P(y n |x n ) P(y n+1 |x n ) Figure 3: e architecture of the temporal convolutional network. TCN is composed of 5 temporal block layers. Each temporal block layer includes two dilated causal convolutions, two dropouts, and a residual connection. In the dilated causal convolutions, the yellow circle represents the zero fillings.

Backward prediction
Forward prediction P(y n-1 |x n-2 ) P(y n-1 |x n-1 ) P ( y n-1 |x n ) P(y n |x n-1 ) P(y n |x n ) P(y n |x n+1 ) P(y n +1|x n ) P(y n+1 |x n ) P(y n+1 |x n+1 ) y n+1 y n y n-1 x n-1 α n-1 α n α n+1 x n+1 x n Computational Intelligence and Neuroscience 5 the weighted fusion to obtain the new probability. According to the ratio of prediction accuracy of adjacent epochs to classification accuracy of current epoch, we calculated the fusion the probability and weights. ey are defined as follows: Here, α i represents the prediction weight of the i-th epoch and PreACC(x i ) is the prediction accuracy of x i , namely, the accuracy of x i as input and y n as output. ACC(x n ) is the classification accuracy of x n . Eventually, the classification label y n is determined by likelihood maximization:

Long Epoch Temporal Context Fine-Tuning.
We utilize HMM to fine-tune the classification results and modify the abnormal sleep stage of the long epochs. e structure of the long epoch temporal context fine-tuning block is shown in Figure 5, which is composed of the state sequence S i and the observation sequence O i . e state sequence is the sleep stage labeled by the sleep expert, and the observation sequence is the fusion classification result. e HMM includes two parameters: state transition probability P tr and emission probability P em . P tr is obtained by statistics of the real labels in the training set, and P em is obtained by using the confusion matrix of the training set.
e fusion results are taken as the observation sequence, and the state sequence is unknown. e most likely sleep stage sequence is obtained by Viterbi algorithm [30]. Using the parameters obtained from the training set to automatically fine-tune the fusion results, those abnormal sleep stage sequences can be corrected. e length of sleep sequence affects the result of fine-tuning. e sequence length is too short, and the long temporal features are insufficient, which leads to some sleep stages that cannot be corrected. e sequence length is too long, and the time dependence between sleep stages decreases. To avoid insufficient or excessive correction, we enumerate the sequence lengths within a specific range and select the best sequence length to fine-tune the sleep stage. erefore, only the EEG signals at night are used. We just selected 30 minutes of these periods, the start and the end of the sleep periods. e number of epochs for each sleep stage is shown in Table 1.

Experimental Settings.
In our experiment, we perform leave-one-subject-out cross validation. With the Sleep-EDF-2013 dataset, we conduct 20-fold cross validation. Each cross validation selects the records of one subject as the test set, the records of 4 subjects as the validation set, and the records of the other 15 subjects as the training set. e training set, verification set, and test set of each fold are not repeated to ensure that test set is independent. And the test data of 20 folds cover the whole dataset. e performance evaluation of sleep stage classification is calculated based on predicted results and actual labels of all test sets. With the Sleep-EDF-2018 dataset, we conduct 10-fold cross validation to assess the performance of the network. It means that with each fold, 90% of the subjects is used for training and 10% as an independent test set. Furthermore, 10% of the training set is used as the validation set. We also conduct the experiments on cross datasets, namely, the Sleep-EDF-2018 as the  e network is implemented using the TensorFlow framework, and the GPU is NVIDIA GTX 2080 Ti. e network is trained with a batch size of 20. e learning rate is set to 1e − 4. e cross-entropy is used to calculate the loss function, and Adam optimizer is adopted. e parameters of TCN affect the performance of the model. rough a large number of experiments, we balanced the classification accuracy and model training time and selected the best parameters. e size of the convolution kernel is 7. e number of convolution layers is 5. e dilation factor is 5, and the filter is 50. During training, the network that yields the best overall accuracy on the validation set is retained for evaluation.

Evaluation Metrics.
To evaluate the classification performance of the model, we utilized the overall accuracy (ACC), macro-averaged F1-score (MF1), and Cohen's kappa coefficient (kappa). ey are defined as follows: where TP i and F1 i are the true positives and F1-score of the class i, S is the total number of classes, M represents the total number of epochs, p 0 represents the sum of the number of correctly classified samples divided by the total samples, p e represents accidental consistency. For each sleep stage i, its precision (Pre), recall (Rec), and F1-score (F1) are defined as follows: where FP i , TN i , and FN i are false positive, true negative, and false negative of the class i, respectively.

Performance of MLTCN.
e classification results of each sleep stage of MLTCN from Sleep-EDF-2013 dataset are shown in Table 2. In all the sleep stages, the classification performance of Wake and N 2 is better, the recall of Wake is 89.6%, and the precision of N 2 is 88.2%. e classification performance of N 3 and REM is relatively poor, the precision of REM is 79%, and F1-score of REM is 82.7%, mainly because some REM is mistakenly classified as N 2 . e main reason is that N 2 is the adjacent REM stage, and the waveform is similar. In addition, according to the characteristic wave of each sleep stage, we found that the characteristic waves of REM are rich, and there are overlapping frequency bands with Wake, N 1 , and N 2 , which are prone to misclassification. N 1 has the lowest classification performance, and the F1-score of N 1 is 39.4%, because N 1 belongs to the transition sleep stage from Wake to REM or N 2 , and the signal waveforms of N 1 and REM stage are relatively similar. e overall accuracy of MLTCN is 84.2%, MF1 is 77.0%, and kappa coefficient is 0.78. e classification results of each sleep stage from Sleep-EDF-2018 dataset are shown in Table 3. e overall accuracy To more intuitively observe the classification results of the MLTCN model, Figure 6 shows the hypnogram of the first night of SC400 from Sleep-EDF-2013 dataset. e overall accuracy of MLTCN is 88.4%, and kappa coefficient is 0.85. It can be seen from the figure that N 1 , as a transitional stage, is misclassified as REM more, and there are a few misclassifications between N 3 and N 2 . Most of the other     Based on Variant 2, the balanced fusion and weighted fusion are introduced to form Variant 4 and Variant 5. e accuracy of Variant 4 is 2.1% higher than that of Variant 2, and the kappa coefficient of Variant 5 is 2% higher than that of Variant 2, mainly because the temporal features between adjacent epochs are considered in the fusion strategy. e performance of Variant 5 is slightly higher than that of Variant 4, because Variant 5 considers the affection of classification and prediction accuracy in the final decision. For example, the accuracy of prediction with the previous epoch is 78.27%, the accuracy of classification with the current epoch is 80.02%, and the accuracy of prediction with the latter epoch is 75.37%. During fusion, different weights are given to the prediction and classification probability, respectively. According to equation (5), the weights of prediction and classification are given with 0.96, 1, and 0.89, respectively. e accuracy of weighted fusion is 0.1% higher than that of balanced fusion.
On the basis of Variant 5, MLTCN adds HMM block, and the accuracy reaches 84.2%, which is 0.3% higher than that of Variant 5. MF1 score and kappa coefficient are also improved in varying degrees. e main reason is that the HMM fine-tuned the long epochs in the test set by learning the state transition matrix and emission matrix in the training set. For example, the observation sequence output by Variant 5 is 1,1,1,1,1,1,1,1,1,5,1,1,1,1,1,1,1. According to the transition matrix of the training set, the most likely hidden state output by Viterbi algorithm is  1,1,1,1,1,1,1,1,1,1,1,1, where 1 represents the Wake stage and 5 represents the REM stage. e observation sequence is modified by the hidden state, and the REM in the abnormal sleep stage rarely seen in the long epochs is fine-tuned to Wake. e sleep stages after fine-tuning are consistent with the sleep expert labeled stages, which improves the classification performance. To better analyze the performance of each variant block, Figure 7 shows the confusion matrix of each model. Figure 7(a) shows the confusion matrix of Variant 1. e accuracy of Wake, N 2 , N 3 , and REM is 80%-88%, which is the lowest classification accuracy among all variable modules, and the accuracy of N 1 is only 32%. e low classification performance of Variant 1 is mainly since it does not consider any temporal features and only learns the features from the time-frequency image. Figure 7( Figure 7(c) shows the confusion matrix of Variant 3. Compared with Variable 2, the accuracy of N 2 and REM is improved by 3% and 4%, respectively. is result shows that HMM has fine-tuned the results of N 2 and REM. Compared with MLTCN, the accuracy of Wake and REM is reduced by 2%. is comparison shows that the temporal feature of adjacent epochs has an impact on the accuracy of these two stages. Figures 7(d) and 7(e) show the confusion matrices for Variant 4 and Variant 5, respectively. Compared with the accuracy of Variant 2, the accuracy of Wake and N 2 is improved by 3%, and the accuracy of REM is improved by 5%, but the accuracy of N 1 and N 3 is reduced. e accuracy improvement of Wake, N 2 , and REM is due to considering the temporal context information of adjacent epochs. e reduction of N 1 classification accuracy may be due to the small number of N 1 samples, resulting in poor prediction and classification. Figure 7(f ) represents the confusion matrix of MLTCN. Compared with the confusion matrix of Figure 7(e), the accuracy of N 1 and REM is improved by 1%, and the accuracy of other stages has not changed. It indicates that the HMM can fine-tune the REM in the long epochs and adjust REM to N 1 to improve the accuracy. Compared with the confusion matrix of Figure 7(a), the accuracy of Wake, N 2 , N 3 , and REM is improved to different degrees. e improvement of these sleep stages indicates that multilevel temporal features play a role in detecting these classifications. In particular, the correct classification of the sleep stages in which sleep experts are interested can better assist sleep experts in diagnosing sleep problems. For example, the accurate classification of the Wake can better diagnose insomnia. e correct classification of REM can more correctly diagnose sleep behavior disorder.

Efficiency of TCN.
Most studies use LSTM to learn temporal features, but LSTM takes a long time to train. To improve the learning efficiency of temporal features within an epoch, TCN is used to learn the intra-epoch temporal context information. A comparative experiment is designed to prove that MLTCN with TCN can not only learn temporal features well, but also has higher efficiency than MLTCN with LSTM. In the MLTCN_TCN, the filter is 50, and the e accuracy of MLTCN with TCN classification is 80.8%, which is 0.7% higher than that of MLTCN with LSTM. e MF1 of MLTCN_TCN and MLTCN_LSTM is 73.3% and 72.2%, respectively. Figure 8(b) shows the training time of the first fold. e training time of the MLTCN with TCN is 3741 s and that with LSTM is 36443 s, which is 9.7 times that of MLTCN_TCN.
e experimental results show that using MLTCN_TCN to learn the intra-epoch temporal features not only has higher performance, but also shortens the model training time.

Influence of HMM Observation Sequence Length.
e length of HMM observation sequence affects the classification performance. According to the discussion of sequence length in literature [32], we tested the classification performance under different lengths in a certain range. Figure 9 shows the accuracy, kappa, and MF1 of different observation sequence lengths. It can be seen that with the increase of sequence length, each performance increases slightly. e accuracy before fine-tuning is 83.9%. With the sequence length of 3, the accuracy is 83.8%, which is degraded by 0.1% compared with that before fine-tuning. It shows that the adjacent temporal context block has learned better temporal features. With the sequence length of 17, the accuracy is 84.2%, MF1 is 77.1%, and kappa coefficient is 78.3%. e accuracy is improved by 0.3% compared with that before fine-tuning, which shows that HMM is effective for finetuning long epochs.

Effectiveness of HMM Fine-Tuning Block.
To evaluate the effectiveness of the HMM fine-tuning block, a group of comparative experiments are performed. e experiments are carried out with and without HMM fine-tuning block.
e HMM observation sequence length is 17. Figure 10 shows the hypnograms of subject SC407 from Sleep-EDF-2013 dataset under different settings. Figure 10(a) shows the hypnogram labeled by human expert. Figure 10(b) shows the hypnogram labeled by the proposed network without HMM fine-tuning block, the accuracy is 90.3%, and kappa is 0.86. Figure 10(  network with HMM fine-tuning block, the accuracy is 90.7%, and kappa is 0.87. e accuracy is improved by 0.4% compared with that before HMM fine-tuning. e finetuning labels are marked by the ★ symbol. It can be seen that the output hypnogram of the MLTCN with HMM finetuning block aligns very well with the corresponding human expert labels. For the long epochs, according to the transition probability and emission probability, the abnormal sleep stage is fine-tuned. For example, some sleep stage REM is fine-tuned to N 1 , and N 2 is fine-tuned to REM.

Discussion
We evaluate the performance of our MLTCN against various existing approaches and compare their performance in terms of the overall accuracy, MF1 score, and κ on Sleep-EDF-2013 and Sleep-EDF-2018 datasets. Table 6 shows the performance comparison between the MLTCN and existing approaches. e table includes the network structure, the form of the input EEG signal, the corresponding relationship between the number of input and output epochs, the number of epochs, and performance. According to the number of epochs from input to output, these models can be divided into four modes: one-to-one, many-to-one, one-tomany, and many-to-many. One-to-one mode does not learn the temporal features of any epochs. e many-to-one and one-to-many mode can learn the temporal features between adjacent epochs and learn the sequential features of long epochs from many-to-many mode.
On the Sleep-EDF-2013 dataset, Phan et al. [28] used a simple CNN to learn features from the time-frequency images and did not consider any level of temporal features, and the accuracy was 79.1%. Tsinalis et al. [7] and Vilamala et al. [23] adopted the CNN to learn the temporal features from multiple epochs, and the accuracy was 74.8% and 81.3%, respectively. Seo et al. [22] applied Bi-LSTM to learn the temporal features between adjacent epochs, and the accuracy reached 83.6%. Because other levels of temporal features were not considered, the accuracy was 0.6% lower than that of the MLTCN model. Many-to-one mode needs multiple epochs as input, so it is easy to cause model ambiguity. e one-to-many mode proposed by Phan et al. [14] learned the temporal features between adjacent epochs, and the accuracy was 81.9%. On this basis, MLTCN adds the temporal features of intra-epoch and long epochs. Moreover, MLTCN uses the fusion method when learning the temporal features of adjacent epochs. e accuracy is improved by 0.9%, the MF1 score is improved by 1.2%, and the kappa coefficient is improved by 0.02.
Many-to-many models can learn the temporal features of long epochs. On the Sleep-EDF-2013 dataset, Supratak et al. [25] utilized RNN to learn the temporal features. Zhang et al. [16] adopted dual CNN to learn the features from raw EEG signals and time-frequency images. ese features are used as the input of RNN to learn the temporal correlation of successive epochs and fine-tune the final results using HMM. Compared with the MLTCN model, this model learns the temporal features of long epochs and lacks the temporal features within and adjacent epochs. e accuracy of this model is 83,8%, which is 0.4% lower than that of the MLTCN model. Yang et al. [10] utilized HMM to learn the temporal features of long epochs and obtained an accuracy of 83.98%, but did not consider the intra-epoch temporal features, and the performance was lower than that of MLTCN. MLTCN learns the three-level temporal features from intra-epoch, adjacent epochs, and long epochs at the same time.
e accuracy was 84.2%, the MF1 score was 77.1%, and the kappa coefficient was 0.78, which were higher than other classification models. On larger Sleep-EDF-2018 dataset, the performance of our MLTCN model is better than one-to-many network proposed by Phan et al. [14] and many-to-many network proposed by Supratak et al. [25].

Conclusion
We propose a MLTCN for sleep stage classification, which can learn the temporal features from three levels: intraepoch, adjacent epochs, and long epochs. MLTCN utilizes multilevel temporal context learning blocks to obtain complete temporal features and improve the classification performance.
e evaluation of the proposed model was conducted on the Sleep-EDF-2013 and Sleep-EDF-2018 datasets and achieved stable and promising results, which outperformed the existing approaches. Besides, the ablation experiments are performed to verify the effectiveness of each temporal feature learning block. rough the comparative experiment between MLTCN_LSTM and MLTCN_TCN, it is proved that MLTCN_TCN not only improves the classification performance but also shortens the training time. e sensitivity analysis of the HMM observation sequence length shows that HMM has a good effect on fine-tuning the long epochs. In our future work, we will try to apply this model to other types of subjects, such as patients with sleep apnea, and analyze the temporal features of sleep stages in special populations.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.