A High-Efficiency Fatigued Speech Feature Selection Method for Air Traffic Controllers Based on Improved Compressed Sensing

Air traffic controller fatigue has recently received considerable attention from researchers because it is one of the main causes of air traffic incidents. Numerous research studies have been conducted to extract speech features related to fatigue, and their practical utilization has achieved some positive detection results. However, there are still challenges associated with the applied speech features usually being of high dimension, which leads to computational complexity and inefficient fatigue detection. This situation makes it meaningful to reduce the dimensionality and select only a few efficient features. This paper addresses these problems by proposing a high-efficiency fatigued speech selection method based on improved compressed sensing. For adapting a method to the specific field of fatigued speech, we propose an improved compressed sensing construction algorithm to decrease the reconstruction error and achieve superior sparse coding. The proposed feature selection method is then applied to optimize the high-dimension fatigued speech features based on the fractal dimension. Finally, a support vector machine classifier is applied to a series of comparative experiments using the Civil Aviation Administration of China radiotelephony corpus to demonstrate that the proposed method provides a significant improvement in the precision of fatigue detection compared with current state-of-the-art approaches.


Introduction
IATA (the International Air Transport Association) has predicted that China will become the largest civil aviation market in the world by around 2025, with China's civil aviation involving the flow of 1.6 billion passengers by around 2037 [1]. e rapid development of civil aviation represents the great challenge to air traffic control and contributes to increasing shortages of air traffic controllers (ATCs). e resulting high workloads can increase the fatigue experienced by ATCs, thus increasing the probability of human error and the associated dangerous consequences for aviation safety [2]. Research studies have demonstrated that greater fatigue is closely associated with higher risk [3].
is situation has resulted in considerable attention being paid to the accurate detection of fatigue in ATCs among researchers in the field of civil aviation.
Fatigue in ATCs can be measured using a multitude of methods and tools, which can be grouped into two categories: subjective and objective methods [4]. Subjective selfrating scales and questionnaires have been the most-important sources of data for assessing both ATC and pilot fatigue [5,6]. Two renowned and validated subjective fatigue/sleepiness scales are the Karolinska sleepiness scale [7] and NASA's task load index [8]. Although subjective methods are easy to implement, they perform poorly in detecting a fatigue state rapidly, including real time.
erefore, objective methods have received a considerable amount of research interest.
ere are two categories of popular objective methods based on their different manifestations: (1) methods based on physiological parameters, including heart rate, blood pressure, breathing rate, electroencephalogram, and skin electricity [9][10][11], and (2) methods that directly record observable body actions, including voice strength, eye movement, blink times, yawning, and nodding frequency [12]. ese objective methods are more accurate and can be used to formulate a reliable physiological fatigue index. e main disadvantage of these monitoring techniques is that their intrusiveness usually results in aversion and disturbance to the ATC, which will reduce their accuracy. e rapid developments in speech recognition have resulted in vocal feature-based methods recently emerging as the preferred avenue for research into fatigue in ATCs [13]. Vocal features are convenient to collect and analyse, given that the main job of ATCs involves communicating with pilots via radiotelephony, and regulations specify that all voice records must be preserved for a certain period of time.
ere are several analyses in the literature for the connection between vocal features and fatigue [14,15] [16]. Krajewski introduced a fatigue eigenvector composed of linear speech features such as the fundamental frequency, resonance peak, and mel-frequency cepstrum coefficient (MFCC) [17]. However, the reported average accuracy when using these features was 76.5%, which is inadequate for the work performed by ATCs.
It has been demonstrated that the detection accuracy of fatigued speech is greatly affected by feature extraction and efficient features' selection [15]. It has recently become convenient to extract common speech features such as pitch, energy, and MFCC using commercial software (e.g., Opensmile) [18]. In addition, some state-of-the-art approaches utilizing nonlinear features based on wavelet decomposition and the fractal dimension [19] have shown more efficient results in detecting ATC fatigue. Overflow features result in a difficult trade-off between computational complexity and accuracy. Furthermore, the duplicated features obtained by different methods will confuse the subsequent recognition network, which consequently leads to inefficient results in detecting fatigue [20]. is situation indicates the need to achieve efficient features' selection and reduce the dimensionality of features.
Compressed sensing (CS) is a sub-Nyquist sampling technique that allows a sparse signal to be reconstructed reliably from a set of measurements to reduce the signal redundancy and reconstruction costs [21]. Many researchers have attempted to utilize this characteristic in exploring the performance of CS in dimension reduction and feature selection. For example, Haneche et al. proposed a novel speech enhancement approach based on the CS framework in 2019 [22], while Langari et al. extracted the best subset of features for speech emotion recognition by combining with CS in 2020 [23]. Although the technique of CS is beneficial for speech recognition, a considerable challenge is determining a well-designed measurement matrix that accurately represents the corresponding specific target speech signal. For this reason, the goal of this paper is to improve the conventional framework of CS to achieve the feature selection of speech, which will lead to a higher fatigue detection rate for ATCs using a popular machine learning training network, such as a support vector machine (SVM). e rest of this paper is organized as follows. Section 2 briefly introduces the basic theory of CS, Section 3 proposes a fatigued speech detection network and describes an improved CS construction algorithm (ICSCA) in detail. Section 4 reports on the series of experiments performed to test our new method and conclusions are drawn in Section 5. And, all the terminologies used in this paper are illustrated in Table 1.

Compressed Sensing
CS was proposed by Candes and Donoho, who constructed the initial theoretical framework consisting of signal sparse coding, measurement matrix construction, and a reconstruction algorithm. In brief, CS can achieve complete sampling to the original signal at a sampling rate that is much lower than the Nyquist sampling theorem and reconstruct the original signal using only a small proportion of the sampled data. e detailed description is shown in Figure 1.
In Figure 1, XϵR N denotes the original signal and YϵR M is the final compressed signal, and M is usually smaller than N. In addition, ΨϵR N * N and ΦϵR M * N indicate the sparse matrix and measurement matrix, respectively.

Sparse
Coding. CS theory is based on the assumption that the signal is sparse or highly compressible; in other words, most of the signal values are either zero or small enough to be ignored. Even though the signals under consideration often do not satisfy the sparse condition, it might be possible to find a basic matrix to transform the original signal linearly and ensure that the coefficient vector is sparse, in case of which the original signal also exhibits sparsity. e formula for sparse coding is as follows: where SϵR N represents the coefficient vector, and only K of the N signal entries are nonzero (K ≪ N). e selection of the sparse matrix depends on the inherent characteristics of the signal. e common methods used in the sparse representation include the curvelet transform, wavelet transform, barren transform, discrete cosine transform, and discrete Fourier transform.

Selection of Measurement Matrix.
Another major problem in CS is how to choose measurement matrix Φ. For a sparse one-dimensional signal, a measurement matrix Φ is constructed to compress the original signal and obtain a measurement signal, which can be expressed as follows: where A � ΦΨεR M * N is defined as the sensing matrix. Generally, the restricted isometry property (RIP) defined in Definition 1 is the property that sensing matrix A needs to satisfy.

Definition 1.
For any sparse signal x and measurement matrix Φ, there exists δ k ∈ (0, 1), and δ k is the minimum value satisfying equation (3); then, it is called δ k , the rip constant of order k of Φ: e purpose of the RIP is to ensure that the "redundant" information discarded in the process of compression measurement is controlled within an acceptable range and to prevent useful information from being discarded. e RIP has been proved to be a sufficient condition for the existence of a single feasible solution of equation (3) [24].

Reconstruction Algorithm.
e process of signal reconstruction is the reverse solution of equation (1). Since M is less than N, it is an NP-hard question for which it is difficult to obtain exact solutions. e signal reconstruction process is expressed as follows: min , where ‖ ‖ 0 denotes the number of nonzero elements. In order to reduce the computational complexity, many scholars have proposed replacing the L o norm with the L 2 norm in order to transform the problem from nonconvex to convex. Some other algorithms have also been proposed by researchers to solve this problem, such as orthogonal matching pursuit (OMP) [25], iterative hard thresholding [26], basis pursuit [27], and compressed sampling matching pursuit [28].
In summary, when applying CS, it is necessary to ensure that the signal is sparse, which has led to some efficient reconstruction algorithms being proposed by researchers as CS theory has advanced. However, how to construct an efficient sensing or measurement dictionary for a particular type of input signal remains a challenge that needs to be overcome. erefore, below, we propose an ICSCA that is suited to fatigued speech among ATCs.

Architecture of Fatigued Speech Detection.
With the introduction of CS, a high-efficiency speech detection model based on the Civil Aviation Administration of China radiotelephony corpus is proposed. Some signal preprocessing methods are first applied to reduce the impact of noise added during the collection process, such as denoising, filtering, and emphasis. Wavelet decomposition is then applied to the speech signal, and the detailed coefficients of each signal layer are extracted. Inspired by a recently proposed nonlinear feature [29], the detailed fractal dimension coefficients of each signal layer are calculated to extract the ATC fatigued speech features. Furthermore, an ICSCA is applied to remove the redundant information and perform the final selection of the ATC fatigued speech feature. e accuracy of fatigue detection is calculated with the help of an SVM. Figure 2 shows the detailed architecture of the proposed model.

Preprocessing and Feature Extraction
Preprocessing. e energy of the speech signal is concentrated in the low frequency, and the high-frequency parts carry less energy. For solving this problem, the signal preemphasis is utilized to increase the high-frequency part of the speech signal, thereby to obtain the signal spectrum in the entire frequency band. e preemphasis is generally implemented by a first-order FIR high-pass digital filter and original signal x n (the sample value at n time) can be processed as follows: where y n is the new signal and μ represents the preemphasis coefficient and is set as 0.95. e speech signal is a time-varying and unsteady process, and its characteristic parameters will change randomly over time, but in the short-term range (generally 10∼30 ms), the speech has relatively stable characteristics, that is, the speech signal has short-term stability. erefore, if the speech signal is divided into short-term segments, then each segment can be regarded as stable. Taking the 16 K sampling frequency as an example, 256 sampling points are used as a chunk that is about 16 ms. And, the overlapping segmentation method is usually used to ensure a smooth transition between adjacent chunks. Finally, the selected stride is 64, and there are 192 sample points overlapped between two adjacent chunks. en, the chunk signal would be windowed due to reduction in the discontinuity of the signal at the beginning and end of the chunk.
is is achieved by using the Hamming window w(n), and the final processing signal y w (n) can be obtained as follows:  Journal of Healthcare Engineering Based on the former signal preprocess, the two typical and prevalent speech features (pH [30] and SWFF [31]) were selected to verify our proposed methods better, which are based on the speech linear and nonlinear research theory separately. e basic signal process of these two methods is introduced in the follow sections.

pH Vocal Source Feature.
e pH is a time-frequency feature used in a speaker recognition and verification system [30]. Research shows that this feature is closely related to the excitation source and consists of a vector containing the Hurst index [32]. en, the Hurst exponent (0 < H < 1) expresses the time correlation or scaling degree of the speech signal. Its autocorrelation coefficient function (ACF) decays gradually in the following form: where the value of H can be associated with the spectral characteristics of X(i) { } N i�1 . e detailed extraction process can be shown in Figure 3 [30].
Step 1: the discrete wavelet transform (DWT) is applied to decompose speech signals into approximate coefficients (a(l, k)) and detail coefficients (d(l, k)). l is the decomposition scale (l � 1, 2, . . . , J) and k is the coefficient index of each scale.
Step 2: for each scale l, variance

Speech Wavelet Fractal Feature (SWFF).
e theory of fractal dimension (FD) and wavelet decomposition are applied in extracting SWFF feature. Fractal is a complex system whose complexity can be described by a noninteger dimension called the fractal dimension (FD). It can be defined by data and calculated approximately and experimentally. It is related to H as follows [33]: where D represents the fractal dimension, ε is the side length of a small cube, and N(ε) is the number needed to cover the measured geometry with the small cube.
In the process of wavelet decomposition, inspired by [31], the Daubechies wavelet was chosen as the wavelet basis function because it is highly consistent with our requirements. And, the frequency distribution of speech signals on each scale after wavelet decomposition is shown in Figure 4, where high-frequency coefficient is the detail coefficient. en, the detailed calculation of FD can be introduced as follows: Step 1: a time series X(i) { } N i�1 with length N is set up. ere are k new time series X m k that are obtained by reconstructing the time series with a delay method.
Step 2: the curve length L m (k) of each X m k can be calculated using the following formula: Step 3: the length of the total sequence can be approximated as the average of the length of the sequence curve generated by k delays. For different values of k, a set of curve data related to k and L (k) can be obtained.
In the end, the detailed SWFF feature can be obtained from the following formula: where FD refers the FD calculation method and k max is set as 10. D(d i ) represents the FD of the detail coefficients of i th layer.

Improved CS Construction
Algorithm. e sensing dictionary and measurement matrix are constructed based on the modified t-mean index. e inner product of ϕ i and ε i is made equal to 1, such as in equation (6), which defines the tmean coherence coefficient as where G(i, j) represents the element in row i and column j of the Gram matrix. e absolute coherence coefficient is the average value of all nondiagonal elements whose absolute values in the Gram matrix exceed a certain threshold t. A greedy algorithm is then used to make the Gram matrix closer to the ideal Gram matrix. Specifically, the nondiagonal elements are gradually reduced to near 0. Finally, Φ and Ψ can be constructed when μ t (Φ, Ψ) satisfies the threshold. e above process can be described as follows: e value of threshold t can be set to t > 0 to reduce the number of iterations because matrix G ′ cannot be completely iterated into I, and the nondiagonal elements in G ′ cannot be made equal to zero. It is proved that the minimum value of nondiagonal elements in the ETF (wqual-dimensional tight frame) matrix is . (16) e construction process and characteristics of G ′ are very similar to the ETF matrix. In this case, equation (12) can be modified as arg min where H ∈ R N * N , the diagonal element of matrix H is equal to 1, and nondiagonal elements are equal to t E * sign (G ′ (i, j)). Solving equation (14) yields the measurement matrix and sensing dictionary. Equation (14) can be decomposed into the following two problems that are solved iteratively: Problem (2): Ψ � arg min Evaluation and performance assessment are calculated iteratively by using OMP and equation (11). If the difference between the results of successive iterations is less than the  threshold or the number of iterations exceeds the set maximum number of iterations, the algorithm is terminated. e gradient method is used to solve Problem (1). e values of the nondiagonal elements of the matrix can be reduced to reduce the coherence between different columns. e optimization process is described as follows: Step 1: define the cost function as C � ‖Φ T Φ − H‖ 2 F . Step 2: calculate the gradient of the cost function: Simplify this to Step 3: the complete iteration equation is where k is the number of iterations and β is the step size, which is set as 0.001.
Step 4: use OMP to evaluate the coherence coefficient of t and evaluate whether the difference between the results of two successive iterations is less than the threshold.
Two points need to be considered when solving Problem (2): (i) ensuring the correlation between the sensing dictionary and measurement matrix throughout the process and (ii) ensuring the consistency between Ψ and Φ, where μ t (Ψ, Φ) should be as small as possible. For overcoming the former difficulty, we propose methods as follows.
Matrix G ′ � Ψ T Φ is first constructed. en, using the taut operator to shrink the nondiagonal elements in the matrix, approximation degree H is gradually reduced. Finally, a pair of perceptual dictionaries and measurement matrices can be obtained by singular value decomposition. e value range of the nondiagonal elements of the matrix is [1, −1] because matrix Ψ and matrix Φ are initially column normalized. Applying the tighten operator further narrows this range to [−c, c], where c < 1. A simple and easy-to-implement operator is proposed for mapping from It can be seen that the above tightening operator can adjust the range of matrix G ′ nondiagonal elements in iterations with only one parameter, c, which is set as 0.4.
Utilizing the SVD decomposition yields e diagonal elements in matrix V are nonnegative, and all diagonal elements are arranged from the upper-left corner to the lower-right corner. In order to be closer to H, set the maximum M elements in V M to be retained and then construct as follows: At the same time, in order to ensure that the inner product of corresponding atoms is 1, it should be treated according to the following formula: Above all, we construct a pair of sensing dictionary Ψ and measurement matrix Φ with a weak cross correlation.

SVM Settings.
An SVM is a classification model whose mathematical strategy involves maximizing the interval of different kinds of data. erefore, an SVM can be formalized as a convex quadratic programming problem. Here, a WLS-SVM (weighted-least-squares SVM) [34] is used for the classification process, which is formulated as where A t ij represents the membership grade, t � 1, 2, . . . , n. e WLS-SVM utilizes fuzzy c-means clustering methods to decide the rule number, which is based on the following formula: where m ∈ (1, ∞) denotes a fuzzy exponent, μ ij (μ ij ∈ U) is the degree to which x j belongs to the i th rule, and z i is the i th cluster center. e advantage of a WLS-SVM is that general errors including noise in the input and output variables are considered as empirical errors. Furthermore, in terms of the selection of the Gauss kernel function, we finally use the radial basis function (RBF) due to its superior antijamming ability for noise in data. e RBF kernel in this research is the same as the activation function used by Mu et al. [35]. e mathematical model of the kernel function is as follows: where c is the parameters of the kernel function.

Experimental Results
Experimental results were obtained on a Windows 10 personal computer equipped with a 64 bit Intel Core i5-9300H CPU running at 2.4 GHz and with 8 GB of RAM. All of the proposed methods were implemented using Python (version 3.7) and TensorFlow (version 1.14.0) software.

Datasets and Parameters.
A fatigued speech dataset [31] consisting of 1606 speech samples from ATC radiotelephony was used in the experiment depicted in Table 2. Due to the proportion of samples representing fatigued speech being less than for normal speech samples, we finally selected 824 speech samples from the dataset (412 fatigued speech samples and 412 normal speech samples) to ensure the authority of experimental results. e SWFF was then extracted as the original signal feature. e dimension of the SWFF was 256, and according to the progress of CS, we set the final feature dimension to be 32.
During the set of the SVM, the 824 speech samples were divided into K � 6 groups (the overall average). Each subset dataset was used as a verification set, and the remaining subset dataset was used as a training set so that K models could be obtained. e average classification accuracy of the final verification set of these K models was used as the performance index of the classifier under this K-CV. e penalty factor was set to c � 9.7656 × 10 −4 , and the gamma parameter was c � 0.5.

Results and Analysis.
In this section, the experiments were conducted by using two types of prevalent fatigue features (PH and SWFF). And, the sparse autoencoder (SAE) [36] was utilized to replace the SVM classifier. Furthermore, the Gauss random matrix and uncompressed sample were selected for comparisons with the ICSCA. e fatigue state detection results obtained by using these two nonstop measurement matrix construction algorithms for feature sampling are shown in Figures 5 -7 and Table 3.
Overall, it was clear that SWFF feature played better detection performance with the same classification methods. Considering the use of different classifiers, we can see that the SAE method consumed less time, but the average accuracy was far lower than the SVM.
In terms of the function of different measurement matrices, compared with the detection results without feature sampling, the accuracy of ATC fatigue state detection for Gaussian random matrix algorithm feature sampling was reduced by about 2%, while the detection results with proposed ICSCA were improved to 85.11% (pH) and 94.25% (SWFF) separately. Finally, it can see that the proposed ICSCA method also has the fastest operation speed of 1.37 minutes (pH) and 1.21 minutes (SWFF), which features the highest accuracy rate of 97.11%, when compared with DDL is 93.10%, while pH is 60.36% and SWFF is 71.39%. ese   Journal of Healthcare Engineering findings demonstrated that the ICSCA proposed in this study provides better improvement in both detection accuracy and operation time.

Conclusions
In order to quantitatively and fast detect fatigue condition of ATCs, we proposed a CS-based framework for detecting fatigue from speech of ATCs. en, an improved compressed sensing reconstruction algorithm is proposed to decrease the reconstruction error and achieve superior sparse coding, which was applied to fatigued speech selection with redundant information in the original feature vector removed. Finally, pH and SWFF speech features are applied to a series of comparative experiments using the Civil Aviation Administration of China radiotelephony corpus to demonstrate that the proposed method provides a significant improvement in the precision of fatigue detection compared with current state-of-the-art approaches.
Data Availability e radiotelephony corpus data sampled from Air Traffic Management Bureau, Civil Aviation Administration of China, used to support the findings of this study, are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Journal of Healthcare Engineering 9