A Machine Learning-Based Big EEG Data Artifact Detection and Wavelet-Based Removal: An Empirical Approach

IIIT-Bhopal, M.P., Bhopal, India Department of Electronics and Communication, GGITS, Jabalpur 482002, M. P., India Department of Computer Science and Engineering, KL University, Vijayawada, Andhra Pradesh, India Department of Computer Science, College of Computers and Information Technology, Taif University, Taif 21944, Saudi Arabia Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka 1229, Bangladesh Computer Science & Engineering Department, University Institute of Technology, Rajiv Gandhi Proudyogiki Vishwavidyalaya, (Technological University of Madhya Pradesh), Bhopal 462023, India Department of Computer Science and Engineering, Radharaman Engineering College, Bhopal, M.P., India


Introduction
An effective diagnosis and analysis of neurological diseases are possible when a vital neurological signal is acquired from the patient. However, this signal is apprehended even in the highly hospitably environment and besmirched by some nonphysiological signals (artifacts). e most vital neurophysiological signal is electroencephalography (EEG) which represents the human brain electrical activities. erefore, mitigation of these artifacts from EEG signals is a vibrant topic for research [1].
Experimentation is done on synthesized data which is detailed in [21].
e Ground-Truth (GT) EEG data is considered from online open resources [22].

Synthesized Artifact Signal Generation.
A more accurate artifact removal method is developed in this research work and applied to remove motion artifacts, as this artifact is the most recurrent and distressing component in the EEG data.
However, this GT signal only is not capable to relate the effectiveness of the artifact elimination procedure. e simulated data consist of GT EEG data and an artifact template (which can be manually produced or separately chronicled). e simulated or synthesized data are created by adding this artifact template to the GT signal. us, the simulated data will be more effective because, after artifact removal, the signal can be compared with pure signal and the artifact elimination algorithm effectiveness is checked. ese motion artifacts are generated synthetically by simulation. ey are created by adding random noise sequences to the amplitude modulated EEG signal to be clearly seen in the EEG signal as artifacts.
ese motion artifacts thoroughly distress the EEG data quality.
erefore, effective suppression is highly recommended before any neurological disorder diagnosis and analysis. Various artifact removal algorithms are applied to suppress these artifacts in the state-of-the-art research. However, the most suitable and efficient algorithms are applied in this recommended work to mitigate these artifacts effectively.
In this recommended work, ensemble empirical mode decomposition (EEMD) [9], blind source separation (BSS) [13], and wavelet transform (WT) [16] are applied in cascade for effective elimination of these motion artifacts. For removing the randomness and unpredictability of motion artifacts, the wavelet transform is more effective [7]. Finally, results are optimized by using Harris hawks optimization (HHO) algorithm.
A detailed architecture of the recommended methodology based on these algorithms is discussed in Section 2.

Recommended System Model
As shown in Figure 1, a system model is used to show a schematic representation of the recommended algorithm.
In this recommended work, primarily, the synthesized artifactual signal is preprocessed to eliminate the line and external noises [23]. Furthermore, these signals are decomposed through the EEMD approach. e decomposition is done for both pure EEG data and artifact-contaminated EEG data to generate IMFs. ese generated IMFs are passes to the support vector machine (SVM) classifier for training. Subsequently, this classifier is used for the detection of motion artifacts from EEG. Once artifacts are identified, subsequently, the IMFs generated are sourced to a cascaded approach of CCA and SWT algorithm for purifying [24].
is cascaded algorithm will take some more time to execute. However, in medical diagnosis, this increased time is considerable.
is approach is applied in an automated system where artifacts are automatically identified and removed with the most efficient algorithm to attain a clean signal [25].

Recommended Algorithm.
e main goal of the recommended procedure is to remove the motion artifacts. Moreover, the neural information must be preserved after this EEG artifact removal. e recommended algorithm is divided into five categories. e details of the stages are given below [26]: (i) Preprocessing: first, EEG data available as an online open-source interface [22] are preprocessed through a third-order butter worth filter for baseline wandering with two passband frequencies of 0.5 Hz to 99 Hz. (ii) Synthetic artifact generation: the preprocessed signal of 120 datasets is considered as the groundtruth signal. Each data set contains 10000 samples. Furthermore, synthesized data are prepared by creating artifact templates of sinusoids and simulating these templates with a different amplitude, duration, and location and, finally, superimposed these templates onto the ground-truth signal. is artifact-contaminated EEG signal set also has 120 datasets with 10000 samples in each dataset [27]. (iii) Motion artifact detection: the motion artifact detection from single-channel artifact-contaminated EEG data has been carried out in two stages.

Motion Artifact Detection Using SVM.
ese generated IMFs are used to attain statistical features such as kurtosis, mean, skewness, and variance, as discussed in [29].
ese features are applied to support vector machine (SVM) to detect motion artifacts from EEG data.

Motion Artifact Removal.
Once the artifact is detected in EEG data, artifacts' EEG data IMFs are applied to a cascaded approach of CCA-SWT for effective mitigation of EEG motion artifact. ese correlation components (CCs) generated by the CCA algorithm are created based on selfcorrelation and uncorrelation. Statistically, uncorrelated components have distinguished properties. reshold outputs are reconstructed through inverse wavelet transform to obtain external noise-free CCs. Reconstruction. Subsequently, the CCs having motion artifacts source are identified by using Pearson's correlation coefficients as the threshold. e CCs having less than the threshold value are rejected. Artifact-free EEG signal is obtained by reconstructing the remaining CCs from the original EEG signal.

Optimization.
e irregularity which is introduced due to artifacts in EEG signal is removed by using the efficient methodology and finally optimized by Harris hawks optimization (HHO) algorithm.

Experiments
In this experiment, synthesized EEG signal is generated and processed with recommended cascaded algorithm to eradicate the diminishing effect of the motion artifact on the EEG data. Furthermore, the recommended algorithm results have 2 Mathematical Problems in Engineering been compared with the existing methodologies whose results are available in state of the art [30]. e data samples are considered from an online open surface interface [22]. e MATLAB code used for EEMD is free to download from [31], and the rest of the functions required are directly used from the Matlab toolbox. Ground-truth EEG signal and synthesized motion artifact simulated EEG signals are shown in Figure 2. is simulated EEG signal is created by adding randomly simulated sinusoids into the original EEG signal at different locations having different amplitudes. Figure 2 shows the change in behaviour due to simulated artifact. is synthesized EEG signal is decomposed by using EEMD algorithm. e intrinsic mode functions (IMFs) generated through this EEMD algorithm are presented in Figure 3. ese 14 IMFs extracted from EEMD are used to calculate statistical moment-based features such as mean, variance, kurtosis, and skewness. ese statistical features are applied to SVM, a machine learning algorithm for artifact detection. e artifact detection by the neural network is applied by authors [19,20].
As the name suggests, SVM is a classification algorithm based on supervised learning and used for motion artifact detection from nonlinear EEG data. e SVM have different kernels, which enable the nonlinear classification [32]. e attributes are initially extracted from EEMD-generated IMFs for both pure and artifactual EEG data. ese structures were applied as a training set for the machine learning classifier. Furthermore, the test set is also created by using EEMD for artifact contaminated EEG data and pure EEG data. ese test sets are passed through the classifier for artifact detection based on the training set. e support vector machine (SVM) and radial basis function (RBF) kernel attain satisfactory accuracy for motion artifact detection from contaminated EEG data. e artifact detection accuracy is presented in Table 1, termed as a confusion matrix. e confusion matrix suggests that only 6 test sets were incorrectly classified as pure EEG data. ese datasets are artifact data. Moreover, pure EEG data are misclassified as artifact data for merely 2 instances. e SVM classifier and radial basis function (RBF) kernel attain the accuracy of 98.3% for motion artifact detection from contaminated EEG signals.
Once the motion artifacts are perceived in the input EEG signal, artifact suppression is executed through a cascaded approach. e IMFs extracted from the EEMD approach of artifacted EEG data are applied to BSS-CCA [13]. e separated components after ensemble consequence of EEMD algorithm and CCA algorithm are presented in Figure 4. Each CCs resembles the section of a different source.
e CCs generated after EEMD-CCA cascaded approach is further processed with stationary wavelet transform (SWT) for improved artifact mitigation. is SWT is preferred as this algorithm will suppress the artifact while maintaining the neural information of the EEG signal [33].
In real-time application, the occurrence of artifacts in the recorded EEG data is not known. e recommended algorithm will be quite effective and can be implemented   practically. Foremost, first, the motion artifact will be detected through a classification algorithm. Once the artifact is detected, then disturbance created due to motion artifact in EEG signal can be best handled with the efficient and improved artifact removal algorithm (CCA-SWT). As CCA is a simple source separation approach, SWT is preferred as an effective artifact elimination algorithm due to its shift-invariance property [34]. e synthesized EEG data and the reconstructed EEG data (post artifact elimination) comparison is shown in Figure 5. Figure 5 shows the comparison of the plot for artifactual EEG signal with blue colour and motion artifact suppressed EEG signal by red colour. It can be observed from visual inspection that the EEG signal is dirtied with the great, random motion artifact. e recommended method minimizes these artifacts greatly while preserving all the neural information initially present in the pure EEG signal. e recommended algorithm is tested on the different EEG datasets to check the validity of the recommended procedure in actual time application. e recommended algorithm effectively detects motion artifactual EEG dataset as shown in Figure 6 in the red box. In addition, EOG artifacts have been minimized significantly. Although, in this work, the SVM classifier is trained for motion artifacts only. e detection ability of the classifier will be improved in future work. us, the motion artifacts have been removed by the recommended method (EEMD-CCA-SWT) as well as preserved by the peak amplitude variations, which carry the required information for the signal. us, Figure 6 shows that the recommended method preserves the meaningful information even after artifact removal. A statistical analysis of the recommended method with existing methods is given in Section 4.

Performance Assessment Factors
Some important performance evaluation parameters for assessment of the recommended algorithm are as follows.

Difference in Signal-to-Noise Ratio (∆SNR).
e ∆SNR is calculated by the change of SNR for the signal pre-and postartifact removal [6].

Lambda.
is is a difference in correlation between signals which shows the percentage reduction in artifacts denoted by λ [6].

Power Spectral Density (PSD) Improvement.
PSD improvement is calculated by finding the change between PSD of the artifactual and artifact-free data [6].

Correlation Improvement.
e association difference between synthesized and original signal is used as the performance measure.

Spectral Distortion (P dis ).
e spectral distortion P dis is deliberated as where PSD ref (w) denotes PSD of the reference signal and PSD recon (f) denotes PSD of the reconstructed signal. e spectral distortion P dis is given by the PSD ratio of the reconstructed signal to the reference EEG signal [10].

Coherence Improvement (ΔCoh).
ΔCoh measures the phase consistency between noisy and ground-truth signal. e percentage coherence improvement is defined as    e variable is the coherence between mention and artifactual signals and the coherence between mention and recreated signals [35]. us, the higher value of ζ shows the superior artifact removal.

Information Transfer Rate (ITR).
Brain-computer interfaces use the information transfer rate (measured in bits per trial) as an evaluation metric (BCI). One mental-calculation task and two motor-imagery tasks were performed by two subjects. e tasks included left hand, right hand, foot, tongue, and foot. Hidden Markov models are used to classify the electroencephalogram (EEG) patterns. BCI systems with two subsets have their information transfer rates reported. ere is a wide variation in the information transfer rates, ranging from 0.46 bits per trial to 0.82 bits per trial.

Results and Discussion
e simulation is performed on an available online dataset [22] for statistical evaluation. e synthetic artifacts are added to reference data at random locations and at a random time (stretching from 150 µs to 1 s). e analysis is based on artifacts' removal and signal distortion. e quantitative evaluations of some important matrices are shown in Table 2. ese evaluations are done for synthesized EEG signals generated with different SNRs. Moreover, these results are compared with all existing artifact removal methodology EEMD-CCA [6].
From Table 2, it is manifested that the recommended method performs better than the existing method [6] with improved DSNR, which indicates the improved quality of signal after artifact removal. Moreover, it also indicates that boosted Lambda, correlation, and PSD value show improved artifact removal in assessing the existing approaches. Additionally, the RMSE [36] values have reduced significantly with the recommended artifact removal method. e reduction in the RMSE value indicates effective artifact mitigation from EEG signal [37]. e coherence values have improved after recommended artifact removal presents the efficacy of the approach. Figure 7 shows the plot of RMSE concerning different artifact SNR for EEMD-CCA [6] and EEMD-CCA-SWT artifact removal methods [38]. e recommended artifact removal method has a minimum RMSE value that indicates the significant motion artifact removal. e recommended method performs much better with high artifact SNR. Figure 8 demonstrates the behaviour of spectral distortion for the recommended method and compared with EEMD-CCA. e result shows that restored signal PSD reaches close to the reference signal PSD value with high artifact SNR significantly. Figures 9 and 10 present the extent of artifact elimination by scheming the DSNR and lambda parameters for different SNRs. It can be concluded that both the parameters have improved [39] with respect to other existing methods. However, results are improved by using latest and accurate optimization algorithm which is discussed in subsequent section.

Harris Hawks Optimization (HHO)
Asghar Heidari et al. [40] introduced an innovative population-based, gradient-free optimization method in 2019 [41]. HHO simulates Harris hawks' actions of predation, surprise pounce, and attack. In addition, HHO contains two optimization stages, namely, exploitation and exploration, like other metaheuristic algorithms (see Figure 11) mentioned in the following sections.

Investigation Stage.
Harris hawks have insightful eyes that track and spot predators, but sometimes it is not easy to locate the predators. en, the Harris hawks will stick, wait, and last hours, waiting patiently. In HHO, above actions are modelled [41] on the stage of discovery as follows: where P t+1 i denotes location of ith individual in (t + 1)th repetition, P rabbit denotes position of the rabbit (predators), x denotes the arbitrary number in the range [0, 1], N 1 , N 2 , N 3 , and N 4 are also random numbers inside [0, 1], LB is the lowest bound of the given optimization problem; UB is the highest bound of the given optimization problem, and P t m denotes the middling position of the populace which is calculated as follows: where M is the extent of the populace and P t s is the location of sth specific in ith repetition.

Evolution from Exploration to Exploitation.
e evolution from exploration to use is perilous to metaheuristic approaches [41]. In HHO, rabbit evasion energy is denoted by A and is applied to transform these dual stages. e assessment declines with the rise in the numeral of repetitions, which can be performed arithmetically as where A 0 is an arbitrary number which is defined in the interlude [−1, 1], t represents the current iteration, and t max characterises the extreme iterations' number.

Exploitation Stage.
Harris hawks also targeted gnaw after they have found their prey [41]. e real process of predation is always very complex, the prey has an escape opportunity, and Harris hawks responded differently according to the prey's behaviour. Four techniques are used in the exploitation stage to better model the actual situation. An arbitrary number (N) is applied to define whether the prey has escaped effectively. Condition N < 0.5 designates a good escape, while case N > 0.5 shows a loss. e energy absconding from the beast (A) affects the Harris hawks' actions. e soft assault happens if |A| < 0.5; if not, then hard assault takes place [41]. When the artifact supressed signal is processed through this Harris hawks optimization process, the irregularity due to external and electronic noise is removed greatly. e mentioned improvement can also be analysed by Table 3. Table 3 suggests that, after optimization algorithm application, the EEG artifact removal performance is improved. EEG noise is suppressed, while EEG signal quality is preserved.
In this research paper, the recommended artifact removal method performance is measured by both parametric value and plot comparison. Moreover, both approaches present the success of the recommended method. Moreover, recommended methodology with HHO optimization outperforms over other methods.

Conclusion
In this research work, an effective EEG motion artifact detection and removal approach are recommended for cultivating the precise neurological diseases analysis and diagnosis. Primarily, the signal channel signal is decomposed by using EEMD algorithm. ese decomposed EEGs (IMFs) have been applied to SVM classifier for detection of artifacts from input EEG signal. Once artifacts are detected, then efficient artifact removal cascaded approach (CCA-SWT) is applied on IMFs. e correlation coefficients are reconstructed after motion artifact detection and removal. e reconstructed signals are further optimized by HHO algorithm and evaluated qualitatively by visual analysis and quantitatively based on parametric evaluation. e results show improved performance as compared to results on [6] for EEG artifact removal. Moreover, it is also concluded that the neural information are preserved even after artifact suppression.
In the future, we will try to improve the performance of artifact removal method which is adaptive for detection of various neural artifacts, and and improved version of optimization algorithm will be applied to optimize the result.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.