Intelligent Analysis of Premature Ventricular Contraction Based on Features and Random Forest

Premature ventricular contraction (PVC) is one of the most common arrhythmias in the clinic. Due to its variability and susceptibility, patients may be at risk at any time. The rapid and accurate classification of PVC is of great significance for the treatment of diseases. Aiming at this problem, this paper proposes a method based on the combination of features and random forest to identify PVC. The RR intervals (pre_RR and post_RR), R amplitude, and QRS area are chosen as the features because they are able to identify PVC better. The experiment was validated on the MIT-BIH arrhythmia database and achieved good results. Compared with other methods, the accuracy of this method has been significantly improved.


Introduction
Electrocardiogram (ECG) is a graph that records the changes in electrical activity produced by each heart cycle of the heart from the body surface. It contains abundant basic functions and pathological information of the heart. erefore, it is of great significance for the evaluation of cardiac safety and the evaluation of various treatment methods. It is also an important means for the examination and diagnosis of various arrhythmias. Premature ventricular contraction (PVC) is the most widespread and common arrhythmia in the clinic, and it represents the abnormal behaviour of signals produced by ECG. PVC is not immediately life-threatening, but it can cause a deadly heart rhythm [1,2]. At present, for the diagnosis of the disease, doctors can only use the current medical technology for diagnosis based on their personal experience. At the same time, doctors may make a wrong diagnosis due to long hours of high-intensity work. However, the use of computer-aided detection of PVC can effectively improve the efficiency of diagnosis.
At present, machine learning has been widely used in medical diagnosis to help doctors improve the efficiency of diagnosis and treatment, so that doctors can diagnose diseases as soon as possible. For example, Liu et al. [3] proposed a method for diagnosing PVC based on the Lyapunov exponent. is method can distinguish PVC from other types by analyzing the Lyapunov exponent and training learning vector quantization. Gutiérrez-Gnecchi et al. [4] identified each waveform based on the quadratic wavelet transform process and identified eight heartbeat types through a probabilistic neural network. Although the method proposed in [3,4] can identify PVC well, these high classification results which are based on small datasets or duplicate data are not tested on more datasets. Zarei et al.' study [5] was based on the "replacement" strategy to examine the effect of each heartbeat on the main direction change and proposed a method for detecting the newly arrived PVC heartbeat. Li et al. [6] proposed a method for establishing a PVC recognizer by matching the correlation coefficients with the template. Zhou et al. [7] proposed a new method for detecting PVC in combination with deep neural networks and rule inference. Llamedo and Martínez [8] studied and validated a simple heartbeat classifier that separates supraventricular and ventricular beats based on RR features and features derived from wavelet transform. Zhang et al. [9] proposed a method of selecting effective feature subsets by one-to-one comparison, using SVM to perform heartbeat classification. Sahoo et al. [10] proposed a method based on multiresolution wavelet transform, which was tested on 48 records of MIT-BIH arrhythmia database to realize classification of four heartbeat types. e ECG waveform is complex. By observing the ECG waveform, it is easy to obtain a large number of ECG features in terms of morphology. However, only a single feature cannot achieve highprecision classification [5,6,10], which will reduce the accuracy of classification.
In this paper, a method based on the combination of features and random forest is proposed to distinguish the difference between non-PVC heartbeats and PVC heartbeats. Considering that a single feature may not achieve good results, multiple features will reduce the efficiency of the classifier. erefore, this paper selected three characteristic parameters (RR intervals, QRS area, and R amplitude) through experimental verification and used them together to detect PVC. e experiment was trained on 22 records in the MIT-BIH arrhythmia database and verified on 22 nonoverlapping records. is method not only reduces the computation time and complexity of the classifier but also achieves a high-precision recognition rate of the datasets. e outline of this paper is as follows: Section 2 introduces the datasets used in ECG databases and experiments and discusses in detail the methods of detecting PVC, including signal preprocessing, feature extraction, random forest classification learning methods, and processing of unbalanced datasets. Performance evaluation and experimental analysis are explained in Section 3. Section 4 is the summary of this paper.

PVC Identification Method
Based on Features e experimental data used in this paper is the publicly available MIT-BIH arrhythmia database [11]. e MIT-BIH arrhythmia database is one of the three most recognized and widely used standard ECG databases in the world and has been widely used in the verification and classification of arrhythmia algorithms. e database consists of a total of 48 records, each containing two signals, each with a length of 650,000 samples and a duration of approximately 30 minutes, sampled at 360 Hz. e 48 records contained 23 records (numbered from 100 to 124 inclusive with some numbers missing) randomly selected from more than 4000 Holter recordings. e remaining 25 records (numbered from 200 to 234 inclusive, including some missing numbers) were the records of uncommon but clinically significant arrhythmias. e MIT-BIH data consists of three parts: the header file [.hea], which is stored in the ASCLL code; the data file [.dat], which is stored in the 212 format; and the comment file [.atr], which also uses binary storage. From the MIT-BIH arrhythmia database record, we know that four of the 48 records (102, 104, 107, and 217) contain paced beats. According to the recommendations recommended by the Association for the Advancement of Medical Instrumentation (AAMI), this paper discards four records and uses the remaining 44 records as experimental data. In this paper, in order to compare with other methods, we divided 44 records into two datasets DS1 and DS2, which are close to 1 : 1. DS1 is used for training, DS2 is used for testing, and each dataset contains 22 records from the ECG database. e specific division is shown in Table 1. Based on the AAMI standard, there are five types of heartbeats: N, S, V, F, and Q. Before entering data into the classifier, mark N, S, F, and Q types as non-PVC types so that the dataset contains only PVC and non-PVC categories, with a focus on PVC category data. Figure 1 shows the various stages of detecting premature ventricular contraction, namely, signal preprocessing, feature extraction, and classification.

Signal Preprocessing.
In this paper, the signal preprocessing part is divided into four stages, which consists of the data reading stage, denoising processing stage, QRS identification stage, and heartbeat segmentation stage. e MIT-BIH arrhythmia database consists of 48 two-conductor records, which were only read using MLII lead data during the data read phase. In the process of collecting signals, the original signal was denoised by a wavelet filter, and then the R wave was located by digital analysis of the slope, amplitude, and width. Finally, a single heartbeat was extracted from the complete ECG signal with the R peak as the center. e contents of each stage will be described in detail below.
During the acquisition process, the ECG signal is often interfered by various external noises, which inevitably affects the location of the ECG waveform feature points. In order to correctly identify the waveform and extract more accurate features in later research, the ECG signal must be denoised by the preprocessing process to improve the signal-to-noise ratio.
In recent years, wavelet transform has been widely used in signal denoising research. It is a method that can perform localized signal analysis in both time and frequency domains. It has the characteristics of multiresolution analysis and is suitable for analyzing nonstationary signals and extracting local features of signals. For example, Alyasseri et al. [12] denoised the nonstationary ECG signals by using the β-hill climbing metaheuristic algorithm and the wavelet transform method, which achieved good results. Similarly, Wang et al. [13] proposed an adaptive threshold method based on wavelet transform, which also achieves the effect of noise reduction by dynamically adjusting the threshold. erefore, this paper used Singh and Tiwari [14] to propose an optimal 8-order Daubechies mother wavelet basis function method and performed signal denoising in the wavelet domain. It determines the optimal wavelet filter in three steps: firstly, the base wavelet filter and the low-pass filter are selected from the wavelet filter library; secondly, the correlation coefficient between the ECG signal and the selected wavelet filter is calculated; finally, the number of cross-correlations was maximized to determine the best wavelet filter.
is method suppresses the interference of high frequency noise and effectively distinguishes the signal from the noise.
QRS wave recognition plays an important role in ECG signal analysis. Heart rate, RR interval, and morphology can only be calculated after determining the QRS complex, to distinguish between PVC heartbeats and non-PVC heartbeats. ere are many algorithms for detecting QRS complexes. Since the "Pan and Tompkins" QRS recognition algorithm based on the differential method [15] is easy to understand and implement, we choose this algorithm to locate the R wave. e algorithm first used a digital band pass filter to reduce false identification due to interference present in the signal and then squared the data samples so that the signal passes through the moving window to obtain waveform characteristics other than the slope of the R wave. Finally, the R wave was identified by digital analysis of slope, amplitude, and width. e algorithm automatically adjusts thresholds and parameters on a regular basis to accommodate changes in QRS morphology and heart rate, providing an accurate method of use for ECG signals with multiple signal characteristics and QRS morphologies [15]. After that, this paper compared the R wave detected by the "Pan and Tompkins" method with the R wave marked in the MIT-BIH arrhythmia database. Table 2 shows the detailed test results.
After the R wave is detected, a single heartbeat is extracted for ECG signal segmentation. By observing the ECG waveform, the R peak is easier to identify and extract than other features. erefore, researchers are accustomed to divide a single heartbeat with the R peak as a reference. According to the empirical value, the 100 sampling points before each detected R point are used as the starting point of a single heartbeat, that is, the starting point of the P wave. Similarly, the 150 sampling points after each detected R point are taken as the end point of a single heartbeat, that is, the end point of the T wave. Assuming that the detected R point is R_sample, and the entire heartbeat interval is recorded as [R_sample− 100, R_sample+150].

Feature Extraction.
In the feature extraction phase, we used 44 recorded data. According to Table 2, the number of wrong heartbeats and the number of missed heartbeats in the 44 records are only 0.54% and 0.37%. erefore, we excluded the wrong and missing heartbeats and used the remaining heartbeats in the following study. In Section 2.1, a single heartbeat and the start and end points of the heartbeat are extracted, but these few points are not enough to extracte more features. Next the start and end points of the QRS complex will be used to extract the PR interval, QRS interval, and QT interval. In recent years, there are many methods to identify PVC based on PVC disease characteristics, and we used 7 features (R amplitude [16], PR interval, QRS interval, QT interval, QRS area [17], pre_RR interval [18,19], and post_RR interval [20]) for the study. In order to facilitate the  Journal of Healthcare Engineering subsequent calculation and understanding, the seven features are, respectively, recorded as R_amp, PR, QRS, QT, QRS_area, pre_RR, and post_RR. In addition, the start and end points of the heartbeat and the start and end points of the QRS complex are also marked as P_start, T_end, QRS_start, and QRS_end.
For the two points QRS_start and QRS_end, the sliding window method is used by us. For example, when looking up the QRS_start point, we narrow down the moving range of the sliding window and define it as [P_start, R_sample]. According to the empirical value, the size m of the sliding window is set to 10. Moving backward from the 4 th point after the P_start point (X 1 ), the interval of the sliding window is With each window sliding, the amplitude variance S 1 corresponding to [X i − 4, X i ] and the amplitude variance S 2 corresponding to [X i + 1, X i + 5] are dynamically calculated according to equation (1). en, S 1 and S 2 are compared. When S 1 < S 2 and the S 1 value fluctuates around 0, it is judged that the [X i − 4, X i ] segment is a stationary baseline. en, it is judged whether the five values after the X i point are rising or falling. If so, the X i point is regarded as the QRS_start point; otherwise, the window continues to move backward. Similarly, when looking for the QRS_end point, move within the [R_sample, T_end] interval, first find a smooth baseline, and then find the QRS_end point: where X k represents the amplitude corresponding to , X represents the average amplitude, and n represents the number of points. Using the five points obtained above (P_start, QRS_start, R_sample, QRS_end, and T_end), we can get some of the features used in this study. e following is a detailed description of each feature and the annotation in the signal ( Figure 2): (i) R amplitude: the amplitude used in the paper is the amplitude after the noise is filtered out by the filter. (ii) PR interval: the width of the PR interval is defined as the distance from the P_start point to the QRS_start point, so PR interval(PR) � (QRS start − P start /360  Note. e first column is the name of the record, the second column is the number of R waves marked in MIT-BIH, the third column is the number of correctly detected R waves, the fourth column is the number of falsely detected R waves, the fifth column is the number of missed R waves, the sixth column is the evaluation indicator-sensitivity, and the seventh column is the evaluation indicator-positive prediction rate. According to the AAMIEC38 standard, the difference between the detected QRS complex and the manual mark is within 150 ms, which means that the location detection is successful. (vi) pre_RR interval: each heartbeat is based on the R peak; each heartbeat is sequentially traversed to obtain the distance between the current heartbeat and the previous heartbeat and then divided by the sampling rate (360) to obtain the pre_RR interval (pre_RR). It should be noted that starting from the second heartbeat is because the first heartbeat cannot get the previous RR interval. (vii) post_RR interval: the same method as pre_RR is used to traverse each heartbeat to get the distance between the current heartbeat and the back heartbeat and then divide by the sampling rate to get the post_RR interval (post_RR). It should be noted that the last heartbeat should be discarded because the last heartbeat cannot obtain the subsequent RR interval.
In the feature extraction stage, seven features are extracted. In order to improve the efficiency of the classifier, the random forest classifier is used to further analyze the impact of different quantitative features on the classification result. e results of the analysis can be seen in Figure 3. We analyse the importance of each feature to the results. One of the seven features was selected to study on the DS1 dataset, and the number of V types that can be correctly identified by using only this feature is obtained. Similarly, another feature is selected for experimentation until all seven features have been selected. We then rank the importance of the classification based on each feature, as shown in Figure 3(a). Among them, pre_RR has the greatest impact on the result and only adopts the pre_RR feature to achieve a recognition rate of 33%. e top rankings are post_RR (24%), QRS_area (18%), and R_amp (17%). At the same time, we have studied different experimental results produced by different numbers of feature subsets on the DS2 dataset. Based on the ranking order obtained in Figure 3(a), we enter different numbers of features in turn to get the results shown in Figure 3(b). When the number of features is only one, the effect is not good, which is within our expectation. As the number of features increases, the classification accuracy continues to increase. When the number of features is increased to 4, the classification effect is obviously good, and the effect is almost the same as when the number of features is 5, 6, and 7. It is further illustrated that these four features play an important role in PVC identification. It is finally concluded that the use of these four features can achieve a good classification effect of the test set. erefore, the RR interval (pre_RR and post_RR), the QRS region (QRS_area), and the R amplitude (R_amp) are used for the study in Section 3.

Random Forest.
As a highly flexible machine learning algorithm, random forest (RF) [21] has broad application prospects. RF is an important Bagging-based integrated learning method. It contains multiple decision trees and is a combination of several inefficient models into an efficient model. In practical applications, it is widely used in classification and regression problems because of its high accuracy, strong antinoise ability, difficulty in overfitting, ability to process unbalanced datasets [22], and no need to standardize datasets [23]. In this paper, a decision tree based on the CART algorithm was used to construct a random forest classifier. CART (classification and regression tree) is a well-known decision tree learning algorithm, and classification and regression tasks are available. e algorithm that based on the training model of random forest is shown in Figure 4.
Bagging is the most famous representative of the parallel integrated learning method [24]. Its basic process is to sample T samples with m training samples and then train a base learner based on each sample set. Finally, these base learners are combined. Its algorithm description is shown in Algorithm 1.
h t represents the T th learner, and D bs is the sample distribution generated by self-sampling.
Bagging can be applied to tasks such as two-category, multiclassification, and regression, which is one of its advantages. In addition, it can use the remaining samples in the initial training set as a validation set to perform an "out-ofbag estimate" on generalization performance, recording the training sample used by each base learner. Let D t denote the training sample set actually used by h t ; let H oob (x) denote the out-of-packet prediction for sample x; that is, only those base learners that do not use x training on X are considered as (2) en, the out-of-package estimate of bagging generalization error is e CART decision tree [25] used in this paper uses the "Gini index" to select the partitioning attribute. Assuming that the proportion of the k th sample in the current sample set D is p k (k � 1, 2, . . . , K), the purity of the dataset D can be measured by the Gini value: Suppose the discrete attribute a has V possible values a 1 , a 2 , . . . , a V . If a is used to divide the sample set D, V branch nodes are generated. e V th branch node contains all samples in D that have a value of a v on attribute a, which is denoted as D v . We calculated the Gini value of D v according to equation (4) and then gave the branch node a weight |D v |/|D|, considering that the number of samples included in different branch nodes is different.
at is, the more the number of samples, the greater the influence of the branch nodes, so the Gini index of the attribute a is defined as en, in the candidate attribute set A, we selected the attribute that makes the postpartition Gini index the smallest as the optimal partition attribute, i.e., a * � arg min a∈A Gini_index(D, a). (6) e detailed flow chart of attribute division is shown in Figure 5.

Synthetic Minority Oversampling Technique Algorithm.
A serious imbalance in the dataset is studied in this paper. It can be seen in Table 1 that the ratio of the non-V type to V type is about 14 :1 (44 recordings), which will undoubtedly affect the results of the study. In order to avoid this, the SMOTE (Synthetic Minority Oversampling Technique) algorithm is chose to manually synthesize new samples from minority class samples. e specific process of the algorithm is divided into three steps: (a) Suppose x is each sample in a minority class and S is a minority class sample set. Calculate the distance d from x to all samples in S using the Euclidean distance (equation (7)) and get its k nearest neighbors:  Journal of Healthcare Engineering where a i represents the i th dimension of x and b i represents the i th dimension of a certain sample in S. (b) Determine the sampling magnification N according to the sample imbalance ratio and randomly select several samples from its k nearest neighbors for x. (c) Assuming that the randomly selected neighbor is x ′ , each selected x ′ and x, respectively, construct a new sample x new according to the following equation:

Evaluation Indicators.
In this paper, accuracy (Acc), positive predictive value (PPV), sensitivity (Se), specificity (Sp), and c (Youden's index [26]) are used as evaluation indicators for the algorithm. Accuracy is the most commonly used metric, which refers to the ratio of the number of samples correctly classified by the classifier to the total number of samples for a given test dataset. Generally speaking, the higher the correct rate, the better the classification effect. e positive predictive value indicates the proportion of the number of cases that are truly positive among the total number of positive cases tested. Sensitivity represents the proportion of positive cases that are correctly identified in all practical positive cases. Specificity represents the proportion of negative cases that are correctly identified in all actual negative cases. In addition to the above evaluation indicators, this paper also used a combined result of sensitivity and specificity, namely, c. e above indicator equations and the confusion matrix of the classification (

Analysis of Experimental Results of Different Parameters.
Since random forest is random, when we set different values for parameters, it produces different classification results. For example, n_estimators represents the maximum number of decision tree. In general, the value is too small, it is easy to under fitting, its value is too large, and it is easy to overfitting, so choosing a suitable value is crucial for research. Based on the different n_estimators values, we studied and analyzed the DS1 dataset, as shown in Table 4.
Obviously, as can be seen from Table 4, the DS1 dataset achieves the best RF performance at n_estimators � 120. Its indicators are 98.29%, 97.58%, and 96.58%. is indicates that different parameters have an effect on the experimental results. After that, we adjusted the other parameters of the DS1 dataset and got the optimal parameters on the classifier (DS1: n_estimators � 120, min_samples_split � 100, and min_samples_leaf � 30).

Analysis of Experimental Results of Various Classifiers.
In this paper, five evaluation indicators of Acc, PPV, Se, Sp, and c are used to compare the performance differences among K-nearest neighbor (KNN), logistic regression (LR), naive Bayes (NB), multilayer perceptron (MLP), decision tree (DT), and random forest (RF) on the unbalanced binary dataset (DS2). Table 5 presents a performance comparison of the various classifiers.
According to the results in Table 5, the results obtained using the NB algorithm are relatively the worst. e NB's indicators are the lowest, indicating that the method may not be applicable to the dataset. e best result is RF. In addition to the lowest PPV (95.46%), the four indicators of RF (Acc, Se, Sp, and c) show the best results. RF is composed of multiple decision trees, and its final result is determined by all decision tree voting. is is the biggest advantage of RF compared with other algorithms. erefore, its classification effect is better than the other five methods, which is why we choose random forest instead of other classifiers.

Analysis of Experimental Results of Unbalanced
Datasets. Due to the large differences in the types of datasets, we conducted experiments on this issue. On the initial unbalanced training set DS1 (13 : 1), the new data were synthesized using the SMOTE algorithm, which made the non-V type and V type in the DS1 dataset reach 8 : 1, 4 : 1, and 1 : 1, respectively. We trained with different proportions of DS1 datasets and tested them on the unbalanced training set DS2. From Table 6, we can see that, as the ratio between non-V and V shrinks, the results are getting better and better. e value of each indicator is constantly changing, with the value of c changing the most (37.13%), rising from 58.32% to 95.45%, and the other indicators are rising by 17%-20%.
is undoubtedly shows that an unbalanced dataset can have a serious impact on the experiment. Table 7 shows the evaluation of PVC performance between the methods that are used in this paper and the research methods of other scholars. is paper takes the same test set [5-9, 27, 28], as the study object, and compares the results of other publications in different records. As can be seen from the experimental results of the seven records, the proposed classifier of this paper achieved better positive predictive value (99.44%) and specificity (99.45%). Its accuracy (99.32%) is only 0.03% lower than 99.35% of [5]. Similarly, 6 records in the ECG are used to compare with others. e results showed that the sensitivity (99.09%), specificity (99.34%), and c (98.43%) of the proposed method were not the best, but the accuracy and positive predictive value were better than any results. Finally, we applied the parameters obtained in Section 3.2.1 to DS1 for training and then tested the classifier on DS2. e results of the classifier for these records are accuracy (96.38%), positive predictive value (95.46%), sensitivity (97.88%), specificity (97.56%), and c (95.45%). It is clear that accuracy (96.38%) and specificity (97.56%) are lower than other results. However, the other three indicators showed the best results, of which c (95.45%) was second only to the result of [7] (97.13%). e final experimental results show that the method of this paper has a good effect on identifying PVC.

Conclusions
Automatic analysis technology for identifying PVC has been established in the field of ECG research for several decades. In order to improve the identification rate of PVC, many scholars have been exploring this aspect. is paper proposes a method based on a combination of multiple features and a random forest algorithm to distinguish PVC. PVC can be identified by the characteristics of RR intervals (pre_RR and post_RR), R amplitude, and QRS area. ese features were tested for 22 records (DS2) in the MIT-BIH arrhythmia database and achieved good results. e accuracy, positive predictive value, sensitivity, specificity, and c of the method reach 96.38%, 95.46%, 97.88%, 97.56%, and 95.45%, respectively. e results demonstrate that the method has a high recognition rate in ECG data, which makes a great significance in clinical application. However, this method is currently only validated in two types of studies. For the problem of multicategory research, it is also a problem that needs more time to explore in the future.

Data Availability
All datasets used to support the findings of this study are included within the article. All datasets used to support the findings of this study were supplied by the publicly available MIT-BIH database from the Massachusetts Institute of Technology.
e URL to access the data is https://www. physionet.org/cgi-bin/atm/ATM. e coding used to support the findings of this study have not been made available because the source code in this article is part of a national project and is a trade secret, and the source code is not available.

Conflicts of Interest
e authors declare that there are no conflicts of interest.  Journal of Healthcare Engineering 9