Comparison of Different EHG Feature Selection Methods for the Detection of Preterm Labor

Numerous types of linear and nonlinear features have been extracted from the electrohysterogram (EHG) in order to classify labor and pregnancy contractions. As a result, the number of available features is now very large. The goal of this study is to reduce the number of features by selecting only the relevant ones which are useful for solving the classification problem. This paper presents three methods for feature subset selection that can be applied to choose the best subsets for classifying labor and pregnancy contractions: an algorithm using the Jeffrey divergence (JD) distance, a sequential forward selection (SFS) algorithm, and a binary particle swarm optimization (BPSO) algorithm. The two last methods are based on a classifier and were tested with three types of classifiers. These methods have allowed us to identify common features which are relevant for contraction classification.


Introduction
Preterm birth, that is, birth before the 37th week of pregnancy, remains a major problem in obstetrics. Children born before term present a high risk of mortality as well as health and development problems [1]. According to the World Health Organization (WHO), preterm birth rates range between 5% and 12% of births and perinatal mortality occurs in 3% to 47% of these cases in even the most developed parts of the world [2].
Delivery occurs after the onset of regular and effective uterine contractions, which cause dilation of the cervix and expulsion of the fetus. A contraction of the uterine muscle occurs due to the generation of electrical activity in a given uterine cell that spreads to other, neighboring cells. The evolution of uterine contractions, from weak and ineffective during pregnancy to strong and effective during labor, is therefore related to an increase in cellular excitability to an increase in the synchronization of the entire uterus [3].
A primary aim of pregnancy is to maintain the wellbeing of both mother and fetus and to keep the latter in utero as long as needed for a healthy birth. During pregnancy, the monitoring of uterine contractility is crucial in order to differentiate normal contractions, which are ineffective, from those effective contractions which might cause early dilation of the cervix and induce preterm birth. Despite increased knowledge and understanding of the phenomena involved in the onset of preterm labor, the methods currently used in obstetrics are not precise enough for an early detection of preterm birth threats. We need a more reliable method for early detection and prevention of preterm birth threats.
One of the most promising methods for monitoring uterine activity began in the 1950s and was developed in the 1980s. It is based on the study of the electrical activity of the uterus as recorded on the mother's abdominal surface [5]. The electrohysterogram (EHG) consists of the summation of the electrical activity generated by the active uterine muscle cells, plus the noise related to corrupting electrical and mechanical activities. EHG, recorded externally, has been demonstrated to be representative of the uterine electrical activity as recorded internally [6].
Many teams have extracted features from the EHG signals in order to find specific information leading to the detection of preterm birth. Firstly, linear methods in both time and frequency domains were used to extract features from the EHG. In order to improve the results obtained by using linear methods and because the EHG, like other biomedical signals, presents some nonlinear characteristics, several measures have been proposed for detecting nonlinear characteristics in the EHG.
A large number of features have thus far been extracted from the EHG signal by many different researchers using very different population and recording protocols. In general, the complexity of calculations required for diagnostic purposes increases with the number of features in play. The reduction of feature dimensionality through the elimination of irrelevant and noisy features is very important in pattern recognition. The objective of this study is to select the most significant subset, among features extracted from the bibliography, in order to discriminate pregnancy and labor contractions, with these features being computed from the same given population. In this study, we selected from the bibliographic data 20 features (16 linear and 4 nonlinear) extracted from the EHG: mean frequency (MPF) [7], peak frequency (PF) [8][9][10], and deciles ( 1 . . . 9) [11] which contain the median frequency [9,11,12], parameters extracted from wavelet decomposition ( 1 . . . 5) [13], Lyapunov exponent (LE) [14,15], time reversibility (Tr) [15], sample entropy (SE) [12], and variance entropy (VarEn) [16].
In this work, three methods are presented for feature subset selection. The first one, developed in this work, is based on the measurement of the Jeffrey divergence (JD) distance between the parameter histograms computed from the pregnancy and labor EHG classes [17]. The last two methods, developed for data mining, rely on the combination of a classifier and a search procedure: either sequential forward selection (SFS) [18] or binary particle swarm optimization (BPSO) [19]. The goal of these methods is to select, from a given feature set, the features subset that gives the maximum classification accuracy. This paper is organized as follows. In the first part we will describe the experimental protocol. Then we will present the features extracted from the EHG processing bibliography and the three methods for feature selection. Finally, we will present the results of feature selection.

Experimental Protocol
In our study we used signals recorded on 48 women: 32 during pregnancy (33-39 weeks of gestation) and 16 during labor (39-42 weeks of gestation). The measurements were performed at two hospitals in France and Iceland. In Iceland, the measurements were performed at the Landspitali University Hospital, using a protocol approved by the relevant ethical committee (VSN02-0006-V2). In France, the measurements were performed at the Center for Obstetrics and Gynecology (Amiens), using a protocol approved by the relevant ethical committee (ID-RCB 2011-A00500-41)). After the recording, we followed the pregnant women in order to label the signals as either pregnancy or labor. When the woman gave birth within 24 hours, the signal was labeled "labor". If the delivery occurred later, the signal was labeled "pregnancy". In our study not all pregnancies ended by a spontaneous delivery, and in both hospitals different drugs are routinely used for labor induction or progress. In our study, 7 women received oxytocin (79 contractions), 3 women received Prostaglandin (49 contractions), 1 woman received Epidural (3 contractions), and 8 women received no drugs (38 contractions). Our database contains only singleton pregnancies.
The EHGs were recorded using a multielectrode system composed of 18 electrodes: 16 arranged in a 4 × 4 matrix positioned on the woman's abdomen and two reference electrodes placed on each of her hips [20]. The amplifier bandwidth is 0.16-128 Hz. To increase the signal-to-noise ratio, we calculated the vertical bipolar signals (Vb ). Finally, we obtained 12 bipolar signals as shown in Figure 1. The bandwidth of our signal lies between 0.1 and 3 Hz. The sampling frequency used is 200 Hz, downsampled by a factor of 12 to obtain a new signal of 16.67 Hz.
In this study, we used only one bipolar signal, Vb7, because this signal is a reference recording position that has been used for a long time in our research. It is located on the median vertical axis of the uterus. The signal energy in this area remains high throughout the pregnancy as well as during labor. The bursts of uterine electrical activity that correspond to contractions were manually segmented, based on the tocodynamometer signal recorded simultaneously. After this manual segmentation of EHG bursts, we obtained a database containing 133 pregnancy bursts and 133 labor bursts.
Computational and Mathematical Methods in Medicine

Parameters Extraction.
In our study 20 parameters have been extracted from the EHG. These parameters are divided into two categories: linear and nonlinear.

Linear Parameters
Parameters Related to the Power Spectral Density. Several frequency parameters have been extracted from the power spectral density (PSD), ( ). In our work, we use the Welch Periodogram method to calculate the power spectral density of each burst [11]. This Welch Periodogram uses a window of type nfft, with size equal to the length of signal/2, with 50% overlap, for a total of three windows used. Eleven frequency parameters are extracted from this PSD: mean frequency MPF [7], peak frequency (PF) [8][9][10], and deciles 1 . . . 9 [11], which contain the median frequency 5 [9,11,12]. Deciles correspond to the frequencies 1 . . . 9 that divide the power spectral density into parts with each containing 10% of the total energy. Consider the following: Parameters Extracted from Wavelet Decomposition. Some authors have also used time-frequency methods, such as wavelet decomposition, to characterize the nonstationary characteristics of the EHG. In our work, we used the wavelet symlet 5, a choice based on the study referenced in [21]. This study compared several types of wavelets. The results have shown that the symlet 5 appears to be the most appropriate wavelet for the analysis of EHG signals for detection and classification purposes. After decomposition of each EHG burst into detail coefficients, we calculate the variances on the following detail levels: 2, 3, 4, 5, and 6 (named W1, W2, W3, W4, and W5) as previously proposed in [13]. These detail coefficients are as follows:  Figure 2). The choice of the details depends on the sampling frequency of the signal (sample rate equal to 16.67 Hz after downsampling) in order to correspond to the same frequency bands as the one selected in [13]. These selected details represent more than 96% of the signal energy and cover the frequency band of interest.

Nonlinear Parameters
Time Reversibility (Tr). A time series is reversible if the probabilistic properties are unchanged with respect to time reversal. Time irreversibility is a good indicator of nonlinearity. To calculate the time reversibility (Tr) of the signal we have used equation described in [15] as follows: where is the signal length and is the time delay.
Lyapunov Exponent. The Lyapunov exponent (LE) studies the stability and the sensitivity to initial conditions of the system. It measures the rate of trajectory separation between adjacent tracks in the phase space [14,15]. In our study we used the equation of LE described in [15] and represented by where ‖Δ 0 ‖ represents the Euclidean distance between two states of the system at an arbitrary time 0 and ‖Δ ‖ corresponds to the Euclidean distance between the two states of the system at a later time .
Sample Entropy. We used the sample entropy (SE) to identify the regularity of EHG signals. In our work, we used the sample entropy described in [12]. A less predictable time series presents higher sample entropy.
where the four parameters , , , and represent, respectively, the length of the time series, the length of the sequences to be compared, the tolerance for accepting matches, and the number of pattern matches (within a margin for ) that is constructed for each .
In our study, the value of equals 2. This value is determined by the method of the false nearest neighbors (FNN); the value of equals 0.2 according to the literature [12].
Variance Entropy. Recent studies have used the variance entropy (VarEn) to study biological signals but not for the EHG. We are interested in using variance entropy because this method combines the variance with sample entropy via inverse-variance weighting. For a time series , variance entropy is defined as where is the th segment of , is the inverse variance of , and is the number of sliding windows. is not fixed because the length of signals in our database depends on EHG burst durations.
The sliding window, of size equal to 50 (window size), slides over time with a step of 45 (step size), leading to an overlap between the sliding windows which is equal to 5. The choices of window size and step size were made empirically after several trials. therefore depends on window size.
Because variance entropy combines the variance with sample entropy via inverse-variance weighting, the number of windows is very important and can significantly affect the results. must be neither too high nor too small. A too large value induces large computing time and does not give a precise result. A too small value limits detection of variability.

Feature Selection Based on Jeffrey Divergence Distance.
This method consists of calculating, for each feature, the Jeffrey divergence (JD) distance between the two histograms obtained from the pregnancy and labor burst classes. This distance between the two histograms allows us to measure the similarity/dissimilarity of their corresponding statistical properties. A smaller distance means a larger similarity while a larger distance implies a lower similarity [22]. The divergence distance is then used to select the discriminating features. Indeed, the greater the distance between the feature histograms of pregnancy and labor classes is, the more discriminating the feature is [17].
This method has two parts. The first part consists of calculating parameters and their histograms. The second part consists of computing the distance between the histograms.
Calculating Parameters and Their Histograms. For each contraction of each group, we apply the following steps.
(4) Group these values in a matrix calculated on the whole EHG database. Thus, we obtain, for a given contraction, a vector with dimension 20. The 20 columns of the matrix correspond to the 20 parameter values computed from one EHG. As we have 133 contractions in each class, we obtain 133 vectors of dimension 20 by class (one 133 × 20 matrix for each class). We then compute from these 133 vectors of each class the histogram for each parameter, giving us 2 sets of 20 histograms, one for each parameter and for each class.
Distance between Histograms. After obtaining, for a given parameter, the two histograms for the labor and pregnancy classes, we measure the distance between the histogram of the two classes. To measure this distance, we use the Jeffrey divergence method presented in [22]: where and are the two histograms and where bins ( = 10) are defined as = {ℎ } and = { }, with the bin index ∈ {1, 2, . . . , }. After calculating the distances between every two corresponding parameter histograms for the 20 parameters, we obtain a distance vector of dimension 20. We compute the distribution of the distances contained in this distance vector. The goal of our study is to select the most discriminating parameters; therefore, we apply a threshold on the vector of distances in order to select the parameters associated with the larger distances. After verification of the Gaussianity of this distribution by using the Lilliefors test, the threshold is chosen to be equal to mean +1 * standard deviation of the distance distribution.

Sequential Forward Selection (SFS)
. Sequential forward selection (SFS) is a sequential search algorithm for feature selection [23] developed for data mining. SFS begins with an empty subset. The value of the criterion function ( ) is calculated for each feature by using a classifier. The feature presenting the best classification performance is selected ( ) and then added to the subset. The next step consists of adding sequentially the feature + that has the highest criterion function ( + + ) when combined with the features that have already been selected. This cycle is repeated until no criterion improvement is obtained when extending the current subset. The following steps present the algorithm of SFS [18].  For our study, three classical classifiers have been used to compute the criterion function (minimal error). The classifiers are as follows: linear discriminant analysis (LDA) [24], quadratic discriminant analysis (QDA) [25], and -nearest neighbors (KNN) with = 11 (the choice of is based on the number of training sets) [24]. The SFS algorithm searches sequentially for the best feature subset. We then chose only the combination of features that presents this minimal error.
The SFS algorithm is applied to synthetic data to test its efficiency. The synthetic data consists of a matrix 400 * 6 (400 observations corresponding to two classes defined by 6 features). Four features (features 1, 2, 4, and 6) are generated randomly (centered normalized Gaussian). The remaining ones (features 3 and 5) are generated using normalized Gaussian distributions of mean m1 for class 1 and m2 for class 2. After verification of its efficiency on synthetic data, we have applied it to our EHG database. We used 70% of the data set for classifier training and the remaining 30% for testing.

Binary Particle Swarm Optimization (BPSO). Particle swarm optimization (PSO) was developed by Eberhart and
Kennedy in 1995. It is a population-based stochastic optimization technique that was inspired by the social behavior of bird flocking or fish schooling [26]. PSO uses a number of particles (the swarm) moving around in the search space in order to achieve the best solution. We assume that our search space is -dimensional and that each particle is a point in this space. The position of the th particle of the swarm is represented as = ( 1 , . . . , . . . ). Each particle has a best previous position best = ( 1 , . . . , , . . . , ),which corresponds to the best fitness value (in our case best classification given by a classifier fed with the selected features). The global best particle among all the particles in the population is represented by best = ( 1 , . . . , , . . . , ). The velocity of the th particle is denoted by = (V 1 , . . . V , . . . , V ). The particles velocity and position are manipulated according to the following two equations: where is the inertia weight, 1 and 2 are positive constants, and 1 and 2 are two random values in the range [0, 1]. Kennedy and Eberhart also proposed a binary particle swarm optimization (BPSO) in order to solve optimization problems with discrete valued parameters [19]. In BPSO, the position of each particle is represented as binary strings. By comparing PSO and BPSO we found that they have a common velocity equation and a different particle position equation which can be computed as follows: where (V +1 ) is the sigmoid function and 3 is a random number in the range [0, 1]. BPSO has been widely used recently in the literature for feature subset selection [27]. In this case, the length of a binary string of each particle is equal to the length of the total number of features, and each particle presents a candidate for subset selection. If the bit included in the binary strings has a value of "1", the feature is selected; otherwise, the feature is not selected. The following steps present the BPSO algorithm.
(1) Initialize all particles positions and velocities randomly. Set the number of iterations and other BPSO parameters.
(2) Calculate the fitness value ( ) of each particle. Fitness represents the percentages of correct classification.
(6) Go to step 2, and repeat until the number of iterations is reached.
When the limit number of iterations is reached, we obtain an optimal solution (best subset of feature selection). The parameters for the BPOS were chosen classically as 30 particles, with the length of each particle being equal to 20 (maximum number of features), = 100 iterations. The acceleration constants 1 and 2 were set to 2. We also used a linear descending inertia weight passing from 0.6 to 0.1.
In our paper, for BPSO algorithm, three classical classifiers have been used to compute the fitness: linear discriminant analysis (LDA) [24], quadratic discriminant analysis (QDA) [25], and -nearest neighbors (KNN) with = 11 (the choice of is based on the number of training sets) [24]. The best feature subset chosen by the BPSO algorithm is defined as the one giving the maximum percentages of correct classification after 100 iterations (1 run). Then, to evaluate the performances and variability of BPSO, we performed multiple runs (200 runs). This algorithm is applied to the same synthetic and real data as described above.

Results of Parameters Extraction.
For each contraction of each group, we calculate the 20 parameters (linear and nonlinear parameters) from the EHG. Table 1 presents the mean ± standard deviation of each of the parameters in each class. Additionally, in this table we present for each parameter the Lilliefors test result concerning its Gaussianity.

Results of Feature Selection Using JD Distance.
We first present the results obtained with the feature selection method based on the Jeffrey divergence. After calculating the distances between every two corresponding feature histograms for the 20 features, we obtain a distance vector of dimension 20, as presented in Figure 3. The red color represents the maximum distance value of the distribution and the blue color its minimum. Each coordinate of this vector represents a different feature. Figure 4 shows the selection vector obtained after applying the threshold (equal to the mean +1 * standard deviation of the distance distribution) on the distance vector. A white color indicates that the feature has been selected as being discriminating between pregnancy and labor.

Results on Synthetic
Data. The algorithms BPSO and SFS were first applied to the synthetic data described above in order to test their efficiency. Table 2 presents the results obtained after applying SFS and BPSO to these synthetic data. We notice that the two features 3 and 5 marked in bold font in Table 2 are always selected by the two algorithms (BPSO and SFS) whatever the mean m1 and m2 (m1#m2) and whatever the classifier. We can also notice that both methods select larger sets than the minimum set containing the 2 clearly discriminating features, whatever the classifier, with SFS giving smaller sets than BPSO most of the time.  Figure 4: Selection vector representing the best parameters for the discrimination between pregnancy and labor.   Table 3 also presents the best feature subsets as selected by SFS, which corresponds to the minimal error. From Table 3, we can notice that BPSO always selects 3 features regardless of classifier type. These features are VarEn, D1, and D8. For SFS, the common features between the subsets obtained from the three classifiers are SE, VarEn, W2, and D8. Only VarEn and D8 marked in bold font in Table 3 are systematically selected by both methods.

Validation.
The results presented above give seven subsets of features selected by using JD, SFS with QDA, SFS with LDA, SFS with KNN, BPSO with QDA, BPSO with LDA, and BPSO with KNN. We then evaluated the performances of these seven selected subsets by calculating the percentages of correct classification that they give when used as inputs of a classifier. We used for this validation the same classifiers as used for the selection phase: QDA, LDA, and KNN. Table 4 presents the percentages of correct classification for each subset by using the QDA classifier. The subset of features selected by BPSO with QDA presents the highest percentage of classification (88.72%) followed by the one selected by SFS with QDA (87.47%). Table 5 presents the percentage of correct classification for each subset by using the LDA classifier. The result indicates that the subset selected by SFS with QDA and SFS with KNN presents the highest percentage (84.96%) followed by the ones selected by SFS with LDA and BPSO with LDA (83.71%). Table 6 presents the percentage of correct classification for each subset by using the KNN classifier. The result indicates that the subset selected by BPSO with QDA presents the highest percentage of classification (87.47%) followed by the ones selected by BPSO with KNN (84.96%).

Discussions and Conclusions
In this paper, we have extracted several features (linear and nonlinear) from the EHG. Then we have applied three selection techniques (Jeffrey divergence distance, SFS, and BPSO) in order to select the most pertinent features allowing discrimination between labor and pregnancy contractions.
It is clear from Table 2 (results obtained for synthetic data) that the algorithms BPSO and SFS have the ability to select discriminating features. In Table 3, which presents the selection results obtained from EHG signals, we notice that BPSO and SFS selected different subsets of features when using the three different classifiers. It is very important to highlight the most repetitive features selected with the different methods. Indeed, these features are expected to be very pertinent. Five features have been selected when applying the JD algorithm (W3, D6, D7, D8, and D9). Three others have been repeatedly selected by BPSO (VarEn, D1, and D8) and three others have been repeatedly selected by SFS (SE, VarEn, W2, and D8), whatever the type of classifier. Only D8 is common to these 3 subsets. D8 corresponds to the decile of mean value 0.36 ± 0.09 for the pregnancy class and 0.45 ± 0.12 for the labor class (Table 1). This increase in D8 value is in agreement with the most accepted observation made by teams that have worked on EHG frequency content, whatever the species: a clear shift towards higher frequencies of the EHG frequency content when going from pregnancy to labor [3,6]. Comparing the 6 subsets of features obtained by BPSO and SFS, we notice that two features, VarEn and D8, are also common. This confirms the observation made by different teams concerning the interest of taking into account the nonlinear characteristics of EHG for diagnostic purposes [12,14,15]. VarEn also increases from pregnancy to labor (Table 1), indicating an increase in EHG nonlinearity from pregnancy to labor, which is in agreement with the work done by different teams [12,14,15]. VarEn performed better here than the other nonlinear features computed by these teams. This short subset should also be of diagnostic interest.
From the validation study, developed in order to test the performance of the selected subsets, we notice from Table 4 that the feature selected by BPSO with QDA corresponds to the highest percentage of correct classification (88.72%) obtained when using QDA. Table 6 presents the second highest percentage of correct classification (87.47%) obtained with the subset of features selected by BPSO with QDA by using KNN. This allows us to say that BPSO associated with QDA seems to be the most efficient feature selection method in our study.
Two conclusions can be drawn from these results.
(i) The most discriminating subset should be (or at least should contain) the 2 following features: VarEn and D8. Indeed, they are the most pertinent features selected in the 7 subsets of features defined by using JD, SFS, and BPSO, whatever the classifier.
(ii) BPSO with QDA gives larger selected sets than the JD method, which might be demanding in time and in training data. But these sets also give much better results for the validation phase when using nonlinear classifiers. This tends to imply that the classification of EHG should be based on a nonlinear approach, either for feature selection or classification, rather than on linear ones.
As future work, we will classify the EHG by using, as inputs of the classifier, only the selected features in order to compare the obtained results with the ones obtained by using all the features. We will also test the best methods of feature selection on a larger database. Accordingly, we will be able to use more robust and relevant cross-validation techniques than the simple one used in this study. Additionally, we should try other classification methods for the validation phase, such as those based on neural networks, SVM, or KNN. We will also include in this selection process features related to uterine synchronization and activity propagation that have been proven to beof interest for EHG monitoring [4], as soon as they are available from the work currently in progress in our team. With this work we expect to obtain the most pertinent data analysis features for predicting the preterm labor threat as early as possible.