Improving Rolling Bearing Fault Diagnosis by DS Evidence Theory Based Fusion Model

Rolling bearing plays an important role in rotating machinery and its working condition directly affects the equipment efficiency. While dozens of methods have been proposed for real-time bearing fault diagnosis andmonitoring, the fault classification accuracy of existing algorithms is still not satisfactory. This work presents a novel algorithm fusion model based on principal component analysis and Dempster-Shafer evidence theory for rolling bearing fault diagnosis. It combines the advantages of the learning vector quantization (LVQ) neural networkmodel and the decision tree model. Experiments under three different spinning bearing speeds and two different crack sizes show that our fusion model has better performance and higher accuracy than either of the base classification models for rolling bearing fault diagnosis, which is achieved via synergic prediction from both types of models.


Introduction
Healthy operation of machinery systems is important in modern manufacturing enterprises, which leads to increasing attention to fault diagnosis technology that detects, identifies, and predicts abnormal states of manufacturing systems.Complex fault states and uncertain fault information bring high demand for real-time and intelligent fault diagnosis.Rolling bearing is a key component of rotating machinery, and any failure may cause equipment malfunction or catastrophic consequences.Almost every machine has at least one of these components, and their faults can be the direct cause of subsequent problems in other parts.Thus, bearing faults should be detected at the early stage [1].Generally, rolling bearing consists of shaft, balls, inner race, outer race, cage, and housing.In principle, each component may fail.However, the inner race, outer race, and balls are the most vulnerable components due to friction and thus more prone to malfunction.Therefore, rolling bearing fault detection and diagnosis is of great significance to ensure production efficiency and equipment safety.The essence of fault diagnosis process is signal processing and pattern recognition.Signal processing functions to extract the features that characterize the nature of the faults from complex original signals, whereas pattern recognition can classify the fault types and identify specific faults according to the input features, which can thus reduce reliance on technical personnel.
Thus far, several methods have been used for bearing fault diagnosis, and each has its case history of successes and failures.These methods can be classified according to their information source types, such as acoustic measurements, current and temperature monitoring, wear debris detection, and vibration analysis.Vibration analysis is broadly considered as the most effective monitoring technique in rotating machinery.Numerous vibration phenomena can be interpreted to be an amplitude modulation of the characteristic vibration frequency of a machine.Once the bearing fails, vibration pulses are produced.Pulse signals with smooth and less fluctuations are produced even when the bearing is normally operating.Recently, many fault diagnosis studies for rolling bearing based on vibration data have been reported.Sanz et al. [2] presented a method for detecting the states of rotating machinery with vibration analysis.Zhou and Cheng [3] proposed a fault diagnosis method based on image recognition for rolling bearing to realize fault classification under variable working conditions.Li et al. [4] presented Initial high-dimensional features are obtained by decomposing the vibration signals with wavelet transform.However, redundant information of high-dimensional features may cause dimensionality problems to subsequent pattern analysis.Hence, principal component analysis (PCA) has been introduced to reduce dimensionality and eliminate redundant information aiming at improving the classification speed and accuracy.Taouali et al. [5] proposed a new method for fault detection using a reduced kernel PCA and obtained satisfactory results.Wodecki et al. [6] presented a multichannel processing method for local damage detection in gearbox using the combination of PCA and time frequency.Cho et al. [7] suggested a fault identification method that is especially applicable to process monitoring using PCA and achieved a high efficiency.Nguyen and Golinval [8] addressed the fault detection problem in mechanical systems using the PCA method, which effectively improved the detection results.Their findings indicate that the optimal features extracted with PCA are considered accurate and efficient.
As a simple and efficient classifier, the decision tree can be used to infer the classification rules from a set of training samples, which has been extensively used in fault diagnosis.Karabadji et al. [9] discussed a new approach for fault diagnosis in rotating machines based on improved decision trees and gained ideal results.Rutkowski et al. [10] presented a new algorithm based on decision trees to determine the best attributes and obtained a high accuracy of classification with a short processing time.Amarnath et al. [11] selected the descriptive features extracted from acoustic signals using a decision tree algorithm to realize the fault bearing diagnosis.Krishnakumari et al. [12] selected the best features through a decision tree to train a fuzzy classifier for fault diagnosis.
In recent years, artificial neural networks (ANNs) are widely used in fault diagnosis because of their capability of learning highly nonlinear relationships.The back propagation (BP) neural network is extensively used, but it can easily fall into local optimum.The learning vector quantization (LVQ) neural network is a learning algorithm that trains hidden layers under the supervision state, which can overcome the shortcomings of the BP network and achieve better prediction performance.Rafiee et al. [13] presented an ANN for fault detection and identification in gearbox using features that are extracted from vibration signals.Umer and Khiyal [14] evaluated the LVQ network for classifying text documents, and the results showed that the LVQ required less training of samples and exceeded other classification methods.Melin et al. [15] described the application of competitive neural networks using the LVQ algorithm to classify electrocardiogram signals and produced desired results.
Another strategy to improve prediction performance is to use information fusion approach.For example, the Dempster-Shafer (DS) evidence theory has been adopted to handle information fusion.Kushwah et al. [16] proposed a multisensor fusion methodology using evidence theory for indoor activity recognition, which gained ideal identification accuracy.Basir and Yuan [17] investigated the use of DS theory as a tool for modelling and fusing multisensor pieces of evidence pertinent to engine quality.Bhalla et al. [18] used DS evidence theory to integrate the results of BP neural network and fuzzy logic to overcome the conflicts of fault diagnosis.
Previous methods for fault diagnosis of rolling bearing suffer from either a single source of information or single type of models, which leads to biased prediction.To address these issues, we propose an information fusion model for bearing fault diagnosis by combining the LVQ neural network and the decision tree classifier, of which the predictions are fused using the DS evidence theory.Our algorithm is inspired by that fact that ensemble machine learning algorithms based on fusing predictions from multiple base machine learning models have been shown to be able to achieve most competitive performance [19,20] and DS evidence theory based fusion method has been successfully applied to fault diagnosis of hydraulic pump [21], rolling bearing [22,23], and complex electromechanical systems [24].However, previous work of DS fusion for bearing fault diagnosis has not studied whether DS based evidence fusion can be used to combine heterogeneous base models into more accurate prediction models.The inputs of our model are statistical characteristics obtained by decomposing the vibration signals using wavelet transform.We use the PCA technique to reduce the features dimensions according to the cumulative contribution rate of the eigenvalues.We also use the LVQ neural network and decision tree to perform the initial fault prediction.Then, we calculate the basic probability assignment (BPA) of the two models through normalized operations.Finally, we fuse the results of the two methods and adopt the DS evidence theory to identify the fault type of the rolling bearing.The structure of our fusion model is shown in Figure 1.
The rest of this paper is organized as follows.Section 2 describes the experimental setup and design.Section 3 The vibration signals are gained from the accelerometer at the base of the rolling bearing.The sampling frequency is 24 kHz, the length of the record is 10 s, and the sample length is 1024 for all experiment cases.The highest frequency is found to be 12 kHz via experimenting.The sampling frequency must be twice that of the highest measured frequency, according to the Nyquist sampling theorem; thus, it is set at 24 kHz.The choice of the sample length is considered arbitrary to a certain extent.However, the statistical measurement is more meaningful when the number of samples is large; meanwhile, the computing time increases with the number of samples.Generally, a sample length of approximately 1000 is selected to achieve balance.In some feature extraction techniques, the sample length is always 2  , and the nearest 2  to 1000 is 1024.Therefore, 1024 is selected as the sample length under normal circumstances [25].
Ninety experiments are conducted by varying the parameters under three different spinning speeds of the bearing (500, 700, and 900 rpm).First, a rolling bearing without any defects is used for the good case.Ten samples are collected under each spinning speed of the bearing.Then, 30 different cases are obtained by changing the shaft speed.Second, the outer race fault condition is conducted in the test rig.The crack of the outer race fault is created via spark erosion technique.The crack size is 0.5 mm wide and 0.7 mm deep, and the other crack is 0.3 mm wide and 0.6 mm deep.The performance characteristics of the outer race fault of the bearing are studied, as explained for the good case.Vibration signals with outer race fault are recorded in memory, keeping all other modules in good condition.A crack corresponds to three different spinning speeds of the bearing.Fifteen samples are obtained for each speed with five samples.Thirty different cases are obtained by changing the shaft speed and crack size.Third, the inner race fault is simulated by the same crack size as that of the outer race.Thirty samples are also obtained.The cumulative 90 samples, of which the dimensions are reduced by PCA, are used as input for the LVQ neural network.Similarly, 90 samples are collected again as the input for the decision tree.

Methodology
3.1.Feature Extraction.Generally, statistical features are good indicators of the state of machine operation.Vibration signals are obtained from different spinning speeds and fault types, and the required statistical characteristics can be extracted using time or frequency domain analysis.The most commonly used statistical feature extraction methods are fast Fourier transform (FFT) and wavelet transform.FFT converts time domain signal into frequency domain signal and is widely used in signal detection.However, the FFT method is inherently flawed in handling unstable processes.This method only acquires the frequency components of the signal as a whole but is unaware of the moment at which the components appear.The difference between two time domain signals is large, but the spectrum may be the same; thus, FFT cannot render good performance [26].In comparison with FFT, wavelet transform is a local transformation of time and frequency [27].Wavelet transform has good spatial and frequency domain localization characteristics.It can effectively extract information from the signal through the expansion, translation, and other computing functions.Wavelet transform is widely applied in multiscale refinement analysis of functions or signals.It can focus on the detail of the analysis object by using fine time domain or space step in high frequency, which solves many problems that FFT cannot [28].
In the present study, wavelet transform is used to collect time domain features of the vibration signals, which are gained from the accelerometer.Wavelet coefficients cannot be directly used as input of the diagnostic model; thus, a feature extraction preprocessing step is required to prepare the data for the model.A large number of features can be extracted from each signal, which can be divided into two categories: features with dimensions and dimensionless features.The features with dimensions such as variance, mean, and peak are more likely to be affected by working conditions.Dimensionless features such as kurtosis, crest, and pulse are less sensitive to external factors.Different features reflect different aspects of the fault information of the bearing.Effective feature extraction and selection and preprocessing are critical for successful classification [29].The increase in the number of features will inevitably lead to redundancy and curse of dimensionality while ensuring comprehensive and complete access to the fault information.To achieve a balanced control, only 10 statistical characteristics with good sensitivity and differentiation to the fault type are selected as inputs to the model in this work as shown below.
(1) Variance.It is the measurement of signal dispersion degree.A larger variance indicates a greater fluctuation of data, whereas a smaller variance indicates a smaller fluctuation of data.The following formula is used to compute variance.
(2) Kurtosis.It indicates the flatness or the spikiness of the signal.It is considerably low under normal condition but increases rapidly with the occurrence of faults.It is particularly effective in detecting faults in the signal.The following formula is used to solve for kurtosis: (3) Mean.It represents the central tendency of the amplitude variations of the waveform and can be used to describe signal stability, which is the static component of the signal.The following formula is used to obtain the mean: (4) Standard Deviation (Std).It is the measurement of the effective energy of the signal and reflects the discrepancy degree between individuals within the group.The following formula is used for its computation: (5) Skewness.It is the measurement of the skew direction and extent of the data distribution and is a numerical feature of the degree of asymmetry of the statistical data.The following formula is used to compute skewness: (6) Peak.It refers to the instantaneous maximum value of the fault signal in a given time.The following formula is used to compute peak: (7) Median.It refers to the value of a variable in the middle of the array, which is sorted from small to large with all variables.
The following formula is used to determine the median: The statistical feature matrix of some samples is shown in Table 1.The purpose of PCA is to reduce the dimensionality of data while preserving as many changes as possible in the original dataset.PCA transforms the data into the coordinate system; thus, the maximum variance of any projection of data lies on the first coordinate, and the second largest variance falls on the second coordinate, and so on.PCA algorithm can remove redundant information, simplify the difficulty of dealing with the problem, and improve the resistance to external interference through the processing of raw data.Therefore, PCA is used in this paper.The specific steps of PCA algorithm are shown as follows.
Step 1. Input sample matrix  = { 1 ,  2 , . . .,   }  .The rows of the matrix represent the number of samples, whereas the columns represent the dimensions.
Input percentage of information retention after dimension reduction (e).
Step 2. Calculate the mean by columns.
Step 3. Obtain the new sample matrix  with data centralization.
Step 4. Calculate the eigenvalues and eigenvectors.
Step 5. Determine the final dimension k.
The cumulative contribution rate of the eigenvalues () is used to measure the representation of the newly generated principal components to the original data.Generally,  should be greater than or equal to 85% to extract the former  principal components as the sample features.
Step 6.Output the principal components.
Notably, the dataset is divided into training and testing sets prior to importing the model in this study.Therefore, both sets must be processed separately when PCA is used.Subtracting the mean value of the training sample and using the transformation matrix obtained from the training sample are important when the dimension of the testing sample is reduced to ensure that the training and testing samples are switched to the same sample space.In this study, the first four principal components are selected, and some of them are shown in Table 2.

Decision Tree.
Decision tree is a common supervised classification method in data mining.In supervised learning, a bunch of samples is provided.Each sample has a set of attributes and a category label.These categories are determined in advance, and a classifier is then created by a learning algorithm.The topmost node of the decision tree is the root node.Decision tree classifies a sample from the root to the leaf node.Each nonleaf node represents a test of the attribute value.Each branch of the tree represents the results of the test.Each leaf node represents a category of the object.In short, the decision tree is a tree structure similar to a flow diagram.
A decision tree is built recursively following a top-down approach.It compares and tests the attribute values on its internal nodes from the root node, then determines the corresponding branch according to the attribute value of the given instance, and finally draws the conclusion in its leaf node.The process is repeated on the subtree, which is rooted with a new node.Thus far, many decision tree algorithms exist, but the most commonly used one is the C4.5 algorithm.The pseudocode of the C4.5 algorithm is shown in Pseudocode 1.
A leafy decision tree may be created due to the noises and outliers of the training data, which will result in overfitting.Many branches reflect anomalies of the data.The solution is pruning and cutting off the most unreliable branches, in which pre-and postpruning are widely used.The C4.5 algorithm adopts a pessimistic postpruning.If the error rate can be reduced through replacing a subtree with its leaf node, the subtree will be pruned.

LVQ Neural Network.
LVQ is an input feed-forward neural network with supervised learning for training the hidden layers.The LVQ neural network consists of input, hidden, and output layers.The hidden layer automatically learns and classifies the input vectors.The results of classification depend only on the distance between the input vectors.If the distance between the two input vectors is particularly similar, then the hidden layer divides them into the same class and output them.
The network is completely connected between the input and hidden layers; meanwhile, the hidden layer is partially connected with the output layer.Each output layer neuron is connected to different groups of hidden layer neurons.The compute information gain ratio (InGR) (g) end for (h)  best = attribute with the highest InGR (i) Tree = create a tree with only one node  best in the root (j)  V = generate a subset from  except  best (k) for all  V do (l) subtree = C4.5 ( V ) (m) set the subtree to the corresponding branch of the Tree according to the InGR (n) end for Pseudocode 1: Pseudocode of the C4.5 algorithm.number of neurons in the hidden layer is always greater than that of the output layer.Each hidden layer neuron is connected with only one output layer neuron, and the connection weight is always 1.However, each output layer neuron can be connected to multiple hidden layer neurons.The values of the hidden and output layer neurons can only be 1 or 0. The weights between the input and hidden layers are gradually adjusted to the clustering center during the training.When a sample is placed into the LVQ neural network, the neurons of the hidden layer generate the winning neuron by the winnerlearning rules, allowing the output to be 1 and the other to be 0. The output of the output layer neurons that are connected to the winning neuron is 1, whereas the other is 0; then, the output provides the pattern class of the current input sample.The class learned by the hidden layer becomes a subclass, and the class learned by the output layer becomes the target class [30].The architecture of the LVQ neural network is shown in Figure 3.
The training steps of the LVQ algorithm are as follows.
Step 1.The learning rate  ( > 0) and the weights   between the input and hidden layers are initialized.

Hidden layer
Output layer Class 1 Class n

Class i
Input vector Step 2. The input vector  = ( 1 ,  2 , . . .,   )  is fed to the input layer, and the distance (  ) between the hidden layer neuron and the input vector is calculated.
Step 3. Select the hidden layer neuron with the smallest distance from the input vector.If   is the minimum, then the output layer neuron connected to it is labeled with   .
Step 4. The input vector is labeled with   .If   =   , the weights are adjusted as follows: Otherwise, the weights are updated as follows: Step 5. Determine whether the maximum number of iterations is satisfied.The algorithm ends if it is satisfied.Otherwise, return to Step 2 and continue the next round of learning.

Evidence Theory.
Data fusion is a method of obtaining the best decision from different sources of data.In recent years, it has attracted significant attention for its wide application in fault diagnosis.Generally, data fusion can be conducted at three levels.The first is data-level fusion [31].Raw data from different sources are directly fused to produce more information than the original data.At this level, the fusion exhibits small loss and high precision but is timeconsuming and unstable and has weak anti-interference capabilities.The second is feature-level fusion [32].At this level, statistical features are extracted separately using signal processing technique.All features are fused to find an optimal subset, which is then fed to a classifier for better accuracy.At this level, information compression for transmission is achieved but with poor integration accuracy.The third is decision-level fusion [33].This fusion is the highest level of integration, which influences decision making, and it is the ultimate result of three-level integration.The decisionlevel fusion exhibits strong anti-interference ability and small amount of communication but suffers from large amount of data loss and high cost of pretreatment.In this paper, we focus on decision-level fusion, which is known as the DS evidence theory.
The DS evidence theory has been originally established in 1967 by Dempster and developed later in 1976 by Shafer, who is a student of Dempster.Evidence theory is an extension of Bayesian method.In the Bayesian method, the probability must satisfy the additivity, which is not the case for the evidence theory.The DS evidence theory can express uncertainty, leaving the rest of the trust to the recognition framework.This theory involves the following mathematical definitions.
Definition 1 (recognition framework).Define Ω = { 1 ,  2 , . . .,   } as a set, where Ω is a finite set of possible values and   is the conclusion of the model.The set is called the recognition framework.2 Ω is a power set composed of all the subsets.The recognition framework with capacity of four layers and the relationship between the subsets is shown in Figure 4. a, b, c, and  are the elements of the framework.Definition 2 (BPA).BPA is a primitive function in DS evidence theory.Assume Ω as the recognition framework; then, m is a mapping from 2  to [0, 1], and  is the subset of Ω. m is called the BPA when it meets the following equation: Definition 3 (combination rules).For ∀ ⊆ Ω, a finite number of  functions ( 1 ,  2 , . . .,   ) exist on the recognition framework.The combination rules are as follows:  where , which reflects the conflict degree between evidences.

Results and Discussions
The experiments are conducted to predict good and outer and inner race fault conditions of the rolling bearing, as discussed in Section 2. The diagnosis model in this article should undergo three steps, whether it is a neural network or decision tree.First, the relevant model is created with the training set.Then, the testing set is imported to simulate results.Finally, simulation and actual results are compared to obtain the fault diagnosis accuracy.Hence, each group of experimental data, which are extracted from the vibration signals, is separated into two parts.Sixty samples are randomly selected for training and the remaining 30 samples are used for testing.

Results of the Tree-PCA.
Sixty samples in different cases of fault severity have been fed into the C4.5 algorithm.The algorithm creates a leafy decision tree, and the sample classification accuracy is usually high with the training set.However, the leafy decision tree is often overfitted or overtrained; thus, such a decision tree does not guarantee an approximate classification accuracy for the independent testing set, which may be lower.Therefore, pruning is required to obtain a decision tree with relatively simple structure (i.e., less bifurcation and fewer leaf nodes).Pruning the decision tree reduces the classification accuracy of the training set but improves that of the testing set.The re-substitution and cross-validation errors are a good evidence of the change.Sixty samples, including 10 statistical features extracted from the vibration signals, are used as input of the algorithm, and the output is the pruning of the decision tree, as shown in Figure 5.
Figure 5 shows that the decision tree has leaf nodes, which stand for class labels (namely, 1 as good, 2 as outer race fault, and 3 as inner race fault), and decision nodes, which stand for the capability of discriminating (namely,  5 as skewness,  2 as kurtosis,  1 as variance, and  8 as RMS).Not every statistical feature can be a decision node, which depends on the contribution of the entropy and information gain.Attributes that meet certain thresholds appear in the decision tree; otherwise, they are discarded intentionally.The contribution of 10 features is not the same, and the importance is not consistent.Only four features appear in the tree.The importance of the decision nodes decreases from top to bottom.The top node is the best node for classification.The most dominant features suggested by Figure 5 are kurtosis, RMS, mean, and variance.Re-substitution error refers to the difference between the actual and predicted classification accuracy, which is obtained by importing the training set into the model again after creating the decision tree using the training set.The cross-validation error is an error value of prediction model in practical application by cross-validation.Both are used to evaluate the generalization capability of the prediction model.In this study, the re-substitution error is expressed by "resub-err," the cross-validation error is expressed by "cross-valerr," and the average classification accuracy rate is expressed by "average accuracy."In the experiment, we can obtain the results shown in Table 3.
Table 3 shows that the cross-val-err is approximately equal (0.08 ≈ 0.09) and the re-sub-err after pruning is greater than before (0.07 > 0.01), but the average accuracy of the fault of the testing set after pruning significantly improves (84.09% > 82.98%).
At the same time, the PCA technique is used to reduce the dimension of statistical features.The first four principal components are extracted to create the decision tree according to the principle that the cumulative contribution rate of eigenvalues is more than 85%.Thus far, the dimension of the  statistical feature is reduced from 10 to 4, and the amount of data is significantly reduced.The decision tree is constructed with the first four main components, as shown in Figure 6.
Figure 6 shows that the testing set can be classified depending on the first ( 1 ) and second ( 2 ) principal components.The remaining two principal components do not appear in the decision tree because their contribution value does not reach the thresholds.When comparing Figure 5 with Figure 6, the decision tree after dimension reduction is simpler and has fewer decision nodes than before.Furthermore, the cross-val-err is equal, and the average accuracy is not low.Table 4 shows the experimental results.
Table 4 shows that the cross-val-err is equal (0.08 = 0.08), the re-sub-err of the decision tree after dimension reduction is lower (0.03 < 0.07), and the average accuracy is slightly higher (86.56% > 84.09%); however, the running time of the program is considerably lower (1.33 seconds < 2.02 seconds).Therefore, dimension reduction is necessary and effective for constructing the decision tree, especially with many statistical attribute values.Meanwhile, the design of the hidden layer is important in the LVQ neural network; thus, it is defined by  crossvalidation method.The initial sample is divided into  subsamples.A subsample is retained as the validation data, and the other  − 1 subsamples are used for training.The crossvalidation process is repeated  times.Each subsample is validated once, and a single estimate is gained by considering the average value of  times.This method can avoid overlearning or underlearning state, and the results are more convincing.In this study, the optimal number of neurons in the hidden layer is 11 through 10-fold cross-validation, which is most commonly used.The LVQ neural network is created to obtain the error curve, as shown in Figure 7.To highlight the superiority of LVQ, a BP neural network is also created using the same parameter settings, and the error curve is shown in Figure 8.
We used the mean squared error (MSE) as the evaluation measure, which calculates the average squared difference between outputs and targets.A lower value is better, and zero means no error.When comparing Figure 7 with Figure 8, the BP neural network has less training time and smaller MSE.Evidently, the BP neural network is superior to LVQ in this case.However, the BP neural network algorithm is essentially a gradient descent method.It is a way of local optimization and can easily fall into the local optimal solution, which is demonstrated by the results in Table 5.The Figure 9 shows the ROC curve, which is a plot of the true positive rate (sensitivity) versus the false positive rate (1 − specificity) as the thresholds vary.A perfect test would show points in the upper-left corner with 100% sensitivity and 100% specificity.equal to 1. Therefore, the network performs well.The classification accuracy of LVQ-PCA further illustrated it as shown in Table 5.

Results of the Fusion Model.
The decision tree and the LVQ neural network are widely used in fault diagnosis due to their simplicity and good generalization performance.However, their classification accuracy is still dependent on the datasets and may get unsatisfactory performance.To solve this problem, the DS evidence theory is introduced in this study.The target recognition frame  is established by considering the three states of the bearing: good ( 1 ), outer race fault ( 2 ), and inner race fault ( 3 ).Each fault sample belongs to only one of the three failure modes and is independent.The outputs of the LVQ neural network are used as evidence 1 and those of decision tree are used as evidence 2; then data fusion is performed based on the aforementioned DS method.The experiment has been run 20 times to reduce the occurrence of extreme values and to obtain reliable results.The classification accuracies on  the training and testing sets for each run are recorded and the final performance comparisons are plotted as a boxplot (Figure 11).
Figure 11 shows that the accuracy on the training sets of three types of faults fluctuates slightly around 98%.The accuracy on  1 -train is low and has an outlier.The accuracy on  2 -train is on the higher side, with a maximum of up to 100%.The accuracy on  3 -train is between those of  1train and  2 -train with small variation.The small variations of the accuracy on the training sets indicate that the prediction models are stable.The accuracy on the testing sets is relatively scattered.The accuracy on  1 -test is concentrated on 90% and has an exception of up to 97%.The accuracy on  2test is concentrated on 97% while the accuracy on  3 -test is near 94%.The average value of the 20 experimental results is considered as the final result of data fusion to reduce the error, as shown in Table 5.
Table 5 presents the results of all the seven algorithms used in the present study.Each of the algorithms has been run 20 times with different partition of the training and test datasets, and the average prediction accuracy for three fault types is recorded.First, it was found that the BP neural network falls into local optimal solutions as its average accuracy is only 36.7% (see second column of Table 5).Therefore, we conclude it as a failing method in our experiments.The average accuracy of the LVQ neural network has increased from 78.9% to 88.8% after applying dimension reduction.The performance of the decision tree improves slightly using pruning (from 83.0% to 84.1%) but increases to 86.6% through combining PCA and decision tree.The results indicate that dimensionality reduction is an effective means to improve prediction performance for both base classification models.The DS fusion model proposed in this study achieved an average accuracy of 94.0% by fusing predictions of LVQ-PCA and Tree-PCA, which is the best compared to all other 6 base methods.This demonstrates the capability of DS fusion to take advantage of complementary prediction performance of LVQ and decision tree classifiers.This can be clearly seen from the second row and the third row, which show the performance of the algorithms on predicting outer race fault and inner race fault.In the prior case, the Tree-PCA achieves a higher performance (97.9%) compared to LVQ-PCA (78.9%) while, in the latter case, LVQ-PCA achieved an accuracy of 100.0%compared to 79.4% of Tree-PCA.Through the DS fusion algorithm, the prediction performances are 96.6% and 94.9%, respectively, which avoids the weakness of either of the base models in predicting some specific fault types.

Conclusion
We have developed a DS evidence theory based algorithm fusion model for diagnosing the fault states of rolling bearings.It combines the advantages of the LVQ neural network and decision tree.This model uses vibration signals collected by the accelerometer to identify bearing failures.Ten statistical features are extracted from the vibration signals as the input of the model for training and testing.To improve the classification accuracy and reduce the input redundancy, the PCA technique is used to reduce 10 statistical features to 4 principal components.
We compared different methods in terms of their fault classification performance using the same dataset.Experimental results show that PCA can improve the classification accuracy of LVQ neural network in most cases but not always for the decision tree.Both LVQ neural network and decision tree do not achieve good performance for some classes.The proposed DS evidence theory based fusion model fully utilizes the advantages of the LVQ neural network, decision tree, PCA, and evidence theory and obtains the best accuracy compared with other signal models.Our results show that the DS evidence theory can be used not only for information fusion, but also for model fusion in fault diagnosis.
The accuracy of the prediction models is important in bearing fault diagnosis while the convergence speed and the running time of the algorithms also need special attention especially in the case of large number of samples.The results in Table 5 show that the fusion model has the highest classification accuracy but takes the longest time to run.Therefore, our future research is not only to ensure the accuracy but also to speed up the convergence and reduce the running time.

Figure 1 :
Figure 1: Structure of our fusion model.It mainly includes A feature extraction based on wavelet transform; B dimension reduction using PCA; C training of LVQ neural network and decision tree models; and D DS evidence theory based fusion prediction model.

Figure 3 :
Figure 3: Architecture of the LVQ neural network.

Figure 4 :
Figure 4: Recognition framework of the DS theory.

Figure 5 :
Figure 5: Pruning of the decision tree.
Sixty samples are used for training, and the remaining 30 samples are used for testing in the LVQ neural network.The network structure of 10-11-3 is used in the experiment.The network parameters are set as follows: Maximum number of training steps: 1000 Minimum target training error: 0.1 Learning rate: 0.1.
Figure 10  shows the training regression plot.Regression value (R) measures the correlation between outputs and targets; 1 means a close relationship, and 0 means a random relationship.All ROC curves are in the upper-left corner, and the value of  is 0.96982, which is approximately

Table 1 :
Feature matrix of some samples.Dimensionality Reduction.PCA is a statistical method and is widely used in data reduction.By means of orthogonal transformation, a group of variables that may be related to one another are transformed into a set of linearly uncorrelated variables, which is called the principal components.It functions to maintain the primary information of original features while reducing the complexity of the data, which reveals the simple structure behind the complex data.PCA is a simple and nonparametric method of extracting relevant information from intricate data.

Table 2 :
Principal components of some samples.

Table 3 :
Error values before and after pruning.

Table 4 :
Classification errors before and after dimensionality reduction.

Table 5 :
Comparisons of classification performances of different models.