^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

We propose a preprocessing method to improve the performance of Principal Component Analysis (PCA) for classification problems composed of two steps; in the first step, the weight of each feature is calculated by using a feature weighting method. Then the features with weights larger than a predefined threshold are selected. The selected relevant features are then subject to the second step. In the second step, variances of features are changed until the variances of the features are corresponded to their importance. By taking the advantage of step 2 to reveal the class structure, we expect that the performance of PCA increases in classification problems. Results confirm the effectiveness of our proposed methods.

In many real world applications, we faced databases with a large set of features. Unfortunately, in the high-dimensional spaces, data become extremely sparse and far apart from each other. Experiments show that in this situation once the number of features linearly increases, the required number of examples for learning exponentially increases. This phenomenon is commonly known as the curse of dimensionality. Dimensionality reduction is an effective solution to the problem of curse of dimensionality [

Feature selection concerns representing the data by selecting a small subset of its features in its original format [

The basis of feature extraction is a mathematical transformation that changes data from a higher dimensional space into a lower dimensional one. Feature extraction algorithms are generally effective [

Principal Component Analysis (PCA) is an effective feature extraction approach and has successfully been applied in recognition applications such as face, handprint, and human-made object recognition [

The main objective of this paper is to improve the accuracy of classification using features extracted by PCA. PCA is the best-known unsupervised linear feature extraction algorithm; but it is used for classification tasks too. Since PCA do not pay any particular attention to the underlying class structure, it is not always an optimal dimensionality-reduction procedure for classification purposes, and the projection axes chosen by PCA might not provide the good discrimination power. However, the study in [

The rest of this paper is organized as follows. Section

This section reviews ReliefF and PCA briefly and presents the drawbacks of PCA.

Relief [

ReliefF Algorithm

In ReliefF algorithm,

PCA is a very effective approach of extracting features. It is successfully applied to various applications of pattern recognition such as face classification [

PCA employs the entire features and it acquires a set of projection vectors to extract global feature from given training samples. The performance of PCA is reduced when there are more irrelevant features than the relevant ones. On the other hand, PCA has no preknowledge about the class in a given data. So, it is not efficient to determine the classes in the subspace of a given dataset.

We present an example to confirm the mentioned points. This example uses a dataset with five variables and 300 records. The number of classes is three and each class has 100 points. The last two variables represent uniform distributed noise points and irrelevant features. Table

Centroids and standard deviations of classes in different variables.

Class | Class centroids | Standard deviations | No. of points |
---|---|---|---|

1 | (0.547, 0728, 0.424, 0.492, 0561) | (0.054, 0.044, 0.071, 0.288, 0.302) | 100 |

2 | (0.299, 0.585, 0.318, 0.555, 0.455) | (0.061, 0.044, 0.069, 0.269, 0.274) | 100 |

3 | (0.422, 0.452, 0.636, 0.520, 0.536) | (0.055, 0.050, 0.075, 0.263, 0.274) | 100 |

The centroids of two noise variables (

Synthetic dataset with three normally distributed classes in the three-dimensional subspace of

A plot of a new data point by applying the PCA using two significant eigenvectors.

Figure

A new data point by applying the PCA using two significant eigenvectors after removing irrelevant features.

As shown in Figure

Proposed preprocessing steps.

In the step of the relevant analysis, weights of features are calculated through one feature weighting approach (like Relief or its extension for multiclass dataset called ReliefF). Assume that

After removing the irrelevant features, we do not need to collect all the features. In the variance adjustment step, the variances of features have been changed so that the most important feature becomes the most important feature for PCA. A key idea for this step is motivated from this characteristic of PCA: a feature with maximum variance has the most important for PCA. The new variance of

Notice that each feature weighting method can be utilized in the first step. Since the output of the first step is used as a subject for the second step (variance adjustment), more effective feature weighting methods lead to better results. Hence, if we use a feature weighting more effective than ReliefF, the obtained result is better than we use ReliefF. Further, the type of feature weighting is very important. For example, if we replace ReliefF with another unsupervised feature weighting method like SUD [

The extracted features are formed only by using relevant features.

The preprocessing steps have low time complexity.

The preprocessing steps reveal the underlying class structure for PCA approximately.

This section presents the experimental results to show the effectiveness of RPCA on four UCI datasets and synthetic data introduced in Section

Summary of four UCI data sets.

Database | Training | Testing | Features |
---|---|---|---|

Twonorm | 400 | 7000 | 20 |

Waveform | 400 | 4600 | 21 |

Ringnorm | 400 | 7000 | 20 |

Breast cancer | 100 | 545 | 9 |

In order to provide a platform where PCA and RPCA can be compared, KNN classification errors are used. The number of nearest neighbors is achieved by trial and error. To eliminate statistical variation, each algorithm is run 20 times for each dataset. In each run, a dataset is randomly partitioned into training and testing. Also, 50 irrelevant features with Gaussian distributions are added to UCI datasets. The mean of Gaussian distribution is equal to zero and the standard deviation is set based on dataset.

Table

The testing errors.

Database | PCA | RPCA |
---|---|---|

Synthetic data | 0.5787 | 0.0083 |

Twonorm | 0.2529 | 0.0349 |

Waveform | 0.6653 | 0.2496 |

Ringnorm | 0.5021 | 0.1797 |

Breast cancer | 0.3581 | 0.0434 |

Classification errors of PCA and RPCA on the four UCI datasets.

We propose a new preprocessing method comprised two steps to improve the performance of PCA in classification task. After weighting features and selecting relevant features in the first step, the variances of features are adjusted based on their importance in the second step until the most important feature has the most variance. Finally, PCA is applied to the modified data. Since, in the first step, ReliefF is used for feature weighting, we nominate our proposed preprocessing technique RPCA. Moreover, we can utilize another type of feature weighting method instead of ReliefF. For example, SUD [

This research is supported by Iran Telecommunication Research Center (ITRC).