We propose a preprocessing method to improve the performance of Principal Component Analysis (PCA) for classification problems composed of two steps; in the first step, the weight of each feature is calculated by using a feature weighting method. Then the features with weights larger than a predefined threshold are selected. The selected relevant features are then subject to the second step. In the second step, variances of features are changed until the variances of the features are corresponded to their importance. By taking the advantage of step 2 to reveal the class structure, we expect that the performance of PCA increases in classification problems. Results confirm the effectiveness of our proposed methods.
1. Introduction
In many real world applications, we faced databases with a large set of features. Unfortunately, in the high-dimensional spaces, data become extremely sparse and far apart from each other. Experiments show that in this situation once the number of features linearly increases, the required number of examples for learning exponentially increases. This phenomenon is commonly known as the curse of dimensionality. Dimensionality reduction is an effective solution to the problem of curse of dimensionality [1, 2]. Dimensionality reduction is to extract or select a subset of features to describe the target concept. The selection and extraction are based on finding a relevant subset of original features and generating a new feature space through transformation, respectively [1, 3]. The proper design of selection or extraction process improves the complexity and the performance of learning algorithms [4].
Feature selection concerns representing the data by selecting a small subset of its features in its original format [5]. The role of feature selection is critical, especially in applications involving many irrelevant features. Given a criterion function, feature selection is reduced to a search problem [4, 6]. Exhaustive search, when the number of the features is too large, is infeasible and heuristic search can be employed. These algorithms, such as sequential forward and/or backward selection [7, 8], have shown successful results in practical applications. However, none of them can provide any guarantee of optimality. This problem can be alleviated by using feature weighting, which assigns a real-value number to each feature to indicate its relevancy to the learning problem [6]. Among the existing feature weighting algorithms, ReliefF [5] is considered as one of the most successful ones due to its simplicity and effectiveness [9]. A major shortcoming of the feature weighting is its inability to capture the interaction of correlated features [4, 10]. This drawback can be solved by some feature extraction techniques.
The basis of feature extraction is a mathematical transformation that changes data from a higher dimensional space into a lower dimensional one. Feature extraction algorithms are generally effective [11]. However, their effectiveness will be degraded when they are used for processing large-scale datasets [12]. In addition, the features extracted from the mathematical transformation usually concern with all original features. So the extracted features may contain information originated from the irrelevant information in the original space [3, 13].
Principal Component Analysis (PCA) is an effective feature extraction approach and has successfully been applied in recognition applications such as face, handprint, and human-made object recognition [14–16] and industrial robotics [17]. The traditional PCA is an orthogonal linear transformation and operates directly on a whole pattern represented as a vector and acquires a set of projections to extract global feature from a given training pattern [18]. PCA reduces the dimension such that the representation is as faithful as possible to the original data [2]. PCA employs all features in the original space, regardless their relevancy, to produce new features. This may result in features containing information originated from irrelevant features in the original space. A side effect is misclassification results. Some works have been done to improve the performance of PCA via the feature weighting. In [19, 20], feature weighting has been used for eliminating irrelevant features or using the weight of features in its calculation. In [19], rank is used instead of the original data for copying the outliers and noises. Honda et al. used weights of features in PCA-guided formulation, while in our proposed method we utilize weights of features to properly change the dataset.
The main objective of this paper is to improve the accuracy of classification using features extracted by PCA. PCA is the best-known unsupervised linear feature extraction algorithm; but it is used for classification tasks too. Since PCA do not pay any particular attention to the underlying class structure, it is not always an optimal dimensionality-reduction procedure for classification purposes, and the projection axes chosen by PCA might not provide the good discrimination power. However, the study in [21] illustrates that PCA might outperform LDA which is one of the best supervised dimensionality reduction method, when the number of samples per class is small or when the training data nonuniformly samples the underlying distribution. In the present work, we propose a novel preprocessing method composed of two steps. In the first step, the qualities of features are computed via a feature weighting algorithm. The selected relevant features, features with weights larger than a predefined threshold, are then subject to the second step. In the second step, the variances of features are modified until the most relevant ones become the most important ones for PCA. Finally, PCA is performed on them to generate uncorrelated features.
The rest of this paper is organized as follows. Section 2 reviews ReliefF, PCA, and its associated problems in brief. Section 3 describes the proposed algorithm. Section 4 presents our experiments on both synthetic and real data and the final section is Conclusion.
2. Review of the ReliefF and PCA Methods
This section reviews ReliefF and PCA briefly and presents the drawbacks of PCA.
2.1. ReliefF
Relief [5] is one of the most successful algorithms to assess the quality of features. The main idea of Relief is to iteratively estimate the weights of features according to how well values distinguish among instances that are near each other. The original Relief limits into two classes problems and deals with complete data [22]. In particular, it has no mechanism to eliminate redundant features [23]. This paper utilizes an extension of Relief called ReliefF [22] that solves the two first problems of Relief. In contrast to Relief, which uses the 1-nearest-neighbor algorithm, ReliefF uses an approach based on K-nearest-neighbor algorithms. Pseudocode 1 presents the pseudocode of this algorithm. It is assumed that D={(xj,yj)}j=1N denotes a training dataset with N samples in which each sample consists of t features x=(x1,…,xt) and the known class label yj. In each iteration, ReliefF randomly selects a sample (pattern) x and then searches k of its nearest neighbors from the same class, termed nearest hits Hj, and also the nearest neighbors from each of different classes, called nearest misses Mj(y). To compute the weight of each feature, ReliefF uses the contribution of all the hits and misses.
Algorithm 1: Pseudocode of ReliefF [2].
ReliefF Algorithm
(1) Initialization: given D={(xj,yj)}j=1N, y is the label of classes between 1…c.
c number of class, set wi=0,1≤i≤t, number of iteration T;
(2) for l=1 to T
(3) Randomly select a pattern xr from D with classyr;
In ReliefF algorithm, T is a parameter defined by users and determines the number of process repeats to estimate the weight of each feature. xri is the ith feature of sample xr and p(y) is the prior probability of class y.
2.2. Principle Component Analysis
PCA is a very effective approach of extracting features. It is successfully applied to various applications of pattern recognition such as face classification [18]. As mentioned above, N and t are the number of samples and their dimension of dataset D, respectively. PCA finds a subspace whose basis vectors correspond to the maximum-variance direction of the original space. As mentioned before, PCA is a linear transform. Let W represents the linear transformation that maps the original t-dimensional space into an f-dimensional feature space where normally f≪t. Equation (1) shows the new feature vectors, zj∈Rf(1)zj=WTxj,j=1,2,…,N.
Columns of W are the eigenvectors ei obtained by solving (2):
(2)λjej=QejwhereQ=XXT,X={x1,…,xN}.
Here Q is the covariance matrix and λj the eigenvalue associated with the eigenvector ej. The eigenvectors are sorted from high to low according to their corresponding eigenvalues. The eigenvector associated with largest eigenvalue is the most important vector that reflects the greatest variance [21].
PCA employs the entire features and it acquires a set of projection vectors to extract global feature from given training samples. The performance of PCA is reduced when there are more irrelevant features than the relevant ones. On the other hand, PCA has no preknowledge about the class in a given data. So, it is not efficient to determine the classes in the subspace of a given dataset.
We present an example to confirm the mentioned points. This example uses a dataset with five variables and 300 records. The number of classes is three and each class has 100 points. The last two variables represent uniform distributed noise points and irrelevant features. Table 1 shows the centroids and the standard deviations of the three classes [24].
Centroids and standard deviations of classes in different variables.
Class
Class centroids
Standard deviations
No. of points
1
(0.547, 0728, 0.424, 0.492, 0561)
(0.054, 0.044, 0.071, 0.288, 0.302)
100
2
(0.299, 0.585, 0.318, 0.555, 0.455)
(0.061, 0.044, 0.069, 0.269, 0.274)
100
3
(0.422, 0.452, 0.636, 0.520, 0.536)
(0.055, 0.050, 0.075, 0.263, 0.274)
100
The centroids of two noise variables (x3 and x4), against other three variables, are very close and their standard deviations are larger than those of the other three variables. Figure 1, illustrates the 300 points in different two-dimensional subspaces. We can find no class structure in subspaces with two noisy features. Now, PCA is applied on the database presented in Table 1. Figure 2 shows the results obtained by using two significant eigenvectors extracted by PCA.
Synthetic dataset with three normally distributed classes in the three-dimensional subspace of x0, x1, x2 and two noise variables x3, x4. (a) The subspace of x0, x1. (b) The subspace of x0, x2. (c) The subspace of x1, x2. (d) The subspace of x0, x3. (e) The subspace of x0, x4. (f) The subspace of x1, x3. (g) The subspace of x1, x4. (h) The subspace of x2, x3. (i) The subspace of x2, x4. (j) The subspace of x3, x4 [24].
A plot of a new data point by applying the PCA using two significant eigenvectors.
Figure 2 shows that the obtained result is not suitable for classification, because there is no mechanism in PCA algorithm to determine irrelevant features. As mentioned before, PCA finds projections of the data with maximum variance. Observably, in this example, there are two irrelevant features with the largest variance. Now, PCA is just performed on three relevant variables x1,x2,x3. Figure 3 illustrates the new data by applying the PCA. Notice that the class structures can be found in Figure 3. Because of removing irrelevant features, it is suitable for classification. The next section presents the proposed algorithm to solve this problem.
A new data point by applying the PCA using two significant eigenvectors after removing irrelevant features.
3. RPCA Feature Extraction
As shown in Figure 2, the directions founded by PCA are not proper for classification if the variances of features are not corresponding with their importance. For example, if the variances of irrelevant features are large, then the extracted features via PCA are not suitable for classification. Therefore, it is expected that if the importance of features are proper with their variances then the extracted features using PCA are more likely suitable for classification. In this paper, a new preprocessing method is proposed which involves two connected steps: relevance analysis and variance adjustment as shown in Figure 4.
Proposed preprocessing steps.
In the step of the relevant analysis, weights of features are calculated through one feature weighting approach (like Relief or its extension for multiclass dataset called ReliefF). Assume that W=[w1,w2,…,wt] be the weight vector, estimated by using ReliefF, for the t variables in the original space. Since the weights indicate the level of relevancy, the feature with the largest weight has the largest relevancy. The relevancy level is close to zero or negative when the feature is irrelevant [5]. In this work, features with the weights larger than the threshold defined by user γ are the subject to the next step. Therefore, W vector is changed as follows:
(3)wi={wiwi>γ,0otherwise.
After removing the irrelevant features, we do not need to collect all the features. In the variance adjustment step, the variances of features have been changed so that the most important feature becomes the most important feature for PCA. A key idea for this step is motivated from this characteristic of PCA: a feature with maximum variance has the most important for PCA. The new variance of ith feature is calculated as follows:
(4)δnewi=m-(wk(m)-wi)(m-k(i)),
where m is the number of features that their weight are more than threshold (number of relevant features). wk(m) is the weight of most important feature and k(i) is the weight rank of i-th feature (1 is least importance and m is most importance). Since wk(m)-wi≻0, and (m-k(i))≥0, δnewi is always positive. It is important to mention that wk(m)≻wi because wk(m) is the largest weight. Then, to modify the variance of i-th feature to δnewi, the values of it should be multiplied by the number specified for it. So, it is calculated as follows:
(5)δnewi=1N-1∑j=1N(nxji-nx⃑i)2,δnewi(N-1)=∑j=1N(nxji-nx⃑i)2,n=δnewi(N-1)∑j=1N(xji-x⃑i)2.
Equation (5) shows the way that can obtain n for each feature where σnewi is the new variance of ith feature and calculated using (4). N is the number of samples and xji,x⃑i are ith feature of jth sample and mean of ith feature, respectively. After this adjustment, PCA is employed on data. We call our proposed method RPCA that refers to applying ReliefF in the first step for weighting features.
Notice that each feature weighting method can be utilized in the first step. Since the output of the first step is used as a subject for the second step (variance adjustment), more effective feature weighting methods lead to better results. Hence, if we use a feature weighting more effective than ReliefF, the obtained result is better than we use ReliefF. Further, the type of feature weighting is very important. For example, if we replace ReliefF with another unsupervised feature weighting method like SUD [25], the proposed method can be utilized for the unsupervised dataset as a dimensionality reduction. The advantages of our preprocessing method are summarized as follows.
The extracted features are formed only by using relevant features.
The preprocessing steps have low time complexity.
The preprocessing steps reveal the underlying class structure for PCA approximately.
4. Simulation Results
This section presents the experimental results to show the effectiveness of RPCA on four UCI datasets and synthetic data introduced in Section 2.2. Table 2 summarizes the data information of the four UCI datasets. We applied ReliefF, which employs M instead of just one nearest hit and miss, in our experiment. The value of M was set to 10 as suggested in [22].
Summary of four UCI data sets.
Database
Training
Testing
Features
Twonorm
400
7000
20
Waveform
400
4600
21
Ringnorm
400
7000
20
Breast cancer
100
545
9
In order to provide a platform where PCA and RPCA can be compared, KNN classification errors are used. The number of nearest neighbors is achieved by trial and error. To eliminate statistical variation, each algorithm is run 20 times for each dataset. In each run, a dataset is randomly partitioned into training and testing. Also, 50 irrelevant features with Gaussian distributions are added to UCI datasets. The mean of Gaussian distribution is equal to zero and the standard deviation is set based on dataset.
Table 3 shows the testing errors. The number of extracted features is five expected in syntactic dataset which is two in this dataset. The number of training and testing instances for synthetic dataset are 100 and 200, respectively. The performance of KNN is degraded significantly in the presence of the large number of irrelevant features [6]. Figure 5 illustrates the average testing errors of PCA and RPCA as a function of the number of extracted features for 20 runs. This figure reveals that RPCA significantly outperforms PCA in terms of classification errors and effectiveness in reducing dimensionality. These results show that RPCA can significantly improve the performance of KNN. As discussed in Section 3, using a feature weighting better than ReliefF in the first step can lead to better results.
The testing errors.
Database
PCA
RPCA
Synthetic data
0.5787
0.0083
Twonorm
0.2529
0.0349
Waveform
0.6653
0.2496
Ringnorm
0.5021
0.1797
Breast cancer
0.3581
0.0434
Classification errors of PCA and RPCA on the four UCI datasets.
5. Conclusion
We propose a new preprocessing method comprised two steps to improve the performance of PCA in classification task. After weighting features and selecting relevant features in the first step, the variances of features are adjusted based on their importance in the second step until the most important feature has the most variance. Finally, PCA is applied to the modified data. Since, in the first step, ReliefF is used for feature weighting, we nominate our proposed preprocessing technique RPCA. Moreover, we can utilize another type of feature weighting method instead of ReliefF. For example, SUD [25] can be employed in unsupervised data. The simulation results show that the RPCA significantly improves the efficiency of PCA in classification purposes.
Acknowledgment
This research is supported by Iran Telecommunication Research Center (ITRC).
DashM.LiuH.WahB. W.Dimensionality reductionin20092Hoboken, NJ, USAJohn Wiley & Sons958966LiuH.MotodaH.2008Taylor & Francis GroupYangM.WangF.YangP.A novel feature selection algorithm based on hypothesis-margin200831227342-s2.0-70350730210SunY.WuD.A RELIEF based feature extraction algorithmProceedings of the 8th SIAM International Conference on Data MiningApril 20081881952-s2.0-52649124295KiraK.RendellL. A.A practical approach to feature selectionProceedings of the 9th International Conference on Machine Learning1992249256SunY.Iterative RELIEF for feature weighting: algorithms, theories, and applications2007296103510512-s2.0-3424762237810.1109/TPAMI.2007.1093YustaS. C.Different metaheuristic strategies to solve the feature selection problem20093055255342-s2.0-6024908702810.1016/j.patrec.2008.11.012PudilP.NovovičováJ.Novel methods for subset selection with respect to problem knowledge199813266742-s2.0-0002649659DietterichT. G.Machine-learning research: four current directions1997184971362-s2.0-0031361611WettschereckD.AhaD. W.MohriT.A review and empirical evaluation of feature weighting methods for a class of lazy learning algorithms1997111–52733142-s2.0-0031073477YanJ.ZhangB.LiuN.YanS.ChengQ.FanW.YangQ.XiW.ChenZ.Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing20061833203322-s2.0-3164443309210.1109/TKDE.2006.45JensenR.ShenQ.2008John Wiley & SonsPress Series on Computational IntelligenceJolliffeI. T.20022ndWileyTurkM.PentlandA.Eigenfaces for recognition19913171862-s2.0-0026065565MuraseH.KimuraF.YoshimuraM.MiyakeY.An improvement of the autocorrelation matrix in pattern matching method and its applicationto handprinted ‘HIRAGANA’1981643276283MuraseH.NayarS. K.Visual learning and recognition of 3-d objects from appearance19951415242-s2.0-002922087610.1007/BF01421486NayarS. K.NeneS. A.MuraseH.Subspace methods for robot vision19961257507582-s2.0-0030260990ChenS.ZhuY.Subpattern-based principle component analysis2004375108110832-s2.0-184271198410.1016/j.patcog.2003.09.004Pinto Da CostaJ. F.AlonsoH.RoqueL.A weighted principal component analysis and its application to gene expression data2011812462522-s2.0-7844927582610.1109/TCBB.2009.61HondaK.NotsuA.IchihashiH.Variable weighting in PCA-guided κ-means and its connection with information summarization201115183892-s2.0-78751611271MartinezA. M.KakA. C.PCA versus LDA20012322282332-s2.0-003524892410.1109/34.908974KononenkoI.Estimating attributes: analysis and extensions of RELIEFProceedings of the European Conference on Machine Learning (ECML '54)199471182Gilad-BachrachR.NavotA.TishbyN.Margin based feature selection—theory and algorithmsProceeding of the 21st International Conference on Machine Learning (ICML '04)July 20043373442-s2.0-14344253100HuangJ. Z.NgM. K.RongH.LiZ.Automated variable weighting in k-means type clustering20052756576682-s2.0-1814441938910.1109/TPAMI.2005.95DashM.LiuH.YaoJ.Dimensionality reduction of unsupervised dataProceedings if the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI '97)November 19975325392-s2.0-0031359166