^{1}

^{1}

^{2}

^{1}

^{1, 3}

^{1}

^{2}

^{3}

Many information processing problems can be transformed into some form of eigenvalue or singular value problems. Eigenvalue decomposition (EVD) and singular value decomposition (SVD) are usually used for solving these problems. In this paper, we give an introduction to various neural network implementations and algorithms for principal component analysis (PCA) and its various extensions. PCA is a statistical method that is directly related to EVD and SVD. Minor component analysis (MCA) is a variant of PCA, which is useful for solving total least squares (TLSs) problems. The algorithms are typical unsupervised learning methods. Some other neural network models for feature extraction, such as localized methods, complex-domain methods, generalized EVD, and SVD, are also described. Topics associated with PCA, such as independent component analysis (ICA) and linear discriminant analysis (LDA), are mentioned in passing in the conclusion. These methods are useful in adaptive signal processing, blind signal separation (BSS), pattern recognition, and information compression.

In information processing such as pattern recognition, data compression and coding, image processing, high-resolution spectrum analysis, and adaptive beamforming, feature extraction or feature selection is necessary to deal with the large storage of raw data. Feature extraction is a dimensionality-reduction technique, mapping high-dimensional patterns onto a lower-dimensional space by extracting the most prominent features using orthogonal transforms. The extracted features do not have any physical meaning. In contrast, feature selection decreases the size of the feature set or reduces the dimension of the features by discarding the raw information according to a criterion.

Orthogonal decomposition is a well-known technique to eliminate ill-conditioning. The Gram-Schmidt orthonormalization (GSO) is suitable for feature selection. This is due to the fact that the physically meaningless features in Gram-Schmidt space can be linked back to the same number of variables of the measurement space, thus resulting in no dimensionality reduction. The GSO procedure starts with QR decomposition of the transpose of the full feature matrix,

An orthogonal transform can decompose the correlations among the candidate features so that the significance of the individual features can be evaluated independently. Principal component analysis (PCA) is a well-known orthogonal transform that is used for dimensionality reduction. Another popular technique for feature extraction is linear discriminant analysis (LDA), also known as Fisher’s discriminant analysis [

For a

In this paper, we give a state-of-the-art introduction to various neural network implementations and algorithms for PCA and its extensions. This paper is organized as follows. In Section

The stochastic approximation theory [

The Hebbian learning rule was introduced in [

For a stochastic input vector

Oja’s rule introduces a weight decay term into the Hebbian rule [

The Robbins-Monro conditions are not practical for implementation, especially for learning nonstationary data. Zufiria [

PCA is based on the spectral analysis of the second-order moment matrix called correlation matrix that statistically characterizes a random vector. In the zero-mean case, this matrix becomes the covariance matrix. In the area of image coding, PCA is known as Karhunen-Loeve transform (KLT) [

PCA allows the removal of the second-order correlation among given random processes. By calculating the eigenvectors of the covariance matrix of the input vector, PCA linearly transforms a high-dimensional input vector into a low-dimensional one whose components are uncorrelated. PCA is directly related to singular value decomposition (SVD), and the most common way to perform PCA is via SVD of the data matrix. However, the capability of SVD is limited for vey large data set.

PCA is often derived by optimizing some information criterion, such as the maximization of the variance of the projected data or the minimization of the reconstruction error. The objective of PCA is to extract

PCA finds those unit directions

By repeating maximization of

A linear least squares (LS) estimate

Neural PCA originates from the seminal work by Oja [

The single-neuron model was extended to a

Architecture of the PCA network.

By using Oja’s rule (

The symmetrical subspace learning algorithm (SLA) [

Weighted SLA introduces asymmetry into the SLA [

SLA and weighted SLA are nonlocal algorithms, and they rely on the calculation of the errors and the backward propagation of the values between the layers. A PCA algorithm is obtained by adding a term to SLA [

By combining Oja’s rule and the GSO procedure, Sanger proposed GHA for extracting the first

The GHA is given by [

Both SLA [

In addition to the popular SLA, weighted SLA and GHA algorithm, there are some other Hebbian rule-based PCA algorithms such as local LEAP (learning machine for adaptive feature extraction via principal component analysis) [

The LEAP algorithm [

The DPD algorithm is a nonlocal PCA algorithm [

Existing PCA algorithms including the Hebbian rule-based algorithms can be derived by optimizing an objective function using the gradient-descent method. The least mean squared error- (LMSE-) based methods are derived from the modified MSE function

The gradient-descent or Hebbian rule-based algorithms are highly sensitive to

In [

The LMSER algorithm is derived on the MSE criterion using the gradient-descent method [

PASTd [

Kalman-type RLS [

RRLSA [

PCA can be derived by any optimization method based on a proper objective function. This leads to many other algorithms, including gradient-descent based algorithms [

The infomax principle [

The NIC algorithm [

The NIC algorithm is a PSA method. It can extract the principal eigenvectors when the deflation technique is incorporated. The NIC algorithm converges much faster than SLA [

Most popular PCA or MCA algorithms do not consider eigenvalue estimates in the update equations of the weights, and they suffer from the stability-speed problem. The convergence speed of a system depends on the eigenvalues of its Jacobian. In PCA algorithms, the eigenmotion depends on the principal eigenvalue of the covariance matrix, while in MCA algorithms on all the eigenvalues [

In order to extract multiple PCs, one has to apply an orthonormalization procedure, like GSO, or its first-order approximation as used in SLA [

The anti-Hebbian learning rule updates a synaptic weight by the same amount as that in the Hebbian rule [

Anti-Hebbian rule-based PCA algorithms can be derived by using a

Architecture of the PCA network with hierarchical lateral connections. The lateral weight matrix

The Rubner-Tavan PCA algorithm is based on the PCA network with hierarchical lateral connection topology [

The weights

The APEX algorithm is used to adaptively extract the PCs [

Assuming that the correlation matrix

A desirable number of neurons can be decided during the learning process. When the environment is changing over time, a new PC can be added to compensate for the change without affecting the previously computed PCs, and the network structure can be expanded if necessary.

For growing each additional PC, APEX requires

A class of learning algorithms, called

Most existing linear complexity methods including GHA [

PCA is based on the Gaussian assumption for data distribution, and the optimality of PCA results from taking into account only the second-order statistics. For non-Gaussian data distributions, PCA is not able to capture complex nonlinear correlations, and nonlinear processing of the data is usually more efficient. Nonlinearities introduce higher-order statistics into the computation in an implicit way. Higher-order statistics, defined by cumulants, are needed for a good characterization of non-Gaussian data. PCA can be generalized to distributions of the exponential family [

When the feature space is nonlinearly related to the input space, we need to use nonlinear PCA. The outputs of nonlinear PCA networks are usually more independent than their respective linear cases. For non-Gaussian input data, a nonlinear PCA permits the extraction of higher-order components and provides a sufficient representation.

Kernel PCA [

In order to increase the robustness of PCA against outliers, a robust version of the covariance matrix based on the

The multilayer perceptron (MLP) can be used to perform nonlinear dimensionality reduction and hence nonlinear PCA. Both the input and output layers of the MLP have

Kramer’s nonlinear PCA network [

A hierarchical nonlinear PCA network composed of a number of independent subnetworks can extract ordered nonlinear PCs [

A hybrid hetero/autoassociative network [

MCA, as a variant of PCA, is to find the smallest eigenvalues and their corresponding eigenvectors of the autocorrelation matrix

The anti-Hebbian learning rule and its normalized version can be used for MCA [

Minor components (MCs) can be extracted in ways similar to that for PCs. A simple idea is to reverse the sign of the PCA algorithms. This is because in many algorithms PCs correspond to the maximum of a cost function, while MCs correspond to the minimum of the same cost function. However, this idea does not work in general and has been discussed in [

A general algorithm that can extract, in parallel, principal and minor eigenvectors of arbitrary dimensions is derived based on the natural-gradient method [

The orthogonal Oja (OOja) algorithm consists of Oja’s MSA [

The above algorithms including Oja’s MSA [

By using the Rayleigh quotient as an energy function, invariant-norm MCA [

The nonlinear PCA problem can be solved by partitioning the data space into a number of disjunctive regions and then estimating the principal subspace within each partition by linear PCA. This is the so-called localized PCA. The distribution is collectively modeled by a collection or a mixture of linear PCA models, each characterizing a partition. Most natural data sets have large eigenvalues in only a few eigendirections, while the variances in other eigendirections are so small as to be considered as noise. The localized PCA method provides an efficient means to decompose high-dimensional data compression problems into low-dimensional data compression problems. The localized PCA method is commonly used in image compression [

VQ-PCA [

Similar to localized PCA, localized ICA is used to characterize nonlinear ICA. Clustering is first used for an overall coarse nonlinear representation of the underlying data and linear ICA is then applied in each cluster so as to describe local features of the data [

Complex PCA is a generalization of PCA in complex-valued data sets [

In [

There are many other complex PCA algorithms. Both PAST and PASTd are, respectively, the PSA and PCA algorithms derived for complex-valued signals [

Simple neural models, described by differential equations, are derived in [

When certain subspaces are less preferred than others, this yields constrained PCA [

The constrained PAST algorithm [

Generalized EVD is a statistical tool extremely useful in feature extraction, pattern recognition as well as signal estimation and detection. The generalized EVD problem involves the matrix equation

Generalized EVD achieves simultaneous diagonalization of

Any generalized eigenvector

A recurrent network with invariant

Two-dimensional PCA [

Bidirectional PCA [

The uncorrelated multilinear PCA algorithm [

Given two sets of random vectors with zero mean,

Cross-correlation asymmetric PCA (APCA) network consists of two sets of neurons that are laterally hierarchically connected [

Architecture of the cross-correlation APCA network. The APCA network is composed of two hierarchical PCA networks. The connections with solid arrows denote feedforward connections, and the connections with hollow arrows denote lateral connections.

The network has the following relations:

The objective function for extracting the first principal singular value of the covariance matrix is given by

Using a deflation transformation, the two sets of neurons are trained with the cross-coupled Hebbian learning rules, which are given in [

Based on the APCA network, the principal singular component of

Coupled learning rules for SVD produce better results than Hebbian learning rules. Combined with first-order approximation of GSO, precise estimates of singular vectors and singular values with only small deviations from orthonormality are produced. Double deflation is clearly superior to standard deflation but inferior to first-order approximation of GSO, both with respect to orthonormality and diagonalization errors. Coupled learning rules converge faster than Hebbian learning rules, and the first-order approximation of GSO produces more precise estimates and better orthonormality than standard deflation [

Tucker decomposition [

CCA [

CCA leads to a generalized EVD problem. Thus, we can employ a kernelized version of CCA to compute a flexible contrast function for ICA. Generalized CCA consists of a generalization of CCA to more than two sets of variables [

Given two centered random multivariables

Suppose that we are given a sample of instances

After manipulation, we have

Under a mild condition which tends to hold for high-dimensional data, CCA in the multilabel case can be formulated as an LS problem [

In [

Two-dimensional CCA seeks linear correlation based on images directly. Motivated by locality-preserving CCA [

CCArc [

The concept of subspace is involved in many information processing problems. This requires EVD of the autocorrelation matrix of a data set or SVD of the cross-correlation matrix of two data sets. For example, in the area of array signal processing, the APCA algorithm is used in beamforming [

Image compression is usually implemented by partitioning an image into many nonoverlapping

In this example, we use the APEX algorithm to train the PCA network and then use the trained network to encode other pictures. We use the benchmark Lina picture of

To improve the quality of the restored image, we employ nonoverlapping

The benchmark Lena picture and its restored version.

The kid picture and its restored version.

In this paper, we have discussed various neural network implementations and algorithms for PCA and its various extensions, including PCA, MCA, generalized EVD, constrained PCA, two-dimensional methods, localized methods, complex-domain methods, and SVD. These neural network methods provide an advantage over their conventional counterparts in that they are adaptive algorithms and have low computational as well as low storage complexity. These neural network methods find wide applications in pattern recognition, blind source separation, adaptive signal processing, and information compression. Two methods that are strongly associated with PCA, namely, ICA and LDA, are described here in passing.

ICA [

ICA can extract the statistically independent components from the input data set. It is to estimate the mutual information between the signals by adjusting the estimated matrix to give outputs that are maximally independent. By applying ICA to estimate the independent input data from raw data, a statistical test can be derived to reduce the input dimension. The dimensions to remove are those that are independent of the output. In contrast, in PCA the dimensionality reduction is achieved by removing those dimensions that have a low variance.

Let a

The goal of ICA is to estimate

The demixing matrix

A well-known two-phase approach to ICA is to preprocess the data by PCA and then to estimate the necessary rotation matrix. A generic approach to ICA consists of preprocessing the data, defining measures of non-Gaussianity, and optimizing an objective function, known as a contrast function. Some measures of non-Gaussianity are kurtosis, differential entropy, negentropy, and mutual information, which can be derived from one another. Popular ICA algorithms include the Infomax, the natural-gradient, the equivariant adaptive separation via independence (EASI), and the FastICA algorithms [

The ICA methods can be easily extended to the complex domain by using Hermitian transpose and complex nonlinear functions. In the context of BSS, the higher-order statistics are necessary only for temporally uncorrelated stationary sources. Second-order statistics-based source separation exploits temporally correlated stationary sources and the nonstationarity of the sources [

Blind separation of the original signals in nonlinear mixtures has many difficulties such as the intrinsic indeterminacy, the unknown distribution of the sources as well as the mixing conditions, and the presence of noise. It is impossible to separate the original sources using only the source independence assumption of some unknown nonlinear transformations of the sources [

Nonnegativity is a natural condition for many real-world applications, for example, in the analysis of images, text, or air quality. Neural networks can be suggested by imposing a nonnegativity constraint on the outputs [

LDA creates a linear combination of the given independent features that yield the largest mean differences between the desired classes [

The objective for LDA is to maximize the between-class measure while minimizing the within-class measure after applying a

There are at most

In [