A NEW APPROACH TO MULTIPLE CLASS PATTERN CLASSIFICATION WITH RANDOM MATRICES

We describe a new approach to multiple class pattern classiﬁcation problems with noise and high dimensional feature space. The approach uses a random matrix X which has a speciﬁed distribution with mean M and covariance matrix r ij ( Σ s + Σ (cid:1) ) between any two columns of X . When Σ (cid:1) is known, the maximum likelihood estimators of the expectation M , correlation Γ , and covariance Σ s can be obtained. The patterns with high dimensional features and noise are then classiﬁed by a modiﬁed discriminant function according to the maximum likelihood estimation results. This new method is compared with a multi-layer feed forward neural network approach on nine digit recognition tasks of increasing di ﬃ culty. Both methods achieved good results for those classiﬁcation tasks


Introduction
This paper presents a new approach to the multiple class pattern classification process.Consider the situation where the multivariate observation x i with p dimensions on an object consists of two independent components y i and ε i (i = 1,2,...,n), where n is total sample size.Let y i have a p-dimensional multivariate normal distribution with mean vector µ and covariance matrix Σ s , while ε i has a multivariate normal distribution with mean vector 0 and covariance Σ ε .
If we let x i = y i + ε i and the vector x i has a multivariate normal distribution with mean vector µ, then the covariance structure between x i and x j is constructed by where γ i j = γ ji is the correlation between x i and x j and γ ii = 1 if i = j.Thus the probability density function of the observation matrix X p×n = ( x 1 , x 2 ,..., x n ) is given by Wang and Lawoko [1]: where M = µ1 , Γ = {γ rs } is the correlation matrix, which is estimated by sample correlation matrix for X.The term tr(•) denotes the trace of a matrix, and 1 is a column vector of unit elements.It is assumed that Σ s , Σ ε and Γ are positive definite matrices.Note that in pattern recognition y i and ε i are referred to as "signal" and "noise," respectively.With the model, the observation vector X can be decomposed into independent signal and noise components, and its covariance matrix can be written as Γ ⊗ (Σ s + Σ ε ), where ⊗ denotes the Kronecker product.

Estimation of parameters
2.1.Estimation of covariance matrix Σ s .With the assumptions that Γ and Σ ε are known and n p in (1.2), Wang and Lawoko [1] proved the following results for normally distributed random matrices.
(a) The maximum likelihood estimators of Σ and M are obtained as s is an unbiased estimator of Σ s and the covariance of Σ * s is given by where (2.5)

Estimation of correlation matrix Γ.
Suppose that Γ is unknown in (1.2), the problem of Σ s (and Σ ε ), which is only possible after the separation of "signal" from "noise," is considered for a specific model [2].We now need to estimate the correlation matrix in D. Q. Wang and M. Zhang 167 (1.2).The log-likelihood function from (1.2) is written as where ∆ = H (Σ s + Σ ε )H and H is an orthogonal matrix.Differentiation of L with respect to Γ and ∆ yields (2.7) Equating dL to zero, we obtain the equations where we have the following matrix equation in where 0 n×n denotes a zero matrix of dimension n × n, and 1 is a column vector of unit elements.
From this we obtain the equation so that If we let Y = I n − (1 Γ −1 1) −1 11 Γ −1 and A = X X, the expression is a quadratic form, that is, It is known that a solution Y = Y * can be obtained through numerical methods [3].Once Y * is known, we can solve for Γ −1 from That is, the equation is linear in Γ −1 and has a unique solution [4], provided that the coefficient matrix is nonsingular.Finally, the elements of Γ are obtained from Γ −1 .Note that the solution Γ = { γij } obtained from the above equations will not necessarily satisfy the usual properties of a true correlation matrix, which would be that (1) Γ is a positive definite symmetric matrix.
(2) |γ i j | ≤ 1, with γ ii = 1 for all i.In order to find a true correlation matrix Γ = {γ i j } which, on the basis of some measure, is as close as possible to Γ, several methods can be used.Some of these techniques, like the "shrinking" and "eigenvalue" methods, are summarized in Rousseeuw and Molenberghs [5].

Generating a modified discriminant function (MDF).
When Σ is known and Σ s and Γ are estimated as in Sections 2.1 and 2.2, the covariance matrix Γ ⊗ ( Σs + Σ ) of X can be obtained by maximum likelihood method.Then the modified discriminant function MDF for each class j is given by where j = 1,2,...,c; xj is an estimator of the sample mean vector of the jth cluster of random matrix X, and c is the number of classes.Note that we use the covariances Γ ⊗ ( Σs + Σ ) in MDF.Equation (2.16) can be further simplified to give a linear modified discriminant function (LMDF) as follows: (2.17) The classification rule is given as follows: assign any given observation, x * , if LMDF i ( x * ) ≥ LMDF j ( x), for all j = i, then the item x * is assigned class i. Classification functions of linear modified discriminant analysis assume equal variance-covariance matrices for all the groups and a multivariate normal distribution.Note that we use the central limit theorem, where the total noise can be approximated as Gaussian or normal distribution.The Polar method can be used for generating noise, and relies on having good uniform random number generator.We describe the method to generate noise in Section 3, where we use it to adjust the mean and variance of signals for making digit recognition with classification problems of increasing difficulty.Also note that the MDF approach differs from the method of Support Vector Machines presented in Duda et al. [6].Their method is a special case in (2.16) and (2.17) where the random samples are correlated with a covariance Γ ⊗ (Σ s + Σ ), and classification functions can be also used to develop discriminant regions.

Simulation data sets.
To investigate the power of our new approach, we used nine multiple dimension digit recognition tasks in the experiments.Each task involves a file (a collection) of binary digit images.Each file contains 100 examples for each of the 10 digits (0,1,...,9), making a total number of 1000 digit examples.Each digit example is an image of 7×7 bitmap.These tasks were chosen to provide classification problems of increasing difficulty, as shown in Table 3.1.In all of these recognition problems, the goal is to automatically recognize which of the 10 classes (digits 0,1,2,...,9) each pattern (digit example) belongs to.Except for the first file which contains clean patterns, all data patterns in the other eight files have been corrupted by noise.The amount of noise in different files was randomly generated based on the percentage of flipped pixels and was given by the two numbers nn in the file name.For example, the first row of this table shows that, recognition Task 1 is to classify those clean digit patterns into the ten different classes.In this task, there are 1000 patterns in total, 500 are used for training and 500 for testing.In Task 3, 10% of pixels, chosen at random, have been flipped.Before starting the training/learning process, all the training examples are randomly ordered.
Examples of the 9 tasks are shown in Figure 3.1.The 9 lines of digit examples correspond to the 9 recognition tasks in Table 3.1.The first 3 tasks, one with clean data and two with only 5% and 10% of flipped rate, are relatively straightforward for human eyes, though there is still some difficulty in distinguishing between "3" and "9."With the increase of the flipped rate in these patterns such as Task 4 and Task 5, it becomes more difficult to classify these digit patterns, even if humans can still recognize the majority.From Task 6 to Task 9, however, it is very difficult, even impossible, for human eyes to make good discrimination.We hypothesized that our new method will do a good job for the first three tasks, but cannot be excellent for Tasks 6 to 9. We also want to investigate whether our new method can achieve an acceptable performance for these difficult tasks and whether the new method outperforms a neural network approach (Section 4) on these tasks.

3.
2. An MDF classification example.This subsection uses an example to briefly describe how to obtain the classification error for each task by applying the new method (2.16).
After applying (2.17) to each task, the conferences of discriminant function can be obtained.In Task 6 in Table 3 where vector x = (x 1 ,x 2 ,x 3 ,...,x 48 ,x 49 ) T .Task 6 classification results for test data are summarized in Table 3.2.The total optimal error rate (OER) for this task is 0.106, or the classification accuracy is 89.40% on average of 10 runs.

The neural networks approach
This section briefly describes the neural network approach to this problem.This approach involves the following steps: determination of the neural network architecture, network training and network testing for classification.

Network architecture.
Multilayer feed forward neural networks have been proved to be suitable for classification and prediction problem [7,8,9,10,11].In this approach, we use a three layer network (with a single hidden layer) to perform the digit recognition problems.The task then becomes determining the number of input nodes, the number of output nodes and the number of hidden nodes.
To avoid feature selection and hand-crafting of feature extraction programs, we directly used the raw pixels as inputs to neural networks.Since each digit example in our recognition tasks is a 7×7 bitmap, we used 49 input nodes in the network architecture.The ten classes of digits, from 0 to 9, form the output nodes in the network.Example patterns containing input patterns and corresponding output patterns for the ten classes of digits for the first digit recognition task are shown in Table 4.1.
The number of hidden nodes was determined by an empirical search method of "trial and error" during network training.We have found that 10-20 hidden nodes were suitable for these classification problems and that the process was relatively robust using these number of hidden nodes.An example neural network architecture with a non-flipped bitmap pattern from class "0" in Task 1 is shown in Figure 4.1.

4.2.
Network training and testing.We used the back error propagation algorithm [12] with the following two variations to train the network.
(i) Online learning.Rather than updating the weights after presenting all the examples in a full epoch, we update the weights after presenting each bit map pattern.(ii) Fan-in.Weight initialization and weight changes are modified by the fan-in factor.The weights are divided by the number of inputs of a node (referred to as the fan-in factor of the node) before network training and the size of the weight change of a node is updated accordingly during network training.

Output layer
Hidden layer

Input layer
Input pattern "0" During network training and testing, the network classification is considered correct if the largest activation value produced by the neural network is for the output node which corresponds to the target class.Otherwise, the classification is incorrect.For example, if the actual activation values of all the output nodes for a given digit pattern is (0.32,0.12,0.45,0.85,0.23,0.21,0.13,0.15,0.33,0.45)and the target output pattern is "0001000000," then this digit was correctly classified as digit "3" by the network; if the target output pattern is "0000000001," then this digit ("9") was incorrectly classified as digit "3" by the network.

Results and discussion
This section describes a comparison of the experimental results of the modified discriminant analysis method and the neural network method.For the neural network approach, we used a network architecture of 49-15-10.The network was trained with a learning rate of 0.5, without momentum.For the MDF approach, we used a priori probability of 0.1 easy since they can be usually set equally.Thus the MDF method generally takes a shorter preparation time and a shorter training time for the same problem.

Conclusions and further work
The goal of this paper was to develop an effective and efficient approach for high dimension, multiple class, noisy pattern classification problems.This goal was achieved by developing a noisy factor and a modified discriminant function in our discriminant approach.The second goal was to investigate whether this approach could do a good enough job for those problems on both clean data and noisy data.Nine digit recognition tasks of increasing difficulty were used as examples in the experiments.A neural network classifier was also developed for the purpose of comparison.
The results suggest that both the MDF approach and the neural network approach did a very good job for the noisy data.On all the 8 noisy tasks presented in this paper, the new MDF approach always achieved better classification performances than the neural network method.Furthermore, it was also more stable, took a shorter preparation time and a much shorter training time than the neural network method.As expected, the performance from both approaches deteriorated as the degree of difficulty of the recognition problems was increased.
Also, both the MDF approach and the neural network approach performed quite well on the very noisy tasks.This is inconsistent with our hypothesis, which did not expect them to achieve good results.This suggests that our new method and the neural network classifier are better than human eyes on these multivariant types of tasks.
To further investigate the power of the MDF method, we will apply it to other classification problems with multiple dimension, noisy data in the future.
.1, for example, there is 30% of noise flipped in the 1000 digits.

Table 4 .
1. Example patterns used in neural networks for Task 1.