Determining Mental State from EEG Signals Using Parallel Implementations of Neural Networks

EEG analysis has played a key role in the modeling of the brain's cortical dynamics, but relatively little effort has been devoted to developing EEG as a limited means of communication. If several mental states can be reliably distinguished by recognizing patterns in EEG, then a paralyzed person could communicate to a device such as a wheelchair by composing sequences of these mental states. EEG pattern recognition is a difficult problem and hinges on the success of finding representations of the EEG signals in which the patterns can be distinguished. In this article, we report on a study comparing three EEG representations, the unprocessed signals, a reduced-dimensional representation using the Karhunen - Loeve transform, and a frequency-based representation. Classification is performed with a two-layer neural network implemented on a CNAPS server (128 processor, SIMD architecture) by Adaptive Solutions, Inc. Execution time comparisons show over a hundred-fold speed up over a Sun Sparc 10. The best classification accuracy on untrained samples is 73% using the frequency-based representation.


Introduction
A physically-disabled person who has no control over their motor responses has no means of communicating to the outside world.Is there a way for such a person to use their mental capabilities to a ect their environment?This question drives the search for patterns in EEG signals that are related to a person's mental state.A device that can reliably and quickly discriminate between several mental states could be used, for example, to generate commands to control a wheelchair.
Computerized analysis of EEG signals has evolved over the past three decades Gev87], with much of the e ort directed towards a better understanding of the functioning of the brain.The work reported here has a di erent goal, to extract information from EEG signals with which we can discriminate mental states.Primarily two approaches have been taken towards this goal.
The rst approach is based on the discovery that a characteristic signal appears in the EEG approximately 300 ms following the occurrence of a relatively rare, but expected, stimulus.Such signals are referred to as event related potentials, or ERP's.An example of how ERP's can be used to communicate with a computer is the work of Farwell and Donchin FD88], who used In one other study, by Lin, Tsai, and Liou LTL93], neural networks were applied to classify EEG signals collected in a paradigm similar to that of KA90].Lin, et al., used Kohonen's algorithm Koh84] to train a matrix of units to identify clusters of similar patterns and associate each cluster with a particular mental task.They trained their classi er on data for all tasks performed by one subject in one recording session and tested the resulting classi er on data from other sessions and other subjects.Most tests showed very poor classi cation accuracy, though the accuracy tended to be higher for some tasks, particularly the mental arithmetic task (see Section 2.1).
The remaining sections are organized as follows.In Section 2, we describe the methods and algorithms used to collect, process, and classify the data.The implementations of the processing and classi cation algorithms are described in Section 3. Results of classi cation experiments are discussed in Section 4, and conclusions are presented in Section 5.

Data Collection, Preprocessing, and Neural Network Algorithm 2.1 Mental Tasks
All data used in this study were recorded previously by Keirn and Aunon.Their selection of the set of mental tasks was guided by Galin and Ornstien GO73] whose results showed detectable hemispheric di erences in some tasks.The following tasks were studied in KA90]: Baseline|Alpha Wave Production The subject is asked to open and close their eyes at approximately ve second intervals.With their eyes closed the subject is to relax as much as possible.This is considered the baseline session for alpha wave production, and other asymmetries.Non-task associated levels of alpha wave production and asymmetries produced across di erent electrodes and across EEG bands are thus obtained.

Mental Arithmetic
The subject is given a non-trivial multiplication problem to solve and, as in all of the tasks, is instructed not to vocalize or make overt movements while solving the problem.An example of such a task is to multiply the numbers 49 times 78.The problems are non-repeating and are designed so that an immediate answer will not apparent.The subject veri es at the end of the task whether or not they arrived at a solution.

Geometric Figure Rotation
The subject is given 30 seconds to study a drawing of a complex three dimensional block gure after which the drawing is removed and the subject instructed to visualize the object being rotated about an axis.

Mental Letter Composing
The subject is instructed to mentally compose a letter to a friend or relative without vocalizing.Since the task is repeated several times the subject will be told to try to pick up where they left o in the previous task.
Visual Counting The subject is asked to imagine a blackboard and to visualize numbers being written on the board sequentially, with the previous number being erased before the next number is written.The subjects is further instructed not to verbally read the numbers but to visualize them, and to pick up counting from the previous task rather than starting over each time.The experiments reported here used only the data recorded from a single subject performing the baseline task and the mental arithmetic task.

Recording of EEG Signals
Subjects were seated in a sound-proof, dimly-lit, room.As shown in Figure 1, electrodes were placed at O 1 , O 2 , P 3 , P 4 , C 3 , and C 4 , standard electrode locations in the 10-20 System Jas88].The electrodes were connected to Grass 7P511 ampli ers that bandpass ltered the signals at

Unprocessed Data Representation
Keirn and Aunon found that quarter-second segments of the 10 second data resulted in classi cation accuracy approximately the same as that obtained from two-second segments.Therefore, for the experiments reported here, we divided the data into quarter-second segments.Segments were actually 62 samples long, slightly less than one quarter second.Ideally, we would like to use all quarter-second segments to train the classi er.This would result in 2,438 (i.e., 2,500-62) overlapping segments, each o set from the previous one by one sample.To make this more tractable, we divided the data into segments that overlap by an eighth of a second, resulting in 79 segments (40 non-overlapping + 39 overlapping) shown in Figure 2. Data from each segment becomes one pattern, so each pattern is a 62 x 6, or 372, component vector.We collected 79 such patterns from one subject performing the baseline and mental arithmetic tasks 10 times each, for a total of 1,580 patterns.Repetitions were performed by the subject in two recording sessions on di erent days.
Examples of actual recordings are shown in Figure 3.The top two graphs show the rst and last quarter-second segment from a subject performing the baseline task.The bottom two graphs are data from the same subject doing the mental arithmetic task.There are no obvious features that di erentiate the top graphs from the bottom graphs.

K-L Representation
It is possible that the large number of components in each pattern is many more than actually needed.Often, when classifying high-dimensional data, equivalent accuracy can be achieved by classifying data obtained by projecting the original data onto the rst n eigenvectors, where n is much smaller than the dimensionality of the original data.This is usually referred to as the Karhunen-Lo eve decomposition, called the K-L representation here.To perform the K-L decomposition, the covariance matrix of the mean-subtracted set of 1,580 372-dimensional patterns was calculated.The eigenvalues and eigenvectors of the covariance matrix were calculated using the Jacobi method.This involves performing a sequence of similarity transformations in order to reduce the o -diagonal elements to zero.A series of plane rotations are applied, each one annihilating one of the o -diagonal elements.Successive transformations result in non-zero values at these elements, but the process ultimately converges and the result is a diagonal matrix.The matrix of eigenvectors is the product of the transformation matrices, and the eigenvalues are the elements of the nal diagonal matrix.
The rst four eigenvectors for the combined baseline and mental arithmetic task data are graphed in Figure 4 as six curves corresponding to the six electrode channels.The number of eigenvectors to project to can be chosen empirically or determined by examining the eigenvalues.One estimate of dimensionality is the global Karhunen-Lo eve estimate, given by the index i for which i = max 0:01, where the i are the eigenvalues in decreasing order for i = 1; 2; : : :.For the baseline and mental arithmetic data used in our experiments, i = 50.Figure 5 shows the rst 50 eigenvalues normalized by the total sum of the 372 eigenvalues.
The K-L representation was formed by projecting each 372-dimensional pattern onto the rst 50 eigenvalues.Examples of patterns in this representation that correspond to the raw

Frequency-Band Representation
In KA90] features were extracted from spectral density estimates using asymmetry ratios given by (R ?L)=(R + L), where R is the area under the spectral density curve of a right hemisphere channel for a speci c frequency band and L is de ned similarly for the corresponding left hemisphere channel.These asymmetry ratios were calculated for each possible right-to-left combination of channels and for each of four frequency bands: delta (0-3 Hz), theta (4-7 Hz), alpha (8-13 Hz), and beta (14-20 Hz).This resulted in 36 asymmetry ratios.In addition the 24 power values themselves (R and L) were added for a total of 60 features.
In the current study, the spectral density was estimated from autoregressive (AR) parameters calculated using the Burg method PM92].This method is based on the minimization of forward and backward linear prediction errors, subject to satisfying the Levinson-Durbin recursions.The spectral density is given by where n are the estimated AR coe cients.An AR model of order 6 (M = 6) was used here since it yielded good results in KA90].
Example patterns in the frequency-based representation are shown in Figure 7.These were derived from the same data as the graphs in Figures 3 and 6.

Neural Network Classi cation Algorithm
In their previous work, Keirn and Aunon used a quadratic Bayesian classi er.This Bayesian classi er assumes a Gaussian shape to the probability density function of data from each class and a linear discrimination between these density functions is determined.This classi cation technique is insu cient accurate classi cation requires a nonlinear combination of the pattern components.
Neural networks have the capability of nding a nonlinear transformation of the pattern in order to classify more accurately.However, the increased complexity of a neural network can result in large computation times to train the network.The quantity of data involved in our experiments prompted out implementation of the neural networks on a SIMD parallel computer, described in the next section.resent the computational units of the network.The interconnections represent scalar values passed as input to each unit.Each unit has a unique vector of weights corresponding to its input vector.The computation performed by the units is a weighted sum of their inputs and a nonlinear squashing function that restricts the range of the output to be between 0 and 1.We used the typical sigmoid squashing function.Let the inputs to a unit be x i , the weights be w i , and the output of the unit be y.The weighted sum and sigmoid function are combined to produce the unit's output: y = 1 1 + e ?P i x i w i : The network consists of two layers of units.The units in the rst layer are called hidden units, because the outputs of these units are used internal to the network to transform the input into another representation for the output unit.The output of the output unit is taken as the classi cation of the current input pattern.
To train the network, a set of training patterns and corresponding correct outputs is repetitively presented to the network.After each pass through the training data, called an epoch, the weights are adjusted to reduce the error between the correct output and the actual output of the network.To determine how to adjust each weight, we applied the error backpropagation algorithm RHW86].This algorithm calculates the gradient of an error function with respect to each weight, then adjusts the weights in the negative gradient direction to reduce the error.The error function is the squared error summed over all training patterns: where p is an index into the set of training patterns, z (p) is the correct output for pattern p, and y (p) is the output of the network for pattern p.For the experiments reported here, z (p) is set to 0.1 for the baseline task and 0.9 for the mental arithmetic task.This determines how we interpret the output of the network as a classi cation: if y (p) > 0:5, pattern p is said to be classi ed as being from a mental arithmetic task; if y (p) < 0:5, the pattern is classi ed as a baseline task.
The gradient of E with respect to the weights results in the following expressions.Let h (p)   i signify the output of hidden unit i when the network receives input pattern p.To change the weights of the output unit, we sum the following w (p) i 's w (p)   i = c(z (p) ?y (p) )h (p)   i over all training patterns and, at the end of the epoch, add the result to the weights in the output unit.The constant c is a scale factor that is chosen empirically to produce weight changes that are not too large nor too small.If too large, the network would converge quickly to a suboptimal local minimum; if too small, the time required for the network to converge would be impractically long.Similarly, we sum the w (p) ij 's for hidden unit j, given by w (p) ij = c h (z (p) ?y (p) )h (p) j (1 ?h (p) j )x (p)   i where x (p)  i is a component of the input pattern give to the network.The hidden units have their own scale factor, c h .

The Over tting Problem
Although this algorithm is designed to minimize the squared error over a training set, the true goal of this procedure is to nd a set of weights for which the squared error is minimized over a novel set of data, i.e., data to which the network was not trained.Only if the algorithm is able to nd a model (the network and weights) that generalizes well to this test set can we say that the model captures the regularities present in the data.This problem has been called the over tting problem|the network too closely matches the training data and does not interpolate and extrapolate well to novel data.The is usually tested by dividing the data into a training and a testing set.The network is trained to convergence on the training set, after which the error on the testing set is calculated and used as an estimate of how well the network will perform on novel data.
To limit the amount of over tting, one may decrease the complexity of a trained network through pruning, or by limiting the growth in complexity during training, or by terminating the training when the network begins to over t.Each is described below.
A trained network that over ts may be pruned by removing weights and units that have minimal e ect on the network's error.This may be performed by sequentially setting each weight to zero and testing the error of the resulting network.Methods that require less computation rely on measures of utility for each weight.For example, the Hessian of the error with respect to the weights for a given input pattern is the sensitivity of the error with respect to each weight LDS90].Other measures of utility include MS89] and RP92].
The complexity of the function learned by a neural network is related to the number of hidden units and the number of weights of magnitude signi cantly di erent from zero.One way to limit the growth of complexity during training is to add an error term that biases the gradient search for weight values to regions in weight space of low-magnitude weights.Weigend, Rumelhart, and Huberman WRH91] developed a term that penalizes large magnitude weights and applied this to several time series prediction problems.
The third method for controlling over tting is more dependent on the data than the previous two.The data is divided into three parts, two for training and testing, and the third for determining when the network is over tting.The third part is called the validation set.After each pass through the training data, the error on the validation set is calculated.Initially, the error on the training and validation set decreases.As the network begins to over t, the error on the validation set begins to increase.At this point, prior to convergence on the training data, training is halted and the error on the testing set is calculated as a measure of the network's performance.Weigend Wei94] used this technique on a time-series prediction problem and found it to be useful even for very small networks.
For the experiments described here, we applied the validation set method for early stopping.We divided the data into 10 distinct subsets, corresponding to the 10 recording sessions for one subject performing two tasks.One subset was selected to be the validation set, another subset was the testing set, and the remaining eight subsets comprised the training data.Thus, there are 90 possible combinations of training, validation, and testing sets.Notice that training, validation, and testing always were performed on data from distinct recording sessions.All 90 combinations were used and classi cation performance (on the testing sets) were averaged over the 90 runs.

Programming Technique
The data windowing, K-L transform, and frequency analysis were performed with a combination of Unix shell scripts and C code and run on a Sun Sparc 10.The eigenvector and frequency analysis were performed using implementations from PFTV88].
The classi cation experiments were performed on a CNAPS Server II from Adaptive Solutions, Inc..Our CNAPS system is a parallel, SIMD architecture with 128, 20 MHz processors, upgradable to 512 processors.It can be programmed at three levels, using assembly language, C with parallel programming extensions, or a library of routines that implements error backpropagation.Assembly language can be used to write highly e cient code, but the library of backpropagation routines is by far the most e cient in terms of development time.The C language level is intermediate, giving the programmer the ability to write e cient code in a familiar language.We chose to use the existing library to implement the error backpropagation algorithm.Speci cally, we used Adaptive Solutions' BPfolded routine, which distributes the load even when the network is larger than the number of processors.
The library consists of routines to set algorithm parameters, to identify the source of the data, and to execute the algorithm.The following is an outline of the C program that we used to call the backpropagation routines.All steps, except step 9, were performed on the CNAPS server; step 9 was implemented in C. The C program was compiled and run on a Sun Sparc 10.
1. Connect to CNAPS server and specify the training algorithm.2. Specify the algorithm's parameters, including the number of inputs, outputs, and hidden units, the number of training and validation patterns and the names of les containing them, the maximum number of epochs to train, the error criterion at which to terminate training, the learning rate constants for output and hidden layers.3. Specify names of training and validation data les.4. Set control modes of CNAPS to log the validation error by epoch. 5. Generate random initial weight values between ?0:2 and 0.2.6. Initiate execution on the CNAPS.Upon completion, results are returned in a prede ned C structure.7. Identify epoch number at which error on validation set was lowest.8. Retrain, starting with same initial weights, stopping at epoch found in previous step.9. Using the weights found in the previous step, calculate the mean square error over the testing set.10.Save results to a le and repeat after changing the learning rates, number of hidden units, or training and validation sets or input representation (by specifying di erent data les).All data for a given representation is stored in a le with one pattern per line and alternating between baseline and mental arithmetic tasks.To produce the training, validation, and testing data les, this le is split into ten equally-sized pieces and each piece is converted into the binary format required by the CNAPS server.This step is performed using the conversion tool cv from Adaptive Solutions.The assembly from these 10 parts into training, validation, and testing sets was implemented in the Tcl scripting language.The Tcl/Tk toolkit and shell is a public domain, interactive programming environment for creating graphical user interfaces Ous94] and general utilities.We chose Tcl to implement this for two reasons.First, we plan to develop a graphical user interface to facilitate the development and visualization of di erent representations of EEG signals.We will also add components to the GUI to allow us to control the running of the CNAPS server and analyze the results.The second reason we chose Tcl is the ease of programming in its interactive environment.

Results
Table 1 shows the results of all classi cation experiments as the average percent of test patterns classi ed correctly, out of 158 patterns.This table includes 90% con dence intervals, based on 90 repetitions (see Section 2.7).
Clearly the best classi cation accuracy is achieved with the frequency-band representation, giving an average accuracy of about 74% for a network with 40 hidden units, though the accuracy varies little for other network sizes, including a network with a single hidden unit.Performance with the other representations is signi cantly lower, ranging from 50% to 53%.The e ect of the number of hidden units is slightly more signi cant for the unprocessed and K-L representations than for the frequency-band representation.These results suggest that the energy within standard frequency bands is more useful in discriminating the two mental tasks than is the unprocessed data, or dimensionally-reduced data.This hypothesis must be tested by further experimentation.It appears that the performance of the unprocessed and K-L representations is increasing with network size, but the di erences are not statistically signi cant.We have not yet tried networks with more than 80 hidden units.
To determine which frequency bands or asymmetry ratios were most useful, the weights to which the neural network converged must be examined.The weight magnitudes provide some information about which inputs are most relevant.More informative would be the partial derivatives of the network output with respect to each input component, showing which input components have the greatest e ect on the output for a given input pattern.

Bene t of Parallel Implementation
To obtain the results reported here, each network was trained a number of times to approximately optimize the training algorithm's parameters|the learning rates for the output and hidden layer.After nding good parameter values, 90 runs of 1,000{5,000 epochs were made with each network to calculate the averages in Table 1.The use of the parallel CNAPS server made the scope of this study practical.
To estimate the actual bene t of using the CNAPS server, we observed the execution time of training various sized networks for 1000 epochs using the unprocessed data representation.Results are graphed in Figure 9  Sparc 10, but the execution time for the parallel CNAPS server increases from 3.8 minutes for two hidden units to only 4.1 for 80 hidden units.A network of 80 hidden units takes about 240 times longer on the Sparc 10 than on the CNAPS server.Running 90 repetitions of the training procedure for a 40 hidden unit network took approximately 6 hours on the CNAPS Server; on the Sparc 10 this would require 37 days.

Program Development E ort
Most of the implementation e ort for this study was devoted to the extraction of the binary EEG data from tape, separating and gathering the data for each experiment, dividing it into segments, then normalizing and converting to the binary format for CNAPS machine.Writing the programs for running on the CNAPS server took one graduate student approximately one month to learn how to use the library.This e ort is detailed below: Retrieving data from tape: This required discussions with Keirn to fully understanding the format of the data on the tape.One graduate student wrote a C program to convert the data to ASCII and extract the data we needed to run each program.E ort: 2 weeks.
Generating training and testing data: The data had to be segmented into quarter second intervals, assigned a correct classi cation value, normalized, divided into training and testing sets, and converted into the binary form required by the CNAPS machine.Two graduates students worked on this phase.E ort: 3 weeks.
Implementing the eigenvector analysis: The Numerical Recipes algorithm PM92] was embedded into a C program that performed the eigenvector decomposition and projected all data vectors onto the highest ranked eigenvectors.E ort: 1 week.
Implementing the frequency analysis: A C program was written to compute the Burg AR coe cients using the evlmem and memcof C code from PM92].E ort: 1 week.
Implementing backpropagation: One student taught himself how to program the CNAPS machine using the library of backpropagation routines.He wrote a C program that accepts a number of command line arguments, calls the appropriate CNAPS routines to initialize the machine and the backpropagation algorithm, and then calls the routines to perform the training and saving of results.E ort: 1 month.
Interpretation of results: Awk AKW88] scripts were written to calculate means and standard deviations of the results from multiple runs.E ort: 1 day.

Conclusion
Two conclusions can be drawn from this work.The rst is related to the results, the second to the method.The results strongly suggest that the frequency-based representation produces signi cantly more accurate classi cation than do the unprocessed or K-L representations.The size of the neural network appears to have little e ect on the classi cation accuracy.This means that the signal representations we have considered do not bear signi cant relationships much beyond the near-linear relationships that are possible with networks having a single hidden unit.
This conclusion must be supported by further experimentation.We are currently investigating the e ect of averaging the output of the network over successive quarter-second segments.Preliminary results show that by doing so the classi cation accuracy can be increased up to 90% correct by averaging over segments from the full 10 second recording period.This, of course, would be impractical for the real-time control of a wheel chair.We are also considering other representations, such as wavelets.Wavelets represent both frequency and time features of a signal.For this reason, periods over which a signal is nonstationary are more accurately represented with wavelets than with a strictly frequency-based representation.
The second conclusion from this study is the utility of the parallel implementation of the error backpropagation algorithm.A much greater number of network sizes and initial weight vectors could be evaluated on the CNAPS server than could be completed in a comparable amount of time on a serial machine.The experiments reported here would have required over one month to complete on a Sun Sparc 10.
The utility of this parallel implementation for a portable, real-time, EEG pattern recognizer is currently not known.If networks with only one or two hidden units su ce, then a parallel implementation of the neural network classi er may be unnecessary.However, a parallel implementation of the Burg algorithm for computing the frequency-based representation might signi cantly reduce the overall response time.Parallel algorithms for computing the FFT (Fast Fourier Transform) on SIMD architectures are well-known KV87, JJS86, Swa84], but the Burg method, based on an AR model, produces a smoother spectrum than does the FFT when applied to noisy signals, such as EEG.Incremental methods, based on Kalman lter techniques, exist for calculating the coe cients of an AR model as new samples are received.We are investigating parallel implementations of these algorithms as e cient means for computing the frequency-based representations.However, the primary bottleneck we currently face in a realtime implementation is the apparent requirement of averaging over a number of successive time segments to gain su cient classi cation accuracy.Even if the classi er provides a real-time response, several seconds of EEG samples must be processed before a con dent task identi cation can be made.Thus, further experimentation with di erent representations is required.

Figure 1 :
Figure 1: Location of the six surface electrodes used to record EEG signals.0.1-100 Hz.The EEG signals were sampled at 250 samples per second and digitized with 12 bits of accuracy.Data was recorded from each subject for a duration of 10 seconds while the subject was performing a single task with their eyes open.Each session resulted in 250 samples/second x 10 seconds x 6 channels, or 15,000 values.

Figure 3 :
Figure 3: EEG signals recorded from six channels as subject performed the baseline or mental arithmetic tasks.

Figure 4 :
Figure 4: First four eigenvectors of mean-subtracted data for baseline and mental arithmetic tasks.

Figure 5 :
Figure 5: First 50 eigenvalues for mean-subtracted data from subject 3 for baseline and math tasks.patterns in Figure 3 are shown in Figure 6.

Figure 6
Figure 6: K-L representations of the rst and last quarter-second of data from the baseline task (top graphs) and the mental arithmetic task (bottom graph).

Figure 7 :Figure 8 :
Figure 7: Frequency-based representations of the rst and last quarter-second of data from the baseline task (top graphs) and the mental arithmetic task (bottom graph).

Figure 9 :
Figure 9: Minutes of execution time for increasing network size for the parallel CNAPS server and a serial Sun Sparc 10.All runs were for 1000 epochs.