Objective and Subjective Evaluation of Online Error Correction during P300-Based Spelling

,


Introduction
A brain-computer interface (BCI) is a system that connects the brain to a computer directly and avoids the need for peripheral nerve and muscle activities to execute user's actions. A major aim of BCI research is to allow patients with severe motor disabilities to regain autonomy and communication abilities [1]. This raises the crucial challenge of achieving a reliable control by measuring and interpreting brain activity on the fly. Due to the highly complex, noisy, and variable nature of brain signals, especially those obtained with noninvasive recordings using scalp EEG, the computer sometimes misinterprets the signals and makes a decision that does not match the user's intention. In this context, it is highly relevant to look for a way to detect and correct errors. One way to tackle this issue is to appeal to the hybrid BCI approach [2], where it has been shown that BCI performance could be improved by supplementing the firstorder brain signal with second-level information to aid the primary classifier and to improve the final decision or BCI output [3]. This complementary signal can be either of a cerebral origin or of a very different nature [2].
Along that line, a couple of recent studies have proposed to use error-related brain signals in BCI applications. It has been established for years that the brain produces specific evoked responses in case of errors. The error-related negativity (ERN or Ne, [4,5]) and error positivity [6,7] are phase locked to the motor response in alternative forcedchoice tasks, whereas the feedback-related negativity (FRN or feedback ERN) is produced in response to negative feedbacks [8] (see [9] for a review on ERN and FRN). In a humancomputer interface, the ERN has already been used to detect errors online [10]. In BCI, an FRN like signal has been observed in response to erroneous feedback [11]. Ferrez and del R. Millan used the term "interaction error potential," as the decision results from the interaction between the user and the machine. Indeed, either one or both may be responsible for the error [12].
In particular, the well-known P300-speller has been used to compare classifiers in their performance to discriminate between correct and incorrect trials [13][14][15], in order to achieve real-time automatic error detection. In [15], the authors compared several classifiers offline, using a threefold cross-validation procedure. The best classification was obtained with an LDA applied on polynomial coefficients. It was then used to evaluate the putative improvement due to ErrP detection, by estimating the ensuing reduction in the number of trials needed to properly spell a letter. They found that four out of five subjects could benefit from error detection, provided that the accuracy of the P300-Speller would remains below 75%. The same group was the first to test online automatic error detection in a P300-Speller BCI. The ErrP was detected with a 68% specificity (the probability of detecting a correct trial) and a 62% sensitivity (the probability of detecting an error trial), in two out of three subjects. However, this was not sufficient to improve the information transfer rate [13]. Finally, in a recent P300-Speller study, healthy and motor-impaired participants increased their bitrate by 0.52 (in bits/trial) using online error detection during copy spelling [14]. However, these studies implemented and evaluated automatic error detection but not automatic error correction. In other words, they could eventually suppress a wrong letter by detecting the ensuing ErrP but did not attempt to immediately replace this letter by another highly probable one [13,14]. In the current study, we evaluated both error detection and correction, where correction was based on the second best guess of a probabilistic classifier.
In a previous experiment, we tested automatic error detection offline [16]. We achieved a very high specificity (above 0.9) and a fairly good sensitivity (up to 0.6), which yielded a significant improvement in offline spelling accuracy in about half of the participants considering an automatic correction based on the second best guess of the classifier. Importantly, it turned out that, for about 50% of the error trials, the second best guess of the classifier corresponded to the true target. Interestingly, this good correction rate (GCR) correlated with the spelling accuracy over subjects, suggesting that more attentive subjects would produce more distinguishable feedback response signals and should be more prone to benefit from automatic correction.
In the current study, we implemented and tested automatic error correction online, in a fairly large group of subjects. To ensure a high error rate in most participants, we used settings that render the spelling fast and challenging. We also evaluated subjective perception of error correction in each participant by means of a questionnaire. Importantly, the group would clearly split in-between participants with low and high ErrP detection specificity. We thus also report and discuss the difference between those two groups, both at the psychological and neurophysiological levels.
The paper is organized as follows. First, Section 2 describes the experimental design in details, including the online OpenViBE scenario that we implemented for spelling and error detection and correction. It also describes the evaluation procedure we used to analyze data offline. Results are exposed in the next section and discussed in the final section of the paper.

General
Principle of the P300-Speller. The P300-Speller is a BCI paradigm developed to restore communication in locked-in patients [17]. The P300 signal is an EEG positive deflection that occurs approximately 300 ms after stimulus onset and is typically recorded over centro-parietal electrodes. This response is evoked by attention to rare stimuli in a random series of stimulus events (the oddball paradigm) [18] and is even stronger when the subject is instructed to count the rare stimuli [19]. It can be used to select items displayed on a computer screen [17,20]. In practice, all possible items are displayed while the user focuses his attention (and gaze) onto the target item. Groups of items are successively and repeatedly flashed, but only the group that contains the target will elicit a P300 response. Correct spelling thus relies on both the user's attentional state and the ability for the BCI to detect the P300 response online.

Participants.
Sixteen healthy subjects took part in this study (7 men, mean age = 28.2 ± 5.1 (SD), range 20-37). They all signed an informed consent approved by the local Ethical Committee and received monetary compensation for their participation. All participants reported normal or corrected-to-normal vision. All subjects had no previous experience with the P300-Speller paradigm or any other BCI application.

Experimental Design
2.3.1. Setup. Participant's brain activity was simultaneously recorded with 56 passive Ag/AgCl EEG sensors (VSM-CTF compatible system) and 275 MEG axial gradiometers (VSM-CTF Omega 275). However, only 32 EEG electrodes were used for the online and offline analysis reported in this paper. The EEG electrode placement followed the extended 10-20 system (see Figure 1). Their signals were all referenced to the nose. The ground electrode was placed on the shoulder and impedances were kept below 10 kΩ. Signals were sampled at 600 Hz.

Stimuli.
We used a standard 6 × 6 matrix of items for stimulation [17], which we combined with our own implementation of the P300-speller in OpenViBE [21,22]. We moved away from the traditional row and column way of flashing items, by adapting the pseudorandom stimulation procedure proposed by Townsend et al. [23]. Townsend and collaborators showed that this improves the spelling accuracy significantly since it minimizes the risk for the subject or patient to be distracted by the letters adjacent to the target. Therefore, we constructed six pairs of groups of six letters, each pair being associated with a particular item of the matrix. This item then belongs to two groups  Figure 1: EEG channel montage. Only yellow channels were used for the online and offline analysis reported in this paper.  only and those two groups have only this single item in common (see Figure 2). In other words, a unique pair of groups of nonadjacent letters is defined in order to replace the original pair of row and column that was associated with a unique possible target [24]. This way of flashing letters also minimizes the probability that a letter will be flashed twice in a row, which minimizes the risk for the user to miss one target presentation. The flash duration was equal to 60 ms and we set the stimulus onset asynchrony (SOA) to 110 ms. We chose these small parameter values in order to make the trials as short as possible. This has a twofold advantage. It enables us to acquire more trials per subject and it makes the spelling more difficult [22]. Both aspects are essential to generate enough error trials and study error detection and correction.

One Trial.
We call a trial the succession of stimulations and observations that are needed to select one item (see Figure 2). Each trial is thus made of several sequences depending on the spelling condition. A sequence of stimulations corresponds to the successive flashing of all the groups, once and in a pseudorandom order. The longer the trial (i.e., the more sequences per trial), the more observations to rely on for the BCI to find the target.
We used two spelling conditions: a fast, more errorprone condition, made of short (2-sequence long) trials and a slower, less risky one, made of four-sequence long trials. These two modes are fairly fast and challenging, which ensures the recording of many trials per subject, among which enough error trials for subsequent analysis.
Since we used copy spelling, each trial started with the presentation of the current target, both at the top of the screen and within the matrix, using a green circle for 1 second. There was no break in-between sequence. At the end of each trial, 2.5 to 4 seconds after the last flash, the feedback was displayed in a blue square at the middle of the screen for 1.3 seconds. It was simultaneously written on the top of the screen (see Figure 2). This large presentation at the center of the visual field was made to favor clear single-trial responses to feedbacks. Participants were explicitly instructed to wait for the feedback at the middle of the screen and not to blink during feedback presentation.
In the session including automatic correction (see below), the second best guess of the classifier was used to 4 Advances in Human-Computer Interaction replace the current feedback, in an orange square, whenever an ErrP would have been detected. In this case, the new feedback was presented for 1 second and the item was also simultaneously corrected at the top of the screen (see Figure 2). After a 0.5 second break, the new target for the next trial was presented.

Full
Procedure. The experiment was divided into five parts.
(1) Installation and Instructions. After having read about the experiment and signed an informed consent form, each subject was prepared for EEG/MEG data acquisition. During preparation, the subject was asked to read the task instructions. Finally, a couple of typical trials were presented so that the subject would be familiar with stimulus presentation before starting the actual experiment.
(2) Speller Training. The aim of the first session was to gather training data in order to set the supervised algorithms subsequently used in the test phase, for individual feature selection and classification. Precisely, those data were used to both compute the individual spatial filters and class parameters. In this training session, subjects were all required to successively copy spell the same 36 items (the whole matrix). Each item or trial was spelled using 3 flashsequences. The session lasted about 10 minutes and was interleaved with short breaks after the 12th and 24th items. No feedback was provided.
(3) Speller Testing and Error Detection Training. After training, the subjects had to go through four true spelling sessions, lasting approximately 12 minutes each. Each session was made of twelve 5-letter words. Subjects received feedback after each letter and words were separated by a 4.5 second break. Within a word, letters were spelled using the same number of sequences (either 2 or 4) and short-and long-lasting words were counterbalanced within sessions. The responses to feedbacks over those four sessions were used to train the feature extraction and classification algorithms for subsequent error detection.
(4) Speller and Error Correction Testing. In the fifth spelling session, participants had to spell twenty 5letter words that they had chosen themselves before the beginning of the whole experiment. Still, it consisted in copy spelling since those words were entered in the computer by the experimenter before the actual session. All letters were spelled using two sequences only (fast mode). This last session lasted 17 minutes approximately. Importantly, whenever an error was detected, automatic correction applied.
(5) Debrief. After recording, participants had to fill in a questionnaire. Using marks between 1 and 10, they had to respond to questions about their perception of the performance of the machine, the difficulty of the task, the quality, and usefulness of the correction.

Feature Extraction.
We used similar preprocessing steps and feature extraction algorithms to process the responses evoked by flashes (for spelling) and feedbacks (for error detection), respectively. Raw data were first downsampled to 100 Hz and bandpass filtered between 1 and 20 Hz, online. 100 Hz here corresponds to a good compromise between the need to sample above 60 Hz (the so-called engineer Nyquist frequency to avoid aliasing) and the advantage of reducing the dimension of the data for online processing. Feature extraction then consisted in linear spatial filtering, whose effect is to reduce the dimension of the data as well as to maximize the discriminability between the two classes (i.e., between targets and nontargets, during spelling, and between error and correct feedbacks, during error detection) [25].
The xDAWN algorithm provides orthogonal linear spatial filters that can be learned from training samples [25]. Based on our previous studies [16,22], we used a five-dimensional filter for both spelling and error detection. As mentioned above, those filters were learned based on the first session, for subsequent spelling, and based on the first four spelling sessions which included feedbacks, for error detection in the last session. For online spelling, the features consisted in the spatially filtered epochs from 0 to 600 ms after flash onset. The evoked response thus included both the P300 and the early visual response [22,26,27]. For online error detection, the features consisted in the spatially filtered epochs from 200 to 600 ms after feedback onset [28].

Feature Classification.
We used a mixture of multidimensional Gaussian model as a classifier. The model parameters (i.e., the mean and variance of each of the two Gaussians) were learned from the same training samples as the parameters of the xDAWN algorithms, for the spelling and error detection task, respectively. Importantly, we assumed conditional independence in time and space between features (naïve Bayes hypothesis) [16,22]. This makes the real-time computation of the posterior probability of each new feature very efficient. This is particularly relevant for the spelling part, since it enables the BCI to update its posterior probability or belief about the target location, after each new observation or flash. Indeed, for spelling, all items are initially assumed to be equiprobable targets. At each new observation, this belief is updated following Bayes rule, by optimally combining the data likelihood and priors. The obtained posteriors then furnish the prior for the next observation, in a Markovian fashion.
For error detection, a decision was made after each single feedback observation. Hence Bayes rule was applied once per trial only. We used individual priors based on each subject's averaged error rate, as given by the first four spelling and ErrP training sessions.

Evaluation of Online Error Detection.
To evaluate ErrP classification, we use the common confusion matrix for a two-class problem (see Figure 3). It involves the estimation of the following complementary measures (reported in percent in the results section): ErrP classifier recognises an error. The second best-guess from the classifier replaces the actually correct letter (worst case). Correct ErrP classifier does not recognises an error, event though the letter displayed is actually incorrect (bad case).
ErrP classifier does not recognises an error. The letter is indeed correct (best case). (i) sensitivity = TP/(TP + FN), that is, the capacity to correctly detect errors; (ii) specificity = TN/(TN + FP), that is, the capacity to correctly detect correct trials; (iii) accuracy = (TP + TN)/(TP + TN + FP + FN), that is, the global efficacy of the classifier.

Evaluation of Online Error Correction.
We computed three quantitative measures to evaluate error correction, both at the single subject and group levels.
We denote the first one by θ. It is the percentage of error trials for which the classifier's second best guess corresponded to the actual target. Note that θ is independent of error detection and only measures how well the classifier might help correcting errors automatically. We estimated θ based on the whole five spelling sessions.
The second measure evaluates automatic error correction on the very last session. It is the good correction rate (GCR) and corresponds to the percentage of detected true error trials that were appropriately corrected. While θ is an offline (predictive) measure, GCR is an online (true BCI behavioral) measure of performance.
Finally, we estimated the individual gain in spelling accuracy due to automatic error correction in the last session. It is simply the difference between the observed accuracy and the one that would have been observed with no online correction.
A commonly used although imperfect measure of BCI performance is the bit rate [29], originally derived from [30]. It can be computed from the following formula: where M is the number of classes and p is the accuracy of the P300 classifier. We report br in bits per minute and use it to compare the spelling accuracy with and without online correction as well as the accuracy that would have been observed if error detection would have been simply followed by the opportunity to spell the letter again. To estimate the latter, we consider that the letter would be spelled again once, with an accuracy corresponding to the one observed online for each subject. Importantly, the respelling of a letter includes a short instruction indicating that the user should focus onto the same target again [14].
For each of the above parameters, we also report the obtained values at the population level. Importantly, since the subjects did not show the same amount of errors, computing the average values for TP, FN, and other parameters would yield a bias estimate of what could be predicted at the population level. Therefore, we rather report the values obtained by concatenating all the trials over subjects, that is, by considering the so-called metasubject. We refer to such quantities at the population level as metavalues. As an example, the metasensitivity corresponds to the sensitivity of the metasubject with mTP true positive and mFN false negative (mTP and mFN being the sum of TP and FN over all the 16 participants).
Finally, the spelling accuracy using automatic correction was related to the specificity, the sensitivity, and the GCR using the following simple formula: where P indicates the spelling accuracy in the absence of correction and Pc the spelling accuracy after automatic correction. The correction becomes useful as soon as Pc > P, which for a given initial spelling accuracy and GCR yields the following limit condition on error detection sensitivity and specificity: 2.8. Additional Offline Analysis. Subjects clearly separated between a low (below 0.75) and a high (above 0.85) ErrP detection specificity group. We compared those two groups in terms of performance, responses to questionnaire, and electrophysiological measures. Because of the small sample size in each group, we used the nonparametric Mann-Whitney test for statistical inference. The electrophysiological responses we considered for quantitative comparisons are the differences between averaged responses to target and non-target stimuli, or to correct and incorrect feedbacks, respectively. These were computed from the downsampled and bandpass filtered data (see Section 4). We compared the amplitudes and latencies of the negative and positive peaks of these differential responses.
For the responses to feedback, only the (last) session with correction was considered. We computed the difference between responses to correct and incorrect feedbacks on  central electrode Cz, as in [14]. We typically observe a first negative component (between 250 and 450 ms) followed by a positive component (between 350 and 550 ms), which we denote by neg-ErrP and pos-ErrP, respectively. For the responses to flashes, we used the data from all sessions. We computed the difference between the averaged responses to target and non-target stimuli and selected the channels exhibiting the maximum absolute differences at the group level. We thus focused our comparison on channel P7 for a negative peak difference in time window 150-270 ms and on channel P8 for positive peak difference in time window 250-500 ms. These two components correspond to the N1 and P300 responses, respectively. N1 is known to be associated with automated stimulus processing that is affected by early attentional processes [31] and to be preponderating at parieto-occipital sites [27].
For each participant, both the amplitudes and latencies of the peaks of the above-defined components were used for subsequent analysis and statistical comparisons. Note that for technical reasons, the electrophysiological data of one participant (S08) could not be saved during the experiment. Therefore, all the results relying on offline evoked potential analysis were obtained from the other 15 subjects.

Performance in Spelling.
In fast mode (including the last session), the online spelling accuracy was 64% ± 21 (SD). In slow mode, it was 80% ± 18 (SD). Accounting for the delay between two trials (5.8 s), this corresponds to rates of 4.52 ± 1.2 (SD) and 4.31 ± 1 correct letters per minute, respectively. The information transfer rate is higher in fast mode, meaning that the loss in accuracy is compensated by the speed increase.

Performance in Error
Detection. The metavalues for the percent of true positive, false positive, false negative, and true negative are shown in Figure 4. The individual sensitivity and specificity values are given in Table 1. At the group level, the (meta) sensitivity, specificity, and accuracy obtained 63%, 88%, and 78%, respectively.
What is very much striking is the split into two groups according to the individual specificities in error detection. Six subjects have quite low specificities, below 75%, while specificities for the other 10 subjects rise above 85% (Figure 5(a)). Table 1. At the group level, the GCR obtains 34%, meaning that for one hundred well-detected errors, thirty-four have been well corrected. The metavalue for θ is 36%, which is very close to the GCR. Over subjects, θ correlates with global spelling accuracy (P < 0.0001, r = 0.87). Precisely, the less the errors during initial spelling, the higher the probability to effectively correct for those errors ( Figure 5(b)).
Over the whole group, the spelling accuracy raised by 0.5% due to automatic error correction, relative to a 62% initial accuracy (i.e., without correction). However, interindividual variability proved quite large. Automatic error correction yielded an improvement in 50% of the subjects (with a maximum gain of 12%), while it caused a degradation of spelling accuracy in 37.5% of the subjects (with a maximum drop of 19%). Table 2 shows the bit rates corresponding to the three compared spelling modes: no correction, predicted correction based on automatic error detection, and hypothetical respelling and online automatic correction. Over subjects, the information transfer rate decreases when we move from automatic correction to automatic detection and respelling. Moreover, the highest bit rate is obtained for the no correction case. However, when restricting the comparison to the high specificity group, the highest bit rate is obtained during automatic correction. Figure 6 provides a graphical representation of individual performance in terms of sensitivity and specificity in error detection. It also summarizes the behavior of the different groups, by displaying the metaspelling accuracy corresponding to the whole group, the low specificity group and the high specificity group, respectively. For those three groups, it further emphasizes the boundaries given by (3), that is, the minimum required trade-off between specificity and sensitivity. Precisely, given the observed GCR and spelling accuracy (in the absence of any correction) for each group, each boundary represents the limit above which the automatic correction becomes fruitful. The more a subject lays above the boundary, the higher the expected increase in spelling accuracy (e.g., S10 had a 12% gain of accuracy). Conversely, the more a subject lays below the boundary, the larger the expected drop of spelling accuracy (e.g., S01 had a 19% drop of accuracy). Table 3. Individual responses for Q1, Q2, and Q8 are detailed in Table 1. Interestingly, the group average answer to Q2 equals 6.6, which is very close to the observed global spelling accuracy of 68.5%. Moreover, the Table 1: Individual and group performance in spelling, error detection, and correction. The first three columns indicate the subject number and demographical information such as gender and age. The next two columns indicate the spelling accuracy (percent of correct letters) for the whole five sessions (mixing words spelled with 2 and 4 sequences) and for the fifth session alone (the one dedicated to online evaluation of error correction, without considering the online correction), respectively. The next four columns indicate the individual sensitivity, specificity, θ, and gain in spelling accuracy obtained in the last session, reported in percent. The last four columns show the individual responses to the most relevant questions (see Table 2 for a full description). The last three lines provide results at the group level (first line) and after having distinguished between low specificity (second line) and high specificity (third line) performers.  response to Q2 correlates with global spelling accuracy over individuals (P < 0.05, r = 0.61). And it also correlates with responses to Q1 (P < 0.01, r = 0.74). Answers to Q1 and Q2 further correlates significantly with error detection specificity (resp. P < 0.05, r = 0.59, and P < 0.01, r = 0.62), as well as with the observed gain in spelling accuracy due to automatic correction (P < 0.05, r = 0.54). The gain in spelling accuracy also correlates positively with the subject's answer about the usefulness of automatic correction (Q8, P < 0.05, r = 0.61).

8
Advances in Human-Computer Interaction  Finally, to the last (binary) question: "Did you prefer the spelling with or without correction?" a short majority reported a preference for no correction (56%). However, when distinguishing between the low and high specificity groups, we find that 83% of the former preferred without correction whereas 60% of the latter preferred with correction.

Electrophysiology.
The grand average feedback-related responses in correct and incorrect trials as well as their difference are depicted in Figure 7. Scalp topographies are also represented for each condition and for the two latencies corresponding to the negative and positive peaks of the difference. The negative peak average latency is 350 ms ± 49 (SD) after feedback onset, while the positive peak average latency is 480 ms ± 51 (SD).
The pos-ErrP amplitude correlates with the accuracy of error detection (P < 0.01, r = 0.64, Figure 8(a)), while the neg-ErrP amplitude correlates with the sensitivity of error detection (P < 0.01, r = −0.70, Figure 8(b)). Hence the larger the difference between responses to correct and incorrect trials, the more efficient the automatic error detection. Besides, we found a significant relationship between the initial spelling accuracy (without correction) and both the amplitude and the latency of the pos-ErrP. Indeed, the higher Advances in Human-Computer Interaction 9 S01 S03 S04 S12 S14 S02 S05 S07 S08 S09 S10 S11 S13 S15 S16  Figure 6: Error detection sensitivity as a function of error detection specificity. All subjects are represented as well as the global metasubject (triangle) and the two metasubjects for the low specificity (circle) and high specificity (square) groups, respectively. The lines are the boundaries above which the automatic correction becomes fruitful, for the three groups (plain, dotted, and dashed, resp.).
the spelling accuracy, the larger (P < 0.05, r = 0.58) and the earlier (P < 0.05, r = −0.55) the positivity. Similarly, the group average responses to target and nontarget stimuli as well as the difference between the two are depicted on Figure 9. Figure 9 also shows the topographies obtained for those three responses, at the peak latency of the N1 (229 ms ± 19 (SD)) and P300 components (380 ms ± 72 (SD)).
The amplitude of the N1 component correlates with both the global spelling accuracy (P < 0.05, r = −0.55, Figure 8(c)) and θ (P < 0.01, r = −0.72). The larger the negative difference between the responses to target and non-target stimuli, the higher the spelling accuracy and the θ. This corroborates what has been already observed in healthy subjects using the P300-speller, namely, that spelling accuracy depends on the ability to focus on the desired character, which yields a larger N1 response and provides a complementary feature to the P300 in order to achieve good classification [26].
Note that for correlations involving neg-ErrP or N1 amplitudes, values are negative since the amplitudes are negatives.
3.6. Between-Group Differences. The low specificity group performed spelling significantly poorer than the high specificity group (53% compared to 78%, P < 0.05). They also benefit significantly less from automatic correction (P < 0.05): correction improved the spelling accuracy by 4% in the high specificity group, while it degraded it by 5% in the low specificity group (Table 1). Accordingly, participants showing a low specificity tend to perceive the machine as less efficient (P < 0.05) and felt like they had fewer control on the BCI (P < 0.05). In agreement with their lower spelling accuracy, the subjects in the low specificity group also present a significantly lower value for θ (0.29 compared to 0.45, P < 0.05).
Finally, the latency of neg-ErrP peak proved shorter (P < 0.05, Figure 10(b)) and the amplitude of the N1 peak proved larger (P < 0.01, Figure 10(c)) for the high specificity group.
We found no significant differences between the two groups on other physiological parameters or on answers to questions Q3 to Q8.

Performance in Spelling and Error Detection.
We used our own implementation of the P300-speller BCI for the evaluation of online automatic error correction in 16 healthy volunteers. We considered two spelling conditions, namely, a slow and a fast mode, whose trial length was guided by our own previous experiments [16,22].
In the current study, the online spelling accuracy proved fairly high. On average, participants could spell about 4.5 correct letters per minute, to be compared with 1.57 correct letters per minute in [14]. This high performance level might be partly attributable to the use of the xDAWN algorithm for spatial filtering [32] and to the departure from the row-column paradigm adapted from [23]. This initial high spelling accuracy has to be kept in mind when interpreting the outcome of error correction. Indeed, it might be that the effectiveness of error detection and correction depends upon the ongoing bit rate, which directly relies on the speedaccuracy trade-off that was targeted when choosing specific stimulation settings (i.e., short-or long-flash durations, short or long sequences).
Regarding the responses to feedback, we observed fronto-central evoked signals whose time courses resemble the ones that have been described recently in a BCI context [12,14,16]. As we already noted in [23], those components that are usually referred to as interaction ErrP in BCI [2], exhibit spatial and temporal patterns that strongly resemble the ones of feedback responses classically reported in cognitive neuroscience studies [8]. This suggests that although contextual modulations can be observed [33,34], responses to feedback may present spatial and temporal components that are independent of the context but specific to the core process of learning from external outcomes. It stresses out the question of what part of the signal and underlying process is specific to a BCI type of interaction and what part is not.
As in a couple of earlier studies [13,14], we showed that those responses to feedbacks can be detected online, from single trials. At the group level, we obtained 88% specificity and 63% sensitivity. For comparison, Dal Seno and collaborators tested two subjects and obtained an averaged specificity and sensitivity of 68% and 62%, respectively [13]. In nine healthy subjects, Spuler and collaborators report a 96% specificity and 40% sensitivity [14]. Note that in the latter study, the authors used a biased classifier to favor specificity. Indeed, a high specificity guarantees that correctly spelled letters will not be detected as errors mistakenly.
In our experiment, we did not use a biased classifier. However, specificity was higher than sensitivity for most of the subjects. This is because spelling accuracy is fairly high, which yields much more training samples for the correct than for the incorrect feedback responses. Indeed, sensitivity higlhy depends upon the quality of the learning of the error class. As an example, subject S07 made 13 errors only. In [14], the authors recommend the use of at least 50 samples to train the ErrP classifier. Hence, part of the interindividual variability in sensitivity is simply due to the variability in the initial spelling accuracy between subjects, which directly affected the quality of the training. However, this is not the case for specificity since as a corollary, high spelling accuracies yield a lot of training samples for the correct class. We obtained 84 training samples in the poorer performer (S03).
Nevertheless, the current results for error detection are slightly worse than the ones we obtained offline in a previous experiment [16]. This might be due to several factors. One factor might be the offline use of ICA, to clean up the signals in our previous study. Second, the use of fairly fast modes here induced smaller interval between two consecutive feedbacks, which might have diminished the expectancy for each new outcome. A known possible cause for smaller feedback responses [35,36]. The latter effect might not be seen in patients where more sequences would typically be implemented. However, there is certainly room for improvement. At least two lines of research are worse mentioning in that respect. One is to make use of preexisting databases to tune the algorithms and suppress individual training [14]. However, this does not seem to be a good option in patients [14]. The second option, which could be used in combination with the first one, would be to use adaptive algorithms and keep updating the individual parameters while using the BCI. Reinforcement learning methods might prove very useful in this context [35].

Global Performance in Automatic Error
Correction. The novelty of our approach comes from the implementation and online evaluation of automatic error correction. Whenever an error is detected, we simply proposed to replace the supposedly erroneous choice by our classifier's second best guess. We evaluated the efficiency of this automatic correction through the computation of the good correction rate (GCR). At the group level, the GCR was equal to 34%. On the one hand, this is much higher than chance level (1/35 = 2.86%), which speaks in favor of applying this strategy for automatic error correction. On the other hand, this is too low to yield a significant gain in spelling accuracy (0.5% only). However, the meta-bit-rate is slightly better with correction than with detection only. This highlights the efficiency of even poor automatic error correction compared to sole error detection which requires cumbersome additional spelling. In order to improve the GCR, one promising option is to use priors from a dictionary in order to bias the classifier [36]. It might improve both the initial spelling accuracy and the ensuing GCR.

4.3.
The Importance of Interindividual Differences. Beyond results at the group level, we observe a large interindividual variability. Specificity of error detection enabled us to clearly distinguish between two groups of participants: the low (<75%, N = 6) and the high (>85%, N = 10) specificity group. Our analysis revealed that those two groups differ not only in terms of specificity but also in terms of electrophysiological responses, initial spelling accuracy, θ values, and spelling accuracy gain as well as in their subjective perception of the BCI experience. One obvious possible explanation is that the high specificity group corresponds to subjects who were more engaged into the task, which yields electrophysiological responses with large signal-to-noise ratio. Indeed, the N1 and P300 responses are known to reflect the participant's involvement in the task ( [27,37], see also [38] for a review on the P300 and [31] for a review on the N1). The P300 has been shown to increase with motivation in a BCI context [39] as well as the ensuing feedback-related negativity (the neg-ErrP) [16,33]. As an expected consequence of a higher signal-to-noise ratio, the spelling accuracy increases as well as the θ. We indeed observe that physiological amplitudes in response to flashes, spelling accuracy, and θ are strongly and positively correlated. Similarly, the larger the ErrP, the more efficient the error detection, especially in terms of specificity. This is in agreement with the correlations we observed between the spelling accuracy, θ, and specificity and the perception of the BCI by the user (Q1 and Q2). Indeed, a high θ associated with an efficient ErrP classification yields a significant improvement due to automatic correction.
All the correlations and group differences we observed are coherent with the hypothesis of a role of task engagement. However, we should remain cautious in our interpretations. This assumption requires further in-depth investigations. Indeed, we could not show any significant correlation between objective measures and the subjective responses to motivation-related question (viz. Q5 and Q7).
Nevertheless, our conclusion can be refined by looking closely at the results in the high specificity group (N = 10). In this group, only 2 subjects experienced a drop of spelling accuracy due to automatic error correction. Accordingly, 6 subjects reported a preference in favor of a spelling including automatic correction. Six subjects showed a higher bit rate when using the automatic correction compared to no correction. And 7 out of the 10 participants obtained a higher bit rate with automatic correction compared to automatic detection with predicted respelling. Three out of the 4 participants, who reported preference for spelling with no correction, did not improve their spelling accuracy with correction. Interestingly, the last one (S10) who obtained the largest improvement (+12%), surprisingly, reported verbally that efficient correction required too much attentional effort. All those results demonstrate the putative usefulness of automatic error correction but highlight the fact that it should certainly be used in a different way depending on the user's profile.

Requirements for Automatic Correction to Be Relevant.
In theory, any initial spelling accuracy could benefit from automatic correction, provided that the sensitivity, the specificity, and the GCR are high enough. All studies concur in stating that specificity is a primary requirement, in order to avoid the highly frustrating situation of automatically discarding a correctly spelled item. But contrary to Visconti and colleagues [15], our results show that the better the spelling accuracy, the more relevant the correction. At first, this might sound counter-intuitive, since the higher the spelling accuracy, the more difficult the rare error detection. However, we do observe that good performers achieve better correction. Indeed, we show that both ErrP detection accuracy and θ (the subsequent ability to automatically correct for errors) are closely related to the spelling accuracy. In other words, the higher the spelling accuracy, the higher the performance in error detection and correction. As also supported by larger averaged ErrP and P300 responses, these results suggest that the more the subject engages into the task, the higher the performance in terms of both spelling accuracy and error correction. This is a strong indication in favor of a possible use of the P300-speller to train subjects in their abilities to focus attention [40].

Conclusion
The BCI presented here is the first online P300-Speller employing automatic correction by another item when an ErrP is detected. Our results are competitive in terms of ErrP detection, although they could probably be improved using adaptive training and a biased classifier. The automatic correction could also be improved, possibly by using the information from a probabilistic dictionary. However, it proved already relevant in terms of bit rate, compared to classical automatic ErrP detection alone. It also proved significant in most of the participants who had the best initial spelling accuracy. Importantly, the correction needs to be adjusted to each participant, depending on their initial spelling accuracy and preference.