Handwritten digit recognition is an important benchmark task in computer vision. Learning algorithms and feature representations which offer excellent performance for this task have been known for some time. Here, we focus on two major practical considerations: the relationship between the the amount of training data and error rate (corresponding to the effort to collect training data to build a model with a given maximum error rate) and the transferability of models' expertise between different datasets (corresponding to the usefulness for general handwritten digit recognition). While the relationship between amount of training data and error rate is very stable and to some extent independent of the specific dataset used—only the classifier and feature representation have significant effect—it has proven to be impossible to transfer low error rates on one or two pooled datasets to similarly low error rates on another dataset. We have called this weakness
Intelligent image analysis is an interesting research area in Artificial Intelligence and also important to a variety of current open research problems. Handwritten digits recognition is a well-researched subarea within the field, which is concerned with learning models to distinguish presegmented handwritten digits. The application of machine learning techniques over the last decade has proven successful in building systems which are competitive to human performance and which perform far better than manually written classical AI systems used in the beginnings of optical character recognition technology. However, not all aspects of such models have been previously investigated.
Here, we systematically investigate two new aspects of such systems.
For the first aspect, we have found that all three datasets considered here give similar performance relative to absolute training set size. This indicates that the quality of input data is similar for these three datasets. A relatively small number of high-quality samples is already sufficient for acceptable performance. Accuracy as a function of absolute training set size follows a smooth asymptotic behavior, in which low error rates (below 10%) are reached quite fast, but very low error rates are reached only after sustained effort.
For the second aspect, we were surprised to notice that none of the considered learning systems were able to transfer their expertise to other datasets. In fact, the performance on other datasets was always significantly worse or unacceptably high.
This may point to a general weakness of present intelligent image analysis systems. We have named this weakness
Small differences in preprocessing methods which have not been documented in sufficient detail (except perhaps in [
A more detailed documentation of preprocessing methods and classification systems in the form of Open Source code would be needed to investigate how to build more robust learning system for this domain, and possibly for intelligent image analysis systems in general.
Reference [
Reference [
Reference [
Reference [
Reference [
We used two well-known (USPS, MNIST) and one relatively unknown (DIGITS) dataset for handwritten digit recognition. The relatively unknown dataset was created by ourselves, so we had complete control and documentation over all preprocessing steps, documented in [
The US Postal (USPS) handwritten digit dataset is derived from a project on recognizing handwritten digits on envelopes [
(a) USPS, (b) MNIST.
The MNIST dataset, one of the most famous in digit recognition, is derived from the NIST dataseta and has been created by LeCun et al. [
The DIGITS dataset was created in 2005, based on samples from students of a lecture given by the author. Each student contributed 100 samples, equally distributed among the digits from 0 to 9. The complete preprocessing is described in [
Figure
(a) digits (
We considered two feature sets.
Additional feature sets could have been considered, but we felt that these two sets would be sufficient for the purpose of this paper.
We considered a variety of classifiers in three groups. All of the classifiers except convNN were taken from WEKA [
Initial experiments indicated that—probably because of the high number of classes—a simple
We also considered polynomial and RBF kernel support vector machines (SVM, see, e.g., [
From earlier experiments, we already knew optimized parameter settings for this classifiers on the DIGITS training set. We extended these experiments and determined similar optimized parameter settings for
The de facto standard for handwritten digit recognition is the convolutional network (convNN, e.g., leNet-5) by [
Training was done using the following parameters, as these proved to give the best results on the original MNIST training/test sets (according to [ initial learning rate: 0.001, minimum learning rate: 0.00005, rate of decay for learning rate, applied every two epochs until minimum learning rate was reached: 0.79418335, run with elastically deformed training inputs for at least 52 epochs, run with non-deformed training input for exactly 5 polishing epochs with a learning rate of 0.0001.
In this section, we will show the full results from our experiments.
This section is concerned with analyzing the relationship between training set size and recognition accuracy, depending on dataset and learning algorithm.
Figure
Abs. training set size versus test set accuracy for pixel-based features on MNIST, USPS, and DIGITS (IBk variants).
Abs. training set size versus test set accuracy for pixel-based features on MNIST, USPS, and DIGITS (SVM variants).
When using instance-based learning, the three datasets perform remarkably similar. Only for the tangent distance variant, DIGITS performs noticeably worse. We presume this is due to the collection of this dataset, where digits had to be written into a regular grid, which forced a very uniform orientation. As tangent distance was constructed to compensate for non-uniform orientations—wihch is not needed here—the additional degrees-of-freedom of this method may have led to overfitting on this dataset, resulting in inferior performance.
When using SVM learning, the picture is similar, albeit less clear. Only for the polynomial variant do we see very similar behavior on the three datasets. For the other two variants, some differences appear. Especially, MNIST performs very badly with the RBF kernel variant. We presume that this is due to the high number of variance in MNIST, and the higher number of parameters for the RBF kernel, such that the amount of training data is no longer sufficient for stable parameter estimation. Also, parameters were optimized for the DIGITS training set, and this may have led to some overfitting
In a second step, we analyzed gradient-based features. Since pixel-based features are a very imprecise way to encode information about handwritten digits, we chose to use direction-specific feature maps which were previously found to work best (see Section
Figure
Abs. training set size versus test set accuracy for gradient features on MNIST, USPS, and DIGITS (IBk variants).
Figure
Abs. training set size versus test set accuracy for gradient features on MNIST, USPS, and DIGITS (SVM variants).
The shape of all learning curves is remarkably similar and might be estimated with just a few data points. They seem to depend on the learning algorithm, the feature representation, and to a lesser extent on the specific dataset in question (e.g., dataset complexity, sample distribution, or other factors).
All previous results mean nothing if the task has not really been solved. So, as it is clear that—small differences between the datasets notwithstanding—all these datasets deal with the writer-independent recognition of handwritten digits and were created by disjunct sets of writers (which were also properly distributed between training and test set), we estimated the quality of each model by testing it on the other datasets. First, we converted both DIGITS and USPS into MNIST format by centering each digit in a
(a) USPS, (b) DIGITS, both reformatted to MNIST format.
First, we trained on each training set in turn and tested on the other two sets. Note that the training and test sets are of different size, so for example, MNIST builds a model from 60,000 samples while DIGITS just builds a model from about 1,800. According to the results from the previous section, we would expect a range of about one order of magnitude (best versus worst) in error rates on the test set corresponding to the training set, with MNIST better than USPS and USPS better than DIGITS. This is exactly what we observed. Surprisingly, the performance on the other test sets is much worse.
This time, we also tested LeCun’s original convolutional neural network model as reconstructed by [
Table
Dataset independence for pixel-based features, each dataset separately.
Classifier | Trained on | Tested on | Avg. error versus own testset | ||
MNIST | DIGITS | USPS | |||
IBk1 euclidean | MNIST | 3.09 | 19.21 | 17.49 | 5.94x |
IBk1 euclidean | DIGITS | 36.22 | 16.59 | 52.72 | 2.68x |
IBk1 euclidean | USPS | 28.41 | 55.01 | 5.33 | 7.83x |
IBk1 NCC | MNIST | 2.83 | 17.65 | 13.70 | 5.54x |
IBk1 NCC | DIGITS | 32.42 | 14.14 | 44.59 | 2.72x |
IBk1 NCC | USPS | 26.06 | 51.17 | 4.58 | 8.43x |
IBk1 TD | MNIST | 1.51 | 13.53 | 5.63 | 6.34x |
IBk1 TD | DIGITS | 25.88 | 10.02 | 37.77 | 3.18x |
IBk1 TD | USPS | 10.51 | 36.47 | 3.64 | 6.45x |
SVM linear | MNIST | 6.83 | 34.97 | 16.24 | 3.75x |
SVM linear | DIGITS | 31.57 | 16.09 | 45.54 | 2.40x |
SVM linear | USPS | 40.64 | 63.25 | 6.53 | 7.95x |
SVM polynomial | MNIST | 1.27 | 16.20 | 11.56 | 10.93x |
SVM polynomial | DIGITS | 30.05 | 11.47 | 47.68 | 3.39x |
SVM polynomial | USPS | 44.78 | 74.33 | 4.43 | 13.44x |
SVM RBF | MNIST | 4.31 | 53.34 | 20.78 | 8.60x |
SVM RBF | DIGITS | 51.50 | 33.74 | 60.09 | 1.65x |
SVM RBF | USPS | 81.05 | 89.98 | 7.37 | 11.60x |
convNN | MNIST | 0.74 | 8.24 | 3.48 | 7.92x |
convNN | DIGITS | 21.43 | 5.73 | 30.0 | 4.49x |
convNN | USPS | 4.25 | 27.56 | 3.08 | 5.16x |
Dataset independence for gradient-based features, each dataset separately.
Classifier | Trained on | Tested on | Avg. error versus own testset | ||
MNIST | DIGITS | USPS | |||
IBk1 euclidean | MNIST | 1.29 | 12.08 | 5.98 | 7.00x |
IBk1 euclidean | DIGITS | 21.91 | 7.29 | 37.62 | 4.08x |
IBk1 euclidean | USPS | 10.30 | 33.07 | 3.49 | 6.21x |
SVM linear | MNIST | 1.34 | 12.92 | 5.63 | 6.92x |
SVM linear | DIGITS | 19.76 | 5.12 | 30.54 | 4.91x |
SVM linear | USPS | 14.62 | 39.37 | 3.34 | 8.08x |
SVM polynomial | MNIST | 0.47 | 8.07 | 4.43 | 13.30x |
SVM polynomial | DIGITS | 17.81 | 3.67 | 24.86 | 5.81x |
SVM polynomial | USPS | 14.68 | 39.03 | 2.79 | 9.63x |
SVM RBF | MNIST | 0.57 | 8.30 | 4.28 | 11.04x |
SVM RBF | DIGITS | 17.75 | 4.06 | 25.46 | 5.32x |
SVM RBF | USPS | 14.89 | 40.03 | 2.79 | 9.84x |
Second, we chose to also test combining two datasets and testing on the remaining dataset. We downsampled the larger training dataset to the size of the smaller training set and combined them, shuffling the results to prevent order effects. The same test sets as previously were used. This time, we computed the error of the remaining completely unseen dataset divided by the average of errors for the two seen datasets (i.e., those whose training set was part of the dataset pool). Again, convNN was trained on the same data.
Tables
Dataset independence for pixel-based features, two datasets combined.
Classifier | Trained on | Tested on | Error versus avg. of own testsets | ||
MNIST | DIGITS | USPS | |||
IBk1 eucl. | MNIST-DIGITS | 9.23 | 18.32 | 25.56 | 1.86x |
IBk1 eucl. | MNIST-USPS | 5.97 | 25.89 | 5.28 | 4.60x |
IBk1 eucl. | USPS-DIGITS | 22.27 | 22.05 | 6.98 | 1.53x |
IBk1 NCC | MNIST-DIGITS | 7.81 | 13.64 | 21.67 | 2.02x |
IBk1 NCC | MNIST-USPS | 4.74 | 24.39 | 4.53 | 5.26x |
IBk1 NCC | USPS-DIGITS | 18.95 | 15.53 | 6.98 | 1.68x |
IBk1 TD | MNIST-DIGITS | 4.34 | 11.41 | 10.81 | 1.37x |
IBk1 TD | MNIST-USPS | 2.65 | 16.09 | 3.64 | 5.12x |
IBk1 TD | USPS-DIGITS | 9.61 | 14.09 | 4.53 | 1.03x |
SVM linear | MNIST-DIGITS | 10.62 | 21.10 | 21.92 | 1.38x |
SVM linear | MNIST-USPS | 12.27 | 43.10 | 8.27 | 4.20x |
SVM linear | USPS-DIGITS | 20.37 | 23.83 | 9.67 | 1.22x |
SVM poly. | MNIST-DIGITS | 4.96 | 8.85 | 16.54 | 2.40x |
SVM poly. | MNIST-USPS | 2.66 | 22.16 | 3.89 | 6.77x |
SVM poly. | USPS-DIGITS | 14.56 | 9.97 | 5.23 | 1.92x |
SVM RBF | MNIST-DIGITS | 12.60 | 34.58 | 31.19 | 1.32x |
SVM RBF | MNIST-USPS | 13.60 | 71.66 | 5.63 | 7.45x |
SVM RBF | USPS-DIGITS | 39.89 | 47.38 | 7.03 | 1.47x |
convNN | MNIST-DIGITS | 3.21 | 4.00 | 6.57 | 1.82x |
convNN | MNIST-USPS | 1.25 | 11.85 | 2.74 | 5.94x |
convNN | USPS-DIGITS | 7.03 | 5.79 | 4.88 | 1.32x |
Dataset independence for gradient-based features, two datasets combined.
Classifier | Trained on | Tested on | Error versus avg. of own testsets | ||
MNIST | DIGITS | USPS | |||
IBk1 eucl. | MNIST-DIGITS | 3.58 | 7.41 | 9.87 | 1.80x |
IBk1 eucl. | MNIST-USPS | 1.98 | 15.59 | 3.34 | 5.86x |
IBk1 eucl. | USPS-DIGITS | 9.51 | 7.80 | 4.14 | 1.59x |
SVM linear | MNIST-DIGITS | 3.43 | 4.79 | 11.01 | 2.68x |
SVM linear | MNIST-USPS | 2.23 | 15.14 | 3.39 | 5.39x |
SVM linear | USPS-DIGITS | 7.50 | 6.24 | 4.38 | 1.41x |
SVM poly. | MNIST-DIGITS | 1.84 | 3.51 | 6.58 | 2.46x |
SVM poly. | MNIST-USPS | 0.91 | 10.75 | 2.54 | 6.23x |
SVM poly. | USPS-DIGITS | 5.94 | 4.45 | 2.94 | 1.61x |
SVM RBF | MNIST-DIGITS | 1.88 | 3.51 | 6.98 | 2.59x |
SVM RBF | MNIST-USPS | 1.00 | 11.53 | 2.44 | 6.70x |
SVM RBF | USPS-DIGITS | 6.22 | 4.57 | 2.84 | 1.68x |
The better gradient-based feature representation is probably responsible for preventing such outliers in the second tables, as more stable models are learned. This time, SVM polynomial and SVM RBF give the best performance (averaged over the completely unseen test datasets’ error rates), closely followed by convNN which uses pixel-based features. Still, this translates to an error of 5.94%, 10.75%, and 5.94% for MNIST, DIGITS, and USPS, which is at least an order of magnitude higher than the best results for handwritten digit recognition (reported on MNIST).
We have shown that relatively small amounts of training data are sufficient for state-of-the-art accuracy in handwritten digit recognition, and that the relationship between training set size and accuracy follows a simple asymptotic function.
We have also shown that
More work is needed to determine how to resolve this weakness. As a first step, we propose a more detailed documentation of preprocessing methods and classification systems in the form of Open Source code for further work in the field, a more comprehensive sharing of both data and methods among active research groups, and focussing specific efforts towards building more robust learning systems. An investigation into specific preprocessing choices and their effect on accuracy would be highly desirable and a major step to building systems with truly stable dataset-independent performance.
The authors gratefully acknowledge the support of the students of AI Methods of Data Analysis, class 2005. They also acknowledge Mike O’Neill, who has written and validated the non-scriptable convolutional network code, which was used for the convNN experiment (thanks, Mike, you saved us a lot of work.) Finally, special thanks to Julian A. for one important suggestion. This research has been funded by Seewald Solutions.
Unfortunately, preprocessing is in most cases not fully documented, which makes such an investigation rather hard. We already did a short analysis on this issue in [
These samples actually come from the supposedly cleaner part of the test set by Census employees, SD-3, which indicates that the proportion of segmentation errors for the remaining dataset may even be higher.
The digits were entered in a regular grid, and visual inspection showed the slant to be minimal.
Note that DIGITS is by far the most accurate algorithm for the RBF kernel variant.
USPS already was sufficiently similar, for DIGITS we used
Although gamma correction results in digits which seem less similar to MNIST than the original set by visual inspection, this proved to reduce the error rate of the original MNIST-trained convolutional neural network by almost a third. On the other hand, although the aspect ratio was lower by 12.5% for DIGITS, additionally compensating this increased the error almost up to the original level. These anecdotes support our upcoming conclusion that performance is very sensitive to a number of factors currently not well understood.