In spite of advances in object recognition technology, handwritten Bangla character recognition (HBCR) remains largely unsolved due to the presence of many ambiguous handwritten characters and excessively cursive Bangla handwritings. Even many advanced existing methods do not lead to satisfactory performance in practice that related to HBCR. In this paper, a set of the state-of-the-art deep convolutional neural networks (DCNNs) is discussed and their performance on the application of HBCR is systematically evaluated. The main advantage of DCNN approaches is that they can extract discriminative features from raw data and represent them with a high degree of invariance to object distortions. The experimental results show the superior performance of DCNN models compared with the other popular object recognition approaches, which implies DCNN can be a good candidate for building an automatic HBCR system for practical applications.
Automatic handwriting character recognition has many academic and commercial interests. The main challenge in handwritten character recognition is to deal with the enormous variety of handwriting styles by different writers. Furthermore, some complex handwriting scripts comprise different styles for writing words. Depending on the language, characters are written isolated from each other in some cases (e.g., Thai, Laos, and Japanese). In some other cases, they are cursive and sometimes characters are related to each other (e.g., English, Bangladeshi, and Arabic). This challenge has been already recognized by many researchers in the field of natural language processing (NLP) [
Handwritten character recognition is more challenging compared with the printed forms of character due to the following reasons: (1) Handwritten characters written by different writers are not only nonidentical but also vary in different aspects such as size and shape; (2) numerous variations in writing styles of individual character make the recognition task difficult; (3) the similarities of different character in shapes, the overlaps, and the interconnections of the neighbouring characters further complicate the character recognition problem. In summary, a large variety of writing styles and the complex features of the handwritten characters make it a challenge to accurately classifying handwritten characters.
Bangla is one of the most spoken languages and ranked fifth in the world and spoken by more than 200 million people [
In Bangla language, there are 10 digits and 50 characters including vowel and consonant, where some contain additional sign up and/or below. Moreover, Bangla consists of many similar shaped characters. In some cases, a character differs from its similar one with a single dot or mark. Furthermore, Bangla also contains some special characters with equivalent representation of vowels. This makes it difficult to achieve a better performance with simple classification technique as well as hinders to the development of a reliable handwritten Bangla character recognition (HBCR) system.
There are many applications of HBCR, such as Bangla optical character recognition, national ID number recognition system, automatic license plate recognition system for vehicle and parking lot management system, post office automation, and online banking. Some example images of these applications are shown in Figure First time to comprehensive evaluation of the state-of-the-art DCNN models including VGG Net [ Extensive experiments on HBCR including handwritten digits, alphabets, and special character recognition. The better recognition accuracy is achieved, to the best of knowledge, compared with other existing approaches that reported in the literature.
Application of handwritten character recognition: (a) national ID number recognition system, (b) post office automation with code number recognition on envelope, and (c) automatic license plate recognition.
Although some studies on Bangla character recognition have been reported in the past years [
Recently, deep learning-based methods have drawn increasing attention in handwritten character recognition [
Deep neural network (DNN) is an active area in the field of machine learning and computer vision [
Among all deep learning approaches, CNN is one of the most popular models and has been providing the state-of-the-art performance on segmentation [
CNN was initially applied to digit recognition task by LeCun et al. [
The overall architecture of a CNN, as shown in Figure
Basic CNN architecture for digit recognition.
The subsampling or pooling layer abstracts the feature through average or maximum operation on input nodes. For example, if a
In contrast to traditional neural networks, CNN extracts low- to high-level features. The higher-level features can be derived from the propagated feature of the lower-level layers. As the features propagate to the highest layer, the dimension of the feature is reduced depending on the size of the convolution and pooling masks. However, the number of feature mapping usually increased for selecting or mapping the extreme suitable features of the input images for better classification accuracy. The outputs of the last layer of CNN are used as inputs to the fully connected network and it typically uses a Softmax operation to produce the classification outputs. For an input sample
However, there are different variants of DCNN architecture that have been proposed over the last few years. The following section discusses six popular DCNN models.
As far as CNN architecture is concerned, it can be observed that there are some important and fundamental components that are used to construct an efficient DCNN architecture. These components are convolution layer, pooling layer, fully connected layer, and Softmax layer. The advanced architecture of this network consists of a stack of convolutional layers and max-pooling layers followed by fully connected and Softmax layer at the end. Noticeable examples of such networks include LeNet [
The visual geometry group (VGG) was the runner up of the ImageNet Large Scale Visual Recognition Competition (ILSVRC) in 2014 [
Basic architecture of VGG Net: convolution (Conv) and FC for fully connected layers and Softmax layer at the end.
The layer specification of All-Conv is given in Figure
All convolutional network framework.
This model is quite different compared with the aforementioned DCNN models due to the following properties [ It uses multilayer convolution where convolution is performed with It uses GAP instead of fully connected layer.
The concept of using
ResNet architecture becomes very popular in computer vision community. The ResNet variants have been experimented with different number of layers as follows: number of convolution layers: 49 (34, 152, and 1202 layers for other versions of ResNet), number of fully connected layers: 1, weights: 25.5 M. The basic block diagram of ResNet architecture is shown in Figure
Basic diagram of residual block.
The Residual Network consists of several basic residual units. The different residual units are proposed with different types of layers. However, the operations between the residual units vary depending on the architectures that are explained in [
The FractalNet architecture is an advanced and alternative one of ResNet, which is very efficient for designing very large network with shallow subnetworks, but shorter paths for the propagation of gradient during training [
An example of FractalNet architecture.
DenseNet is densely connected CNN where each layer is connected to all previous layers [
A 4-layer dense block with growth rate of
The number of network parameters is a very important criterion to assess the complexity of the architecture. The number of parameters can be used to make comparison between different architectures. At first, the dimension of the output feature map can be computed as
Parameter specification in All-Conv model.
Layers | Operations | Feature maps | Size of feature maps | Size of kernels | Number of parameters |
---|---|---|---|---|---|
Inputs | 32 × 32 × 3 | ||||
C1 | Convolution | 128 | 30 × 30 | 3 × 3 | 3,456 |
C2 | Convolution | 128 | 28 × 28 | 3 × 3 | 147,456 |
S1 | Max-pooling | 128 | 14 × 14 | 2 × 2 | N/A |
C3 | Convolution | 256 | 12 × 12 | 3 × 3 | 294,912 |
C4 | Convolution | 256 | 10 × 10 | 3 × 3 | 589,824 |
S2 | Max-pooling | 256 | 5 × 5 | 2 × 2 | N/A |
C5 | Convolution | 512 | 3 × 3 | 3 × 3 | 1,179,648 |
C6 | Convolution | 512 | 3 × 3 | 1 × 1 | 262,144 |
GAP1 | GAP | 512 | 3 × 3 | N/A | N/A |
Outputs | Softmax | 10 | 1x1 | N/A | 5,120 |
The entire experiment is performed on desktop computer with Intel® Core-I7 CPU @ 3.33 GHz, 56.00 GB memory, and Keras with Theano on the backend on Linux environment. We evaluate the state-of-the-art DCNN models on three datasets from CMATERdb (available at:
The statistics of three datasets used in this paper are summarized in Table
Statistics of the database used in our experiment.
Dataset | # training samples | # testing samples | Total samples | Number of classes |
---|---|---|---|---|
Digit-10 | 4000 | 2000 | 6000 | 10 |
Alphabet-50 | 12,000 | 3,000 | 15,000 | 50 |
SpecialChar-13 | 2196 | 935 | 2231 | 13 |
The standard samples of the numeral with respective Arabic numerals are shown in Figure
First row shows the Bangla actual digits and second row shows the corresponding Arabic numerals.
Sample handwritten Bangla numeral images from CMATERdb 3.1.1 database, including digits from 1 to 10.
Figure
Training loss of different architectures for Bangla handwritten 1–10 digits.
Validation accuracy of different architectures for Bangla handwritten 1–10 digits.
Testing accuracy for Bangla handwritten digit recognition.
In our implementation, the basic fifty alphabets including 11 vowels and 39 consonants are considered. The samples of 39-consonant and 11-vowel characters are shown in Figures
Example images of handwritten characters: (a) Bangla consonant characters and (b) vowels.
Randomly selected handwritten characters of Bangla alphabets from Bangla handwritten Alphabet-50 dataset.
The training loss for different DCNN models is shown in Figure
Training loss of different DCNN models for Bangla handwritten Alphabet-50.
The validation accuracy of different architectures for Bangla handwritten Alphabet-50.
Figure
Testing accuracy for handwritten 50-alphabet recognition using different DCNN techniques.
There are several special characters (SpecialChar-13) which are equivalent to representations of vowels that are combined with consonants for making meaningful words. In our evaluation, we use 13 special characters which are for 11 vowels and two additional special characters. Some samples of Bangla special characters are shown in Figure
Randomly selected images of special character from the dataset.
The training loss and validation accuracy for SpecialChar-13 are shown in Figures
Training loss of different architectures for Bangla 13 special characters (SpecialChar-13).
Validation accuracy of different architectures for Bangla 13 special characters (SpecialChar-13).
Testing accuracy of different architectures for Bangla 13 special characters (SpecialChar-13).
The testing performance is compared to several existing methods. The results are presented in Table
The testing accuracy of VGG-16 Network, All-Conv Network, NiN, ResNet, FractalNet, and DenseNet on Digit-10, Alphabet-50, and SpecialChar-13 and comparison against other existing methods.
Types | Methods | Digit-10 (%) | Alphabet-50 (%) | SpecialChar-13 (%) |
---|---|---|---|---|
Existing approaches | MLP [ |
96.67 | — | — |
MPCA + QTLR [ |
98.55 | — | — | |
GA [ |
97.00 | — | — | |
LeNet + DBN [ |
98.64 | — | — | |
VGG Net [ |
97.57 | 97.56 | 96.15 | |
|
||||
DCNN | All-Conv [ |
97.08 | 94.31 | 95.58 |
NiN [ |
97.36 | 96.73 | 97.24 | |
ResNet [ |
98.51 | 97.33 | 97.64 | |
FractalNet [ |
98.92 | 97.87 | 97.98 | |
DenseNet [ |
|
|
|
For impartial comparison, we have trained and tested the networks with the optimized same number of parameters as in the references. Table
Number of parameter comparison.
Models | Number of parameters |
---|---|
VGG-16 [ |
|
All-Conv Net [ |
|
NiN [ |
|
ResNet [ |
|
FractalNet [ |
|
DenseNet [ |
|
We also calculate computational cost for all methods, although the computation time depends on the complexity of the architecture. Table
Computational time (in sec) per epoch for different DCNN models on Digit-10, Alphabet-50, and SpecialChar-13.
Models | Digit-10 | Alphabet-50 | SpecialChar-13 |
---|---|---|---|
VGG-16 [ |
32 | 83 | 15 |
All-Conv Net [ |
7 | 23 | 4 |
NiN [ |
9 | 27 | 5 |
ResNet [ |
64 | 154 | 34 |
FractalNet [ |
32 | 102 | 18 |
DenseNet [ |
95 | 210 | 58 |
In this research, we investigated the performance of several popular deep convolutional neural networks (DCNNs) for handwritten Bangla character (e.g., digits, alphabets, and special characters) recognition. The experimental results indicated that DenseNet is the best performer in classifying Bangla digits, alphabets, and special characters. Specifically, we achieved recognition rate of
The data used to support the findings of this study are available at
The authors declare that they have no conflicts of interest.