An accurate and efficient eye detector is essential for many computer vision applications. In this paper, we present an efficient method to evaluate the eye location from facial images. First, a group of candidate regions with regional extreme points is quickly proposed; then, a set of convolution neural networks (CNNs) is adopted to determine the most likely eye region and classify the region as left or right eye; finally, the center of the eye is located with other CNNs. In the experiments using GI4E, BioID, and our datasets, our method attained a detection accuracy which is comparable to existing state-of-the-art methods; meanwhile, our method was faster and adaptable to variations of the images, including external light changes, facial occlusion, and changes in image modality.
In recent years, eye detection has become an important research topic in computer vision and pattern recognition [
The rest of this paper is organized as follows. In the Related Work, a review of related works is presented. In the Method, we proposed an efficient method for eye detection, which consists of candidate region generation, eye region determination and classification, and eye center positioning. Then, the training scheme, evaluation results, and discussions were presented in the Training Scheme, Evaluation, and Further Discussion, respectively. Finally, conclusion remarks were drawn in the last section.
Many algorithms have been proposed for image-based eye detection. These can be summarized into two categories, namely, traditional eye detectors and machine learning eye detectors.
Traditional eye detectors are typically designed according to the geometric characteristics of the eye. These eye detectors can be divided into two subclasses. The first subclass is the geometric model. Valenti and Gevers [
Machine learning eye detectors can also be further divided into two subclasses. The first subclass is the traditional feature extraction followed by a cascaded classifier. Chen and Liu [
Besides the determination of eye regions, the classification of the left/right eye and the positioning of the eye center are also important for some applications, such as eye tracking system and eye disease detection. However, most of the existing eye detectors cannot efficiently determine the eye regions, distinguish between left or right eyes, and detect the eye center in one round. Therefore, we aim to propose a new method that overcomes the disadvantages of the existing methods.
The overall workflow for the proposed method is shown in Figure
Workflow of the proposed method.
Directly using CNNs on the high resolution face images (e.g.,
We selected the top
Top
After generating the candidate eye regions, we aimed to develop a method that can quickly and robustly predict eye location and classify the eye as left eye or right eye. We developed a set of convolutional neural networks (CNNs) to make effective use of the datasets. The core architecture of our 1st set of CNNs is summarized in Figure
Structure of the 1st set of CNNs.
In our 1st set of CNNs, three sub-CNNs were built and each carries the same structure. In each sub-CNNs, the first layer was a convolutional layer with a kernel size of 5 × 5 pixels, two pixel strides, and one padding, and the convolution layer was followed by a maximum pooling layer with a window size of 3 × 3 and two pixel strides. The second layer was a convolutional layer with a kernel size of 3 × 3 pixels, one pixel stride, one padding, and no pooling layer. The third layer was similar to the first layer, except that the convolutional kernel size was 3 × 3 pixels. Through three stages of convolution and pooling, in which the convolutional layer learned the edges, eye structure, and other basic features, the pooling layer helped the networks to be robust to details in the changes. Next, we used fully connected (FC) layers to combine a deeper knowledge and produce the final region label and confidence index of each candidate region. Finally, we choose the candidate region with the maximum index as the eye region and according the region label to classify the left or right eyes. We then used the coordinates of this region and restored it to the original facial image. All CNNs’ weights were initialized based on ImageNet’s [
Although the 1st set of CNNs outputs the eye region and the eye class (left or right), it still lacks precise positioning of the eye center. The existing eye detectors usually treat the center of the eye region as the eye center, which is inaccurate if the subject is not looking straightwards. Some cases are shown in Figure
Detected eye regions, region centers, and actual eye centers. The yellow cross represents the center of the eye region and the red cross is the actual center of the eye.
To locate the actual eye center, we built the 2nd set of CNNs that locates the pupil region in the eye region. The 2nd CNNs’ architecture is shown in Figure
Structure of the 2nd CNNs.
Samples of images in the datasets. (a) shows samples from GI4E datasets and (b) is from BioID datasets, while (c) is from our datasets. Note the significant variation in illumination, head position, infrared/visible image, and full/half covered face.
In our experiments, we used the GI4E [
In our work, we set
The percentage of images with at least one effective feature point versus the parameter
|
100 | 150 | 200 | 250 | 300 |
---|---|---|---|---|---|
Percentage of images | 60% | 85% | 96% | 97% | 97% |
For each selected candidate point, we generated three candidate boxes as shown in Figure
CNNs’ output. The green points represent the selected eye candidate feature points. The indexes show the confidences and the regions predicted by the first set of CNNs. The magenta rectangle was selected as the final prediction result because it had the highest index, 0.89, and this eye was predicted to be the left eye. The eye region was fed into the second set of CNNs outputting the accurate eye center. The red cross represents the eye center.
Next, we put the marked regions into the 1st set of CNNs, resulting in three indexes between 0 and 1 as shown in Figure
The evaluation of our method was carried out on an Intel(R) Core(TM) i5-6600 desktop computer with 16 GB RAM and NVIDIA GeForce GTX 745 GPU. This algorithm was implemented using MATLAB (R2016a). During the evaluation, the two stages of cascaded CNNs’ output were discussed separately. We evaluated our method on the GI4E, the BioID, and our facial datasets. We randomly divided the datasets into two parts for the CNNs’ training and testing.
Figure
Some results of eye location and classification. The blue box represents the right eye while the orange box represents the left eye.
To measure the eye classification and detection capability, we defined the normalized error as
The comparison between the proposed method and the state-of-the-art methods is presented in Table
Eye localization and classification comparison on the BioID datasets.
Method |
|
|
|
Time (ms) |
---|---|---|---|---|
Valenti and Gevers [ |
86.1% | 91.7% | 97.9% |
|
Timm and Barth [ |
82.5% | 93.4% | 99.7% |
|
Amos et al. [ |
84.1% | 90.2% | 96.1% |
500 |
Araujo et al. [ |
88.3% | 92.7% | 98.9% | 83 |
Asadifard and Shanbezadeh [ |
47.0% | 86.0% | 96.0% | 45 |
Markuš et al. [ |
85.7% | 95.3% | 99.7% | 28 |
Leo et al. 2013 [ |
78.0% | 86.0% | 90.0% | 330 |
Leo et al. 2014 [ |
80.6% | 87.3% | 94.0% | 330 |
Gou et al. [ |
|
|
|
67 |
Ours | 85.6% | 95.9% | 99.5% |
|
Figure
Samples of eye detection, classification, and eye center estimation results on BioID, GI4E, and our datasets. Red cross represents estimated eye center.
Figure
Some samples from the BioID, GI4E, and our datasets in which the proposed method fails in the eye classification and eye center detection.
To verify the effectiveness of our proposed method, we also reported our results in terms of the average eye center detection rate as a function of the pixel distance correctly between the algorithmically established and hand-labeled eye location. Figure
Eye center detection performance of our method on different datasets.
The experimental results demonstrate that our proposed method can achieve satisfactory result on both eye region and eye center detection on benchmark datasets. Based on the cascaded CNNs framework, we not only improved the eye detection speed, but also reduced training time. Actually, if we only used a single network to perform both eye region detection and eye center positioning, it may get a better result. But it must require a bigger dataset and is time-consuming. For example, it took 3 days to train 500 images (
In our experimental environment, the eye regions and centers were located in approximately 9 ms, wherein approximately 2 ms was spent in proposing the eye candidate feature points while 2 ms was spent in calculating the accurate region of the left eye or right eye (1st set of CNNs), and 30 ms was spent in locating the center of the eye (2nd CNNs). The frame rate is around 30–60 fps for most eye detection tasks. This shows that our proposed method can deal with real time problems that arise in realistic scenarios.
In this paper, we presented an effective cascaded CNNs method to detect the eye location in facial images. Our method can simultaneously detect left and right eye locations and center even when the face was blocked and is insensitive to visible or infrared light images. In addition, eye positioning does not rely on the face detector. For the evaluation, we tested our method using over 5,000 facial images and found that our proposed eye detector was efficient and effective. We used features points combined with the cascaded CNNs in order to achieve significantly high efficiency and satisfactory classification rate. In our future work, we plan to collect more facial images to train more powerful eye recognition models.
The authors declare that they have no conflicts of interest.
The work presented in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region (Reference no. UGC/FDS13/E04/14).