Exploration of Ear Biometrics Using EfficientNet

Biometrics is the recognition of a human using biometric characteristics for identification, which may be physiological or behavioral. The physiological biometric features are the face, ear, iris, fingerprint, and handprint; behavioral biometrics are signatures, voice, gait pattern, and keystrokes. Numerous systems have been developed to distinguish biometric traits used in multiple applications, such as forensic investigations and security systems. With the current worldwide pandemic, facial identification has failed due to users wearing masks; however, the human ear has proven more suitable as it is visible. Therefore, the main contribution is to present the results of a CNN developed using EfficientNet. This paper presents the performance achieved in this research and shows the efficiency of EfficientNet on ear recognition. The nine variants of EfficientNets were fine-tuned and implemented on multiple publicly available ear datasets. The experiments showed that EfficientNet variant B8 achieved the best accuracy of 98.45%.


Introduction
e ear begins to develop in a fetus during the fifth and seventh weeks of pregnancy [1]. At this stage, the face acquires a more distinguishable shape as the mouth, nostrils, and ears begin to form. ere is still no exact timeline at which the outer ear is created, but it is accepted that a cluster of embryonic cells connects to establish the ear. ese are called auricular hillocks, which begin growing in the lower portion of the neck. e auricular hillocks broaden and intertwine within the seventh week to deliver the ear's shape. Within the ninth week, the hillocks move to the ear canal and are more noticeable as the ear [1]. e external anatomy of the ear can be seen in Figure 1. e growth of the ear in the first four months after birth is linear, and the ear is then stretched in development between the ages of four months and eight years. After this, the ear size and shape are constant until age seventy, increasing in size again.
Biometrics is the recognition of a human using their biometric characteristics, which may be physiological or behavioral.
e physiological biometric features are the DNA, face, ear, facial, iris, fingerprint, hand geometry, hand vein, and palm print, and behavioral biometrics are signatures, gait patterns, and keystrokes. Voice is considered as a combination of biometric and physiological characteristics. Numerous systems have been developed to distinguish biometric traits, which have been used in multiple applications, such as forensic investigations and security systems. With the current worldwide pandemic, facial identification has failed due to users wearing masks. However, the human ear has proven more suitable as it is visible. In Table 1, an investigation was done to ascertain the performance, distinctiveness, permanence, collectability, and acceptability of the biometric.
In different physiological biometric qualities, the ear has received much consideration of late as it tends to be said that it is a solid biometric for human acknowledgment [2]. Ear biometric framework is dependable as it does not change and is of uniform tone, and its position is fixed at the center of the face's side. e size of an individual's ear is more critical than a unique finger impression and makes it simpler to capture an image of the subject without necessarily needing to gain information from the subject [2]. ere are numerous difficulties in correctly gauging the details of the ear, and these are concealment of the ear by clothing, hair, ear ornaments, and jewelry. Another interference could be the different angles that the image was taken, concealing essential characteristics of the ear's anatomy.
ese difficulties have made ear recognition a secondary role in identification systems and techniques commonly used for identification and verification.
Although several computer-aided detection models have been developed to identify ears, low accuracy and sensitivity are still significant concerns that misidentify ears. Existing models are also computationally complex and expensive. e contributions of this work are summarized as follows: (1) Implementation of state-of-the-art EfficientNets to develop an effective and inexpensive ear detection system. It is the first time the EfficientNet model is being applied to classify ears. (2) e proposed model accuracy through EfficientNet.
(3) Finally, benchmark datasets were used to evaluate the performance of the model. e remainder of the work is structured as follows: Section 2 presents related works, and Section 3 presents detailed data and methodology explored in this study. e experimental results and discussion are provided in Section 4, and Section 5 concludes the paper.

Related Work
is section presents different algorithms using the convolutional neural network (CNN) for ear identifications, and a summary of the related works is shown in Table 2.
Emeršič et al. [3] organized the dataset of the UERC which was used for the benchmark, training, and testing sets. In the completion, it was seen that handcrafted feature extraction methods, such as LBP [13] and patterns of oriented edge magnitudes (POEM) [14], and CNN-based feature extraction methods were used to obtain the ear identification.
e challenges were to find methods to remove occlusions such as earrings, hair, other obstacles, and background from the ear image. e occlusion was done by creating a binary ear mask, and then the system recognition was done using the handcrafted features. Another proposed approach was to calculate the score of matrices from the CNN-based features and handcrafted features when they are fused, and a 30% detection rate was achieved.
Tian and Mu [4] applied a CNN to ear recognition in which they designed a CNN-it was made up of three convolutional layers, a fully connected layer, and a softmax classifier. e database used was USTB ear, which consisted of 79 subjects with various pose angles. e images utilized excluded earrings, headsets, or similar occlusions. Chowdhury et al. [15] proposed an ear biometric recognition system that uses local features of the ear and then uses a neural network to identify the ear. e method estimates where the ear could be in the input image and then takes the edge features from the identified ear. After identifying the ear, a neural network matches the extracted feature with a feature database.
Raveane et al. [5] presented that it is difficult to precisely detect and locate an ear within an image, this challenge increases when working with the variable condition, and this could also be because of the odd shape of the human ears as well as lighting conditions and the changing profile shape of an ear when photographed [5]. e ear detection system used multiple CNNs, combined with a detection grouping algorithm, to identify an ear's presence and location. e proposed method matches other methods' performance when analyzed against clean and purpose-shot photographs, reaching an accuracy of upward of 98%. It outperforms them with a rate of over 86% when the system is subjected to noncooperative natural images where the subject appears in challenging orientations and photographic conditions.
Multiple scale faster region-based convolutional neural network (Faster R-CNN) to detect ears from 2D profile images was proposed by Zhang and Mu [6]. is method was used by taking three regions of different scales that are detected to defer the information from the ear location within the context of the ear in the image, which was done to extract the ear correctly. e system was tested with 200 web images that achieved a 98% accuracy. Other experiments conducted were on the Collection J2 of the University of Notre Dame Biometrics Database (UND-J2) and University of Beira Interior Ear dataset (UBEAR); these achieved a detection rate of 100% and 98.22%, respectively, but these datasets contained large occlusions, scale, and pose variation.
Kohlakala and Coetzer [7] presented semi-automated and fully automated ear-based biometric verification systems. CNN and morphological postprocessing manually identify the ear region. It is used to classify ears in either the foreground or background of the image. e binary contour image applied the matching for feature extraction, and this was done by implementing a Euclidean distance measure, which had a ranking to verify for authentication. e Mathematical Analysis of Images Ear database and the Indian Institute of Technology Delhi Ear database were two databases, which achieved 99.20% and 96.06%, respectively. Geometric deep learning (GDL) generalizes CNNs to non-Euclidean domains, presented by [8] Tomczyk and Szczepaniak. It used the convolutional filters with a mixture of Gaussian models. ese filters were used so that the images could be easily rotated without interpolation. It shows the published experimental results that the approach Hansley et al. [12] UERC challenge NA is was done using handcrafted descriptors, which were fused to improve recognition Alshazly et al. [9] presented and compared ear recognition models built with handcrafted and CNN features. e paper took seven performing handcrafted descriptors to extract the discriminating ear image. ey then took the extracted ear and trained it using support vector machine (SVM) to learn a suitable model. ey then used CNN-based models, which used a variant of the AlexNet architecture.
e results obtained on three ear datasets showed the CNN-based models' performance increased by 22%. is paper also investigated if the left and right ears have symmetry. e results obtained by the two datasets indicate a high impact of balance between the ears.
Alkababji and Mohammed [10] presented the use of a deep learning item detector called faster region-based convolutional neural network (Faster R-CNN) for ear detection. is CNN is used for feature extraction. It used the principal component analysis (PCA) and a genetic algorithm for feature reduction and selection. It also used a connected artificial neural network as the matcher. e results achieved an accuracy of 97.8% success rate.
Jamil et al. [11] build and train a CNN model for ear biometrics in various uniform illuminations measured using lumens. ey considered that their work was the first to test the performance of CNN on underexposed or overexposed images. e results showed that for images with uniform illumination with a luminance of above 25 lux achieved a result of 100%. e CNN model had problems recognizing images when the lux was below ten, but produced an accuracy of 97%. is result shows that CNN architecture performs just as well as the other systems. It was found that the dataset had rotations which affected the results.
Hansley et al. [12] presented an unconstrained ear recognition framework that was better than the current state-of-the-art systems using publicly available databases. ey developed CNN-based solutions for ear normalization and description. is was done using handcrafted descriptors, which were fused to improve recognition.
is was done in two stages. e first stage was to find the landmark detectors, which were untrained scenarios. e next step was to generate a geometric image normalization to boost the performance. It was seen that the CNN descriptor was better than other CNN-based works in the literature. e obtained results were higher than different reported results for the UERC challenge.

Dataset.
In this study, all the experiments were performed with numerous public ear datasets; an explanation of these datasets is provided below. UBEAR, EarVN1.0, IIT, ITWE, and AWE databases are best suited for ear identification due to their large data size. However, it shows that EarVN1.0 has the foremost prominent usage during age estimation using CNN techniques. It is an appropriate dataset for ear images taken in a controlled environment, while ITWE is compatible for classifying ears in an uncontrolled environment, a summary of the datasets is shown in Table 3.

Mathematical Analysis of Images (AMI) Ear Database.
e AMI Ear database [19] was collected at the University of Las Palmas. e database comprises 700 ear images of 100 distinct Caucasian male and female adults between the ages of 19 and 65. All images within the database were taken under an equivalent illumination and a glued camera position. Both the left-and right-hand sides of the ears were captured. e pictures obtained were cropped to form the ear area covering almost half the image. e pose of the images varies in yaw and servery in pitch angles, and this dataset is often found publicly.

e Indian Institute of Technology (IIT) Delhi Ear Database.
e IIT database [16] was collected by the Indian Institute of Technology Delhi in New Delhi between October 2006 and June 2007. e database is formed from 421 images of 121 distinct adults of both male and female. All images were taken inside the environment, with no significant occlusions present, and only the right-hand side of the ear was captured. e pictures obtained in the dataset were both raw and normalized. e normalized images were in grayscale and of size 272 × 204 pixels.

e University of Beira Ear (UBEAR)
Database. e University of Beira presented the UBEAR database [25]. e database comprises 4429 images of 126 subjects, and these were of both males and females. e images were taken under varying lighting conditions, and angles and partial occlusions were present. ese images were of both the leftand right-hand sides of the ear.

e Annotated Web Ear (AWE) Database.
e AWE database [18] is a set of public figures from web images. e database was formed from 1000 images of 100 different subjects whose sizes vary and were tightly cropped. Both the left-and right-hand sides of the ears were taken.
e EarVN1.0 database [22] comprises 28412 images of 164 Asian male and female subjects, and left-and right-hand sides of the ear were captured. Collection was during 2018 and is formed from unconstrained conditions, including camera systems and lighting conditions. e pictures are cropped from facial images to obtain the ears, and the pictures have significant variations in pose, scale, and illumination.

e Western Pomeranian University of Technology Ear (WPUTE) Database.
e Western Pomeranian University of Technology Ear (WPUTE) database [20] was obtained within the year 2010 to gauge the ear recognition performance for images obtained within the wild. e database contains 2071 ear images belonging to 501 subjects. e images were of various sizes and were of both the left-and right-hand sides of the ear, and these were taken under different indoor lighting conditions and rotations. ere were some occlusions included in the database, and these were the headset, earrings, and hearing aids.

e Unconstrained Ear Recognition Challenge (UERC).
e Unconstrained Ear Recognition Challenge (UERC) database [21] was obtained in 2017, then extended in 2019, and is a mix of two databases that currently exist and a newly

In-the-Wild Ear (ITWE) Database.
e In-the-Wild Ear (ITWE) database [23] was created for recognition evaluation and has 2058 total images, and 231 male and female subjects. A boundary box obtained these images of the ear, and coordinates of those boundary boxes were released with the gathering. e pictures contained cluttering backgrounds and were of variable size and determination. e database includes both left-and right-hand sides of the ear, but no differentiation was given about the ears.

e University of Science and Technology Beijing (USTB) Ear Database.
e University of Science and Technology Beijing (USTB) Ear database [17] contained cropped ear and head profile images of male and female subjects split into four sets. Dataset one includes 60 subjects and has 180 images of right close-up ears during 2002. ese images were taken under different lightings and experienced some shearing and rotation. Dataset two contains 77 subjects, has 308 images of the righthand side ear approximately 2 meters away from the ear, and these images were taken in 2004. ese images were taken under different lighting conditions. Dataset three contains 103 subjects and has 1600 images, and these images were taken during the year 2004. e images are on the proper and left rotation, and therefore, the images are of the dimensions 768 × 576 pixels. e dataset contains 25500 images of 500 subjects; these were obtained from 2007 to 2008; the subject was in the center of the camera circle. e images were taken when the subject looked upward, downward, and at eye level. e images during this dataset contained different yaw and pitch poses. e databases are available on request and accessible for research.

e Carreira-Perpinan (CP) Ear
Database. e Carreira-Perpinan (CP) [24] Ear database is an early dataset of the ear utilized for ear recognition systems. It was created in 1995 and contained 102 images with 17 subjects. e images were captured in a controlled environment, and therefore, the images include variability in minor pose variation.

e Indian Institute of Technology Kanpur (IITK) Ear
Database.
e Indian Institute of Technology Kanpur (IITK) is an ear database [26] that the Institute of Technology of Kanpur compiled. e database is split into three sets, and the first set consists of 190 male and female subjects of profile images. e total number of images was 801. e second dataset also contained 801, and with a total of 89 subjects, these images had variations in pitch angle. e third dataset contains 1070 images of an equivalent of 89 subjects, but with a variation in yaw and angle.

e Forensic Ear Identification Database (FEARID).
e Forensic Ear Identification Database (FEARID) [27] is different from other databases as it only includes the ear prints. ese contain no occlusions, variable angles, or illumination. ough there is no mention of any variables, other influences like the force the ear was pressed against the scanner and the scanner's cleanliness need to be considered.
is database comprised 7364 images of 1229 subjects. is database was used for forensic application and not for biometric use.

3.1.13.
e University of Notre Dame (UND) Database. e University of Notre Dame (UND) database contains [28] many subsets of 2D and 3D ear images. ese images were appropriated over a period from 2003 to 2005. e database contains 3480 3D images from 952 male and female subjects and 464 2D images from 114 male and female subjects. ese images were taken in different lighting conditions, yaw, pitch poses, and angles. e images are only of the left-hand side of the ear.

e Face Recognition Technology (FERET) Database.
e Face Recognition Technology (FERET) database [29] is a sizeable facial image database and was obtained between the years 1995 to 1996. It contains 1564 subjects and has a total of 14126 images. ese images were collected for face recognition and were of the left-and right-hand profile images, which made them perfect for 2D ear recognition.

e Pose, Illumination and Expression (PIE).
Carnegie Mellon University obtained the Pose, Illumination and Expression database [30], which contains 40000 images and 68 subjects. e images are of the facial profile and have different poses, illuminations, and expressions.

e XM2VTS Ear Database.
e XM2VTS Ear database [31] is frontal and profile facial images from the University of Surrey; the database contains 295 subjects and 2360 images captured during controlled conditions. ese images were a set of cropped images of 720 × 576 pixel size and were from video data.

e West Virginia University (WVU) Ear Database.
e West Virginia University (WVU) Ear database [32] is a video database and is formed from 137 subjects. e system was an advanced capturing procedure that allowed them to capture the ear at different angles; these images included earrings and eyeglasses.

Preprocessing.
Image preprocessing is a considerable part of the deep learning task. Most CNN models generally require a large dataset to learn to discriminate features suitably for making predictions and obtaining a good performance. As images in the datasets are of different sizes, the inputted images need to be resized to conform to all the other CNN models, but the features need to be preserved when resizing is performed. e examples of the original and the preprocessed images are shown in Figures 2 and 3.

Transfer Learning.
In this study, the concept of transfer learning was adopted and helped with the pretrained CNN model for large datasets to learn features of the target (right 6 Computational Intelligence and Neuroscience  Hence, it requires many datasets for training, making it computationally complex and applying these models directly on small and new dataset results in feature extraction bias, overfitting, and poor generalization. e pretrained CNN modified and fine-tuned its structure to suit the dataset given. is concept of transfer learning is computationally expensive, has less training time, overcomes limitations of the dataset, improves performance, and is faster than training a model from the beginning. e pretraining CNN model fine-tuned in this work is the EfficientNets. e proposed structure is represented in Figure 4.

EfficientNet Architecture.
EfficientNet is a lightweight model based on the auto machine learning framework to develop a baseline EfficientNet B0 network and uniformly scaled up the depth, width, and resolution using a simplified and effective compound coefficient to improve EfficientNet models B1-B8. e models performed efficiently and attained superiority over the existing CNN models on the other CNN datasets. EfficientNets are smaller and only require a few parameters, and they are faster and more generalizable to obtain higher accuracy on other datasets' poplar for the transfer learning task. e proposed study fine-tuned EfficientNet models B0-B8 on the dataset to detect the ears. In transferring the pretrained EfficientNets to the ear dataset, the models were fine-tuned by adding a global average pooling to reduce the number of parameters and fix overfitting. e dense layers follow the global average    pooling with a ReLU activation function and a dropout rate of 0.4 before the output last layer [33]. is is done with the softmax activation function to determine the probabilities of the input data to represent the ears, and this can be seen in where σ is the softmax activation function, q represents the input vector to the output layer, i is depicted from the exponential element e q i , N is the number of classes, and e q y represents the output vector of the exponential function. It is known that many iterations could lead to model overfitting, while too few can cause model underfitting; this study used an early stopping strategy. It configured approximately 90 training iterations before terminating, this was to cater for early stopping to improve performance, and this was applied to control overfitting and used gradient descent. e EfficientNet B0-B8 models were trained with 100 iterations (epochs). e batch size for each iteration was 32, and the momentum equals 0.2 and was regulated. At the same time, categorical cross-entropy is the loss function used to update weights at each iteration. Hyperparameters used were evaluated and found to perform optimally, and this can be defined in where △ α J is the gradient of the loss with regard to α, n is the defined learning rate, α is the weight vector, while x and y are the respective training sample and label.

Results and Discussion
Various EfficientNet variants were fine-tuned on all the ear datasets to detect the ear. Each dataset is split into 20% training and 80% test sets. e experiments were entirely performed using Keras deep learning framework using the TensorFlow backend. e models were evaluated using the popular evaluation metrics, equation (     Accuracy Loss Accuracy Loss Accuracy Loss Accuracy Loss Accuracy Loss Accuracy Loss Accuracy Loss Accuracy Loss Accuracy Loss Computational Intelligence and Neuroscience

Specificity.
It is the ratio of correctly classified negative instances by a model to the overall number of true-negative instances being tested, equation (5).

Accuracy.
It is a measure that indicates the ratio of all the correctly recognized cases to the overall number of cases. While this metric generally gives a decent reflection of the classifier, it may not reflect a classifier's true performance in a scenario where there is an uneven class distribution. Accuracy can be computed using the following formula, equation (3).

Sensitivity.
It is the ratio of all correctly classified positive instances by a model to the overall number of positive classifications by a model. A low precision indicates that a model suffers from high false positives. Precision can be computed using the following formula, equation (4).
TPR � sensitivity e results obtained are presented in Figures 5 and 6 this is the accuracy and loss of these datasets. e various EfficientNet models average at the 100 epochs, and the accuracy is determined using the test set. e models performed at extracting and learning discriminative features from the dataset. EfficientNet B8 attains the best accuracy 98.45%, and the EfficientNet results are noted in Table 4.
An advantage of EfficientNets is that they are smaller with fewer parameters and faster, and obtain transfer learning successfully from the datasets. e worst performing EfficientNet is B2, as shown in Table 4. Even though it has minimal parameters, the reason that this performed poorly could have been because the images were downsampled. is was done to conform to the model's image input size. It can be seen that performance improves as the model gets deeper. EfficientNet B0 started poorly, beginning to converge from the 30 iteration, with little noise, until the 30 iteration and then stabilized until 50 iteration, when overfitting started. e best performing EfficientNet is B8, as shown in Table 4, and this is because of the large number of parameters. It began to converge from the 60 iteration and then stabilized until 90 iteration, when overfitting started. It is found that when the dataset is a large and equal number of classes, the results achieved were high. Determining the most suitable hyperparameters was one of the challenges faced and the overfitting, which was limited due to the data samples.
e results of the proposed methods compared with related studies are presented in Figure 7.

Conclusion
is study investigated and implemented EfficientNet models to automatically identify ears on the most prominent and publicly available datasets. EfficientNets that achieved state-of-the-art performance over other architectures to maximize accuracy and efficiency were explored and fine-tuned on profile images. e finetuning technique is valuable to utilize rich generic features learned from significant dataset sources such as ImageNet to compliment the lack of annotated datasets affecting ear domains. e experimental results show the effectiveness of EfficientNets in extracting and learning distinctive features from the ear images and then classifying them into a left or right suitable class. Out of the nine Effi-cientNet variants explored in this study, the EfficientNet B8 outperformed the others, as evident in Table 5 and depicted in Figure 7. One of the significant downfalls of the proposed approach is training the model on small datasets and training on images with low resolutions.
ese limitations can easily result in significant overfitting. To overcome this, you need to have compelling image preprocessing techniques. Although the proposed methodology is specified to do ear detection, it could be extended to detect other parts of the face, given the right set of datasets.   [4] 69.33 Raveane et al. [5] 98 Zhang and Mu [6] 99.11 Kohlakala and Coetzer [7] 95.63 Tomczyk and Szczepaniak [8] NA Alshazly et al. [9] 22 Alkababji and Mohammed [10] 97.8 Jamil et al. [11] 97 Hansley et al. [12] NA Average of our work 97.07 Computational Intelligence and Neuroscience 13

Data Availability
Datasets used to support the findings of the study are publicly available.

Conflicts of Interest
e authors declare that they have no conflicts of interest.