Ear Biometrics Using Deep Learning: A Survey

is paper explores ear biometrics using a mixture of feature extraction techniques and classies this feature vector using deep learning with convolutional neural network. is exploration of ear biometrics uses images from 2D facial proles and facial images. e investigated feature techniques are Zernike Moments, local binary pattern, Gabor lter, and Haralick texture moments.e normalised feature vector is used to examine whether deep learning using convolutional neural network is better at identifying the ear than other commonly used machine learning techniques. e widely used machine learning techniques that were used to compare them are decision tree, näıve Bayes, K-nearest neighbors (KNN), and support vector machine (SVM). is paper proved that using a bag of feature techniques and the classication technique of deep learning using convolutional neural network was better than standard machine learning techniques. e result achieved by the deep learning using convolutional neural network was 92.00% average ear identication rate for both left and right ears.


Introduction
e ear begins to develop on a fetus amid the fth and seventh weeks of pregnancy [1]. At this stage of the pregnancy, the face acquires a more distinguishable shape as the mouth, nostrils, and ears begin to form. ere is still no exact timeline at which the outer ear is created during pregnancy, but it is accepted that a cluster of embryonic cells connect to establish the ear. ese are called auricular hillocks, which begin to grow in the lower portion of the neck. e auricular hillocks broaden and intertwine within the seventh week to deliver the ear's shape. Within the ninth week, the hillocks move to the ear canal and are more noticeable as the ear [1]. e external anatomy of the ear can be seen in Figure 1. e growth of the ear in the rst four months after birth is linear. e ear is then stretched in development between the ages of four months and eight years. After this, the ear size and shape are constant until the age of seventy, when they increase in size again.
Biometrics is the recognition of a human using their biometric characteristics, which may be physiological or behavioural.
e physiological biometric features are the DNA, face, ear, facial, iris, ngerprint, hand geometry, hand vein, and palm print, with the behavioural biometrics being signatures, gait pattern, and keystrokes. Voice is considered as a combination of biometric and physiological. Numerous systems have been developed to distinguish biometric traits, which have been used in numerous applications such as forensic investigations and security systems. With the present worldwide pandemic, facial identi cation has failed due to users' wearing masks. However, the human ear has proven more suitable as it is visible. In Table 1, the characteristics that were looked at were the performance of the biometric if it is distinctive, permanence, ability to be collected, and acceptability.
In the di erent physiological biometric qualities, the ear has received much consideration of late as it tends to be said that it is a solid biometric for human acknowledgement [2]. e ear biometric framework is dependable as it does not change, it is of uniform tone, and its position is xed at the centre of the face's side. e size of an individual's ear is more critical than a unique nger impression and makes it simpler to capture an image of the subject without necessarily needing to gain information from the subject [2].
ere are numerous di culties in correctly gauging the details of the ear. ese are concealment of the ear by clothes, hair, ear ornaments, and jewellery. Another inference could be the different angle at which the image was taken, concealing essential characteristics of the ear's anatomy. ese difficulties made ear recognition a secondary role in identification systems and techniques commonly used for identification and verification. is paper's contributions are summarised below.
(1) A survey has been conducted with different deep learning architectures (2) A study of the present ear bench-mark databases and their suitability for ear identification (3) Different algorithms used for ear identification were outlined, highlighting the weaknesses and strengths (4) A review of the present deep learning algorithms used for ear identification e remainder of this work is organised as follows: Section 2 presents the foundation data on deep learning; Section 3 presents the vast majority of the ear information bases that are accessible for research; Section 4 presents a study of ear recognition calculations; the different profound learning strategies used to identify the ear are introduced in Section 5; and Section 6 presents the conclusion.

Review of Deep Learning
Deep learning is an AI model that utilises numerous layers to progressively understand the data. is paper will discuss the structures and contemporary strategies for deep learning designs in AI models that find the correct representation for the inputted information.

Neural Network (NN).
A neural network (NN) is a type of machine learning algorithm that learns representations from data [3,4]. A neutron may connect the processing unit from the directly linked network. Whenever there is a link, it has a weight that will be adjusted to assist the training process. e feed-forward neural network is when each neuron may be a function f(x: θ) which maps to an input, then to an output. e network learns the values of the parameters θ � w, b, where w is a weight vector and b a scalar.
is is often e first layer within the network is the input layer, and therefore, the last layer is the output layer. e middle layers within the algorithm are referred to as the hidden layers. When there are many hidden layers, this is often mentioned as a deep neural network; this is depicted in Figure 2.

Convolutional Neural Network (CNN).
A convolutional neural network (CNN) is an NN that joins two or more layers together to produce one composite layer. e convolutional layer is able to learn features from the input data. By stacking many convolutional layers, the network is able to learn a hierarchy of increasingly complex features [3]. A pooling layer is usually added between successive convolutional layers to reinforce essential elements. In doing the CNN, it reduces the number of parameters that are passed to the lower layers. is is depicted in Figure 3.

Building Block for Convolutional Neural Networks
is layer is a set of learnable filters or kernels used to slide over the entire input volume, performing a dot product between entries of the filter and the input layer [5]. e convolutional operation first extracts patches from its information in a sliding window fashion and then applies the same linear transformation to all the areas. e output of the convolutional operation is referred to as a feature map. e network will learn filters and then recognise the visual patterns that are in the input data. is is often shown asx l ij where x l ij is the computation of the input and is the sum of the contributions from the previous layer cells.

Pooling Layer.
A pooling layer usually follows a single or multiple convolutional layers and is used to reduce the feature mapped dimensions keeping the essential elements [3]. A pooling layer is applied to a rectangular neighbourhood using a sliding window operation. Other pooling operations are maximum, depicted in Figure 4, average depicted in Figure 5, and weighted global pooling.

Nonlinearity Layer.
e nonlinearity layer involves three steps. In step one, the layer performs the convolutional operation on the input feature map and produces a linear activation [3]. e second step would be to do the nonlinear transformation, and lastly, the pooling layer is used to modify the output. Nonlinear transformation can be carried out using activation functions; this gives the network the ability to learn a nontrivial representation, making the network resilient to slight modifications or noise in the input data and improving the computational efficiency.
is is often shown aslY (l−1) where l is the nonlinearity layer and the volume Y (l−1) i is from the convolutional layer l − 1.

Fully Connected Layer.
e fully connected layer is used as a feature extractor. e features produced are then passed to the fully connected layers for classification. Each unit in the fully connected layer is connected to all the units in the previous layers. e last layer is usually a classifier that produces a probability map over the different classes. All the features are converted into one-dimensional feature vectors before passing into the fully connected layer. e reason that this is carried out is that spatial information in the image data is lost, has a high computational cost, and can only work with images that are of the same size [6]. is is often shown as

2.3.5.
Optimisation. e performance of the deep CNN can be improved by training the network on a large data set. Training involves looking for the parameter of the model that reduces the cost function [3]. Gradient descent, shown in equation (5), is a widely used method for updating the network parameters through the backpropagation algorithm. e optimisation can be carried out at any stage in the process. Applied Computational Intelligence and Soft Computing 2.3.6. Loss Function. Loss function is used in machine learning to evaluate how the specific algorithm model obtains data. e main goal of training an NN is to make sure that the loss is low. When the output is far from the actual value, the loss will be high and low when the prediction is close to the actual value [3]. e loss function used is meansquared error, which is calculated by taking the mean of squared differences between actual and predicted values, and the binary cross entropy takes the output node to classify the data into two classes which are passed through a sigmoid function with an output of 0 or 1.

Parameter Initialisation.
Parameter initialisation is a deep learning optimisation algorithm that is iterative and requires the user to state a starting point for the algorithm. e point at which the user chooses influences how fast learning can converge [3].

Hyperparameter Tuning.
Hyperparameter tuning is the parameter that the user supplies to control the algorithm's behaviour before training starts, and this can be the learning rate, batch size, or image size [3].

Regularisation.
Regularisation is a technique for improving the performance of machine learning algorithms on unseen data [3]. Regularisation is carried out to reduce the overfitting of the training set, and this happens when the gap between the training and test error is too large.

Single Pathway.
A single pathway may be a primary network that resembles a feed-forward deep neural network [7]. Using one path, the data moves from the input layer to the classification layer. Kleesiek et al. [8] proposed a 3D  single-path CNN that has fully connected convolutional layers: the classification layer, which allows the network to classify multiple 3D pixels on just one occasion.

Cascaded Architecture.
In the cascaded architecture, the output of the CNN is concatenated with another [9]. ere are many variations with this architecture within the literature, but the input cascade is prominent. In this architecture, the output of the CNN becomes a direct input of another CNN. e input cascade is employed to concatenate the contextual information to the second CNN as additional image channels. Cascaded architecture is an improvement to the only pathway that performs multiscale label prediction separately.
ere are many other cascaded architectures: local pathway concatenation and hierarchical segmentation.

UNET.
UNET improves a convolutional network that resembles an encoder and decoder network designed to do biomedical image segmentation [10]. e network consists of a contracting path and an expansive path, which provides it with the u-shaped architecture. e contracting path consists of the repeated application of two convolutional layers, followed by a rectified linear measure and a top pooling layer that goes along the trail to scale back the spatial information while feature information is increased. e expansive path consists of upsampling operations combined with high-resolution features from the contraction path through skip connections.

AlexNet
Architecture. AlexNet architecture is an easy but powerful CNN architecture consisting of convolutional and pooling layers [11]. ese layers are fully connected at the highest point, and the benefits of the AlexNet include the size with which it uses the GPU for training and performing the task. is architecture remains a starting point in applying deep neural networks, specifically for computer vision and speech recognition.

Visual Geometry Group
Architecture. Visual geometry group architecture is a network created by Visual Graphics Group researchers at Oxford University [12]. It is characterised by a pyramidal shape because it comprises a group of convolutional layers followed by pooling layers; these pooling layers make the layers narrower in shape. e benefits include keeping a good architecture used for benchmarking for any task. e pretrained networks of the VGG are also primarily used for different applications but require numerous computational resources and are slow to coach, above all when training the dataset from scratch.

GoogLeNet Architecture.
e GoogLeNet architecture is referred to as the inception network and was created by Google researchers [13]. It is made from twenty-two layers with two options that these layers can either convolute or pool the input. e architecture contains many beginning modules stacked over each other, allowing joint and parallel training, which helps with faster convergence. e benefits are that there is speedier training, which reduces the size. It , however, possesses an Xception network, which could increase the point for the divergence of the beginning module.

Residual Network (ResNet) Architecture.
e residual network (ResNet) architecture is a 152-layer deep CNN architecture of the residual blocks. is is more profound than that of the AlexNet and VGG architectures as it is less computationally complex than these networks. It is referred to as a residual network [14], which is made up of numerous succeeding residual modules that are the essential building blocks of the architecture. ese modules are stacked to produce an end-to-end network.
e advantage of this architecture is that performance is improved due to its many residual layers and it is used for network training.

ResNeXt Architecture.
ResNeXt architecture is the present state-of-the-art technique for visual perception, which is a hybridisation between inception and ResNeXt architectures [15]. ResNeXt is referred to as the aggregated residual transform network, but it is an improvement over the inception network. It splits the concept and transforms and merges in a commanding but easy way by bringing in cardinality. It uses residual learning, which will enhance the joining of the deep and wide networks. ResNeXt uses many transformations within a split, transform, and merge blocks; and the transformations in cardinality define these. ResNeXt used a mixture of VGG topology and GoogLeNet architecture to correct the spatial resolution using 3 × 3 filters within the split, transform, and merge blocks. e increase in cardinality improves the performance and produces a different and improved architecture.

Advance Inception Network.
e advance inception network includes Inception-V3, Inception-V4, and Inception-ResNet. is is often an improved version of Inception-V1, Inception-V2, and GoogLeNet [16]. Inception-V3 reduces the computational cost of deep networks but does not affect generalisation. Szegedy et al. [17] replaced large-sized filters (5 × 5 and 7 × 7) with small and unequal filters (1 × 7 and 1 × 5) and used 1 × 1 convolution as a blockage before the vast filters. Inception-ResNet combines the strength of the residual learning and starting block.

DenseNet Architecture.
e DenseNet architecture [16] is similar to ResNet but was created to fix the vanishing gradient problem. DenseNet utilises cross-layer connectivity by connecting each preceding layer to the next layer in a feed-forward manner. is was carried out to fix the ResNet by preserving identity transformations, which increased complexity. As it uses solid blocks, it allows to feature maps of all previous layers to be used as the inputs into the subsequent layers.
Applied Computational Intelligence and Soft Computing 5 2.4.11. SqueezeNet Architecture. Hu et al. [18] proposed an auxiliary block for the choice to feature maps for object discrimination. e new block named SE-block overpowers the smaller feature maps and stimulates the category feature maps. It was created to be added into any CNN architecture before the convolution layer. It has two primary operations: squeeze and convolution. e convolution kernel captures local information but ignores features' contextual relations, while the squeeze operation captures global information of the feature maps. e network generates a feature map that is a more robust architecture and is helpful when there is low bandwidth.

Xception Architecture.
Xception architecture is referred to as risky inception architecture that overdoes depthwise separable convolution [19]. e first inception block is modified by making it more complete and substituting different spatial dimensions (1 × 1, 5 × 5, and 3 × 3) with one dimension (3 × 3) followed by a 1 × 1 convolution to achieve computational complexity. It makes the network computationally efficient by uncoupling spatial and feature map channels.

Deep Reinforcement Learning.
Deep reinforcement learning [20] may be a system trained entirely from scratch, ranging from random behaviour to an accurate knowledge domain from experience. It is a mixture of reinforcement and deep learning using fewer computation resources and data. e algorithm can learn from its environment and apply it to any sequential decision-making problems, including image analysis.

Fully Convolutional Network.
A fully convolutional network [21] is a set of convolutional and pooling layers. Bi et al. [22] developed a multistage fully convolutional network with the parallel integration method for segmentation. [23] may be a particular sort of artificial neural network that builds on a pyramidal structure by utilising skip connections that skip some convolutional layers. It is composed mainly of multiple convolutional layers.

Convolutional and Deconvolutional Neural Networks.
is architecture is formed from two significant parts: convolutional and deconvolutional networks [24]. Deconvolutional networks are CNNs that operate during a reversed process, and networks extract discriminated features. e deconvolutional layers are applied for smothering the segmentation maps to get the ultimate high-resolution output.

Residual Attention Neural.
Zhou et al. [25] designed residual attention neural that improves CNNs feature representation by incorporating attention modules into CNN and forms a network capable of learning object-aware features. It employs a feed-forward CNN that stacks residual blocks with an attention module. It combines two different learning strategies into the eye module that permits fast feedforward processing and top-down attention feedback during a single feed-forward process to supply dense features that infer each pixel. e bottom-up feed-forward structure produces low-resolution feature maps with reliable semantic information. e top-down learning strategy globally optimises the network such that it gradually outputs the maps to input during the training process. Table 2 shows a summary of the deep convolutional neural network architecture used for ear identification.

Overview of the Ear Dataset
Many factors can affect an ear detection system's performance. e ear images' datasets are easier to use than others. e more ear datasets are for researchers to use, the more this field can evolve and grow. It is always good to use highquality images in research associated with soft biometrics. A brief description of a number of the available ear databases is highlighted in Table 3 and examples of images are shown in Figures 6 and 7.

Mathematical Analysis of Images (AMI) Ear Database.
e AMI ear database was collected at the University of Las Palmas. e database comprises 700 ear images of 100 distinct Caucasian adult males and females between 19 and 65 years of age. All images within the database were taken under equivalent illumination and with a glued camera position. Both the left-and right-hand sides of the ears were captured. e pictures obtained are cropped to form the ear area, covering almost half of the image. e pose of the themes varies in yaw and surveying in pitch angles, and datasets are often found publicly.

e Indian Institute of Technology (IIT) Delhi Ear Database.
e IIT database [26] was collected by the Indian Institute of Technology Delhi in New Delhi between October 2006 and June 2007. e database is formed from 421 images of 121 distinct adults of both males and females. All images were taken inside the environment, with no significant occlusions present, and only the right-hand side of the ear was captured. e pictures obtained in the dataset were both raw and normalised. e normalised images were in greyscale with a size of 272 × 204 pixels.

e University of Beira Ear (UBEAR) Database.
e University of Beira presented the UBEAR database [27]. e database comprises 4429 images of 126 subjects, and these were of both males and females. e images were taken under varying lighting conditions and angles, and partial occlusions were present. ese images are of the ear, both the left-and right-hand side ear images were provided.

e Annotated Web Ear (AWE) Database.
e AWE ear database [28] was a set of public figures from web images. e database was formed from 1000 images of 100 6 Applied Computational Intelligence and Soft Computing different subjects, whose sizes varied and were tightly cropped. Both the left-and right-hand sides of the ears were taken.
3.5. EarVN1.0. e EarVN1.0 database [29] comprises 28412 images of 164 Asian male and female subjects, and left-and right-hand sides of the ear were captured. It was collected during 2018 and is formed from unconstrained conditions, including camera systems and lighting conditions. e pictures are cropped from facial images to obtain the ears, and the pictures have significant variations in pose, scale, and illumination.

e Western Pomeranian University of Technology Ear (WPUTE) Database.
e Western Pomeranian University of Technology Ear (WPUTE) database [32] was obtained in the year 2010 to gauge the ear recognition performance for images obtained in the wild. e database contains 2071 ear images belonging to 501 subjects. e images were of various sizes and held both the left-and right-hand sides of the ear and were taken under different indoor lighting conditions and rotations. ere were some occlusions included in the database. ese were the headset, earrings, and hearing aids.

e Unconstrained Ear Recognition Challenge (UERC).
e Unconstrained Ear Recognition Challenge (UERC) database [14] was obtained in 2017, then extended in 2019, and is a mix of two databases that currently exist and a newly created one. e database contains 3706 subjects with 11804 ear images, and the database ears have both right-and lefthand side images.

In the Wild Ear (ITWE) Database.
e In the Wild Ear (ITWE) database [33] was created for recognition evaluation and has 2058 total images, including 231 male and female subjects. A boundary box obtained these images of the ear. e coordinates of those boundary boxes were released with the gathering. e pictures contained cluttered backgrounds and were of variable size and determination. e database includes both the left-and right-hand sides of the ear, but no differentiation was given about the ears.

e University of Science and Technology, Beijing (USTB) Ear Database.
e University of Science and Technology Beijing (USTB) Ear Database [30] contained cropped ear and head profile images of male and female subjects split into four sets. Dataset one includes 60 subjects and has 180 images of right-close-up ears during 2002. ese images were taken under different lighting, experiencing some shearing and rotation. Dataset two contains 77 subjects and has 308 images of the right-hand side ear, approximately 2 m away from the ear, and the images were taken in 2004. ese images were taken under different lighting conditions. Dataset three contains 103 subjects and has 1600 images. ese images were taken during the year 2004. e images are on the proper and left rotation, and therefore, the images are of the dimensions 768 × 576. e dataset contains 25500 images of 500 subjects; these were obtained from 2007 to 2008; the subject was in the centre of the camera circle. e images were taken when the subject looked upwards, downwards, and at eye level. e images in this dataset contained different yaw and pitch poses. e databases are available on request and accessible for research.

e Carreira-Perpinan (CP) Ear Database.
e Carreira-Perpinan (CP) [34] ear database is an early dataset of the ear utilised for ear recognition systems. It was created in 1995 and contained 102 images with 17 subjects. e images were captured in a controlled environment, and therefore, the images include variability in minor pose variation.

e Indian Institute of Technology, Kanpur (IITK) Ear
Database.
e Indian Institute of Technology Kanpur (IITK) is an ear database [35] that the Institute of Technology of Kanpur compiled. e database is split into three sets, the first set consists of 190 male and female subjects of profile images. e total number of images was 801. e second dataset also contained 801 total of 89 subjects, and AlexNet [11] AlexNet is seen as a deep convolutional neural network architecture and applied to numerous ear recognition systems 53.6 DenseNet [16] DenseNet connects each layer in the CNN to another and applied to ear image datasets, yielding positive results 62.0 ResNet [14] ResNet is a class of extremely deep CNN architecture that addresses vanishing gradient by using skip connections that prevent information loss as the network goes deeper. As ResNet addressed the vanishing gradient issue, it has been applied to numerous ear image datasets yielding positive results

15.0
ResNeXt [15] ResNeXt is a modularised CNN architecture, which has been applied to ear image datasets yielding positive results 95.8 Visual geometry group [12] e visual geometry group is a very deep CNN and is one of the top performers. e VGG is used in recognition systems and has been applied to unconstrained ear image datasets, yielding positive results.

83.0
Applied Computational Intelligence and Soft Computing

e Forensic Ear Identification Database (FEARID). e Forensic Ear Identification Database (FEARID)[36] is different from other databases as it contains the ear prints.
ese contain no occlusions, variable angles, or illumination. ough there is no mention of any variables, other influences like the force the ear was pressed against the scanner and the scanner's cleanliness need to be considered.
is database comprised 7364 images of 1229 subjects. is database was used for forensic application and not for biometric use.

e University of Notre Dame (UND) Database.
e University of Notre Dame (UND) database contains [37] many subsets of 2D and 3D ear images. ese images were appropriated for a period from 2003 to 2005. e database contains 3480 3D images from 952 male and female subjects and 464 2D images from 114 male and female subjects. ese images were taken in different lighting conditions, yaw, pitch poses, and angles. e images are only of the left-hand side ear.

e Face Recognition Technology (FERET) Database.
e Face Recognition Technology (FERET) database [38] is a sizeable facial image database and was obtained between the years 1995 and 1996. It contains 1564 subjects and has a total of 14126 images. ese images were collected for face recognition and were of the left-and right-hand profile images, which made them perfect for 2D ear recognition.

e Pose, Illumination, and Expression (PIE).
Carnegie Mellon University obtained the Pose, Illumination, and Expression database [39], which contains 40000 images and 68 subjects. e images are of the facial profile and have different poses, illuminations, and expressions.

e XM2VTS Ear Database.
e XM2VTS ear database [40] is frontal and profiles face images from the University of Surrey; the database contains 295 subjects and 2360 images Applied Computational Intelligence and Soft Computing 9 captured during controlled conditions. ese images were a set of cropped images of 720 × 576 size and were from video data.

e West Virginia University (WVU) Ear Database.
e West Virginia University (WVU) Ear database [41] is a video database and is formed from 137 subjects. e system was an advanced capturing procedure that allowed them to capture the ear at different angles; these images included earrings and eyeglasses.

Description of Ear Algorithms
is section presents different algorithms and techniques used for ear identification. It presents a description of these algorithms and suggests the most effective approach. A brief description of ear algorithms is highlighted in Table 4.
Ansari and Gupta [42] used outer helix curves of the ears as they moved parallel to at least one feature spot in the ear image. Helix curves were obtained using the Canny edge detector to remove the ear from the entire image. e obtained sides are then separated into a convex or concave edge, allowing the system to determine the helix edges. is technique was run on 700 side-ear images and had an accuracy of roughly 93%.
Abdel-Mottaleb and Zhou [43] segmented the ear from a facial profile image using supported template matching, where they modelled the ear by its external curve. Yuizono et al. [44] also used a template matching technique for detection, in which they used both hierarchical 2D images. In 3D ear detection, Chen and Bhanu [45] used a modelbased (template matching) technique for ear detection. An averaged histogram of the shape index represents the model template. e detection is a four-step process: edge detection and threshold, image dilation, connected component labelling, and template matching. A test set of 30 subjects from the UCR database achieved a 91.5% detection rate with a 2.52% warming rate. Later, Chen and Bhanu [45] developed another shape-model-based technique for locating human ears inside face range images, where the ear shape model is represented by a group of discrete 3D vertices like the helix and antihelix parts. ey started by locating the sting segments and grouping them into different clusters that are potential ear candidates. Arbab-Zavar and Nixon [46] developed an ear recognition system based on the ear's elliptical shape, employing a Hough transformation (HT). ey achieved a 100% detection rate using the XM2VTS face profile database, consisting of 252 images from 63 subjects, and 91% using the UND, collection F, database. Burge and Burger [47] have proposed a way to do ear recognition using geometric information about the ear. e ear has been represented by employing a neighbourhood graph obtained from a Voronoi diagram of the ear edge segments, whereas template comparison has been performed using subgraph matching. Choras [48] has used the ear's geometric properties to propose an ear recognition technique during which feature extraction is administered in two steps. In the initial step, global features are extracted. e second step extracts local features while matching local features. In another geometry-based technique proposed by Shailaja and Gupta [49], an ear is represented by two sets of features, global and native, obtained using outer and internal ear edges, respectively. Two ears during this technique are declared similar if they are matched to the feature sets. e method proposed has treated the ear as a planar surface and has created a homograph transform using SIFT feature points to register ears accurately. It has achieved robust results in background clutter, viewing angle, and occlusion. Cummings et al. [50] used the image ray transformation, based upon an analogy to light rays, to detect an image's ears. is transformation can highlight tubular structures like the helix of the ear and spectacle frames. By exploiting the elliptical shape of the helix, this method segmented the ear into regions and achieved a detection rate of 99.6% using the XM2VTS database.
Chen and Bhanu [45] fused complexion from colour images and edges from a range of images to perform ear detection. e images observed that the sting magnitude is more prominent around the helix and, therefore, the antihelix parts. ey clustered the resulting edge segments and deleted the short irrelevant edges. Using the UCR database, they reported an accurate detection rate of 99.3% (896 out of 902). e UND databases (collections F and a subset of G) reported an accurate detection rate of 87.71% (614 out of 700). Hajsaid et al. [51] addressed the matter of an automated ear segmentation scheme by employing morphological operators. ey used low computational cost appearance-based features for segmentation and a learning-based Bayesian classifier to determine whether the segmentation's output was incorrect.
ey achieved a 90% accuracy on 3750 facial images with 376 subjects within the WVU database.
Prakash and Gupta [52] used complexion and templatebased techniques for automatic ear detection during a side profile face image. e technique first separates skin regions from nonskin regions and then searches for the ear within the skin regions employing a template matching approach. Finally, the ear region is validated using a moment-based shape descriptor. Experimentation on an assembled database of 150 side-profile face images yielded an accuracy of 94% . Basrur et al. [53] introduced the notion of "jet space similarity" for ear detection, which denotes the similarity between Gabor jets and reconstructed jets obtained via principal component analysis (PCA). ey used the XM2VTS database for evaluation; however, they did not report their algorithm's accuracy.
Rahman et al. [54] used a cascaded AdaBoost technique, supported by Haar features for ear detection. is system is widely known within the domain of face detection because of the Viola-Jones method. It is a speedy and comparatively robust face detection technique. ey trained the AdaBoost classifier to detect the ear region even in the presence of occlusions and degradation in image quality. ey reported a 100% detection performance on the cascaded detector tested against 203 profile images from the UND database, with a false detection rate of 5 × 10. A second experiment detected 54 ears out of 104 partially occluded images from the XM2VTS database.
Chang et al. [55] built a multimodal recognition system that supported face and ear recognition. e manually identified coordinates of the triangular fossa and the antitragus are used for ear detection for the ear images. eir ear recognition system was supported by Eigen-ears' concept, using principal component analysis (PCA). ey reported performance of 72.7% for the ear in one experiment, compared to 90.9% for the multimodal system, using 114 subjects from the UND, collection E, database.
Naseem et al. [56] proposed a general classification algorithm for (image-based) visual perception, supported by a sparse representation computed by L1 minimisation.
is framework provides new insights into ear recognition's two crucial issues: feature extraction and robustness to occlusion. e ear portion is manually cropped from each image, and no normalisation of the ear region is required. ey conducted several experiments using the UND and USTB databases with session variability, various head rotations, and different lighting conditions. ese experiments yielded a high recognition rate within the order of 98%.
Nanni and Lumini [57] have proposed a multi-matcherbased technique for ear recognition that obtains the ear's appearance-based local properties. It considers overlapping subwindows to extract local features using Gabor filters. Further, Laplacian Eigen Maps are accustomed to reduce the feature vectors' dimensionality. e ear is represented using the features obtained from a group of the most discriminative subwindows selected using the sequential forward floating selection (SFFS) algorithm. Matching during this technique is performed by combining the outputs of several 1-nearest neighbour classifiers constructed on different subwindows. Another technique that supports the fusion of colour spaces is proposed by Nanni and Lumini, where few colour spaces are selected using the SFFS algorithm, and Gabor features are extracted from them. Matching is Table 4: Summary of the ear algorithms.

Author
Algorithms used Accuracy (%) Summary Ansari and Gupta [42] Canny edge detector 93 Uses outer helix curves of the ears with Canny edge detector, and this only obtains the edges of the ear and is only used to determine the helix Abdel-Mottaleb and Zhou [43] Template matching 91.5 ey used a segmented ear obtained from a facial profile and only modelled the ear's external curve Arbab-Zavar and Nixon [46] Hough transform 91 ey only looked at the ear's elliptical shape, and they used a small sample of profile ears Burge and Burger [47] Geometric information 94 ey did ear recognition using geometric information of the ear and used neighbourhood graphs obtained from a Voronoi diagram of the ear edge segments Cummings et al. [50] Image ray transform 99.6 Used ray transformation to detect an image of the ear and only obtained the helix of the ear and spectacle frames Chen and Bhanu [45] Fused complexion from colour images and edges from a range of images 87.71 Fused complexion from colour images and edges from a range of images to perform ear detection Prakash and Gupta [52] Complexion and template-based technique 94 Used complexions and template-based techniques for automatic ear detection Basrur et al. [53] Gabor jets and reconstructed jets obtained via principal component analysis NA Introduced the notion of "jet space similarity," but did not report their algorithm's accuracy Rahman et al. [54] Cascaded AdaBoost technique supported Haar features 100 is system is widely known within the domain of face detection because of the Viola-Jones method, and it is a speedy and comparatively robust face detection technique Chang et al. [55] Multimodal recognition system 90.9 is system supported both the face and ear recognition Naseem et al. [56] General classification algorithm 98 is system investigated two crucial issues: feature extraction and robustness to occlusion Nanni and Lumini [57] Multi-matcher-based technique NA is system considers overlapping subwindows to extract local features Yan and Bowyer [58] Contour extraction algorithm 21 is system only used the ear contour using the active outline Minaee et al. [59] Independent component analysis and a radial basis function 94.11 e original ear image database and decomposing it into linear combinations of many basic images Abdel-Mottaleb and Zhou [43] Support vector machine 100 is approach is used for 3D ear detection and then a sliding window approach and linear SVM classifier to identify the ear Applied Computational Intelligence and Soft Computing administered by combining several nearest neighbour classifiers constructed on different colour components.
Yan and Bowyer [58] developed an automatic ear contour extraction algorithm.
is was carried out by detecting the ear pit based on the position of the nose and cutting the ear contour using the active outline starting around the ear tip. is paper's results showed that 21% of the images tested were incorrectly segmented, but if they changed it to use only depth information and not colour, only 15% of the images were incorrectly segmented. A hybrid system for ear recognition was investigated by Minaee et al. [59]. is system combines an independent component analysis (ICA) and a radial basis function (RBF) network.
is was conducted by taking the original ear image database and decomposing it into linear combinations of many basic images. en, the corresponding coefficients of these combinations are used in the RBF network. ey achieved 94.11% using two databases of segmented ear images.
A 3D ear detection system was investigated by Abdel-Mottaleb and Zhou [43]. ey showed a novel shape-based feature set called histograms of categorised shapes (HCS).
is approach is used for 3D ear detection and then a sliding window approach and linear support vector machine (SVM) classifier to identify the ear. ey reported a perfect detection rate, a 100% detection rate, and a 0% falsepositive rate.

Review of Ear Algorithms Using CNN
is section presents different algorithms using CNN used for ear recognition. is paper presents a description of these algorithms and suggests the most effective approach. A brief description of the ear algorithms using CNN is highlighted in Table 5.
Emeršič et al. [60] organized the dataset of the UERC. It was introduced and used for the benchmark, training, and testing sets. In this study, it was seen that handcrafted feature extraction methods such as linear binary pattern (LBP) [61], patterns of oriented edge magnitudes (POEM) [62], and CNN-based feature extraction methods were used to obtain the ear identification. In this challenge, one method needs to figure out a way to remove occlusions like earrings, hair, other obstacles, and background from the ear image. e occlusion was carried out by creating a binary ear mask, and then the system recognition was conducted using the handcrafted features. Another proposed approach was to calculate the score of matrices from the CNN-based features and handcrafted features when they are fused. A 30% detection rate was produced.
Tian et al. [21] applied a deep convolutional neural network (CNN) to ear recognition in which they designed a CNN-it was made up of three convolutional layers, a fully connected layer, and a softmax classifier. e database used was USTB ear, which consisted of 79 subjects with various pose angles. ere were occlusions like no earrings, headsets, or similar occlusions. Chowdhury et al. [63] proposed an ear biometric recognition system that uses local features of the ear and then uses a neural network to identify the ear. e method estimates where the ear could be in the input image and then gets the edge features from the identified ear. After identifying the ear, a neural network matches the extracted feature with a feature database. e databases used in this system were AMI, WPUT, IITD, and UERC, which achieved an accuracy of 70.58%, 67.01%, 81.98%, and 57.75%, respectively. Raveane et al. [64] presented that it is difficult to precisely detect and locate an ear within an image. is challenge increases when working with variable conditions, and this could also be because of the odd shape of the human ears and changing lighting conditions. e changing profile shape of an ear when photographed is displayed [64]. e ear detection system was a multiple convolutional neural network with a detection grouping algorithm to identify the ear's presence and location. e proposed method matches other methods' performance when analysed against clean and purpose-shot photographs, reaching an accuracy of upwards of 98%. It outperforms other works with a rate of over 86% when the system is subjected to noncooperative natural images where the subject appears in challenging orientations and photographic conditions. Multiple scale faster region-based convolutional neural network (Faster R-CNN) to detect ears from 2D profile images was proposed by Zhang and Mu [65]. is method uses three regions of different scales to detect information from the ears' location within the context of the ear image.
e system was tested with 200 web images and achieved an accuracy of 98% . Other experiments conducted were on the Collection J2 of the University of Notre Dame Biometrics Database (UND-J2) and the University of Beira Interior Ear (UBEAR) dataset; these achieved a detection rate of 100% and 98.22%, respectively, but these datasets contained large occlusions, scale, and pose variation.
Kohlakala and Coetzer [66] presented semiautomated and fully automated ear-based biometric verification systems. A convolutional neural network (CNN) and  morphological postprocessing were used to manually identify the ear region. ey are used to classify ears either in the foreground or background of the image. e binary contour image applied the matching for feature extraction, and this was carried out by implementing a Euclidean distance measure, which had a ranking to verify for authentication. e Mathematical Analysis of Images ear database and the Indian Institute of Technology, Delhi, ear database were two databases, which achieved 99.20% and 96.06%, respectively.
Geometric deep learning (GDL) generalises convolutional neural network (CNN) to non-Euclidean domains, presented by [67] Tomczyk and Szczepaniak. It used convolutional filters with a mixture of Gaussian models. ese filters were used so that the images could be easily rotated without interpolation. eir paper published experimental results on the approach of the rotation equivalence property to detect rotated structures. e result showed that it did not require labour-intensive training on all rotated and nonrotated images.
Alshazly et al. [68] presented and compared ear recognition models built with handcrafted and convolutional neural networks (CNN) features. e paper took seven handcrafted descriptors to extract the discriminating ear image. e extracted ear was trained using Support Vector Machines (SVM) to learn a suitable model, after which the CNN-based model used the AlexNet architecture. e results obtained on three ear datasets show the CNN-based models' performance by 22%. is paper also investigated if the left and right ears have symmetry. e results obtained by the two datasets indicate a high impact of balance between the ears. to learn a suitable model Seventy-three (73) application papers that are deep learning ear identification methods are reviewed in this paper Employing fusion of learned and handcrafted features for unconstrained ear recognition is was conducted using handcrafted descriptors, which were fused to improve recognition irty-one (31) application papers that are deep learning ear identification methods are reviewed in this paper Alkababji and Mohammed [69] presented the use of a deep learning item detector, which they called faster regionbased convolutional neural networks (Faster R-CNN) for ear detection. is convolutional neural network (CNN) is used for feature extraction. It used Principal Component Analysis (PCA) and a genetic algorithm for feature reduction and selection. It also used a connected artificial neural network as the matcher.
e results achieved an accuracy of 97.8% success.
Jamil et al. [70] built and trained a CNN model for ear biometrics in various uniform illuminations measured using lumens. ey considered that their work was the first to test the performance of CNN on very underexposed or overexposed images. e results showed that for images with uniform illumination and a luminance of above 25 lux, the results achieved were 100%. e CNN model had problems recognising images when the lux was below ten, but still obtained an accuracy of 97%. is result shows that the CNN architecture performs just as well as the other systems. It was found that the data set had rotations that affected the results.
Hansley et al. [71] presented an unconstrained ear recognition framework that was better than the current state-of-the-art systems using publicly available databases. ey developed CNN-based solutions for ear normalisation and description.
is was performed using handcrafted descriptors, which were fused to improve recognition, and was carried out in two stages. e first stage was to find the landmark detectors, which were untrained scenarios. e next step was to generate a geometric image normalisation to boost the performance. It was seen that the CNN descriptor was better than other CNN-based works in the literature. e obtained results were higher than different reported results for the UERC challenge. Tables 6 and 7 show the comparison of this review paper with recent/existing review papers to establish their differences. A critical analysis of Tables 6 and 7 reveals that the most recent and closest review paper to this article is the excellent review work. Tables 8 and 9 show the differences between the review article and the existing review papers.

Conclusion
is paper presented a comparative survey of various convolutional neural network architectures, with their strengths and weaknesses. A thorough analysis of the existing deep convolutional neural network methods used for ear identification was discussed. Furthermore, the paper discussed and investigated the success of using the ear as a primary biometric system for identification and verification. It was found that other works battled to identify the ear if pose and angle of the image were changed.
is will be looked at in the future as to how this can be eliminated. Also, it was found that if clothes, hair, ear ornaments, and jewellery were not removed, it interfered with the identification of an ear. In addition, a study was performed on ear identification benchmarks and their performance on other CNN models measured by standard evaluating metrics.
Future work will be to investigate and implement Effi-cientNet models to automatically identify ears on the most prominent and publicly available datasets. EfficientNets that achieved state-of-the-art performance over other architectures to maximize accuracy and efficiency were explored and fine-tuned on profile images. e fine-tuning technique is valuable to utilize rich generic features learned from significant dataset sources such as ImageNet to complement the lack of annotated datasets affecting the ear domains.

Abbreviations
NN: Neural network CNN: Convolutional neural network.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.