An Overview of Multimodal Biometrics Using the Face and Ear

In the recent years, we have witnessed the rapid development of face recognition, though it is still plagued by variations such as facial expressions, pose, and occlusion. In contrast to the face, the ear has a stable 3D structure and is nearly unaffected by aging and expression changes. Both the face and ear can be captured from a distance and in a nonintrusive manner, which makes them applicable to a wider range of application domains. Together with their physiological structure and location, the ear can readily serve as supplement to the face for biometric recognition. It has been a trend to combine the face and ear to develop nonintrusive multimodal recognition for improved accuracy, robustness, and security. However, when either the face or the ear suffers from data degeneration, if the fusion rule is fixed or with inferior flexibility, a multimodal systemmay performworse than the unimodal system using only the modality with better quality sample.+e biometric quality-based adaptive fusion is an avenue to address this issue. In this paper, we present an overview of the literature about multimodal biometrics using the face and ear. All the approaches are classified into categories according to their fusion levels. In the end, we pay particular attention to an adaptive multimodal identification system, which adopts a general biometric quality assessment (BQA) method and dynamically integrates the face and ear via sparse representation. Apart from a refinement of the BQA and fusion weights selection, we extend the experiments for a more thorough evaluation by using more datasets and more types of image degeneration.


Introduction
Face recognition (FR) has been intensively studied and received significant progress in the recent decade. Besides the better acceptability and the widely available low-cost camera sensors, compared with other popular biometrics such as fingerprint, iris, and retina recognition, face recognition has the potential to recognize uncooperative subjects from a distance and in a nonintrusive manner [1]. erefore, face recognition can be applied to a wider range of applications, including biometric authentication, surveillance security, border control, forensics, and digital entertainment. However, owing to the variations such as facial expression, aging, pose, illumination, occlusion (e.g., sunglasses, scarf, and mask), and the increasing risk of spoof attacks, FR is not yet as accurate, flexible, and secure as desired [2,3].
Recent studies have also validated that the ear can be used for biometric recognition. Compared with FR, the ear recognition (ER) has several appealing advantages: e ear has a stable 3D structure with rich information, nearly unaffected by aging and facial expressions. ER is also contactless or nonintrusive. e ear is located near the face and can be captured along with the face using the same type of sensor or a single sensor at two times. Lots of popular face feature extraction and classification techniques are applicable to the ear. erefore, it is highly attractive to combine the face and the ear and, thus, develop nonintrusive multimodal biometric systems for better recognition performance [2][3][4][5].
However, it must be noticed that data degeneration of either one of the modalities combined would degrade the multimodal recognition performance [2,3]. Particularly, when one or a subset of the modalities used is corrupted severely to a certain extent, a multimodal system may perform worse than the unimodal systems using the other modality with a good quality sample.
is consequence mainly results from the fact that most existing multimodal systems either use fixed fusion rules or their fusion rules cannot effectively adapt to the variations of biometric traits and the changes of the environment. Given the independence among the modalities used in multimodal systems, biometric quality-based adaptive fusion, which prefers to the modalities with high-quality samples presented, is an effective way to handle such situations. e face could be occluded by sunglasses, scarf, and mask, and its appearance is prone to change with the variations such as expression, aging, pose, and illumination. On the other hand, it may be more likely for hair and earrings to cover the ear and for uneven illumination to change the ear appearance. Hence, biometric quality assessment is critical in developing adaptive face-and ear-based multimodal recognition systems.
In this paper, we present an overview of the multimodal biometrics using the face and ear. All the approaches are classified according to their fusion methodologies. For this aim, we also give an illustration about the biometric fusion methodologies. In the end, we pay particular attention to an adaptive face-and ear-based multimodal identification system presented in [2], which uses a general BQA method and dynamically consolidates the traits via sparse representation (SR or sparse coding). We refine its BQA and fusion weights selection partly according to literature [3]. en, we extend the experiments for a more thorough evaluation by using more datasets and more types of image degeneration. e rest of the paper is structured as follows. Section 2 briefly introduces multibiometrics and biometric fusion methodologies. Section 3 presents an overview of multimodal biometrics using the face and ear. In Section 4, we illustrate an adaptive multimodal identification system and provide extended experimental results and discussions. Finally, we conclude the paper in Section 5.

Multibiometrics
e unimodal biometric systems that rely on a single biometric trait have to contend with a variety of application problems such as noise, unsatisfactory accuracy, nonuniversality, spoof attacks, and restricted degrees of freedom. In order to address or alleviate some limitations of unimodal biometric systems, multibiometric systems that consolidate multiple sources of biometric information to establish an identity have been investigated for two decades [6][7][8]. A variety of sources of biometric information can be utilized to establish a multibiometric system. According to the nature of biometric sources selected, multibiometric systems can be broadly classified into four categories: multimodal (e.g., the face and ear), multi-instance (or multiunit, e.g., the left and right irises), multisensor (e.g., 2D and 3D face sensors), and multialgorithm (e.g., minutiabased and ridge-based fingerprint matchers) [7,9]. In a multibiometric system, biometric information fusion can be accomplished at several different levels, including the sensor-level, feature-level, score-level, ranklevel, or decision-level [7], as shown in Figure 1. Sanderson and Paliwal [10] categorize the fusion schemes at these levels into two broad categories: preclassification or fusion before matching and postclassification or fusion after matching.
Preclassification schemes include fusion at the sensor (or raw data) and the feature levels while postclassification schemes include fusion at the match score, rank, and decision levels. In the literature, postclassification fusion schemes are fairly popular due to the ease of accessing and processing the match scores, ranks, and decisions. However, preclassification fusion also catches much attention recently because of the capability of utilizing more biometric information for classification and the emerging advanced computational techniques.  preprocessing approaches, such as denoising and normalization, are generally employed before fusion. Chang et al. concatenate the normalized, masked ear, and face images of a subject to form a combined face-plus-ear image [11]. Jain and Ross [12] introduce a fusion scheme of mosaicking multiple impressions of the same finger to form an enhanced fingerprint. Sensor-level fusion could possibly make full use of all the evidence presented, but it is rarely used in the literature due to the fact that the raw biometric data may contain noisy or redundant data, together with its high computational complexity.

Feature-Level Fusion.
Fusion at this level is performed after feature extraction but before matching. ey consolidate the feature sets extracted from multiple biometric samples or by multiple algorithms into a single feature set. e common feature fusion techniques include serial concatenation [13], parallel fusion using a complex vector [14], and the methods that extract correlation feature of multiple modalities [15]. Compared with the latter two methods, serial concatenation is simple but effective and easy to extend for combining more than two modalities. Feature fusion schemes are expected to pertain most of the discriminative information from multiple biometric sources while containing less redundant data; thereby, they are expected to be the best way to improve multibiometric performance. However, the feature sets of multiple modalities may be incompatible; for example, the minutiae set of fingerprints and the eigen-coefficients of face are irreconcilable [8]. Besides, concatenating several feature vectors with fixed dimensionality might lead to the curse-of-dimensionality problem.

Score-Level Fusion.
A match score represents the result of comparing two feature sets extracted using the same feature extractor. Each biometric matcher provides a match score. After transformed into a common domain, the match scores can be easily combined by using some simple rules such as Sum, Product, and Max or Min rules, or we concatenate all scores to form a score vector, and this vector is classified using Fisher's discriminant analysis, support vector machine (SVM), Bayesian classifier, neural network, and decision tree. Fusion approaches at this level are most commonly used in the biometric literature primarily due to the ease of accessing and processing match scores. In contrast to feature-level fusion, fusion at the score level is applicable to all kinds of multibiometric systems, while the information contained in matching scores is much richer than ranks and decisions. It is widely recognized that scorelevel fusion strikes the best balance between the effectiveness and the ease of fusion.

Rank-Level Fusion.
Rank-level fusion is typically used in, but not limited to, multibiometric identification systems where each classifier associates a rank with every enrolled identity. Typically, a higher rank indicates a good match. e goal of rank-level fusion schemes is to consolidate the ranking output by the individual biometric subsystems in order to derive a consensus rank for each identity. Ranks provide more insight into the decision-making process of the matcher compared to just the identity of the best match, but they reveal less information than match scores [8]. e ranks generated by different biometric matchers are comparable; thereby, rank-level fusion is simpler to implement than the score-level fusion, such as the highest rank method, the Borda count method, and the logistic regression method.

Decision-Level
Fusion. Each biometric system makes its own recognition decision based on its own feature vectors or matching scores. While a verification system returns either a "match/accept" or a "nonmatch/reject" decision, an identification system makes such binary decision for every enrolled identity. In multibiometric systems, the binary decisions output by each matcher can be fused with AND or OR rules, majority voting, weighted majority voting, Bayesian decision fusion, and the Dempster-Shafer theory of theory. e least information contained in binary decision, compared with the other fusion approaches, implies that decision-level fusion is unlikely to achieve performance and popularity like score-level fusion.
Postclassification fusions are fairly popular due to the ease of accessing and processing the match scores, ranks, and decisions. Among them, score-level fusion is the most often seen in the literature. In contrast, combinations at the early stage are relatively difficult because the raw biometric data may contain noisy or redundant data, while feature sets extracted from different biometric traits may be incompatible. Nevertheless, because of the capability of utilizing more discriminative information for classification and the emerging advanced computational techniques, preclassification fusions have drawn much attention in the recent years. Feature-level fusion is generally believed to have potential on exploiting the most discriminative information contained in the raw data so as to pull the multibiometric performance to a higher level.

Face-and Ear-Based Multimodal Biometrics
According to literature [1][2][3][4][5], we summarize the benefits of the combination of the face and ear as follows: (a) e ear and the face are less correlated, and the combination of their discriminative information generally improves the recognition performance; (b) the ear is not affected by facial expressions and aging, and when the face is unreliable or even unavailable, the multimodal system can rely on the ear, and vice versa; (c) it is much more difficult to spoof both the face and the ear simultaneously than to spoof only one; (d) both the data collections are contactless, and thus, the multimodal recognition can still operate from a distance and in a nonintrusive manner; (e) they can share the same feature extraction and matching algorithms, whereby the derived feature sets are compatible for biometric fusion (f ) benefiting from their physiological location and structure, the multimodal recognition enjoys a wider range of recognition angle in the yaw direction; (g) the ear and can be Mathematical Problems in Engineering captured along with the face by using the same type of sensor or a single sensor at two times; and (h) face detection helps speed up ear detection by offering an ear region of interest. Inspired by these attractive features, the face-and ear-based multimodal biometrics has attracted much attention in computer vision community. We classify the approaches in the literature according to the fusion levels and their operation modes, as shown in Table 1.

Sensor-Level Fusion Methods.
Chang et al. [11] study the comparison and combination of ear and face biometrics with principal component analysis-(PCA-) based feature extraction algorithm. e probe images differ from the gallery in one of three categories: day variation (88 subjects), lighting variation (111 subjects), and pose variation (101 subjects). Every subject has two images: one is used as gallery image, and the other was used as probe image. Fusion is performed at the sensor level, where the normalized face and ear images of a subject were concatenated to form a combined face-plus-ear image. In the day variation experiment, multimodal method achieved 90% rank-one recognition rate, while the face's and the ear's are comparable and they are 70.5% and 71.6%, respectively. A significant progress is also achieved in lighting variation experiment. Face recognition in this experiment gets 64.9%, ear recognition gets 68.5%, and the multimodal recognition achieves 87.4%. Overall, the multimodal recognition gets a significant improvement of roughly 20% rank-one recognition rate over the FR and ER in both lighting variation and day variation experiments.
Yuan et al. [16] adopt the similar fusion scheme but use full-space linear discriminant analysis (FSLDA) instead of PCA for feature extraction. In experiments, the 4 ear images of a subject from the USTB ear database are randomly coupled with 4 face images of a subject from the ORL face database to form 4 multimodal biometric samples, of which 3 samples are used for gallery and the rest is used for probe.
ere are 75 subjects in total. As their experimental results show, face recognition gets 88.0% rank-one recognition rate, lower than the 94.7% of ear recognition. e accuracy is 98.7% when it comes to the multimodal recognition, which outperforms the biometric recognition using either one alone.
Yamen et al. [17] explore three approaches to fuse the profile and ear, namely, spatial fusion, intensity fusion, and channel fusion. In spatial fusion, they concatenate profile face and ear images side-by-side. In channel fusion, they stack the color channels of face and ear images; for example, the combined images will have 6 color channels if the input data is RGB images. In the intensity fusion, they average pixel intensity values of the profile face and ear images. After sensor-level fusion, they employ a fine-tuned CNN model, VGG-16 or ResNet-50, to extract the multimodal feature and classify the user age and gender simultaneously. In their experiments, compared with the state-of-the-art methods, the proposed method using channel fusion and VGG-16 achieves the best results on both age and gender classification tasks, which are, respectively, 67.59% and 99.11% on two multimodal datasets selected from publicly available face datasets. [18] report a multimodal biometric system integrating the ear and profile face at the feature level. After PCA-based feature extraction, they gain the fused feature by means of canonical correlation analysis (CCA). A best accuracy of 97.37% is achieved on a subset of the USTB ear database with 38 subjects and 3 images/subject for gallery and 2 images/subject for probe. Furthermore, they develop a Kernel CCA-based feature fusion algorithm, which gets a further improved accuracy of 98.68% on the same database [19].

Feature-Level Fusion Methods. Xu and Mu
Abate et al. [5] employ iterated function systems transformation (IFS) to characterize the self-similarity of a face or ear image. Each face component, as well as the ear, is described by a list of centroids, which are, then, concatenated to form an overall feature vector. e system is tested on a subset with 100 subjects (2 images/subject) from the FERET database. When both the face and the ear probe images are occluded by blocks of 50 × 50 pixels, the multimodal system gets a 100% correct recognition rate; in contrast, the recognition rates of FR and ER based on IFS are only 90% and 81%, respectively. eir respective recognition rates are 93%, 80%, and 70% when the area of occluded blocks comes up to 100 × 100 pixels.
eoharis et al. [20] propose a unified approach that fuses 3D face and ear data. ey construct, in advance, an annotated face model (AFM) and an annotated ear model (AEM) based on statistical data. As a preprocessing step, an AFM or AEM is fitted to the face or ear 3D data by using ICP and Simulated Annealing (SA). Wavelet coefficients, as feature, are calculated from the resulted geometry images. For the multimodal fusion, the feature vector of each modality is multiplied by a global normalization weight, and then, all of them are concatenated into a single vector. e weights are selected empirically and slightly in favor of the face feature, as they consider the face modality more reliable. On a virtual multimodal database with 324 subjects based on the FRGC v2 face database (4007 range images, 466 subjects) and UND ear database (830 range images, 415 subjects), this multimodal method achieves 99.7% rank-one recognition rate, which is obviously better that the 97.5% for FR and 95% for ER.
Huang et al. [2] introduce a multimodal method called MSRC, which combines the PCA feature sets of the face and the ear with serial concatenation and employs an SR-based classification for multimodal classification. MSRC is reported with a significant improvement compared to Xu's CCA-based methods. However, they argue that although all existing multimodal systems are reported with evidently better performance than the unimodal recognition using either the face or ear alone, they may perform much worse than unimodal systems when the face or ear encounters image degeneration, owing to their fixed fusion rules. eir experiments show that even using advanced classification techniques, MSRC cannot avert a rapid decline of accuracy in the cases of face or ear data degeneration. In order to handle this limitation, they propose a quality-based adaptive feature fusion scheme, thereby developing a multimodal approach called MRSCW [2]. Experiments demonstrate that MRSCW achieves quite encouraging robustness against image degeneration and outperforms many up-to-date methods. Very impressively, even when a query sample of one modality is extremely degenerated, MRSCW can still get a performance comparable to the unimodal recognition using the other modality.
In [17], Yamen et al. also explore the feature fusion strategy to combine the profile face and ear for age and gender classification. ey first train two separate CNN models using the profile face and ear images. e CNNbased feature vectors of the two traits are concatenated to form a multimodal feature vector, which are, then, fed to the fully connected layers for simultaneous age and gender classifications. Although the network models are more sophisticated, both the feature fusion methods based on VGG-16 and ResNet-50 are inferior to the methods using channel fusion at the sensor level. e insufficient training samples used may be responsible for this result. Meanwhile, it is noted that the channel fusion methods preserve all the face and ear information to train the multimodal CNN models, while feature level fusion methods may have lost discrimination information before multimodal combination. [22] combine the ear and face profile and use FSLDA feature extraction algorithm for both the ear and face. ey test three score fusion schemes, i.e., Sum, Product, and Median rules, together with a decision-level fusion scheme. A subset of the USTB II ear database is used, which consists of 294 images for 42 subjects. Each subject has seven profile views head images with variations of the head position and slightly facial expressions. In the experiments, for each subject, 5 images are used for gallery and the other 2 images are used as the probe image. As a result, 94.05% subjects can be recognized by ER and 88.10% subjects can be recognized correctly by FR. For the fusion schemes, the best performance is achieved by the Sum rule, while the Median and Product rules get accuracies of 97.62% and 96.43%, respectively.

Score-Level Fusion Methods. Xu and Mu
Yan [23] explore multimodal biometrics by using face and ear 3D data. Experiments are performed on a database with 174 subjects, and each has two ear shapes and two face shapes. e proposed system uses an improved ICP-based (Iterative Closest Point) approach and is fully automatic. e recognition rates of FR and ER are 93.1% and 97.7%, respectively, while the multimodal recognition with Sum rule yields 100% recognition rate.
tMahoor et al. [24] combine the 2.5D ear data and 2D face image at the score level. For 2.5D ear recognition, a series of frames is extracted from a video clip. e ear segment in each frame is independently reconstructed using the shape from the shading method. en, various ear contours are extracted and registered using the ICP algorithm. For 2D face recognition, a set of facial landmarks are extracted from frontal facial images using active shape model. en, the responses of facial images to a series of Gabor filters at the locations of facial landmarks were calculated and are used for recognition. ey report accuracies of 81.67%, 95%, and 100% for face, ear, and fusion, respectively, on the WVU database.
Islam et al. [26] fuse 3D local features for ear and face at the score level, using weighted sum rules. ey use the FRGC v2 3D face database and UND ear databases. e proposed multimodal method achieves an identification rate of 98.71% and a verification rate of 99.68% (at 0.001 FAR) for neutral  [18,19] eoharis et al. [20] Huang et al. [2][3][4] Yaman et al. [17] Badrinath and Gupta [21] Score-level fusion Huang et al. [4] Xu and Mu [22] Yan [23] Mahoor et al. [24] Cadavid et al. [25] Yaman et al. [17] Islam et al. [26] Mahoor et al. [24] Darwish et al. [27] Kisku et al. [28] Raghavendra et al. [29] Yazdanpanah et al. [30] Huang et al. [31] Huang et al. [32] Rank-level fusion -Monwar and Gavrilova [33][34][35] Kumar et al. [36] Decision-level fusion Rahman and Ishikawa [37] Xu and Mu [22] Kisku et al. [38] Boodoo and Subramanian [39] Mathematical Problems in Engineering 5 face expression. For other types of facial expressions, they achieve 98.1% and 96.83% identification and verification rates, respectively. Huang et al. [31] propose a face-and ear-based multimodal verification system by using sparsity-based matching metric. ey construct dynamic dictionary by using the training samples of the claimed client and some nontarget subjects. e face and ear query samples are first encoded separately, and then, the resulting sparsity-based matching scores are combined with Sum-rule fusion for multimodal verification. ey consider the sparse coding as a competing one-to-many matching process where a verification request can be accepted only in the case when the genuine class defeats almost all the nontarget classes and achieves an eligible sparsity-based matching score in encoding the query data. Hence, the system not only examines the matching score but also compares implicitly the correlations of the query data to the client and many nontarget subjects and, thereby, offers double insurance for identity security. eir experiments demonstrate that the proposed multimodal method is not only better than its unimodal counterparts but also significantly superior to the well-known multimodal methods such as likelihood ratio (LLR), support vector machine (SVM), and the sum-rule fusion methods using cosine similarity.
Furthermore, to enhance the resistance against the face or ear unimodal spoof attacks, Huang et al. [32] propose to use the collaborative representation fidelity to measure the anomaly degree of a query sample to the claimed client. ey combine the anomaly degrees and sparsity-based matching scores obtained from the face and the ear query samples in a stack way. In the end, they use the genuine, imposter, and partial spoofed multimodal score samples to train a SVM classifier for multimodal verification. Extensive experimental results demonstrate the superiority of the proposed method in licit scenario and under the worst-case partial spoof attacks. More importantly, the proposed method achieves a good balance between the accuracy and security when the system uses a fixed operating threshold for both licit and spoofing scenarios.
In [17], Yamen et al. also evaluate the score level fusion for multimodal age and gender classification using profile face and ear. ey use two individual CNN models to generate the probabilities associated with all age or gender classes. en, they achieve the confidence scores by using a certain calculation method with the the probabilities as input data. ey finally select the prediction of the CNN model that has the maximum confidence score. In their experiments, they evaluate the proposed score fusion approach with a series of confidence calculation methods, but both the age and gender classification results are inferior to the proposed sensor-and feature-level fusion methods. [33] combine the face, ear, and signature for identity verification by utilizing rank-level fusion approaches, i.e., the highest rank, Borda count, and logistic regression. As to feature extraction, the same PCA or Fisher's linear discriminant (FLD) approach is employed for all modalities. ey build a chimeric database consisting of faces, ears, and signatures for testing. For the face, they use the ORL database, containing 400 images, 10 each of 40 different subjects. For the ear, they use the Carreira-Perpinan database. For the signature, they use 160 signatures of the Rajshahi database.

Rank-Level Fusion Methods. Monwar and Gavrilova
e experimental results indicate that fusion with weighted Borda count can improve the overall equal error rate (ERR) to 9.76% compared to an average of 16.78% for verification with individual modalities. Later, Monwar and Gavrilova [34] extend their experiments by including more data from the USTB database. For the signatures, they use 500 signatures in total, with 10 signatures/subject, from the Rajshahi database. ey achieve an EER of 1.12% when using the logistic regression rank-level fusion scheme. [37] put forward a multimodal biometric verification system using ear and face profile. For each modality, they utilize PCA to extract the features and do classification individually. If the multimodal system recognizes any one of the ear and face of a particular person successfully, they consider the correct identification of the subject. is is a typical decision level fusion scheme called OR rule. In their experiments on a database with 18 subjects and 5 images per subject, they achieve a best recognition rate of 94.44% (by adjusting the threshold value), while the best of face and ear unimodal verification methods get 88.88% and 77.77%, respectively.

Decision-Level Fusion Methods. Rahman and Ishikawa
Xu and Mu [22] propose a decision-level fusion scheme called the Modified-Vote rule to combine the decision results of ear and face profile subsystems. e modified-Vote rule is reported slightly inferior to score-level fusion schemes such as Sum and Product rules, while it is comparable to the Median-rule. Kisku et al. [38] apply two trained Gaussian Mixture Models (GMM) to estimate the match score distributions of Gabor features of the face and ear, respectively, and, then, verify an identity based on the Dempster-Shafer theory of evidence. ey get an EER of 4.47% on the IIT Kanpur multimodal database, having 400 subjects in total.

Adaptive Face-and Ear-Based Multimodal Fusion
Quality-based adaptive multimodal fusion is an intuitive but very effective way to improve the multimodal recognition performance in the cases when one or a subset of modalities suffers from data generation. In [2], Huang et al. propose a general biometric quality assessment method for the face and the ear by means of sparse representation. By taking advantages of BQA and sparse representation-based classification, they develop an adaptive multimodal identification system that is able to dynamically integrate the face and ear features. When the face or ear image degeneration is severe to a certain extent, the multimodal system can effectively reduce its negative effect. eir experiments demonstrate that when the face (ear) query sample suffers from 100% random pixel corruption, the multimodal system can still get the performance close to the ear (face) unimodal recognition. Moreover, the employment of BQA and the related dynamic feature fusion does not lead to efficiency reduction because the biometric quality is derived from the coding results of the adopted sparse representation-based classification. In this section, we will introduce the adaptive multimodal system in details and refine its BQA and fusion weights selection partly according to literature [3]. We also evaluate the BQA and the overall performance against more types of image degeneration, including random pixel corruption, random block occlusion, real face disguise (i.e., sunglasses and scarf ), and illumination variation.
where ε > 0 is a constant. (b) We compute the class-specific sparse representation error for each class (i) where α i is the coefficients associated with class i. (c) We perform classification via O(y) � argmin i η i .
Yang et al. [42] argue that the original SRC model is based on the assumption that the coding residual follows Gaussian or Laplacian distribution, which may not hold well in practice, especially when occlusion and corruption occur in the query image. Hence, they develop a new model, namely, the robust sparse coding model (RSC), to seek for the maximum likelihood estimation (MLE) solution to the sparse coding problem. Compared with the SRC model, RSC can estimate and suppress the outliers (e.g., the image pixels or regions corrupted or occluded) by dynamically assigning them with lower weights. e RSC model can be defined by where σ > 0 is a constant and W 1/2 pw � diag(ω 1 , ω 2 , . . . , ω N )(ω k ∈ [0, 1]) is a diagonal matrix with nonnegative scalars. e subscript "pw" represents "pixel weight". e reconstruction residual is r � y − Aα � [r 1 , r 2 , . . . , r N ]. Let r ′ � [r 2 1 , r 2 2 , . . . , r 2 N ], and we reorder its elements in an ascending order to be a new vector r ″ � [r 1 ′ , r 2 ′ , . . . , r N ′ ](r k ′ ≤ r k+1 ′ ). en, ω k is selected with a logistic function as where μ is a positive constant that controls the decreasing rate and parameter δ controls the location of demarcation point, which is defined by δ � r ″ (⌊τN⌋), where τ ∈ (0, 1].

Biometric Quality Assessment.
It is generally acknowledged that a biometric quality metric should be able to mirror the utility of a biometric sample, which refers to the impact of the individual biometric sample on the overall performance of a biometric system [43,44]. However, how to assess and measure the utility of a biometric sample is often difficult because there are a variety of factors that may degrade the biometric quality in practice, such as illumination, pose, expression, and aging effect in face recognition. Assembling multiple quality measures, e.g., signal to noise ratio (SNR), mean squared error (MSE), image resolution, and sharpness, may come to a more comprehensive representation of biometric quality. Nevertheless, this may result in a complex quality-based fusion scheme or classifier and, hence, leads to a higher chance of overfitting. us, it is desirable to seek for a uniform biometric quality measure that is applicable to various degrading factors. Moreover, as for an adaptive multimodal system that intents to eliminate the adverse effect of modalities with a low-quality sample, the access to biometric quality difference among samples of the modalities is critical to its effectiveness. erefore, the compatibility of quality assessment results of different modalities is very important. e success of sparse representation is generally believed to benefit from a simple but important property of the natural data: although the high-dimensional images (or their features) belonging to the same class exhibit degenerate structure, they lie on or near low-dimensional subspaces, submanifolds, or stratifications [45]. Generally, a typical face/ear query image of a subject enrolled can be expected to be encoded with high fidelity over the dictionary A via L 1 -norm minimizations in equation (1) or equation (3). However, in many applications, the query samples are often degenerated by a variety of factors such that they may not lie on or near the targeted low-dimensional subspace spanned by the atoms of the established dictionary. In this context, it is not reasonable to expect a high fidelity of sparse representation. Nevertheless, it is still possible to seek a linear combination of dictionary atoms as close as possible to represent the query sample, using a relatively relaxed L 1norm sparsity constraint. As a result, the overall representation error ρ � ‖y − Aα‖ 2 would be rather evident.

Mathematical Problems in Engineering
Motivated by the abovementioned observations, Huang et al. [2] propose to use the overall representation error as a biometric quality measure for the face and the ear, namely, collaborative representation fidelity (CRF). Note that some corruptions such as isolated/impulsive noise and small occlusion regions often cause large coding residual but lead to a relatively small impact on biometric recognition. erefore, aiming to more effectively predict biometric performance, they revise the CRF quality measure by using a certain part of coding residual.
Suppose that y ∈ R N is a query sample and A ∈ R N×mc is an overcomplete dictionary consisting of all training samples. With the L 1 -norm optimization result of equation (1)  e details of experiment settings will be given in 4.5. Table 2 provides the rank-one recognition rates of SRCbased FR and ER against various levels of random pixel corruption. Both accuracies of the face and ear decrease drastically with the enhancement of corruption. Specifically, the performance of ear biometric is more sensitive to the corruption, which is in accordance with the comparison between Figures 2(b) and 2(d) that the CRF of ear increases faster than that of the face.
is detail strengthens the correlation between the biometric performance and the CRF value.
e abovementioned experimental results demonstrate that the CRF is able to mirror the utility of the face and ear samples that are degenerated by random pixel corruption or pose variation. In Section 4.5, more evidence to support the feasibility of CRF will be found in the experiments under conditions such as illumination changes, sunglasses and scarf occlusion, and random block occlusion.

Quality-Based Fusion Weights Selection.
e basic idea of the quality-based adaptive multimodal fusion in [2] is to assign a punitive weight smaller than 1 to the less reliable modality according to the CRF quality. ey assess the quality difference between the face and ear query samples by their CRFs ratio, formulated in the following equation: where b is a balance factor, which is set as 1.0 if there are no specific instructions. e c fe (c ef ) represents the ratio of the face (ear) CRF to the ear (face) CRF. A c fe (c ef ) larger than 1 means the lower quality of the face (ear) query sample relative to the ear's (face's). Generally, when c fe (c ef ) is larger than a certain threshold θ (θ > 1) and, meanwhile, ρ f > ρ max (ρ e > ρ max , ρ max is a threshold), the face (ear) feature or score should be assigned with a weight w f < 1(w e < 1), and meanwhile, let w e � 1(w f � 1). Ideally, the weight w f (w e ) should be a monotone decreasing function with a variable of c fe (c ef ) A uniform fusion weights selection function for the face and the ear, i.e., w f (c fe ) and w e (c ef ), is formulated with a logistic function as follows:

System
Architecture. e quality-based weighting scheme is designed to alleviate the adverse effect of the lower quality modality. It may not help if both the face and ear samples are degenerated at a similar level. us, to enhance the robustness to simultaneous data degeneration, Huang et al. adopt the RSC model for sparse coding, which helps suppress the outlier pixels or regions in the images by using an iterative coding scheme [42]. eir strategy is that the quality-based fusion scheme is utilized to handle the unimodal data degeneration cases, while the RSC model is adopted as a supplementary measure for tackling the multimodal data degeneration cases. ey finally propose a quality-based adaptive multimodal RSC identification system, namely, AMRSC in short, whose block diagram is shown in Figure 3. AMRSC integrates the face and ear features using the serial concatenation and performs sparse coding at the feature level. It actually uses a two-level weighting strategy, including the pixel level and feature level. It seeks to suppress the outlier pixels at the former level and reduce the adverse effect of the less reliable modality at the latter level.
Suppose that there are c subjects and m samples per subject for both the face and ear.
Let    For simplicity, the quality-based fusion weights for the face and ear are formulated as follows: where W e fw � di ag(w e , w e , . . . , w e ) and they have the same dimensions with z f and z e , respectively.
Finally, the multimodal sparse coding problem in AMRSC can be formulated as Equation (9) can be simplified to be where z � P T W 1/2 pw y, D � P T W 1/2 pw A. e L 1 -norm optimization problem in equation (10) can be solved by l 1 − l s [48]. AMRSC performs sparse coding in multimodal feature space. e resulting code vector α is utilized to estimate the pixel weight matrix W pw with equation (4), as well as the CRF quality. Similar with the algorithm in solving the RSC model, AMRSC uses an iterative sparse coding process. For each iteration, the outliers of face and ear query images are gradually detected and suppressed, and the biometric qualities are accessed. e face and ear features are integrated dynamically since the second iteration.

Databases and Settings.
e multiview USTB III ear database [47] and face databases including the Extended Yale B [49], Georgia Tech (GT) [50], and AR (the first 79 subjects) [46] are adopted to build three multimodal databases, namely, MD I, MD II, and MD III for short. Sample face and ear images are shown in Figure 4. Table 3 provides the compositions of multimodal databases. MD I and MD II use the first 38 and 50 subjects of USTB III. For each virtual subject, we pair 7 face images with 7 ear images to form 7 multimodal samples for training. Each multimodal training sample is a unique pair of face and ear samples from the same virtual subject. On the contrary, for obtaining more instances for testing, each face probe image is paired with all ear probe images, for a subject. For example, in MD I each subject has 38 face images and 13 ear images for testing, so  -based feature dimensionalities for  both modalities on MD I, II, and III are 100, empirically  selected as 150 and 200, respectively. AMRSC is compared with seven multimodal methods, including MRSC, MSRCef, MSRCs, MCCA (Multimodal CCA) [18], MSVM (Multimodal SVM, using a polynomial kernel (k(x, x ′ ) � 0.05 × 〈x, x ′ 〉 + 1) 3 ), MNFL (Multimodal Nearest Feature Line), and MNN (Multimodal Nearest Neighbor). MRSC is a special model of AMRSC when feature weight matrix W fw is a unit matrix, or say, without quality-based feature weighting scheme. MSRCef is also a special model of AMRSC when both the weight matrices W pw and W fw are unit matrices. MSRCs combine the class-specific sparse representation errors of the face and ear with Sum-rule fusion for classification. MSVM, MNFL, and MNN combine the face and ear features with serial concatenation before classification. All the L 1 -norm optimization problems in AMRSC, MSRCef, and MSRCs are solved by using l 1 − 1s [48]. e iteration number of AMRSC and MRSC are both two, while the parameter τ is set as 0.85 for them. Let b � 1.0, θ � 1.2, ρ max � 0.02, and φ � 3 for the quality-based fusion weights selection of AMRSC.

Experiments.
We conduct a series of experiments including six categories, i.e., controlled environment, illumination variation, face disguise (sunglasses and scarf), random pixel corruption, random block occlusion, and multimodal degeneration. For all the methods, the experiments under conditions of random pixel corruption and random block occlusion are repeated five times and the average accuracies are used as the final results. e recognition accuracy (or rate) hereafter is referred to the rank-one recognition rate.
(i) Controlled environment: the face and ear unimodal recognition methods are used as baseline. On MD I, probe subset 4 is not used because of its severe degeneration, which will be used in the illumination test. Correspondingly, its face part, i.e., subset 5 of Yale B, does not join in the face recognition. As shown in Table 4, the SR-based methods, RSC and SRC, significantly outperform the other methods in both FR and ER. RSC performs the best on all datasets due to its outlier detection ability. Table 5 summarizes the multimodal recognition results on the three multimodal databases. Compared with their unimodal counterparts, all the SR-based methods significantly improve the recognition performance. Meanwhile, their superiority to other multimodal competitors is significant. For instance, compared with MNN, all of them are able to get an accuracy increase over 20%. AMRSC achieves the best results on MD I and MD II. Although the advantage of AMRSC is not evident, it is reasonable that quality-based fusion is not designed for the cases where the face or ear query image is not obviously degenerated.
(ii) Illumination variation: MD I is used for testing the robustness of multimodal methods against illumination variation in face images. As the face part of MD I, the images of Yale B are more and more challenging from Subset 1 to Subset 5, as shown in Figure 4. As a result, from the Probe Subset 1 to Probe Subset 4, the multimodal data quality becomes more and more inferior. Table 6 collects the multimodal recognition results. All the methods perform well when the face images are of good quality, such as on the Probe Subset 1 and 2. However, on Probe Subset 3 and 4, MSRCef, MSRCs, and MRSC get evident performance degradation. ey are all obviously inferior to their counterparts in ER; that is, their accuracies are lower than the 89.271% of SRC-based ER (refer to Table 4). Similarly, in spite of the outlier suppression scheme, MRSC is much inferior to the RSC with the ear. On the contrary, with the help of quality-based feature fusion, AMRSC gets quite encouraging results, which    Figure 5, sunglasses almost cover 25% area of the face image, while the coverage of a scarf is about 40%. On Subset 2, RSC can effectively detect and suppress the sunglasses region because of its evident contrary with the same region in the common face image. As shown in Table 7, MRSC achieves the best accuracy of 96.414%. AMRSC also benefits from the RSC model and gets a comparable accuracy of 96.186%. e slight disadvantage of AMRSC over MRSC could probably be explained by the fact that, apart from suppressing the sunglasses region, AMRSC would also reduce the importance of face modality based on its poor quality, while MRSC could make full use of the remainder information of the face image.
On Subset 3, it is more challenging for RSC to detect the scarf region on the face. MRSC only gets an accuracy of 91.951%, while the ER with RSC gets 92.105%. On the other hand, with the advantage of quality-based adaptive fusion, AMRSC gets an evident improvement of 3.76% compared with the ER with RSC. Let us say AMRSC can still take advantage of the discriminative information of the disguised face image. Although AMRSC fails to get the best results on both datasets, on the whole, it is more stable than the competitors.
(iv) Random pixel corruption: by applying random pixel corruption to the face or ear images of Probe Subset 1 of MD III and MD II, we can evaluate the multimodal methods at different corruption levels, i.e., 20%, 40%, 60%, 80%, and 100%. Figure 5 shows sample images with these corruption levels. We call        the cases when only the face or the ear image suffers from corruption/occlusion as unimodal corruption/occlusion. Table 8 summarizes all the multimodal recognition results on MD III. In experiments of both types of unimodal corruption, inevitably all multimodal methods' performance decreases more or less with the corruption in increase. However, in the face corruption case, AMRSC achieves the best accuracy at all corruption levels except the 40% corruption. Even at 100% face corruption, compared to the ER using RSC with 92.105%, AMSRC can still obtain a comparable accuracy of 92.141%. With the help of outlier detection, MRSC is much better than MSRCef and MSRCs, but it is evidently inferior to AMRSC. e advantage of AMRSC over MRSC clearly tells the effectiveness of quality-based fusion. AMRSC's superiority to other methods is overwhelming in the ear corruption case. When the ear  corruption is 100%, AMRSC's accuracy is 95.549%, which is very close to the 95.841% of FR with RSC. Overall, AMRSC is validated to have capability of tolerating the 100% image corruption of face or ear, and on the other hand, the effectiveness of quality-based fusion is proved again.
On MD II, we get similar results with those obtained on MD III. Figure 6 shows the recognition accuracy curves of all multimodal methods in unimodal corruption experiments. A dashed line representing the RSC-based unimodal recognition with the modality without corruption is used as a baseline for evaluating AMRSC's effectiveness. We can see that all the curves obtained by the multimodal methods with the traditional classification descend dramatically with the increase in the corruption level. MSRCef and MSRCs are able to tolerate about 50% corruption. In contrast, the AMRSC's curve is very stable, descending much more gently. AMRSC's curve even does not ever break the dashed line in the face corruption case, while although this happens in the ear corruption case, it has never been far away from the dashed line. ese experiments on MD II demonstrate again that AMRSC can tolerate the most extreme random pixel corruption to the face or the ear image.
(v) Random block occlusion: we simulate the random block occlusion by using a Baboon picture with various sizes, as shown in Figure 7. Figure 8 shows their performance under the conditions when only the face or the ear is occluded on MD III. While all other methods suffer from significant performance reduction with the aggravating occlusion, AMRSC can still achieve performance comparable to the ER with RSC when the 100% area of the face or the ear image is occluded. e competitors seem more sensitive to the ear random block occlusion. e performance of MRSC in this series of experiments reveals that the RSC is not quite effective in dealing with complex occlusion. is, in turn, illustrates the importance of quality-based adaptive fusion in AMRSC. (vi) Multimodal degeneration: Figures 9 and 10 show the multimodal performance in the cases when the face and the ear simultaneously suffer from random pixel corruption and random block occlusion. It can be observed that all the accuracy cures of the multimodal methods with the traditional classification descend sharply after 20% corruption in both degeneration scenarios. AMRSC and MRSC are always comparable. is tells that the quality-based adaptive fusion does not help improve the performance when the face and ear encounter the same level of data degeneration. But, at least, it does not lead to an adverse effect. On the other hand, compared with the multimodal methods with the SRC model, their superiority is evident. y are able to tolerate 40% simultaneous random pixel corruption and 20% simultaneous random block occlusion. is result validates again the effectiveness of the RSC model.
Overall, these six categories of experiments verify that the SR-based biometric quality assessment and the associated adaptive multimodal recognition are highly effective. e RSC model is a beneficial complementary to the qualitybased adaptive multimodal recognition.

Conclusions
Multimodal biometric systems are believed to improve the recognition accuracy and robustness by integrating evidence Mathematical Problems in Engineering presented by multiple biometric modalities. In this paper, we gave an overview of multimodal biometrics using the face and ear. All the approaches are classified according to their fusion methodologies. Many multimodal systems have shown the feasibility and advantages of combining the ear and the face. It is a promising way to develop more accurate multimodal systems that are able to recognize a person contactlessly/nonintrusively. Nevertheless, when the face or the ear suffers from data degeneration, if the fusion rule is fixed or with inferior flexibility, a multimodal system may perform worse than the unimodal system using only the modality with a better quality sample. e biometric quality-based adaptive fusion is an avenue to address this issue. In this paper, we particularly emphasized on a quality-based adaptive multimodal identification system, which adopts a general biometric quality assessment method and dynamically integrates the face and ear via sparse representation. We refined and gave more details about the quality measure and the related fusion weights selection. Moreover, for a more thorough evaluation, we extended the experiments by using more datasets and five types of image degeneration.
Our experimental results demonstrate that the sparse representation-based biometric quality measure is able to mirror the utilities of the face and ear image degenerated by pose, expression, illumination variations, facial disguise such as sunglasses and a scarf, random pixel corruption, and random block occlusion. e quality-based adaptive multimodal method achieves a striking robustness to various types of unimodal corruption/occlusion. Even when the face or ear image suffers from 100% random pixel corruption or random block occlusion, it can still achieve the comparable performance to the unimodal recognition with the ear or the face alone. It can also tolerate a high level of simultaneous face and ear degeneration. In the future, biometric quality assessment and quality-based adaptive multimodal fusion deserve more attention.

Data Availability
No data were used to support this study.

Conflicts of Interest
ere are no conflicts of interest regarding the publication of this paper.