Heterogeneous Visible-Thermal and Visible-Infrared Face Recognition Using Cross-Modality Discriminator Network and Unit-Class Loss

Heterogeneous face recognition (HFR) aims to match face images across different imaging domains such as visible-to-infrared and visible-to-thermal. Recently, the increasing utility of nonvisible imaging has increased the application prospects of HFR in areas such as biometrics, security, and surveillance. HFR is a challenging variate of face recognition due to the differences between different imaging domains. While the current research has proposed image preprocessing, feature extraction, or common subspace projection for HFR, the optimization of these multi-stage methods is a challenging task as each step needs to be optimized separately and the performance error accumulates over each stage. In this paper, we propose a unified end-to-end Cross-Modality Discriminator Network (CMDN) for HFR. The proposed network uses a Deep Relational Discriminator module to learn deep feature relations for cross-domain face matching. Simultaneously, the CMDN is used to extract modality-independent embedding vectors for face images. The CMDN parameters are optimized using a novel Unit-Class Loss that shows higher stability and accuracy over other popular metric-learning loss functions. The experimental results on five popular HFR datasets demonstrate that the proposed method achieves significant improvement over the existing state-of-the-art methods.


Introduction
e applications of facial recognition (FR) [1] systems have increased exponentially with the advent of deep convolutional neural networks. Currently, automated FR is being used in personal devices, public surveillance, access control, security, marketing, and other applications. While the performance of FR algorithms has achieved near-human accuracy for frontal images, the performance is limited in scenarios involving extreme variations in illumination, expression, pose, presentation attacks, and disguises [2][3][4]. e use of infrared, thermal, and 3D imaging is being explored to overcome the limitations of visible modality. ese modalities have shown advantages against pose, illumination, spoofing, and disguise. Infrared and thermal images are captured without visible light and are less likely to be affected by variations in illumination. ermal sensors detect the IR radiation emitted from an object, which is converted to surface temperature. Similarly, the heat emitted by the human body as IR radiation can be recorded by thermal sensors and stored as an intensity image. In addition to robustness to illumination, thermal imagery also shows promise against disguise [5] and spoofing [6], making it suitable for security-sensitive applications of FR. ese advantages of visible and thermal domains have led to an increased usage of these modalities for applications of FR. e increasing application of different modalities for image acquisition results in an abundance of face data where the images belong to different imaging domains. is leads to scenarios where images from different domains need to be matched for face recognition. e infrared spectrum is robust to illumination, making it ideal for capturing images in the dark. is makes infrared imaging suitable for security cameras, surveillance, and monitoring around the clock. As thermal imaging captures the surface temperature of a person's face, it is a suitable choice for face recognition against spoofing and disguise. e applications of thermal imaging include border control, security-sensitive entrances, and disease monitoring for public places. As large-scale identity databases and social media platforms contain visible images only, cross-domain matching is needed to match identities across various applications and scenarios [7] such as visible images from social media and infrared images from CCTV footage. e scenario in which the modality of the probe images is different from that of the enrollment image is known as heterogeneous face recognition (HFR) [8], e.g., visible-to-infrared (VIS-NIR), visible-to-thermal (VIS-THE), and visible-to-sketch. While numerous works have been proposed for VIS-NIR [9][10][11] face matching, VIS-THE has received relatively less attention owing to the high cost of thermal imaging devices and the lack of availability of largescale visible-thermal face datasets. In addition, the large modality gap between the two domains makes visiblethermal face recognition a challenging task. Figure 1 shows the sample images for a subject in the visible, infrared, and thermal modalities. As can be observed, the VIS-NIR modality gap is significantly less prominent than the VIS-THE modality, making VIS-THE image matching a more difficult task.
Several methods have been proposed to reduce the domain gap between different modalities, e.g., common subspace projection-based methods, synthesis-based methods, and local feature extraction-based methods. However, most manually designed feature descriptors are not capable of dealing with large intraclass variations. Recently proposed deep learning-based techniques learn these descriptors automatically and have achieved promising results in multiple cross-modality scenarios. ese approaches rely on the extracted embedding vector distances for training and face matching. Typically, a handcrafted loss function is utilized to project the extracted embedding vectors in a hypothetical feature space. e goal of this projection is to optimize the network to increase interclass separation and decrease intraclass distances. Furthermore, simplistic distance measures such as cosine or Euclidean distance are used at the inference stage for face matching. We hypothesize that the usage of handcrafted loss functions and hardcoded distance measures is not an optimal solution for HFR. Rather, a relational learning function that distinguishes between embeddings of different classes should be more efficient for HFR and other metric learning tasks.
Motivated by the success of convolutional neural networks for relational learning, we propose a Cross-Modality Discriminator Network (CMDN) for heterogeneous face recognition. Instead of using a distance-based loss function for face matching, the network employs a Deep Relational Discriminator (DRD) module to learn the relationships between cross-domain images. We argue that a discriminator function, learned during the network training, should outperform traditional distance metrics for HFR tasks. Furthermore, we formulate a metric learning-based loss, namely, Unit-Class Loss that is robust to a small amount of training data, noisy samples, and large modality differences in the training data. e proposed loss enhances the feature learning of the network by considering individual samples as well as the whole class distributions.
is paper presents a Cross-Modality Discriminator Network and Unit-Class Loss for heterogeneous face recognition. e proposed loss can learn modality invariant identity features from unaligned facial images. In our training process, VGGFace2 [12] weights are used to initialize the backbone network for FR. e backbone network is then fine-tuned on the IRIS [13] face database with a classification layer for visible and thermal face classification. en, for the HFR training, the Deep Relational Discriminator module is integrated with the backbone network, and the training is performed on the HFR dataset. e proposed network can be used to (a) extract face embedding vectors, (b) obtain HFR classification, and (c) create a fusion of embedding vectors and classification probabilities. e presented network is tested on multiple datasets: the TUFTS Face database [14], Collection X1 from the University of Notre Dame (UND-X1) database [15,16], University of Science and Technology China, Natural Visible and Infrared facial Expression (USTC-NVIE) database [17,18], CASIA NIR-VIS [19], and Sejong Face Database [20]. Our results show that our proposed method outperforms the existing methods for VIS-THE and VIS-NIR HFR.
In this article, our main contributions are as follows: (i) We propose an efficient end-to-end Cross-Modality Discriminator Network (CMND) for heterogeneous face recognition. (ii) We propose a Deep Relational Discriminator (DRD) module that eliminates the need for a handcrafted loss function for cross-domain face matching. Rather than using a hard-coded distance metric, the DRD module learns to differentiate between same and different class samples. (iii) We propose a novel Unit-Class Loss (L uc ) for optimizing the CMND. L uc compels the network to learn identity-discriminative embedding representations by penalizing intraclass variations and encouraging interclass distances. (iv) We demonstrate the superior performance of our proposed network over previous works on various VIS-THE and VIS-NIR face datasets. e rest of the paper is organized as follows. In Section 2, we review the related works on VIS-NIR and VIS-THE HFR. Section 3 presents the details of the proposed methodology. e quantitative and qualitative results on different HFR datasets are presented in Section 4. Finally, the conclusion and future works are presented in Section 5. An earlier version of this study has been presented as a preprint in arXiv [21].

Related Works
In this section, we present a review of the related literature. e significant increase in multi-modal and cross-domain data calls for methods that can meet the newly rising needs 2 Computational Intelligence and Neuroscience of multi-modal data [22,23]. Usually, multi-modal domains consist of two very distinct modalities, such as images and text, or text and speech, and are restricted to the classification of close-set [24] problems. On the other hand, HFR considers very similar data from different imaging domains and need the system to be able to handle open-set verification, making it a more challenging task. A close-set problem is where the system is trained to classify categories previously seen during training, such as object recognition, whereas in an open-set scenario the system should be able to match unseen identities at the inference stage. Person reidentification [25] is a similar problem where a subject's whole body is matched against its previously stored information. While person re-identification considers all existing features of the body, such as clothes, color, height, gait, etc., HFR only considers the face features for identity matching. Heterogeneous face matching has been receiving increasing attention from the biometrics community. ere are three categories of the existing methodologies for HFR, namely, latent subspace learning, cross-domain synthesis, and feature extraction-based methods.

Latent Subspace
Learning. Latent subspace learningbased algorithms project cross-domain features to a common latent space, in which the similarity of heterogeneous information can be compared. e most common approach is to derive a set of facial features from both domains, such that the modality-related information is removed, and the identity-related information is retained. is is done by finding a mapping for both modalities to a common feature subspace. Zhu et al. [10] proposed a transductive heterogeneous face-matching method. ey first apply Log-DoG filtering, local encoding, and uniform feature normalization to reduce the domain gap between VIS-NIR images. en, a transductive classifier is trained for face matching. Yi et al. [26] extracted Gabor features at localized facial points and then used Restricted Boltzmann Machines (RBMs) to remove the heterogeneity around the focal points by learning a shared representation. en, these shared representations of local RBMs were connected and classified using PCA. Klare et al. [27] proposed a generic HFR framework where the probe and enrollment images are represented in terms of nonlinear similarity by using a prototype random subspace (similarity kernel space) such that the prototype subjects each have an image in both modalities. Hu et al. [28] proposed to use discriminant partial least squares (PLS) by specifically building PLS gallery models for each subject with the help of thermal cross examples. e images are first preprocessed by the PLS regression-based approach to reduce the domain gap and then the recognition is performed using a one-vs-all model. He et al. [29] proposed to use a deep convolutional network approach to learn a mapping that can project both VIS and NIR images to a common compact Euclidean space. e training strategy is like that of Ref. [30]. Reale et al. [31] applied coupled dictionary learning to the thermal-to-visible matching problem. e coupled dictionaries representing the two domains provide a sparse representation that transforms the data into a single, domain-independent, latent space. Reale et al. [32] propose to use a complex deep model to learn the mapping for both VIS and NIR faces into a domain-invariant latent feature space so that they can be compared directly. Peng et al. [33] proposed to use Markov networks to extract features so that the spatial information can be preserved, and patches are extracted for both probe and gallery images. ese patches are then compared using a coupled representation similarity metric. Latent subspacebased methods need to find the mapping of the input domain to a target domain using manually selected feature extractors. ese manually designed extractors fail in scenarios where there are large variations in data. Deep learning-based methods offer a solution for generic feature extractors which can be trained for the specific dataset and domains.

Cross-Domain Synthesis. Cross-Domain Synthesis
methods aim to generate an image in the gallery domain from the given test image, so that heterogeneous face recognition can be treated as homogenous face recognition. Typically, these methods involve two steps; first, a visible image is synthesized from a thermal image, and next, face matching is performed for the synthesized image and the gallery. Li et al. [34] proposed one of the earlier attempts using a learning-based model to synthesize visible images from thermal counterparts. Iranmesh et al. [35] proposed to use coupled generative adversarial networks (GAN) consisting of two generator and two discriminator networks to synthesize visible images from thermal images. e FR is then performed on the intermediate network features. Fu et al. [36] propose a dual variational generator elaborately designed to learn the joint distribution of paired heterogeneous images. e variational generator is then used to increase the multi-modal training data by synthesizing infrared images from visible images. e increased data is then used to train an HFR network. Synthesis-based methods rely on GANs which lack an objective function, making it difficult to evaluate the quality and validity of the generated data [37].

Modality Independent Feature
Learning. Modality independent feature learning aims to extract features related to face identity, discarding the modality information. Various deep learning approaches have been proposed for crossdomain FR. Ghosh et al. [38] proposed a subclass heterogeneity-aware loss to train deep neural networks for crossdomain and cross-resolution face recognition. Deep perceptual Mapping [39] captures the highly nonlinear relationship between the visible and thermal modalities by using a deep neural network. A deep neural network is used which attempts to learn a nonlinear mapping from visible to thermal spectrum while preserving the identity information.
Riggan et al. [40] used coupled auto-associative neural networks along with deep perceptual mapping (DPM) to learn common features that are useful for cross-domain face recognition. He et al. [30] proposed Wasserstein Convolutional Neural Network to learn domain-invariant features. e low-level layers are trained using the VIS images, and the high-level images are further divided into three parts, i.e., NIR layer, VIS layer, and NIR-VIS shared layer. e first two layers aim to learn modality-specific features, whereas the shared layer aims to learn modality-invariant features.
Metric learning losses and their variations have been used for face verification, but these are sensitive to the selection of image pairs and require hard-sample mining for effective training. Synthesis-based methods rely on effective training of generative adversarial networks, which is an inherently complex problem. Facial feature-based methods depend on facial alignment, which is an independent problem for highly noncorrelated modalities. Various effective approaches have been proposed using multiple network branches, but this increases the network parameters multifold and requires a large number of training resources.
is paper overcomes the shortcomings of the current methods by proposing an end-to-end deep relational network for cross-domain face matching.

Proposed Method
is section presents the building blocks of our proposed algorithm. e goal of an efficient cross-modality recognition system is to project the input images such that the intraclass projections have a small distance whereas interclass projections have a large distance. e choice of our backbone architecture, the motivation for our loss function, and the details of the Deep Relational Discriminator module for achieving this goal are explained below. Figure 2 shows the overall architecture of the proposed network. For a given cross-modality image pair, the network outputs an embedding vector for each image and a match HFR score. e embedding represents the identity of the face in the image as a 256 × 1 float vector. ese embeddings can be compared using cosine distance to find similar identities in the gallery set. e HFR output by the network is a float value between 0 and 1, where a higher value signifies high similarity between the input image pair.

Backbone Architecture.
Deep residual networks were introduced to solve the issue of degradation in deep neural networks [41]. ey introduced skip connections with the identity function to allow the gradient of the cost function to progress directly from deeper layers to shallow layers. e identity blocks are also effective at improving the performance of networks when vanishing gradients and degradation are not an issue. e efficiency of deep residual networks (ResNet-50) [41] has been proven for face recognition on the MS1M [42] and VGGFace2 [12] datasets.
Squeeze and Excitation (SE) blocks recalibrate channelwise feature responses by explicitly modeling the interdependencies between channels. e SE block models channel interdependencies by selectively emphasizing informative channels and suppressing less useful ones. For the transformation x l to x l+1 , squeeze (F sq ), excitation (F ex ), and scaling (F sc ) operations are performed for a given feature matrix x l of size H × W × C. e advantages of integrating SE blocks have been demonstrated for face recognition using the VGGFace2 dataset. In this work, we use the ResNet-50 with SE blocks (SENet-50) [43], architecture as our backbone network. e backbone network takes a 224 × 224-sized image input and calculates a 256-dimensional embedding vector of the input image.

Triplet
Loss. Triplet Loss was used by Schroff et al. [44] for face recognition and clustering. Since then, triplet loss and its variations have been used for single-modal, multimodal, and cross-modal face recognition. For the extracted embedding vector of a given anchor image, positive image, and negative image, the triplet loss aims to minimize anchorto-positive distance and increase anchor-to-negative distance. e idea is to train the network such that the sameclass images are projected in the nearby region, whereas nonclass images are projected farther from the anchor.
where Θ(·) is the feature extraction function of the network, I a is the anchor image, a is the vector embedding for image I a , and L T (a, p, n) is the loss function for the embedding vectors a, positive class image p, and negative class image n. D(·, ·) is the distance function for the learned vector representations of two images. L2 or cosine distances are typically used as distance measures and α is the margin parameter usually set to 1.0. A graphical illustration of the triplet loss is shown in Figure 3(a).

Class Mean Triplet
Loss. e goal of a heterogeneous face recognition system is twofold: (a) to minimize the distances between multi-modal image representations of the same class and (b) to increase the distance between image representations of different classes. Given a vector representation a c , of an image of class c, the class mean triplet loss is defined as follows: where Here U is the universal set containing the classes. L S (a c , m c , m n ) is the loss function for the vector representations of the anchor image a c , belonging to class c. m c and m n are the means of vector representations in classes c and n, respectively. Figure 3(b) shows the conceptual representation of the class mean triplet loss.

Unit-Class Loss.
e training procedure for triplet loss is sensitive to the selection of effective samples, noise, outliers, number of classes, and minibatch diversity. Using class means for loss calculation is more robust to noise and random sample selection. However, convergence becomes more difficult because of the larger number of parameters involved. We present a novel Unit-Class Loss (L uc ) that combines principals from triplet loss and class mean triplet loss to benefit from the advantages of both, as shown in Figure 3(c). e weight parameter β is introduced for the optimal weight distribution of sample-based and classmean-based optimization of gradients. e Unit-Class loss is formalized as It is to be noted that L uc is the weighted sum of triplet loss and class mean triplet loss.
is means that there is no computational or memory advantage offered for the calculation of L uc . However, the proposed loss function optimizes the network faster, reducing the required network training cycles (epochs) and consequently improving the memory and time consumption of the training process.

Deep Relational Discriminator Module.
We hypothesize that a network trained with the end goal of matching crossmodal face pairs would train more efficiently than a network trained to project class vector representations. Traditionally, vector distance measures such as Euclidean or cosine distance have been employed to calculate the similarity between the vector representations of two images. We introduce a Computational Intelligence and Neuroscience 5 Deep Relational Discriminator (DRD) module to distinguish same-class face images from different class face images. An L2-normalized 256-dimension embedding vector of the input image is obtained from the backbone network and processed such that each image embedding is concatenated with every image. e b × 256 output, where b is the batch size and 256 is the embedding dimension, is remapped to the b × b × 512 (batch size × batch size × vector size) dimension input for the DRD module.
where i, j ∈ U, a i , and a j are the vector representations of images i and j, respectively, and b is the total number of images in the batch. e goal is to train the DRD module to classify the same-class images from concatenated image embeddings of an image pair. ese concatenated image pair representations are then fed into subsequent dense layers followed by the sigmoid activation layer, as shown in Figure 4. e proposed DRD module is trained using binary cross-entropy loss to classify the matching image pairs. Our experiments demonstrate that the proposed module improves the network's performance and outperforms hardcoded distance measures for face matching. As the DRD module outputs a match ranking between 0 and 1. We use the binary cross-entropy loss defined as, p(a, b)) + (1 − y(a, b)) · log (1 − p(a, b)).
Here, n is the total number of images in a batch, n 2 is the total number of image pairs, a ∈ 1, . . . .
3.6. Training Process. is section presents the training algorithm for the proposed method in detail. We aim to reduce intraclass variations across the domains while increasing interclass distances. Furthermore, the proposed DRD module improves the network's ability to differentiate between same-class and different-class embedding vectors. e proposed modules and loss are generic and can be integrated with the existing state-of-the-art architectures for a further boost in performance. Figure 5 shows an overview of the proposed training methodology.
Our training process consists of three stages: weight initialization, pretraining, and HFR training. Network weights from a SENet-50 trained on the VGGFace2 [12] face database are used to initialize the backbone network. As the VGGFace2 dataset contains only visible images, we pretrain the base network on a VIS-THE face dataset to learn thermal as well as visible feature extraction. e IRIS [13] face dataset is used to pretrain the network using cross-entropy loss. e dataset contains 2,552 images each in visible and thermal modalities for 29 subjects. e facial images contain variations in expression, pose, and illumination. Feature-wise centering and feature-wise standard normalization calculated on the entire dataset (visible and thermal images) are applied to each sample. Furthermore, geometric transformations are used as data augmentation techniques to mitigate overtraining. e training data is fed in a pseudo-random manner, ensuring a randomized class and modality distribution. e network is trained until there is no further decrease in loss. We choose cross-entropy loss to train the network at this stage, given its established performance and stable nature for FR algorithms.
For HFR training, the last Softmax layer from the pretraining stage is replaced with an L2-normalized dense layer, and the network is connected to the proposed DRD module. e resulting network contains two outputs, i.e., the L2-normalized embedding vectors and the binary classification from the DRD module. We optimize the network using the proposed L uc and L DR D for vector representations and DRD output, respectively. Class identities are passed as label information to Unit-Class Loss and the DRD Loss. e pretraining of the backbone network and the training process of the proposed network are summarized in Algorithms 1 and 2, respectively. It should be noted that our training and testing procedure does not require specific image preprocessing, facial alignment, or landmark labeling for training or testing.

Experimental Evaluation
In this section, we evaluate the performance of the proposed architecture on popular VIS-THE and VIS-NIR face datasets. We provide comprehensive parameter settings and implementation details for research reproducibility. Rank-1 accuracy and verification rates (VR) at False Acceptance Rate (FAR) of 1% and 0.1% are compared to previously proposed HFR algorithms.
e USTC-NVIE database [17,18] is a facial expression database with visible-thermal image pairs for 215 subjects. e database is further divided into two subsets: a spontaneous database consisting of image sequences from onset to the apex of facial expressions and a posed database consisting of apex images.
e images are captured with three illumination variations, i.e., illumination from left, right, and front. is is an expression recognition database but is used for VIS-THE HFR in this study. Among the 215 subjects, 126 subjects were found to have usable data (having sufficient images in both modalities). Both sub-datasets are used to maximize the number of training and test data. e training was performed using 1600 2 image pairs (visible and thermal images) from 100 subjects. 416 2 image pairs from 26 subjects were used for testing.     Computational Intelligence and Neuroscience e TUFTS database [14] contains data belonging to multiple categories, i.e., 2D visible, thermal, Infrared, 3D, 3D LYTRO, Sketch, and Video. e database contains images having variations in pose, expression, and sunglasses. To avoid the additional challenge of pose, only frontal images with variations in expression and glasses were used for training and testing. ermal and visible image corpus was used for our HFR experiments. e training was performed using 481 2 image pairs (visible and thermal images) of 74 subjects. Here, 247 2 image pairs of 38 subjects were used for testing.
UND-X1 [15,16] contains 82 subjects with LWIR and visible light image pairs with varying illumination, expression, and time-lapse. Out of 82 subjects, 50 subjects' images were used for training. Forty image pairs were used for each subject. e training was performed using 1000 2 image pairs from 50 subjects. e test was performed using 1280 2 image pairs of 32 subjects.
Sejong Face Database [20] (SFD) contains images in visible, infrared, and thermal modalities for 100 subjects. e subjects are captured wearing disguises and facial add-ons such as fake beards, caps, scarves, and wigs, with and without makeup, and so on. e addition of facial add-ons makes HFR a more challenging problem for SFD. 975 2 image pairs from 75 subjects were used for training and 325 2 image pairs from 25 subjects were used for testing. e CASIA NIR-VIS 2.0 [19] database contains visible-infrared image pairs for 725 subjects. e number of images for the subjects ranges between 1-22 and 5-50 for visible and infrared, respectively. e database contains two views of the evaluation protocols. View-1 is used for training and View-2 is used for testing. To simulate the practical situations, the VIS images are used as a gallery and the NIR images are used as the probe. For each subject in the gallery set, only one VIS image is selected. For a fair comparison with other results, we follow the standard protocol in View-2 for testing.

Training Parameters and Implementation Details.
e proposed network is implemented using TensorFlow [45], an open-source deep learning framework. e experiments are performed using two Nvidia GTX 1080Ti GPUs. In the pretraining stage, we adopt the IRIS Visible-ermal image dataset to train the backbone network. At this stage, the categorical cross-entropy loss is used for multi-modal face recognition. Here, 5,100 images for 29 subjects are used as training data. e training images are resized to a size of 224 × 224 pixels and the labels contain class information only. e network is trained using an Adam optimizer [46] with an initial learning rate of 3e − 4 and the learning rate is reduced by a factor of 0.8 when the error plateaus. e network is trained using a minibatch size of 64 (32 visible and 32 thermal images) until the loss plateaus.
In the training stage, the Softmax layer is replaced with a 256-dimension L2-normalized dense layer and the DRD module is added to the head of the network. e backbone network is trained on the HFR dataset with the proposed Unit-Class Loss and the DRD head is trained using crossentropy loss. e values of α and β are set to 1.6 and 0.6, respectively. e batch size is set to 128 (64 visible and 64 thermal) images.
e Adam optimizer, with an initial learning rate of 3e − 4 , is used for gradient descent optimization. e learning rate is reduced by a factor of 0.8 when the loss plateaus. e fine-tuning process for each dataset takes approximately 2 hours.
Owing to the difference in image sizes between different modalities and across different databases, the images are cropped to match the shorter edge. e square images are then scaled to 224 × 224 pixels for all databases. Data augmentation is performed to mitigate overtraining and increase training size. Geometric transformations for rotation shift in both axes, brightness shift, shear, and horizontal flip are applied to the training data. Image normalization is performed using feature-wise centering and feature-wise standard normalization calculated on the entire training dataset.

Training and Test Data.
To verify the performance of our proposed HFR algorithm, we compare our method with the state-of-the-art HFR methods. e number of available test subjects, gallery images, and probe images for the face verification experiments is listed in Table 1. e proposed network can be used in multiple ways: (a) the embedding vectors (emb) can be retrieved from the L2-normalized layer for a single image and HFR performed using the cosine distance between the gallery and probe images, (b) an HFR classification (hfr), i.e., genuine versus imposter image pair, for an image pair can be extracted from the DRD output, (c) score fusion (fus) for distance measure of emb and hfr output can be performed to achieve an average recognition measure. e Rank-1 accuracy and for emb, hfr, and fus are presented. e emb results are computed for one visible image matched against the whole thermal gallery. For testing the hfr results, all available VIS-THE image pairs from the test set are used. Fus score is reported on the test pairs used for embedding recognition.

Evaluation Metrics.
Given two test images, the 1 : 1 verification determines if the images belong to the same identity. For the embedding output of the model, the distances are calculated using the cosine distance between the extracted embedding features and the same or different identity classification performed based on a threshold value. For the hfr output of the model, the model outputs the same identity probability for the input image pair. 1: N Identification determines the matching identity from a gallery of N images for a given probe image. We present Rank-1 recognition rate and true positive rate (TPR) at False Acceptance Rate (FAR) of 1% and 0.1% for face recognition and verification tasks [47]. Rank-1 accuracy is defined as the proportion of correctly predicted face image pairs (True positive) among the total number of image pairs, where TP � True positive, FP � False Positive, and FN � False Negative. TPR at a FAR is described as the number of correct predictions at a specific percentage of acceptable false predictions. TPR and FAR are calculated as TPR � TP (TP + FN) , e threshold is defined to determine the FAR. Table 2 presents the Rank-1 identification and verification rates on UND-X1. We compare the performance of our method with those of the recent CpGAN [35], DPM [39], DVG-Face [36], and W-CNN [30].

Experimental Results.
As can be seen in Table 2, the proposed network achieves Rank-1 accuracy of 95.21% with the fus output, which is a significant improvement over 83.73% and 76.45% recognition rate achieved by DPM and CpGAN, respectively. While DPM and CpGaN aim to find a translation from thermal to visible images, the proposed method focuses on identity feature extraction and learning the relationships between those features. DPM achieves similar performance in terms of verification rate compared to the proposed emb and hfr, but the proposed method outperforms CpGAN by a large margin.
Overall, the proposed method achieves an improved performance on the UND-X1 dataset. We compare our results with the existing results on the TUFTS dataset. As the TUFTS database is relatively new, few VIS-THE HFR results have been reported. DVG-Face [36] proposes a two-step dual variation generation method to generate thermal images from visible images. e generated dataset is used to train a LightCNN for recognition. As can be seen in Table 3, we achieve a marginally improved Rank-1 recognition rate over DVG-Face for emb output, but the proposed hfr and fus achieve a significant improvement of 21.3% and 22.8% over the current methods, respectively. e proposed end-to-end CMDN outperforms the two-stage methods by minimizing error. Further baseline results using triplet loss for the TUFTS database are presented in ablation studies.
As USTC-NVIE is primarily an expression database; it lacks significant HFR results. We report baseline results for triplet loss in ablation studies. e PCA, Fisherface, and G-HFR results are presented from a previous study [33] in Table 4. e proposed fus method achieves a Rank-1 recognition rate of 99.7% compared to G-HFR, showing that the proposed method outperforms graphical representations of facial identities as proposed by G-HFR. Furthermore, the proposed emb and hfr achieve very similar recognition rates to fus, of 99.3% and 99.4%, respectively. e proposed emb achieves the highest TPR@FAR for the NVIE dataset, closely followed by hfr and fus.
e Sejong Face Database has been proposed for disguised face recognition across various modalities. e inclusion of facial add-ons that hide large parts of the face makes it a particularly challenging face dataset. e Rank-1 recognition rates for single-modal visible images (singlemodal) and multi-modal images (visible, thermal, and infrared) using score fusion (score-fusion) on the database are reported using the methods reported in Ref. [20]. Furthermore, the database is tested for VIS-THE HFR in detail using DPM, CpGAN, DVG-Face, IDNet [32], and IDR [29], and the results are reported in Table 5. As can be observed, the HFR recognition rates for W-CNN (74.9%) and IDR (74.3%) are similar to single-modal recognition (77.2%). As the single-modal FR is a simpler problem, better results are expected for FR compared to HFR. Using multiple modalities (score-fusion) improves (92.3%) the face recognition results as multi-network ensemble and additional image data are being used for recognition. e proposed method, with a single network and for cross-domain matching achieves a Rank-1 recognition rate of 92.4%, a significant improvement over other methods.
CASIA NIR-VIS 2.0 is used to present our results on the VIS-NIR modality. e dataset has two subsets; View-1 is used for training and View-2, with 10 different splits, is used for testing. e CASIA NIR-VIS 2.0 restricts the testing to one gallery image per subject. erefore, there are 358 visible gallery images and about 6000 probe infrared images for testing. Table 6 Summarizes the results of our proposed model compared to other recent methods applied on CASIA NIR-VIS 2.0. It should be noted that the proposed network is designed and optimized to perform on VIS-THE data. e proposed method achieves a Rank-1 recognition rate of 99.5%, compared to 98.7% achieved by W-CNN and 97.3% achieved by the IDR method. While the proposed method achieved the highest TPR@FAR of 1%, IDR performs better for TPR@FPR of 0.1%.

Cause Analysis.
e proposed network offers multiple advantages over the currently proposed methods for HFR. e Cross-Modality Discriminator Network employs a single backbone, which learns shared weights for both modalities.
e usage of shared weights prevents overtraining of the network and results in better open-set performance. Moreover, having a single network, trained for end-to-end HFR avoids the possible loss accumulation (as opposed to multiple networks), resulting in improved recognition rates. Instead of using a hard-coded loss function, the proposed DRD module learns the interclass and intraclass relationships for deep features of test images. e learned relationships used for HFR classification outperform the manually designed feature descriptors as shown by the performance of the proposed method. Finally, the proposed CMDN gives two outputs, embedding vectors, and the match probability for an image pair, which are combined to improve HFR performance over individual outputs.

Ablation Studies.
To verify the effectiveness of our proposed method, we perform ablation studies to explore the effects of different loss functions, the DRD module, and     Computational Intelligence and Neuroscience     Table 7. e results are calculated using embedding distances for a fair comparison. e ablation experiments are performed as follows: (1) Experiment 1: e SENet-50 backbone network is trained using triplet loss without the DRD module (2) Experiment 2: e SENet-50 backbone network is trained using the class mean triplet loss without the DRD module (3) Experiment 3: e network is trained using the proposed class mean triplet loss without the DRD module (4) Experiment 4: e network is trained using the proposed Unit-Class Loss without the DRD module (5) Experiment 5: e network is trained using triplet loss and the DRD module is added (6) Experiment 6: e network is trained using the proposed Unit-Class Loss and the DRD module (proposed method) As can be seen in Table 7, the combination of Unit-Class loss and DRD module outperforms the ablation variations in terms of Rank-1 accuracy and verification rate. e addition of the DRD module with triplet loss achieves the second-best performance as the network is reinforced for the final goal of HFR as well as embedding vector optimization. e effects of changing the values of α and β are shown in Table 8. As can be seen, changing the value of β affects the network performance, whereas changing the value of α does not have a noticeable effect on performance. We determine the values α � 1 and β � 0.5 to be optimal for our HFR results. Figure 7 shows the test accuracy of the proposed network trained with the triplet loss, class mean triplet loss, and unitclass loss for incremental training cycles. As can be seen, the network trained with unit-class loss achieves improved performance over the networks trained with triplet loss and class mean triplet loss starting from 50 training epochs. e optimal performance is achieved using the proposed loss function in 300 epochs, while the networks trained with triplet and class mean triplet loss achieve their optimal performance after 500 and 450 epochs, respectively. is shows that the proposed loss function not only improves network performance but also decreases memory and time consumption of the training process by reducing the required training epochs.

Conclusion
We present an end-to-end network for visible-to-thermal HFR using a novel Unit-Class Loss and a Deep Relational Discriminator module. e backbone network is initialized with the weights trained on a large-scale visible dataset. Next, the multi-modal features are learned through training on a visible-thermal database using cross-entropy loss. e backbone network is then integrated with the proposed Unit-Class Loss and DRD module for HFR. e proposed loss function maximizes positive-to-negative pair distance by reducing intraclass variations and increasing interclass variance. e HFR performance is further enhanced by deep relational learning to classify same class image pairs. e experiments on multiple cross-domain face datasets prove that the proposed CMDN outperforms the existing state-ofthe-art methods on visible-thermal and visible-infrared datasets, validating the effectiveness of the proposed method.
e usage of the proposed loss and discriminator module is simple and can be adapted to any network architecture for other HFR modalities. Furthermore, the methodology can be adapted for generic processing machines given its low computational complexity, making further research and industrial application feasible. In the future, more advanced architectures can be adapted to further improve and finetune the proposed strategy for other heterogeneous problems. Incorporating modality labels into training is also worth exploring. In the future, we will explore optimizing our network for other heterogeneous recognition problems.

Data Availability
Previously reported face image databases were used to support this study and are available at their respective repositories. ese prior datasets are cited at relevant places within the text as references for USTC-NVIE [17,18], TUFTS [14], UND-X1 [15,16], Sejong Face Database [20], and CASIA NIR-VIS 2.0 face database [19]. e source code and data used to support the findings of this study have been deposited in the GitHub repository (https://github.com/ usmancheema89/ForkCNN-Triplet).

Disclosure
An earlier version of this study has been presented as a preprint in Arxiv.org (https://arxiv.org/abs/2111.14339).

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.