Finger Vein Verification on Different Datasets Based on Deep Learning with Triplet Loss

In this study, deep learning and triplet loss function methods are used for finger vein verification research, and the model is trained and validated between different kinds of datasets including FV-USM, HKPU, and SDUMLA-HMT datasets. This work gives the accuracy and other evaluation indexes of finger vein verification calculated for different training-validation set combinations and gives the corresponding ROC curves and AUC values. The accuracy of the best result has reached 98%, and all the ROC AUC values are above 0.98, indicating that the obtained model can identify the finger veins well. Since the experiments are cross-validated between different kinds of datasets, the model has good adaptability and applicability. From the experimental results, it is also found that the model trained on the dataset that is more difficult to be distinguished will be a better and more robust model.


Introduction
With the development of economy and the advancement of science and technology, information security has attracted increasing attention, and identity authentication based on biological characteristics is a popular research direction in the field of information security [1]. Common biometrics includes face, fingerprint, voice, iris, palm vein, finger vein, and gait. As an emerging biometric verification technology, finger vein features have the following outstanding advantages [2]. Finger vein features can only be collected when blood is flowing. When a person loses life, blood stops flowing, and finger vein images cannot be collected at this time; finger vein authentication obtains the information of the vein inside the finger, not the information on the surface of the finger, so the wear, humidity, and cleanliness of the finger surface will not cause verification obstacles; the finger vein needs to be collected under infrared light, which is difficult to be stolen, and the veins of the human body are very complicated and difficult to be copied or forged. In general, biometric verification, including finger vein verification, is more secure, confidential, and convenient.
There have been many works related to finger vein feature extraction. Miura et al. proposed a method to extract finger vein patterns using maximum curvature points [3]. Huang et al. introduced a wide line detector method that can obtain the width information of finger veins [4]. Miura et al. also proposed a repeated line tracking method to extract the texture features of finger veins [5]. Choi et al. proposed a principal curvature calculation method that can effectively extract finger vein patterns regardless of vein thickness or brightness [6]. Matsuda et al. proposed a method based on feature point matching to combat changes in finger pose and illumination [7]. Lee et al. used the local binary pattern method to extract finger vein codes and achieved good verification results [8]. Mahri et al. implemented a phase correlation function to match finger vein images [9]. Recently, there have been a lot of related works on finger vein verification using deep learning. Hu et al. gave an FV-net to learn the more discriminative and robust feature representations of finger veins [10]. Hong et al. proposed a method based on the modified VGGNet16 network for finger vein verification and achieved good verification results [11]. Zhang et al. proposed adaptive Gabor convolutional neural networks to decrease the number of parameters of networks for finger vein verification [12].
The traditional finger vein verification methods based on hand-crafted features are more sensitive to image quality and changes in finger posture and illumination and are not very stable. The finger vein verification methods based on convolutional neural network can automatically learn image features from training datasets without hand-crafted design and have a greater improvement in stability and verification accuracy than traditional finger vein verification methods. This work intends to use modified ResNet18 and VGGNet16 networks combined with a triplet loss function to perform finger vein verification and cross-validation on different kinds of datasets. The paper is organized as follows: the next section introduces the datasets and the idea of crossvalidation between different kinds of datasets; Section 3 introduces the triplet loss function, the convolutional neural network model, and evaluation indexes used in this work; Section 4 gives the corresponding results of finger vein verification of training-validation sets, and a brief discussion is given based on the results; finally, the Conclusions of this paper are given.  [14]: the HKPU database has 156 subjects and collects 6 finger vein images of the index and middle finger of the left hand for each subject, namely, the database has 312 (156 × 2) classes of fingers and 6 instances for each class. The resolution of the raw images is 513 × 256 pixels.

Datasets, Preprocessing, and Cross-Validation
SDUMLA-HMT database [15]: the SDUMLA-HMT database is a multimodal biometric database produced by Shandong University. The finger vein database is part of the SDUMLA-HMT database and consists of finger vein images of 6 fingers of 106 individuals. The 6 fingers include the index, middle, and ring fingers of both hands. There are 6 finger vein images for each finger, namely, the finger vein database has 636 (106 × 6) classes and 6 instances for each class. The resolution of the images of this database is 320 × 240 pixels. Figure 1 shows the original finger vein samples in the HKPU database and SDUMLA-HMT database. In the images, you can see the tilt of the finger posture and the background noise of the acquisition device. In Table 1, the details of public finger vein datasets employed in this experiment are given in tabular form. It should be noted that the order of resolutions from the highest to the lowest is FV-USM, HKPU, and SDUMLA.
2.2. Preprocessing. As described above, there are extracted ROI images in the FV-USM database which are employed directly in this experiment. The HKPU and SDUMLA-HMT databases need to extract ROI images. Because the raw images of the HKPU and SDUMLA-HMT databases are influenced by the edge of the instrument. If we directly detect the edge of the finger using the Sobel operator that can detect edges of a figure by calculating gradients, the edge of the instrument will also be detected. We found that the image of the middle of the finger is almost unaffected by the edge of the instrument, so we use the Sobel operator to detect the edge of the middle of the finger first. Because the edge of the finger is continuous, we gradually extend from the edge of the middle of the finger to the left and right sides to find the edge of the entire finger. We find the middle line of the finger according to the edge of the finger found, and then rotate the finger to the horizontal direction according to the angle between the middle line of the finger and the horizontal direction [4]. Then according to the edge of the   Figure 2, the preprocessed image samples corresponding to Figure 1 are given. It should be noted that we calculated the mean and standard deviation of each dataset and normalized the images to speed up the convergence of the model.

Cross-Validation.
In this experiment, the crossvalidation of interdataset is employed in this experiment, and we consider not doing cross-validation of intradataset. Although, the training set and the validation sets of intradatasets are not the same; in fact, the training set and the validation sets of one dataset are from the same input space, and there is no difficult transforming the same input space into feature space by the trained model. The crossvalidation of interdataset is employed in the experiment; we use a kind of dataset to train the model, and the other kinds of dataset to validate the model; the input spaces of the training set and the validation sets are different and uncorrelated. We believe that this verification method can better test the recognition ability of the model and is more in line with the actual application scenario of finger vein verification.

Triplet Loss and Convolutional Neural Networks and Evaluation Indexes
3.1. Triplet Loss. The model uses a triplet loss function to learn better feature space representations. The triplet loss function has been explained in detail in the paper [16]; here, we only give a brief description. The calculation of the triplet loss function requires standard or anchor samples, positive samples, and negative samples. The goal of the model training is to make the distance of embedding between positive samples and standard samples smaller than the distance of embedding between negative samples and standard samples, as seen in Figure 3. The kind of positive sample is same as the kind of standard sample, that is, the same finger, and the kind of negative sample is different from the kind of standard sample. The loss function formula is defined as follows f ðxÞ is the representation of the feature space, x a i is the standard sample of the input space, x p i is the positive sample of the input space, x n i is the negative sample of the input space, and α is the interval hyperparameter of the distance between the positive sample and the standard sample and the distance between the negative sample and the standard sample. "+" means that when the value in "[]" is greater than zero, the value is taken as the loss, and when it is less than zero, the loss is zero. In other words, the triplet we need is the case where the distance of the positive example is greater than the distance of the negative example, and the triplet with the distance of the positive example less than the distance of the negative example is the target of the model training, which is not helpful for the training of the model.
Triplets can be divided into three categories [16], easy triplets are triplets that make , and such triplets are not helpful for training. Hard triplets are triplets that meet the condi- and must be misidentified, but there are often very few triplets that meet this condition. Semihard triplets, which satisfy the condition are also the triplets used in the experiments. As shown in Table 2, as the model training progresses, there are fewer and fewer triplets satisfying the semihard condition, which also shows that the model is getting better and better. In Figure 4, the variation of the number of valid training triplets as the training epoch increases is given. It can be found that the number of valid training triplets decreases rapidly in the first few epochs, indicating that the triplet loss function is used as the loss function when training the model; the convergence speed of the model in the first few training epochs is very fast, and as the training epoch increases, the convergence speed gradually slows down.
This experiment adopts the method of generating triplet datasets online [16]. Of course, many of the generated triplet datasets are easy triplets, and not many meet the conditions of semihard triplets. We used the same format as the "pairs.txt" file described in the paper to generate validation

Convolutional Neural Networks.
In the experiment, a modified 18-layer residual network is trained to represent the transformation from the input space to the feature space, that is, f ðxÞ in the Equation (1). The detailed structure of ResNet18 has been described in detail in the paper [18][19][20]. We only give the structure of the weight layer in Table 3. It can be seen that in the last linear transformation layer, the output of the network is the dimension 512 of the feature, not the classification number. Compared with the original ResNet18, the model only transforms the last linear layer from 512 to the embedding of features, instead of the number of categories. The structure of each other layer is identical to that of the original ResNet18. The model is trained to map the input image into the embedding with dimension 512 that can represent the finger vein image well. Determine whether it is the same finger vein by comparing the Euclidean distance of the embeddings with the threshold. If the Euclidean distance between two embeddings is less than the chosen threshold, the two finger veins are the same, and if the Euclidean distance between two embeddings is more than the chosen threshold, the two finger veins are different. This network model is trained with the triplet loss function to get a good representation of the feature space.
The experiment also trains a modified VGGNet16 model [21][22][23], and the structure of the weight layer of the model is given in Table 4. We added a BatchNorm layer between the convolutional layer and the ReLU (Rectified Linear Unit) activation layer of the original VGGNet16 model. Batch-Norm layers can solve the gradient vanishing or explosion problem to a certain extent and speed up the convergence of the model by normalizing the output data. From Table 4, we can see that the modified VGGNet16 model has only one linear transformation layer compared with the original VGGNet16 model which has three linear transformation layers. The original VGGNet16 model, finally, has three layers of linear transformation. The dimensions of features are transformed from maxpool layer 7 × 7 × 512 to 4096, then from 4096 to 4096, and finally from 4096 to the number of classifications. This model only preserves one linear layer to transform the dimensions of features from 7 × 7 × 512 to the uniform feature embedding 512. The structure of each other layer is identical to that of the original VGGNet16. This linear transformation layer is used to transform the extracted feature to embedding of dimension 512. It should be noted that the preprocessing finger vein images are resized to 224 × 224, and the mean and standard deviation of each dataset are calculated to normalize the images. The normalization procedure can speed up the convergence of the model.
The definition of precision is The definition of recall (also called true positive rate, TPR) is The full name of the ROC curve is the receiver operating characteristic curve. Its vertical axis is TPR, and the The method of drawing the ROC curve is to draw different FPRs and their corresponding TPRs by adjusting the threshold. AUC refers to the area under the ROC curve. By integrating the ROC curve, the AUC can be calculated. The larger the AUC, the better the corresponding performance of the model.
The Euclidean distance of two finger vein images Dðx i , x j Þ and the threshold d determine the classification of same and different. N same is all finger veins pairs ði, jÞ of the same finger, and N diff is all pairs of different fingers. All true accepts are defined as These are the finger pairs ði, jÞ that were correctly classified as same at threshold d. Similarly is the set of all pairs which was incorrectly classified as same (false accept). The true accept rate TARðdÞ and the false accept rate FARðdÞ for a given threshold d are defined as [16] TAR

Results and Discussion
In this experiment, the CNN model training and validation are performed using a Linux server with Intel® Xeon® Gold 5218R CPU @ 2.10GHz (4CPUs), 512GB memory, and NVIDIA GeForce RTX 3090 graphics cards having a memory of 24GB (4GPUs). The experimental models are all implemented under the Pytorch framework, and the models are trained using the backpropagation algorithm and the AdaGrad optimizer [24].
In Tables 5 and 6, the accuracy, precision, recall, ROC AUC (area under the receiver operating characteristic curve), and best distance/threshold of the corresponding training-validation sets are given, respectively, in ResNet18 and VGGNet16 models. The TAR (true accept rate) values are also given at FAR (false accept rate) equals 0.001 of the corresponding training-validation sets in the two CNN models. It can be seen that the verification accuracies of We plot the accuracy and ROC AUC of the same training set corresponding to different validation sets in Figure 5; the accuracy and ROC AUC are the average of different validation sets. For example, "F- * " means the training set is FV-USM dataset and the validation sets are the other two data-sets; the values are the average of corresponding validation sets, as seen in Figure 5. We can find that for different validation sets, the model obtained from the SDUMLA-HMT training set is better than the model obtained from the HKPU training set, and the model obtained from the HKPU training set is better than the model obtained from the FV-USM training set. The same result can be seen in Figure 5(b). It should be noted that the results of the ResNet18 model in Figures 5(a) and 5(b) are better than the results of the VGGNet16 model, because the residual   We also plot the accuracy and ROC AUC of the same validation set corresponding to different training sets in Figure 6; accuracy and ROC AUC are the average of different training sets. For example, " * -F" means the validation set is FV-USM dataset and the training sets are the other two datasets; the values are the average of corresponding training sets, as seen in Figure 6. We can find that for different training sets, the results obtained by FV-USM as a validation set are better than those obtained by HKPU as a validation set, and the results obtained by HKPU as a validation set are better than those obtained by SDUMLA-HMT as a validation set. The same result can be seen in Figure 6(b). The results of the ResNet18 model in Figures 6(a) and 6(b) also are better than the results of the VGGNet16 model because of the residual structure in the ResNet18 model.
We found that the trend of the curve in Figures 5 and 6 is just opposite. The curve in Figure 5 shows an upward trend, and the curve in Figure 6 shows a downward trend. A simple explanation is given for this phenomenon. The harder the dataset is used to train the model, namely, the dataset used for the validation set gets the worse results, the better and more robust the trained model will be, as shown in Figure 6 and the opposite trend shown in Figure 5.
Through the analysis of Figures 5 and 6, it can be concluded that SDUMLA-HMT dataset is the most suitable for training set, and FV-USM dataset is the most suitable for testing set. We also notice that the resolution of the SDUMLA-HMT dataset is the lowest, while that of FV-USM dataset is the highest, which illustrates a fact. If the image resolution is lower, the deep learning model will learn more deep features from the image; on the contrary, if the image resolution is higher, the deep learning model will    Figure 6: (a) The accuracy as the same validation set in the two CNN models; (b) the ROC AUC value as the same validation set in the two CNN models.  Computational and Mathematical Methods in Medicine the ROC curve is close to a square, which also shows that the model can map the input space of the finger vein into the feature space well. It should be noted that in the lower left corner of Figures 7 and 8, the training set is FV-USM, the test set is SDUMLA-HMT, and the ROC curve is the least square, because this situation has the worst training set and testing set.

Conclusions
(a) In this experiment, the triplet loss function is applied to the verification of finger veins, and the output embedding of the trained model can represent the features of finger veins well. Good experimental results are obtained, the accuracies of the ResNet18 and VGGNet16 models are at least 92%, and the accuracy of the highest model can even reach 98%, and the ROC AUC values are above 0.98.
(b) It is worth mentioning that our training set and validation set are completely unrelated, and we use a cross-validation method on different kinds of datasets, so the resulting model is more adaptable. From the results of this experiment, we find that the harder the dataset to identify is used to train the model, the better and more robust the resulting model will be.

Data Availability
The experimental datasets are all publicly available online.

Conflicts of Interest
The authors declare that they have no conflicts of interest to report regarding the present study.