A Boosting-Based Deep Distance Metric Learning Method

By leveraging neural networks, deep distance metric learning has yielded impressive results in computer vision applications. However, the existing approaches mostly focus a single deep distance metric based on pairs or triplets of samples. It is difficult for them to handle heterogeneous data and avoid overfitting. This study proposes a boosting-based learning method of multiple deep distance metrics, which generates the final distance metric through iterative training of multiple weak distance metrics. Firstly, the distance of sample pairs was mapped by a convolution neural network (CNN) and evaluated by a piecewise linear function. Secondly, the evaluation function was added as a weak learner to the boosting algorithm to generate a strong learner. Each weak learner targets the difficult samples different from the samples of previous learners. Next, an alternating optimization method was employed to train the network and loss function. Finally, the effectiveness of our method was demonstrated in contrast to state of the arts on retrieving the images from the CUB-200-2011, Cars-196, and Stanford Online Products (SOP) datasets.


Introduction
In the past decades, distance metric learning has been applied effectively in image retrieval, face recognition, person re-identification, clustering, etc. It is now a hot topic in the field of computer vision. anks to the recent success of convolutional neural networks (CNNs), deep distance metric learning methods have attracted lots of attention [1].
Each deep distance metric aims to map training samples to features via CNNs. e mapping should narrow the distance between similar sample pairs and increase that between dissimilar sample pairs. To learn deep distance metrics, many approaches have been developed based on sample pairs [2,3], triplets [2,4], or quadruplets [5]. is study attempts to learn the simple similarity functions of sample pairs. e distance metric was defined as the Euclidean distance between sample pairs, which can be computed rapidly compared with other metrics.
Most of the existing methods of deep distance metric learning try to improve the single loss function based on a single distance metric. However, a single distance metric is insufficient to handle all the samples from the given data distribution. In fact, feature data are generally not distributed uniformly: the density varies from region to region in the data distribution [6,7]. To solve the problem, some scholars resorted to ensemble technique and employed several learners to map each sample to multiple subspaces [8][9][10]. Nevertheless, these strategies do not support end-toend training of the network and loss function of each weak learner.
e lack of this training model suppresses the discrimination ability and increases the susceptibility of the metric to noise. e accuracy of deep distance metric learning could be further improved through joint training of the network and loss function.
is study aims to improve the adaptability of conventional deep distance metric between pairs of samples. e main idea is to divide the last fully connected layer of the CNN into multiple nonoverlapping groups (Figure 1), each of which is a separate feature mapping of the network. e distance metric of sample pairs mapped by one of the groups was evaluated by a piecewise linear function. Each group has a corresponding evaluation function, which is added as a weak learner to the boosting algorithm to generate a strong learner. is finally forms a multidistance metric ensemble.
In addition, the same underlying feature representation, which was pretrained through experiments, was applied to the fully connected layers of all groups. In this way, the high computing cost of CNN training in the boosting framework was significantly reduced. In that framework, each learner reweights the training samples for successive learners, according to the gradient of the loss function. As a result, the successive learners would focus on difficult sample features, producing more suitable feature representations. e final ensemble output is a linear composition of multiple weak learners. Furthermore, the performance of the conventional distance metric was improved by introducing a piecewise linear function, which evaluates the similarity of sample pairs in distance metric learning. is facilitates the joint training of the network and loss function.
rough the evaluation of various deep distance metric learning methods in the image retrieval task， it can be seen that Recall@1 of the proposed method is 4.2, 2.8, and 0.4 higher than that of the previous best score on CUB-200-2011， Cars-196, and SOP datasets, respectively. Experimental results show that the proposed method outperforms the comparison methods, while avoiding overfitting to a certain extent. e contributions of our work are as follows: (1) e last fully connected layer of the CNN was used to form multiple groups of features, which was designed to form a distance metrics ensemble and formulated as a boosting problem. en, an alternating optimization method was adopted to jointly train the network and loss function. (2) A piecewise linear function was employed as the evaluation function of the distance metric of sample pairs mapped by CNN and added as a weak learner to the boosting algorithm to generate a strong learner.

Literature Review
is section reviews the most closely related works out of the numerous publications on the hot topic of distance metric learning.

Deep Distance Metric
Learning. Many methods employ a discriminative distance metric loss function to increase interclass distance and reduce intraclass distance [2,4,[11][12][13][14]. For instance, contrastive loss is a popular tool of deep distance metric learning that minimizes the distance between the eigenvectors of positive sample pairs and widens the distance between the negative sample pairs [2,3]. Based on contrastive loss, triplet loss creates a 3-tuple with a positive sample pair and a negative sample pair, in the light of the relative relationship between intraclass distance and interclass distance [2,4], and ensures that the positive sample pairs are closer in the mapped feature space than the negative sample pairs. Many other loss functions have been extended from the above two losses, namely, histogram loss [15], quadruplet loss [5], N-pair loss [11], angular loss [12], and hierarchical triplet loss [16].
Taking a tuple of samples as training samples yields a huge amount of training data. Deep distance metric learning would be greatly enhanced by acquiring more effective samples. Recently, several scholars have designed sampling strategies to tackle hard and semihard negative mining [16][17][18]. For example, Xuan et al. [7] observed that easy positive samples help to preserve the intraclass difference and thus improve the generalization ability of triplet loss. However, the use of easy positive samples constantly underchallenges the metric, making the embedding space less discriminative.

Ensemble Learning.
e methods above all strive to improve the loss function based on a single distance metric. However, it is difficult for them to adapt to all available data. Recently, ensemble learning, which iteratively trains an ensemble from several weak learners for the final prediction, has been incorporated to boost the generalization performance of deep metric learning.
Negrel et al. [8] explained how to use their boostingbased metric learning algorithm to compute hierarchical organizations of face databases. Kim et al. [9] introduced multiple attention-based learners for ensemble. Xuan et al. [10] grouped labels randomly to create a large family of related embedding models, which can serve as an ensemble. Sanakoyeu et al. [6] employed a divide-and-conquer strategy to divide the embedding space to several clusters and used each cluster to train a single learner.

Other Related Metric
Learning. In addition, there are other types of distance metric approaches recently, such as sample selection, local metric, and hierarchical metric. Wu  Figure 1: Structure of boosting-based deep distance metric.
2 Computational Intelligence and Neuroscience et al. [19] proposed a distance-weighted sampling procedure, which selects more informative and stable examples than traditional approaches, achieving excellent results in the process. Wang et al. [13] generalized tuple-based losses and reformulated them as different weighting strategies of positive and negative pairs within a minibatch. Roth et al. [20] proposed to learn the distribution for sampling negative examples instead of using a predefined one. Local metric learning methods [21,22] learn a collection of Mahalanobis distance metrics, each operating on a different subset of the data obtained by K-means or Gaussian mixture clustering. From [23,24], we learn a two-level category hierarchy by using coarse and fine classifiers. Ge et al. [16] proposed a hierarchical version of triplet loss that learns the sampling all together with the feature embedding. Different from the above approaches, our approach realizes the end-to-end training of the network and loss function of each weak learner, thereby enhancing the accuracy of deep distance metric and reducing the probability of overfitting.

Methodology
be N training sample pairs, each of which belongs to one of the two class labels y n ∈ −1, +1 { }. If the two samples belong to the same class, the pair is labeled y n � +1 and called a positive sample pair; if the two samples belong to different classes, the pair is labeled y n � -1 and called a negative sample pair.
We divide the last fully connected layer of the CNN into multiple nonoverlapping groups. e training sample pair , which is extracted from the mth group of the last fully connected layer. So, the training sample pair x n can be mapped to generate multiple groups of features, which was designed to form a distance metrics ensemble and formulated as a boosting problem.
Drawing on the idea of the boosting algorithm, multiple weak learners are adopted to produce a strong learner of distance metrics between the mapping values of training sample pairs. e weak learners are trained on reweighted samples, according to the gradient of the loss function. In general, we want to a set of weak learners and their corresponding boosting model: where M is the number of weak learners and φ m is the distance metric evaluation function between the eigenvectors of the training sample pair mapped by the mth group of the fully connected layer.
In the above formula, φ m was used to quantify the similarity between two training samples, and it responds to this similarity based on whether the two samples should be considered to represent the same class. erefore, a threshold was defined to deal with the distance metric between two training samples, and a piecewise linear function was adopted as the evaluation function. is function reduces the distance between similar training samples and increases that between dissimilar ones in the mapped space. e evaluation function can be defined as where d(f m (x 1 n ), f m (x 2 n )) is a generic distance metric (the simple Euclidean distance), α m and β m are the evaluated similarity and dissimilarity between the two samples, respectively, and t m is a distance metric threshold. If the Euclidean distance between two mapped training samples is smaller than the threshold t m , then the evaluation value is α m ; otherwise, it is β m (Figure 2).
In each round of boosting, a new weak learner is trained on the reweighted training set in the minibatch, according to the gradient of the loss function, and then added to form a strong learner. As demonstrated by Friedman [25], the training of a single learner can be written as a loss function minimization problem: where ℓ is a loss function. Here, the exponential loss function ℓ(y i ; D M (x i )) � e − y i D M (x i ) is utilized. Inspired by Schapire and Singer [26], formula (3) can be rewritten as where w m i is the weight of training sample x i in iteration m. e weak learner is selected to minimize the loss function in each iteration to update the strong learner. Both α m , β m , and t m of the distance evaluation function and w net m of the mth group of the fully connected layer need to be optimized. e proposed approach is easily integrated into some deep metric learning approaches, such as triplet loss, N-pair loss, and hierarchical triplet loss. However, for some loss functions, such as histogram loss and angular loss, it is not applicable and needs to be improved.

Joint
Training. As to formula (3), we need attempt to jointly learn both the network and loss function. We note that its function was nonconvex, which was difficult to solve in general. Referring to Zhang et al. [27], this study applies an alternating optimization method to jointly train the network and loss function.
Since a learner needs to be optimized in each round of boosting, the optimization problem (4) was investigated, while fixing parameters w net m of the mth group of the fully connected layer. Formula (4) can be decomposed into Computational Intelligence and Neuroscience 3 Taking partial derivatives of formula (5) with respect to α m and β m and setting both to zero to optimize each parameter, After each iteration, the weights of the training sample pairs are updated using the exponential loss function: en, the weights of all training sample pairs are normalized. As shown in formulas (6) and (7), the parameters that affect the evaluation function are only related to t m , i.e., the optimal value obtained by the traversal method. If the training sample pairs are classified correctly, the weight of successive learners tends to be small; otherwise, the weight tends to be large. Hence, successive learners focus on different training sample pairs than previous learners, increasing the diversity among learners (Figure 3). e next step is to update parameters w net m of the mth group of the fully connected layer, while fixing α m , β m , and t m of the evaluation function. ese parameters were trained with the contrastive loss function, using the standard backpropagation algorithm. In the forward process, the similarity distance metric is computed for each input training sample pair. In the backward process, the gradient of the loss function is iteratively propagated for each group (Figure 4).
For the contrastive loss function, the distance metric threshold t m obtained through weak learning training serves as the distance margin of a negative training sample pair. en, the contrastive loss function can be established as e training procedure is illustrated as Algorithm 1. e performance of our method in retrieving images from the above datasets was evaluated by Recall@K. For each retrieval task, the authors computed the percentage of the testing images whose top-K retrieved images contain at least one image with the same class label. e K value was set to K∈{1, 2, 4, 8, 16, 32} for CUB-200-2011 and Cars-196, and K∈{1, 10, 100, 1000} for SOP. Our method was implemented under the framework of TensorFlow. Following Oh Song's approach [2], GoogLeNet was adopted as the feature extractor. e batch size was fixed at 128 in all experiments.

Experiments and Results' Analysis
Since the deep distance metric could be affected by the number of weak learners, the influence of that number on our method was observed through experiments on each of the three datasets. As shown in Figure 5, with the growing number of weak learners, the Recall@1 score first increased  e training samples whose weights are updated at the m-1th iteration e training samples whose weights are updated at the m-th iteration  Computational Intelligence and Neuroscience 5 compares the retrieval performance of our method with that of the multiscale (MS) method [13]. As shown in Figure 6, the retrieval performance of both methods gradually increased with the eigenvector size. Our method performed stably, when the size was equal to or greater than 256, and always outshined the MS. Hence, the eigenvector size was fixed at 256 in subsequent experiments.  Next, the training results and testing results were contrasted on Cars-196. As shown in Figure 7, the training R@1 only had a small gap from the testing R@1. On Cars-196, the R@1 score of the training set was around 93%, only 7% than that on the test set.
is clearly shows that our method avoids overfitting. Figure 8 shows the convergence curves of our method and several state-of-the-art methods on Cars-196. In the first 40 epochs, our method reached the performance of the state of the arts and converged faster than the other methods. However, according to the trend of the curve in Figure 8, that is, number of epochs from 0 to 50, the convergence rate of ours model was not the maximum. However, on the whole, the convergence rate of our method was satisfactory. Except for the MS, the contrastive methods took hundreds of epochs to converge. us, the training time of our method was compared with that of the MS. On a single NVIDIA Tesla V100 GPU, the mean running time of our method was 24 Finally, the image retrieval efficiency of our method was compared with that of the state-of-the-art methods on CUB-200-2011 and Cars-196, respectively. e comparison results (Tables 1 and 2) show that our method outperformed these methods, including higher-order tuples such as LiftedStruct and N-Pairs, as well as angular loss and ensemble methods such as annotation-based expansion (ABE) and deep randomized ensembles for metric learning (DREML). In particular, on the challenging CUB-200-2011 dataset, our

Recall@1
Ours MS [13] Triplet [2] Li edStruct [2] N-Pairs [11]     method led the best-performing state-of-the-art method by a large margin: 4.2% in R@1. On SOP, our method also attained the best performance (Table 3). On all the datasets, our method, with a low feature dimension, performed better than the existing methods, with high feature dimensions.

Conclusions
is study presents a deep distance metrics ensemble method based on boosting, which generates the final distance metric through iterative training of multiple weak distance metrics. Specifically, the last fully connected layer of the CNN was used to form multiple groups of features. e sample pairs were mapped by the CNN, and the distance between the mapped sample pairs was evaluated by a piecewise linear function. e function was added as a weak learner to the boosting algorithm to generate a strong learner.
en, an alternating optimization method was utilized to optimize the parameters of network and loss function. e effectiveness of our method was demonstrated on three datasets widely used in image retrieval tasks. e future research will further improve our method by cascading more models and combine our method with other loss functions.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.