The Novel Sequence Distance Measuring Algorithm Based on Optimal Transport and Cross-Attention Mechanism

,


Background.
Sequence data is one of the most popular data type in real-world applications of machine learning and data mining [1][2][3][4][5][6]. For example, in natural language processing, a sentence is a sequence of words, and in computer vision, a video is a sequence of frames, while in bioinformatics, a protein structure is a sequence of amino acids in a polypeptide chain. Unlike the flat vector data of most machine learning problems, sequence data has the following inherent features: (1) Sequence data is varying at the number of items. e flat feature is usually given at a fixed size, while the length of the sequences could be different, due to the sampling process to form the sequence. (2) Sequence data has a temporal and relational nature. e order of the items in the sequence plays an important role in the understanding of the sequence. Given two sequences of the same items but with different orders, their meaning could be completely different. is is a critical, different nature different from the flat vector data, where the items of the vector are considered to be independent of each other and their orders are not important for the learning problem.
Given these two natures of the sequence data, the common machine learning methods are not necessarily applicable to the sequence data, such as the classification, similarity comparison, representation, and regression models. e most popular way to handle sequence data is to map a sequence to a flat vector and then apply the conventional methods. However, this methodology usually cannot capture the sequential feature of the data; thus, the results are not satisfying [4,[7][8][9][10][11][12].Comparing the similarity/ dissimilarity of a pair of sequences is a fundamental problem of sequence data analysis and understanding. e applications include the similarity search [13][14][15][16] and nearest neighbor-based classification [17,18]. However, the similarity of the two sequences has an essential difference compared with the distance/similarity metrics of flat vectors, such as Euclidean distance, ℓp-norm distances, correlation, Mahalanobis distance, and all kinds of learned metrics. e calculation of the distance between a pair of sequences is more difficult than that of the flat vectors, due to the complex nature of the sequences as mentioned above. Two similar sequences may have different lengths because they are generated with different sampling rates, and encoding the temporal patterns and sequential relations of the items of sequences to distance measures is also difficult. To tackle these challenges, various solutions are proposed, such as dynamic time warping (DTW) [19][20][21][22] and optimal transport (OT) [23][24][25][26]. Most of the methods are based on the item-to-item ground distances of the item pairs of the two sequences and matching them accordingly. e ground distance is extremely important for these methods, but ground distance learning does not receive enough attention from the previous researches. In this paper, we study the problem of learning effective ground distance between the items of the two sequences for the purpose of sequence distance comparison.

Existing Works.
In this section, we reviewed a few ground distance-based sequence distance learning methods.
(1) Villani [23] proposed to compare the distance between two sequences by OT. OT treats one sequence as a set of mass, while the other sequence as a set of demands. e effort to move one unit of mass from the ith item of the source sequence to the jth item of the target sequence is treated as the ground distance between the pairs (i, j). e purpose of OT is to move all the masses from the source sequence to the target sequence, with the minimum amount of effort. To this end, OT minimizes the overall effort of mass moving with regard to the amounts of mass moved from the ith source item to the jth target item for all pairs of (i, j). With the solution of the moved amounts, the overall effort is the distance between the sequences of OT. (2) Su and Hua [4] improved the OT method to consider the positions of items of both source and target sequences. e thought behind this method is that the moved amount of mass from a source item to a neighboring target item should be larger than the other items. To this end, the two regularization objectives are imposed on the learning process of the moved amounts. e first one calculates a position similarity between each pair of source and target items and impose the corresponding moved amount to be large if the similarity is large. e second one firstly constructs a position distance between the pair of source and target items, converts it to the probability of positions being nearby, and finally minimizes the Kullback-Leibler (KL) divergence between the probability and moved amount of each pair. (3) Su and Wu [7] developed a novel ground distance metric learning algorithm by firstly combining a sequence with its label to form a metasequence and then learn the ground distance to compare the sequence to the metasequence. A linear transformation function is designed to map the sequence to a new space, where the sequence items are calculated. With the ground distances of pairs, the OT method is applied to compare the sequence to the meta sequence. e linear transformation parameters and the transportation amount jointly in a minimization problem where a training set of sequences. (4) Su et al. [5] designed a novel sequence representation and similarity learning method by using dimensionality reduction to the feature vectors of the items of sequences. It firstly maps the features of the items to a low-dimensional space so that the sequence classes are separated as much as possible. e class separability is measured by the sequence statistics, and different forms of statistics lead to different dimensional reduction methods. Two statistics are considered, which are model-based and distancebased. e model-based method explores the dynamical structure of the sequences, while the distance-based one explores the similarity of pairs of sequences.

Our Contribution.
It has been proven that the OTbased sequence distance comparison is the most powerful method for the classification and retrieval of sequence data. e most critical factor of the OT-based method is actually the ground distance measure between the items. Although there are many studies on how to improve the OT-based sequence distance learning, however, most of them are focusing on learning the optimal parameters of OT, with a given ground distance metric. However, the ground distance is a critical component of the OT, and the quality of the ground distance directly affects the quality of OT-based distance methods. In this paper, we proposed a novel ground distance metric learning method, which employs the cross-attention mechanism [27][28][29][30][31]. To calculate the ground distance between two items from two sequences, respectively, we firstly represent each item by paying attention from itself to the neighbors of the other item and then paying attention back. e representation vector of one item is the linear combination of its neighbors weighted by the attention scores. e attention scores are the normalized similarity between a neighboring item and the target item. To learn the parameters, we build a unified learning framework to optimize the attention layers and the OT parameters, which are transported amounts. Our contributions of this work are listed as follows: (1) We proposed a novel learning framework to learn the ground distance and OT parameters jointly. In this framework, the ground distance model is composed of cross-attention layers, and OT-based sequence distance is parameterized by the transport amounts. e learning framework allows the attention layers and the transport mounts to regularize the learning of each other. is is the first learning framework to guide the learning of attention layers by OT.
(2) We model the learning framework as a minimization problem and develop an iterative algorithm to solve it. In this algorithm, the attention weight parameters and the transport amounts are updated alternately until the algorithm converges. In each iterative step, we consider the optimization of the parameters one by one, while fixing the other parameters, by solving the suboptimization problems. (3) We conducted extensive experiments over four benchmark datasets to compare our algorithm against the other sequence distance comparison algorithms. Experimental results show the advantage of the attention-based OT algorithm, and we also show the stable property of the algorithm regarding the change of the trade-off parameter and the iteration number.

Paper Organization.
We organize the following parts of this paper as follows: in Section 2, we introduce the proposed algorithm of sequence distance comparison, in Section 3, we conduct experiments to compare our algorithm against the other popular sequence distance methods and also study the properties of the algorithm, and in Section 4, we give the conclusion of this paper and some future works of attentionbased sequence distance learning. where xi ∈ R dx is the vector of the ith item of X, and yj ∈ R dy is the vector of the jth item of Y. To calculate the distance between them, we firstly define an attention-based ground distance metric and then measure the optimal transport distance according to the ground metric.

Cross-Attention-Based Ground Distance.
To calculate the ground distance between the ith item of X, xi, and the jth item of Y, yi, we firstly explore their neighboring items. For xi, we collect the h items in X before it, x i−h , . . . , x i−1 , and h items after it, x i+1 , . . . , x i+h , to form a subsequence around xi, denoted as Ni as the contextual sequence of xi. Similarly, we have the h items before and after yj from Y as its contextual sequence, Mj: In this way, the contextual and temporal information of each item is effectively encoded in the subsequences Ni and Mj. To compare the dissimilarity between xi and yj, we compare the two subsequences by the cross-attention mechanism.
2.1.2. Attention from Items of Ni to yj. Firstly, to compare the dissimilarity between the Ni and yj, we calculate the attention from items of Ni to yi. To estimate the attention weight, we firstly calculate the affinity between xl ∈ Ni and yk ∈ Mj: where f(·) is a nonlinear activation function, such as hyperbolic tangent transformation, and θ ∈ R (dx+dy) is the parameter of the affinity function. e attention weights are obtained by softmax normalization over the items of Ni: With the attention weights from xl to yk, we calculate a representative vector of Ni with attention to yk: as the weighted sum of the items xl ∈ Ni.

Attention from z i k of Mj to Ni.
We represent the subsequence Ni by averaging the vectors of the items as Again, we would like to pay attention from each z i k to xi. Similarly, we first calculate the affinity between them as follows: where ϕ ∈ R 2dx is the affinity function parameter. From the affinities between xi and z i k |k: yk ∈ Mj, we calculate the attention weights from xi to yk ∈ Mj with a softmax function,

Cross-Attention-Based Ground Distance.
To compare the distance between the representative vector z i k to and yk, we first perform a linear transformation over z i k by with W ∈ R dx×dy as parameter. is transformation is to map z i k to the same space as yk. en, we compare their distance by the squared Euclidean distance: Shock and Vibration e final distance between Ni and Mj is the attentionweighted sum of distances d(z i k , yk )|k: k ∈ Mj. e attention is calculated in equation (7), and the distance d(Ni, Mj) is 2.1.5. Optimal Transport Distance. With the ground distance, d(Ni, Mj), between each pair of items (xi, yj)|xl ∈ X, yk ∈ Y of two sequences, we can compute the transport distance. e ground distance between xi and yj is viewed as the effort to move one unit of mass from xi to yj. We define a variable, ηij, to denote the mount of mass moved from xi to yj, then the total effort to move the mass from X to Y is calculated as Moreover, we define an amount of mass for each item xi of X to be moved out, ci. us, the constraint of the amounts moved out of xi is applied as We also define an amount of mass to be received by each item yi of Y, δj, and accordingly i: xi∈X ηij � δj. (13) e optimal transport distance between X and Y is achieved by solving the moved amounts to minimize the moving efforts with the above constraints: We rewrite the optimal transport distance as matrix form by defining the following matrices and vectors: We rewrite equation (14) as where tr(·) is the trace of a matrix and 1 n is a vector of n ones.

Supervised Learning of Attention Parameters and
Ground Distance Metric. In the distance measure of optimal transport, we need to learn the parameters of the two attention layers, θ and ϕ, and the parameter of the ground distance W. To learn these parameters, we have a training set of T triplets of sequences: e tth triplet is composed of an anchor sequence, Xt, a must-link sequence Y + t , and a cannot-link sequence Y − t . e must-link sequence is supposed to have a short distance to anchor sequence, while the cannot-link sequence is supposed to have a long distance to the anchor. In our scenario, we impose the cannot-link sequence has a longer distance to the anchor than the must-link one, with a margin of ε: Accordingly, we define the hinge loss function as follows: e corresponding minimization problem is modeled to learn the parameters: In the objective, the first term is the average of the hinge losses over the training triplets. e second term is the squared ℓ2-norms of the parameters to reduce the complexity of the model. C is the trade-off parameter.

Shock and Vibration
where T + t and T − t are the transport amount matrix of the positive and negative pairs of the tth training triplet, and D + t and D − t are the corresponding ground distance matrix. In this optimization problem, the optimization of the transport amounts is coupled with the optimization of the parameters of the attention layer and the ground distance metric. e optimizations of (W, θ, ϕ) and (T + t , T − t )| T t�1 are dependent on each other, making the problem difficult to be solved directly. Instead of seeking the close solution of equation (21), we propose to solve the attention and ground distance metric parameters and the optimal transport variables jointly in an unified minimization problem: In this optimization problem, both W, θ, ϕ and (T + t , T − t )| T t�1 are the variables of a joint objective function. e optimization of both variables are conducted simultaneously. To solve this problem, we use the alternate optimization method. In an iteration of an iterative algorithm, to optimize one parameter, we firstly fix the other parameters and then solve the suboptimization problem with regard to this parameter. e optimizations of these parameters are introduced as follows.

Optimization of W.
By fixing the other parameters and only considering W, we have the following suboptimization problem: We substitute equations (9) and (10) into equation (16) and rewrite the optimal transport distance between two sequences X and Y as where Substituting equation (24) into equation (23), we rewrite the objective function as where e problem of minimizing o1(W) of equation (23) has a closed-form solution. It is obtained by setting the derivative of o1(W) with regard to W to zero:

Optimization of θ.
We optimize the attention parameter of equation (2), θ. Fixing the other parameters and removing the irrelevant terms from the objective function, we have the following suboptimization problem for θ: Shock and Vibration 5 To solve this problem, we use the gradient descent algorithm as where υ is the descent step and ∇θo2(θ) is the gradient function. To this end, we calculate the gradient of o2(θ) with regard to θ by the chain rule: where ∇θd(θ; Ni, Mj) is the gradient of ground distance between Ni and Mj regarding θ. We substitute equation (9) into equation (10), and meanwhile rewrite the variables are function of θ, we have Moreover, the derivatives of the functions of θ are

Optimization of ϕ.
To optimize ϕ, we have the following suboptimization problem: Again, we use the gradient descent algorithm to update ϕ as where ∇ ϕ o3(ϕ) is the gradient function of o3(ϕ) with regard to ϕ. According to the chain rule, we have (1) compare with the other sequence similarity/distance methods, (2) study the impacts of the trade-off parameter C, and (3) study the convergence of the iterative algorithm.

Datasets.
In our experiments, we used four benchmark datasets of sequences: (1) Spoken Arabic Digits (SAD). is dataset has 8,800 sequences [32]. Each sequence is a series of speech frames of a wave of a spoken Arabic digit. e vector of each item is the 13-dimensional Mel-frequency cepstrum coefficients feature vector. ese sequences belong to 10 classes, and each class is a digit. Each class has 880 sequences. e number of items in each sequence is from 4 to 93.

(2) NTU RGB + D (NTU).
is dataset has 56,880 sequences [33]. Each sequence is a Kinect video, and each item is a frame of the video. e sequences belong to 60 action classes. e feature vector of each item is constructed by combining the joint locations and the skeleton-based frame wide features.
(3) Rice Blast Sequence (RBS). is dataset has 66,153 protein sequences of rice genome proteins, collected from the MSU Rice Genome Database [34]. Each sequence is a sequence of amino acids, and each amino acid is represented by amino acid embedding. e embedding vectors are also learned as a parameter of the model. e sequences are tagged by rice blast disease or not. (4) Australian Sign Language (ASL) Signs. is dataset is composed of 2,565 sequences of sign language signs [32]. e sequences are from 95 classes, and each class has 27 sequences. Each item of a sequence is presented by a 22-dimensional feature vector.
e summary of the statistics of the benchmark datasets is listed in Table 1.

Experimental Setting
(1) Training. To measure the quality of a distance/ similarity measure of sequence, we perform the nearest neighbor classification over the sequence data. Given a dataset of sequences with their class labels, we first split the entire dataset by a 10-fold cross-validation protocol. Each fold is used as a test set, while the other folds are used as training folds.
Within the training set, we use each sequence as an anchor sequence and randomly pick up another sequence of the same class as its must-link sequence, meanwhile pick up a sequence of a different class as its cannot-link sequence. In this way, we construct the training set of triplets of sequences. e model parameters are trained by the training set and then tested over the test set.
(2) Testing. With the trained sequence distance metric, we calculate the distance between each test sequence and each training sequence. e class label of the training sequence with the shortest distance to a test set is assigned to the test, as the classification result of the test sequence.
e accuracy of the test sequences is calculated as the performance. e accuracy rate is the percentage of the correctly classified test sequences over the total number of test sequences.

Comparison to Other Methods.
We compare the proposed AGD algorithm against the most popular sequence distance learning methods, including the optimal transport (OT) [23], the Order-Reserving Optimal Transport (OPOT) [4], the Regressive Virtual Sequence Metric Learning (RVSML) [7], and the Linear Sequence Discriminant Analysis (LSDA) [5]. e accuracy is reported in Figure 1. From this figure, we can observe that in all the benchmark datasets, the proposed AGD method always has the best performances.
e differences between AGD and other methods vary from datasets. For example, in the NTU dataset, the AGD has much better accuracy than the others, while in the RBS dataset, it is only slightly better than the second-best method, LSDA. e main factor behind this phenomenon is the power of the attention mechanism, which embeds each item with its attention to the neighboring items from both the source and target sequences. In most cases, the LSDA is the second-best method, while the original OT method is the worst.
Update t according to equation (19) for each training triplet. (2) Update W according to equation (28).

Sensitivity to Trade-Off Parameter.
In the objective function of our method, there is only one trade-off parameter, C. It controls the regularization term's importance. We perform experiments with varying values of C and the results are shown in Figure 2. From the curves in the figure, we can see that the proposed AGD algorithm is stable to the changes of the trade-off parameter in most cases. e only exception is the results of the dataset NTU. But the change of the accuracy over the change of the value of C is acceptable. e overall conclusion is that AGD is not sensitive to C. us, the parameter tuning of C is easy for the users. One more observation is with the value of C increasing, the accuracy is slightly improving.
is also verifies that the regularization term is also beneficial to the model.

Convergence Study.
Since our algorithm is an iterative algorithm, we are also interested in the convergence of the algorithm. us, we plot the curve of accuracy versus the number of iterations. e curves are given in Figure 3. From this figure, we can see that with the iteration number increasing, the accuracy keeps improving until converge. e number of iterations for the convergence is around 50. e convergence of the algorithm is experimentally verified, and for the size of datasets comparable to our benchmark, the convergence iteration number is acceptable.
We test the significance of the convergence of the accuracy by the Ratio test, and the r values are reported in Table 2. According to these r values, all of them are smaller than 1, meaning all the curves are significantly converged.

Running Time.
We also compare the running time of the proposed method. e running times over the four benchmark datasets are shown in Figure 4. From this figure, we have the following conclusions: (1) Running time and data size are positively correlated. e largest dataset has the longest running time while the smaller one has shorter running time. is is natural since both the training and test processes scan the data points one by one, and more data points means more scanning time.
(2) Our algorithm is faster than the LSDA and RCSML algorithms, while it is slower than the OPOT and OT algorithms. is is acceptable given the significant improvement of the accuracy.

Conclusion
In this paper, we proposed a novel sequence distance measuring algorithm. is algorithm is based on OT, but its main focus is how to learn an effective ground distance measure for the two sequences. e ground distance learning falls to the framework of the cross-attention mechanism, and the attention layer parameters and the OT parameters are learned jointly. is design can use the OT to guild the learning of the attention layers.
us, this framework can provide representation of the two sequences, the ground distance, and the OT simultaneously to optimize the model. e learning is also guided by the supervisor of the must-link and cannot-link triplets of the sequences. e parameters are optimized in an iterative algorithm, and the algorithm is tested over four sequence datasets. e experimental results show its advantage over the sequence comparison algorithms.

Data Availability
All the datasets used in this paper to produce the experimental results are publicly accessed online.