Low-Complexity Scalable Architectures for Parallel Computation of Similarity Measures

Processor array architectures have been employed, as an accelerator, to compute similarity distance found in a variety of data mining algorithms. However, most of the proposed architectures in the existing literature are designed in an ad hoc manner without taking into consideration the size and dimensionality of the datasets. Furthermore, data dependencies have not been analyzed, and often, only one design choice is considered for the scheduling and mapping of computational tasks. In this work, we present a systematic methodology to design scalable and area-eﬃcient linear (1-D) processor arrays for the computation of similarity distance matrices. Six possible design options are obtained and analyzed in terms of area and time complexities. The obtained architectures provide us with the ﬂexibility to choose the one that meets hardware constraints for a speciﬁc problem size. Comparisons with the previously reported architectures demonstrate that one of the proposed architectures achieves less area and area-delay product besides its scalability to high-dimensional data.


Introduction
e computational complexity of machine learning and data mining algorithms, that are frequently used in today's embedded applications such as handwritten analysis, fingerprint/iris/signature verification, and face recognition, makes the design of efficient hardware architectures for these algorithms a challenge. e computation of similarity distance matrices is one of the computational kernels that is required by several machine learning and data mining algorithms to measure the degree of similarity between data samples [1]. For several algorithms such as K-means [2], SVM [3], and KNN [4], distance calculation is a computationally intensive task that accounts for a significant portion of the overall processing time [5].
Given the complexity of today's applications, machine learning and data mining algorithms are expected to handle big and high-dimensional data. In [6], an optimized FPGA implementation of the K-means clustering algorithm has been presented. e authors reported that the maximum number of features that could fit on Stratix V A7 FPGA is around 160. Even partitioning the computation and caching partial results in local memory to accommodate larger sizes was not efficient due to excessive global memory transactions. Most of the existing hardware architectures for similarity distance computation have not taken into consideration the size and/or dimensionality of the datasets. eoretical time and area complexities for some architectures, including the ones presented in [7][8][9], have not been validated experimentally. Implementation results reported for other architectures, including [10,11], are for low-dimensional datasets of dimensions 4 and 9, respectively. Despite the fact that these architectures have low theoretical time complexities, their poor scalability to high-dimensional data makes them not suitable for hardware implementation or implementable with poor performance, as discussed in [6].
In our recent work [12], we have systematically explored the design space of 2-D processor array architectures for similarity distance computation. Using the employed methodology, we were able to obtain the same architectures, as proposed in [7,8] and also to identify an additional four architectures with improved area and time complexities. Furthermore, the obtained architectures have been classified into two groups based on the size and dimensionality of input datasets. 2-D processor arrays are generally faster than 1-D (linear) processor arrays as more processing elements (PEs) are used to perform the computation in parallel. On the contrary, linear arrays are more suitable for resource-constrained applications with limited area and I/O bandwidth, typically found in embedded applications. In this work, we present a systematic technique to explore the design space of linear processor arrays for the computation of similarity distance matrices in order to obtain additional design options for area and bandwidth efficiency optimization, which is desirable in the embedded system design.
In summary, the key contributions of this paper are as follows: (i) We present an algebraic technique to design scalable low-complexity linear processor arrays for the computation of similarity distance matrices based on an algebraic analysis of data dependencies. Compared to the classical approach of analyzing data dependencies that relies on studying how output variables depend on inputs, the employed technique relies on defining a computational domain using algorithm indices and studying how input and output variables depend on these indices. (ii) We propose six scheduling functions using computational geometry and matrix algebra. In addition to the minimum restrictions we used in [12] to obtain valid scheduling vectors, more time restrictions are introduced in this work to meet area and bandwidth constraints. Associated projection matrices for the obtained scheduling vectors are also introduced, to map points in the 3-D computation domain to PEs in the projected 1-D processor arrays. (iii) We perform full design space exploration using the proposed scheduling vectors and their associated projection matrices. Six design options are obtained, analyzed in terms of area, speed, and bandwidth efficiency, and compared analytically and experimentally with existing architectures in the literature. e rest of this paper is organized as follows: related work is presented in the next section. e similarity distance computation problem is formulated in Section 3. In Section 4, the systematic technique used to parallelize distance computation is introduced. In Section 5, a systematic design space exploration is performed to obtain the proposed architectures. Design comparison and implementation results are presented in Section 6 and Section 7, respectively. Finally, Section 8 concludes the paper.

Related Work
Several processor array architectures have been proposed for accelerating the computation of similarity distance. In [7], a distance calculation unit for a VLSI cluster analysis architecture has been proposed as a K × N 2-D processor array to calculate similarity distances between N samples of an input dataset and K cluster centroids. For datasets with large number of samples N, the proposed architecture is not feasible for hardware implementation as it consists of a large number of processing elements (PEs) with numerous input features being fed simultaneously. e authors of [8] proposed a K × M 2-D processor array for the calculation of similarity distances between samples of an M-dimensional dataset and K cluster centroids. For high-dimensional datasets with large number of features M per sample, the proposed architecture is not feasible for hardware implementation due to chip constraints in I/O bandwidth and number of pins.
Compared to 2-D processor arrays, linear arrays are generally more area-efficient with less bandwidth and energy demands. In [9], a linear processor array for the computation of similarity distance has been proposed. e proposed architecture is used to calculate similarity distances between data samples of an input dataset and clusters centroids in a VLSI clustering analyzer. Input data samples are fed in a feature-serial format. e proposed linear array has higher time complexity than 2-D processor arrays. However, both area complexity and number of I/O pins have been reduced. Another linear array for the computation of similarity measures has been proposed in [13]. e proposed architecture is used to calculate a special case of the similarity distance matrix that is required by some machine learning algorithms, in which pairwise distances among all samples of a dataset are calculated.
In [10], a distance calculation unit has been proposed to calculate similarity distances between data samples and cluster centroids in a hardware implementation of the K-means clustering algorithm. e proposed design calculates K distances between a data sample of M features and K cluster centroids concurrently using K adder trees of M − 1 adders each. A similar architecture with pipelined adder trees has been presented in [11] to minimize the critical path delay and improve the throughput.

Similarity Distance Computation
Given dataset X of N samples and dataset Y of K samples with each sample in the two datasets having M features. A similarity measure such as Manhattan, Euclidean, or Cosine distance [1,13] can be used to generate a distance matrix D of K × N elements. e distance between the n th sample of dataset X and the k th sample of dataset Y is represented by the value of element D(k, n) of matrix D. In this work, the calculation of the similarity distance matrix, using Manhattan distance between data samples of the two datasets X and Y, is used to illustrate the introduced concepts and methodologies. Manhattan distance can be expressed as follows: where N and K are the number of samples of datasets X and Y, respectively, and M is the dimensionality (number of features) of the two datasets. e emphasis of this paper is on the parallelization of similarity distance computation rather than the similarity measure used. Hence, the work presented in this paper can be generalized to other similarity measures.
Similarity distance computation in the K-means clustering algorithm [2], for instance, is performed in the same way as described in this section. Distances between N samples of dataset X and the set of centroids of K clusters Y are calculated in order to identify the closest cluster for each data sample.

Parallelizing the Computation of Similarity Distance
In our recent work [12], we have systematically explored the design space of 2-D processor array architectures for similarity distance computation using the methodology proposed by Gebali for designing digital filters systolic arrays [14]. In this work, we focus on extending the methodology to explore the design space of area-efficient linear processor arrays for the computation of similarity distance matrices. For more details on the employed methodology, refer [15]. Figure 1, the computation domain of Manhattan distance (1) is defined by the algorithm indices k, m, and n. Every point in the computation domain has three coordinates, represented as follows:

Data Dependencies.
In the traditional approach, data dependencies are analyzed in dependence graphs, by showing how output variables depend on input variables. In this work, however, data dependencies are analyzed using dependence matrices that show how input and output variables depend on indices k, m, and n, as discussed in our work on 2-D processor array architectures [12]. From (1), output variable D depends on indices k and n of the algorithm. Hence, its dependence matrix is given by the 2 × 3 integer matrix: e three elements in each row of the dependence matrix refer to the ordered algorithm indices k, m, and n. e first row shows that variable D depends on index k, and the second row shows that the variable depends on index n. From (1) also, the dependence matrices of input variables X(m, n) and Y(k, m) are given by the following equation: e associated null space basis vectors of these dependence matrices could be given by the following equation [12]: e employed methodology, described in the following subsections, relies on these vectors to obtain valid scheduling and projection vectors to explore the design space of linear processor arrays for the computation of similarity distance matrices.

Data Scheduling.
A scheduling function determines the computation load to be executed at each time step by assigning each point in the computation domain a time value. All tasks assigned the same time value will be executed in parallel. Broadcasting an algorithm variable results in assigning all points in its broadcast subdomain the same time value. Points in the subdomain of a pipelined variable, on the other hand, are assigned different time values. When an input variable is broadcast, a copy of each data element is available to all PEs through a global broadcast bus, while pipelined input variables are stored by each PE and passed to its neighbor through a local link in the next clock cycle. Broadcasting an output variable results in performing all computations on partial results from all PEs in the same clock cycle. For a pipelined output variable, the partial result that is generated by each PE is accumulated and passed to the next PE until the final result is accumulated by the last PE. One simple scheduling function that is used to schedule computation tasks is the linear scheduling function [15]: where t(p) is the time value assigned to a point p in the computation domain and s � s 1 s 2 s 3 is the scheduling vector. To broadcast an algorithm variable whose null vector is e, we must have Scientific Programming se � 0, (9) and to pipeline this variable, we must have Conditions in (9) and (10) are the minimum constraints that can be used to get a valid scheduling function.
Our strategy for arriving at suitable scheduling functions combines pipelining and broadcast restrictions in (9) and (10). We start by choosing to pipeline the evaluation of all points that lie in a plane perpendicular to one of the three k-, m-, and n-axes. Next, we pipeline the evaluation of all points that lie on lines in the chosen plane. ese lines are parallel to one of the remaining two axes in that plane. Finally, we broadcast the evaluation of all points in the chosen line. In total, we have three axes to choose the planes and two directions to choose the lines in the planes. is gives rise to six possible scheduling functions. Subsection 4.3.1 illustrates how this technique is used to derive our first scheduling vector s 1 .

Calculation of the First Scheduling
Vector s 1 . Let us choose to broadcast input variable X. From (5) and (9), we have which implies s 1 � 0.
To avoid feeding large number of features simultaneously, we choose to supply input variable X in a featureserial format (i.e., along the m-axis). is implies that, for any data sample, the time between the calculations for feature m and feature m + 1 is a one-time step: which implies s 2 � 1. We choose to start the first calculation for sample n + 1 after the last calculation for sample n.
e time between these two calculations is also a one-time step: which implies s 3 � M. Hence, the first valid scheduling vector is given by the following equation: In accordance with Figure 2, the calculated scheduling vector s 1 results in assigning all points on each of the continuous lines the same time value. ese lines are called equitemporal zones since the computations for all points on each line are performed simultaneously [15]. From the geometric perspective, scheduling vector s 1 results in executing all points in a plane with a fixed value of coordinate n before points in the plane with the next value of n. Within each plane, all points on a line with a fixed value of coordinate m are executed before points on the line with the next value of m. Points on each of these lines are executed in parallel.

Calculation of the Remaining Scheduling Vectors.
e remaining five scheduling vectors can be calculated using the same procedure employed to calculate s 1 with different orders of execution along the three axes. In Figure 3, the equitemporal zones are also lines along the k-axis with another order of execution. e associated scheduling vector is given by the following equation: Other two timing alternatives with equitemporal zones along the m-axis are shown in Figures 4 and 5. e associated scheduling vectors for these timing alternatives are given by the following equation: Figures 6 and 7 show two timing alternatives with equitemporal zones along the n-axis.
e associated scheduling vectors are given as follows: A projection matrix P that can be used to perform the projection operation can be obtained using a set of l � (n − k) projection direction vectors d i that belong to the null space of the projection matrix and satisfy the following condition as illustrated in the following equation [14]: where s is the chosen scheduling vector. In this work, our goal is to map the points in the 3-D computation domain shown in Figure 1 to a 1-D domain. Hence, two projection direction vectors have to be specified for each of the six scheduling vectors presented in the previous section. For the scheduling vector s 1 � 0 1 M and according to (18), two possible projection directions could be given as follows:  Scientific Programming ese projection directions are then used to calculate the associated projection matrix according to the procedure described in [14]: Using projection directions d 11 and d 12 results in eliminating the m-and n-axes, respectively, and ensures that the projected processor array will be a linear array of K PEs along the k-axis. Table 1 shows the chosen projection directions and the associated project matrices for the six obtained scheduling vectors.

Design Space Exploration
In this section, we explore the design space of linear processor arrays for similarity distance computation using the derived scheduling vectors and projection matrices in Table 1.

Design #1
: Using s 1 � 0 1 M and P 1 � 1 0 0 . In this design option, each point p � k m n t ∈ D is assigned a time value using scheduling function (8): e projection matrix P 1 maps any point p in the computation domain to the point: which implies that the resulting processor array is a linear array along the k-axis with K PEs. All points in the computation domain with the same k coordinate are mapped to the same point, or PE, in the projected computation domain. e input variable X is broadcast, and the broadcast direction is mapped to the vector: which implies that input data are fed using broadcast lines along the k-axis in the projected architecture. Input variable Y is pipelined since the pipeline condition in (10) is satisfied. e pipeline direction is mapped to the vector: which implies that input Y is localized in the projected architecture. Hence, the k th PE uses only the M features of the k th sample of input matrix Y. Output variable D is also pipelined, and the pipeline direction is mapped to the vector: which implies that output D is also localized. For every M cycles, each PE generates the distance D(k, n) between the n th sample of dataset X and the k th sample of dataset Y. K distances are calculated in parallel by the K PEs. Hence, the total number of time steps is MN steps or clock cycles. e time complexity of the proposed architecture can also be determined by calculating the time value assigned by the scheduling function in (21) to the point with maximum values of coordinates k, m, and n: Since the first time value is zero, the total number of time steps is t(p max ) + 1 � MN steps. e resulting processor array and the structure of each PE are shown in Figures 8 and  9, respectively. e remaining five processor array Scientific Programming architectures shown in the following subsections are obtained using the same procedure as for Design #1.

Design #2
: Using s 2 0 N 1 and P 2 1 0 0 . e projection matrix is the same as that of Design #1. Hence, all points in the computation domain are mapped to a linear processor array of K PEs similar to that in Figure 8 with variable X being broadcast while variables Y and D are localized. e chosen scheduling vector results in assigning each point in the computation domain the time value: e total number of time steps is also MN steps. However, the scheduling vector imposes a di erent order of execution. As shown in Figure 3, N computations for feature m of all data samples are performed before the N computations for feature m + 1. Hence, N registers are required by each PE to store the intermediate results compared to only one register in Design #1. 3 1 0 K and P 3 0 1 0 . e scheduling function for this design alternative is

Design #3: Using s
Accordingly, the total number of time steps is KN steps. e projection matrix P 3 maps any point p in the computation domain to the point: which implies that the resulting processor array is a linear array along the m-axis with M PEs. Both input variables X and Y are localized, and output D is broadcast with its broadcast direction mapped to a line along the m-axis.
Broadcasting an output variable requires that partial results from all PEs are used concurrently to generate one data element of the output matrix in every clock cycle, as shown in Figure 10. e PE structure is shown in Figure 11. Compared to the PE of Design #1 in Figure 9, no registers are required to store partial results since distance calculation is performed within a single clock cycle.

Design #4: Using s 4
N 0 1 and P 4 0 1 0 . e scheduling function for this design choice is e time complexity for this design is equivalent to that of Design #3 that is KN time steps. e processor array architecture and the PE structure are the same as in Figures 10  and 11, respectively. e main di erence between the two designs is in the order of execution that results in generating elements of the output matrix D in a di erent order.
e total number of time steps is KM steps. e projection matrix P 5 maps any point p in the computation domain to the point: which implies that the resulting processor array is a linear array along the n-axis with N PEs. Both input variable X and output variable D are localized. e input variable Y is broadcast with its broadcast direction mapped to a line along the n-axis. For every M cycles, each PE generates the distance D(k, n) between the n th sample of dataset X and the k th sample of dataset Y. N distances are calculated in parallel by the N PEs. e processor array architecture is shown in Figure 12 with the PE structure being the same as that of Design #1 in Figure 9.
5.6. Design #6: Using s 6 1 K 0 and P 6 0 0 1 . e scheduling function for this design option is e processor array architecture and its time complexity are the same as of Design #5. However, the order of execution that is imposed by the chosen scheduling vector, shown in Figure 7, dictates that K registers are required by each PE to store the intermediate results compared to only one register in Design #5.

Design Comparison
e systematic approach adopted in this paper facilitates design space exploration of linear processor arrays for the computation of similarity distance matrices. e obtained architectures provide us with the exibility to choose the one that meets hardware constraints for speci c values of system parameters K, M, and N.
Design #1 and Design #2 are suitable for parallelizing distance calculation tasks involved in processing dataset X of large size N compared to K that represents the size of dataset Y. Distance calculation that is required for clustering data samples of a large-scale dataset X using the K-means clustering algorithm, for example, ts these design options since the size of dataset N and the number of features M are generally much larger than the number of clusters K. Compared to Design #1, Design #2 is not practical since it has the same time complexity but with a large number of additional registers to store intermediate results.
e systematic methodology adopted in this work can be used to obtain the previously devised architecture in [9]. is architecture is similar to Design #1 with input X being pipelined rather than broadcast. Scheduling vector s 1 can be modi ed to re ect this change by applying the pipeline restriction in (10) instead of the broadcast restriction in (9). e modi ed scheduling vector is as follows: processor array architecture is shown in Figure 13. A total of K(K − 1)/2 delay registers are required to feed features of the input dataset Y. e structure of PE is the same as of Design #1 shown in Figure 9 with one more register for the pipelined input X.
Design #3 and Design #4 are similar to the distance calculation unit proposed in [10]. e main drawback of these three architectures is their slow clock rate when used for processing high-dimensional data since M partial results have to be added within a single clock cycle. Clock speed of these architectures can be improved by pipelining the adder trees, with each level representing a pipeline stage and a total of extra K(M − 1) pipeline registers, as proposed in [11]. Large-scale datasets are usually stored in o -chip or on-chip RAMs with limited bandwidth and number of data ports. Hence, architectures in [10,11] are not practical for highdimensional data with large number of features M as they require the feeding of a large number of features simultaneously.
ese two architectures are implemented as K instances of Design #3. However, the proposed architecture is more scalable to high-dimensional data, even if multiple instances of it are used, since only one feature is fed at a time. Another drawback of architectures in [10,11] is that all features of each cluster centroid are read simultaneously. Hence, cluster centroids cannot be stored in on-chip RAMs with limited data ports and have to be stored in register banks. Our proposed architectures, on the contrary, are more exible as the designer can choose to store cluster centroids in either on-chip RAMs or register banks based on data dimensionality and hardware constraints.
is exibility is critical for implementing hardware architectures for the K-means clustering algorithm since cluster centroids have to be updated at the end of each iteration of the algorithm.
Design #5 is another option that is similar to Design #1. e main di erences between the two architectures are in the choice of broadcasting or localizing input variables X and Y and the number of PEs. Design #5 is not amenable for hardware implementation when the value of N is very large since it results in a huge number of PEs. However, this design option is suitable for processing high-dimensional, low sample size (HDLSS) datasets [16]. One example of these datasets is the gene expression microarray datasets. ese datasets typically have a small number of samples N and a large number of genes that represent the features [17]. e time complexity for Design #6 is the same as that of Design #5 with extra K × N registers to store intermediate results. Table 2 summarizes circuit and time complexities of the six proposed processor array architectures obtained in this work and the three previous architectures in [9][10][11]. Critical path delays are also presented, with T a referring to the delay of a w-bit adder to model the delay of subtraction, absolute value, and addition operations, where w is the data width.

Implementation Results
As discussed in the previous section, Design #1 is the best design for calculating similarity distances in the K-means clustering algorithm for large-scale datasets with size N being much larger than the number of clusters K. For a fair comparison, Design #1 and previous architectures in [9][10][11] are implemented on the same FPGA device to accelerate distance computation involved in clustering 4,096 samples of an image dataset (Bridge) [18], with each sample consisting of 16 numerical features of the integer data type. e four architectures are implemented in Verilog hardware description language with Xilinx ISE Design Suite 13.4 targeting Xilinx Virtex7 XC7VX330T. Table 3 shows implementation results for distance calculation involved in one iteration of the K-means clustering algorithm with N 4,096 samples, M 16 features, and number of clusters K 64.
Implementation results show that Design #1 outperforms Design of [9] in terms of area and speed with 25% decrease in area and 51% decrease in the area-delay product (ADP). Design of [9] occupies more slices due to the delay registers used to feed features of input dataset Y. Execution time is determined by the number of clock cycles required to calculate all elements of distance matrix D and the clock rate. Design of [9] requires K − 1 additional clock cycles and has a slower clock speed. Although the clock speed for Design #1 is a ected by the delay of long broadcast bus used to feed features of dataset X, it still attains a higher clock rate due to the higher clock skew for Design of [9], as inspected by the Xilinx tool. e e ect of clock skew and long broadcast buses can be minimized by using clock distribution networks and bu er insertion for Design of [9] and Design #1, respectively, at the cost of more area and power consumption.
Based on time complexities of the four designs in Table 2, as expected, Designs of [10,11] require less time to complete the computation of all elements of distance matrix D. On the contrary, Design #1 achieves 58% and 14% decrease in ADP compared to Designs of [10,11], respectively, using only 6% of their area.

Conclusion
e systematic technique presented in this work is used to explore the design space of scalable and area-e cient processor arrays for the computation of similarity distance matrices. Six new processor arrays, in addition to a previously devised one, are obtained systematically. Implementation results for previous architectures and one of our proposed architectures show that the proposed architecture achieves the best compromise between area and speed with an average decrease of 71% in area and 41% in ADP. e proposed architecture is more scalable to highdimensional data as it requires the feeding of only one feature at time.
Scheduling and projection operations introduced in this work allow for control on time and area complexities of the proposed architectures. We intend to analyze the proposed architectures in terms of power e ciency and extend the proposed methodology to design power-e cient architectures that are critical for embedded applications.

Data Availability
Data sources in this work are from the public clustering datasets available at http://cs.uef. /sipu/datasets.

Conflicts of Interest
e authors declare that they have no con icts of interest.    Figure 13: Processor array architecture for Design of [9] with K 4.