FCM Clustering Approach Optimization Using Parallel High-Speed Intel FPGA Technology

Fuzzy C-Means (FCM) is a widely used clustering algorithm that performs well in various scientific applications. Implementing FCM involves a massive number of computations, and many parallelization techniques based on GPUs and multicore systems have been suggested. In this study, we present a method for optimizing the FCM algorithm for high-speed field-programmable gate technology (FPGA) using a high-level C-like programming language called open computing language (OpenCL). The method was designed to enable the high-level compiler/synthesis tool to manipulate a task-parallelism model and create an efficient design. Our experimental results (based on several datasets) show that the proposed method makes the FCM execution time more than 186 times faster than the conventional design running on a single-core CPU platform. Also, its processing power reached 89 giga floating points operations per second (GFLOPs).


Introduction
Clustering is a topic of great interest in machine learning elds dealing with the process of partitioning sets of data into homogeneous groups or clusters based on the similarities between data points. Clustering techniques are useful because they allow the exploration of labeled and unlabeled data to nd similarities and assign observations to corresponding clusters. e distance measure between elements in a dataset is commonly used in cases where points that are close to each other are assigned to the same cluster. Clustering algorithms are widely used in many elds such as medical applications [1], computer vision [2], data segmentation [3], marketing [4], networking [5], and security [6]. Some of the most commonly used clustering algorithms are K-means clustering [7], fuzzy clustering [8], hierarchical clustering [9], two-step clustering [10], and k-harmonic clustering [11]. While the overall goals of di erent clustering algorithms tend to be similar, diverse starting points and rules usually result in diverse taxonomies of clustering algorithms [12]. ey are various kinds of clustering methods in which objects can be arranged into distinct groups based on a set of strategies and rules. In hierarchical clustering, the data objects are organized into a tree of clusters using top-down or bottom-up approaches by splitting clusters recursively until there are no more clusters. While it is easy to generate the tree without determining the number of clusters in advance, most of the hierarchical algorithms have high time complexity (could be O(n 3 )) with a challenge that once the process of clusters combining or splitting is done, it cannot be undone [13,14]. Centroid clustering algorithms are based on the distance calculation between the dataset objects and the proposed centroids; generally, these approaches give a reasonable performance, but they are very sensitive to noise and outliers [15]. Density-based clustering approaches are less e ective to noise and outliers but it is not easy to work with large datasets; they group objects into separate regions based on their density; in grid-based clustering, the data space of the objects is mapped to a grid space with a limited number of cells and the clustering approach is performed on the whole cells which make this clustering model a fast approach but with a challenge of the inability to find clusters in a low dimensional subspace [16,17]. Model-based algorithms use statistical or neural learning strategies to choose a certain model for each cluster; they provide flexible implementation but with a low prediction quality relatively [16].
Several studies provided comparative analyses between many clustering algorithms by considering various parameters [18][19][20][21]. In [20], Abaas introduced a comparative analysis where the most popular clustering approaches are compared in terms of the ability to handle large datasets, type of dataset, and the number of clusters to classify. Results indicated that these algorithms vary in their performance and their classification accuracy degree. In [19], the authors presented a performance comparison study between nine of the most well-known clustering approaches to provide guidance on how to choose the clustering approach based on dataset characteristics, where the data have a normal distribution. In [21], authors investigated the effect of choosing clustering algorithms that may employ within pattern recognition applications on the overall recognition accuracy and clustering quality, the validity of the clustering approach, and the execution time of the clustering method; the outcome of the results showed that the effect of choosing clustering algorithm is essential in the case of small models' size. In general, there are many challenges associated with using most of the clustering approaches such as the noise problem and the process of initializing several parameters [22]; however, the FCM clustering approach is among the popular clustering approaches that has a reasonable performance and accuracy attributes [23]. e K-means clustering algorithm is probably the most common procedure, and several research studies have proposed different techniques for enhancing the traditional algorithm's performance [24][25][26][27][28][29]. Given a dataset X with npoints, each of which is a vector of d-dimensions, the K-means algorithm divides these points into k clusters via several steps. In the first step, the initial centers are chosen randomly. In the following steps, each point in the dataset is assigned to a single cluster (hard clustering) according to which center each data point is closest to. New centers are then determined for each cluster. ese steps are repeated r times, and the overall complexity of this algorithm is calculated as O(nkdr).
However, there are many limitations associated with using the K-means algorithm. Among these are the sensitivity of choosing the initial centroids, the weak ability to deal with noise and outlier data points, and the challenges that arise when dealing with a large data size [30]. e selection of initial centroids is poor in that there are many outliers. Also, applying this algorithm to a large dataset increases the number of iterations and the overall execution time, thus making it difficult to meet the desired level of performance.
Another common clustering technique is the fuzzy c-means (FCM) algorithm, by which computations are accelerated using high-speed FPGA technology. e FCM algorithm differs from the K-means in that the data points could belong to more than one cluster with varying probabilities (soft clustering).
Several studies that have used clustering approaches for diverse purposes have concluded that the FCM produced better results than the K-means algorithm based on its selection of a set of predefined parameters such as dealing with outliers and stability. However, the FCM algorithm's computation time is higher [31][32][33][34][35][36][37][38]. e longer computational time is linked to the complexity of the FCM algorithm, which is O(nk 2 dr), whereas that of the K-means algorithm is O(nkdr).
erefore, the use of parallel algorithms and high-speed computation platforms is suggested to overcome the relatively high computation time. e concept of fuzzy data was introduced in 1965 [39], which provides a mechanism for manipulating and interpreting complex multifeature data objects [40]. e fuzzy c-means algorithm [41] starts by determining the number of clusters k, whose initial centroids are chosen randomly.
ose centroids are updated with every iteration of the algorithm, as is the membership function of each data point in the dataset. is updating process continues until the centroid values change. ese changes, known as centroids movements, are minimal and must be less than a predefined threshold value. e degree of member function for each observation is equal to a value that belongs to the interval [0, 1]. A greater value toward a specific cluster means a stronger relationship with the center of that cluster. FCM algorithm determines whether the object belongs to a particular class based on its membership function [36]. e member function vector depends mainly on the Euclidian distance between the corresponding target object and the current centroid in each class; however, the object could belong to more than one class with a different possibility [42,43]. e FCM algorithm is an iterative approach where centroids and member functions are updated in every iteration as described in equations (5) and (6); the member function degree of each object is estimated such that the value is bounded between 0 and 1. Finally, when almost there are no changes in the centroids list, objects are classified into a set of K classes according to the principle of highest membership. e steps of the FCM algorithm can be summarized as below: A dataset X contains n vectors of observations such that X � {x 1 , x 2 , . . ., x n } and K clusters with initial centroids C, where C � {c 1 , . . ., c K }. Also, we can define the 2D array member function F, where F is an array of n rows and K columns based on equation At first, all elements in the F array are filled randomly; however, the conditions in equations (2) and (3) should still be satisfied.
e FCM objective function is applied to minimize the equation as follows: where m is the fuzzier exponent. It is generally hard to select the optimal value of m, though since it could be in the ranges of 1.5 to 4 [44,45] in this study, we chose m � 3. e term ‖x i − c j ‖ 2 refers to the Euclidian distance between data point We can summarize the steps of the FCM algorithm as below: Step 1: Step 2: Calculate the new centers vector Step 3: Increase the number of iterations (iter � iter + 1).
Step 4: If the stop condition is not satisfied, return to step 2; otherwise, STOP.
In this study, the stop condition is met when there are almost no changes in the centroids or when the average changes of all centroids are less than the threshold value ε, where ε � 10 − 7 . e remainder of this paper is organized as follows: Section 2 introduces the FPGA computing platform and its characteristics. e OpenCL is presented in Section 3. In Section 4, we discuss related work and literature reviews. e proposed approach for the performance is discussed in Section 5. Experimental results are presented in Section 6. Section 7 provides a conclusion of the proposed study.

FPGA Computing Platform
High-speed computing platforms such as FPGAs have become popular in many applications, including image and signal processing, machine learning, security, pattern recognition, and scientific problems [46][47][48][49][50][51]. FPGA technology creates a customized hardware design that reflects desired objectives. FPGA technology also has software flexibility. At the same time, it is still possible for the platform to be reconfigured many times with many possible configurations, thus making the process of optimizing the proposed design more comfortable.
In Intel FPGA devices, the structure usually incorporates adaptive logic modules (ALMs), RAM blocks, and extensive digital signal processing (DSP) blocks. FPGAs can also carry other kinds of blocks, such as phase-lock loops (PLLs), which can adjust the internal clock frequency. ALMs contain at least one lookup table (LUT), each of which is made of one or more flip-flops (FFs).
ese ALMs are diffused throughout the FPGA fabric, making the FPGAs very amenable for temporally parallel (systolic or pipelined) computations that can be applied to monopolize loop-level concurrency in several applications.
In such cases, the body of the loop is divided into executable pieces, with each piece targeted for execution on a different stage of computational logic generated within the FPGA. e data passed along pipelined stages are stored in discrete and accessible ALM flip-flop resources. Generally, when it is completely pipelined, it takes one clock cycle to pass an item of data from one stage to another in a mere temporal pipeline.
All stages concurrently perform their computations with different data. In such cases, the expected number of clock cycles required to handle any single item (usually referred to as pipeline latency) equals the number of stages in the system carried out to treat the body of the loop.
However, if there are a substantial number of elements in the loop, then the most important metric is the initiation interval (II). is metric reflects how many clock cycles the system should wait for, on average, before permitting the next item to enter the pipeline. A custom-created pipeline within an FPGA effectively reveals the low-level structure of an application. e device used-namely, Intel De5a-Net Arria-10-has 427,200 ALMs, which are used to implement several hardware circuits functions, 1518 DSP blocks to ensure the efficient implementation of several floating points operators, and 2713 RAM blocks to store data for the synthesized design.

OpenCL.
Open Computing Language (OpenCL) is a programming framework that simplifies the distribution of works among multiple and different/similar kinds of computation platforms [52]. For the FPGA-based platform, the most influential benefits come from abstracting most of the hardware details and significantly reducing the development time [46,52]. e OpenCL framework also enables various number of threads to be created such that the number of created threads can be set by the programmer according to the platform's architecture. A single thread could be applied, such as when using a task-parallel model in FPGA, or several threads can be created for each core in multicore systems. In some cases, such as when a graphical processing units (GPUs) computation model is used, millions of threads might be created. e OpenCL programming model has two main parts: the host and device programs. e host code is usually written in a C/C++ programming language, and it manages all communication with the device. is part of the code is compiled using the GNU g++ compiler. e device code is an OpenCL-based program that implements the segment of code that should be accelerated using a high-speed computation device (in this study, the De5a-Net FPGA device). e device code is compiled using the Intel FPGA compiler. As this process could take hours to complete, it should be carried out before compiling the host code. Figure 1 depicts the programming model flow.

Related Work.
e FCM yields good clustering results but requires a long computation time.
erefore, many studies have attempted to improve the speed of the FCM algorithm. High-speed computation platforms such as GPUs [54][55][56][57][58], multicores [53], and FPGAs are utilized to run these complex computations by which the FCM algorithm is modified to tolerate the hardware features and achieve reasonable improvements.
Among these platforms, the FPGA provides an especially high-speed hardware solution that can be customized according to the algorithm's specifications. Afshin introduced a hardware solution approach allowing the FCM to be utilized for brain MR image segmentation [59]. In this case, the Xilinx FPGA Virtex7 was used with the Modelsim tool to obtain the simulation results. MATLAB is used to run the FCM algorithm on an Intel Core i5 machine to compare with the software. Although the speed improved approximately 100 times, the design was not implemented on a real hardware device.
In the field of image processing, an improved FCM algorithm was introduced that uses a pulse mode hardware structure to save resource usage and offers a reasonable speed improvement. e proposed design was tested and verified on the Virtex-6 FPGA platform [60].
Another study discussed the development of an improved FCM algorithm applicable to real-time applications such as video processing [61]. Specifically, the study discussed two hardware approaches to tune the performance of two different hardware structure devices from Xilinx and Altera.
In other work, Hwang et al. [62] discussed a hardware approach to optimize the FCM algorithm by creating a wellpipelined hardware circuit designed to perform the overall FCM computation steps. e proposed circuit design was implemented on a Stratix III device, with the results indicating the favorable performance of the proposed design. It also consumed a relatively small percentage of the available resources.
is study introduces the ability of a parallel high-level computing language (OpenCL) to tune the performance of a common heavy computation clustering algorithm to run on high-speed Intel FPGA technology. In addition to reducing latency, FPGAs are also widely utilized to reduce the overall energy consumption as proofed in [46].

Methods of Optimization and the Proposed Approach Performance Tuning.
e continuously improving performance and effectiveness of the FPGAs platform have promoted their extensive use to accelerate diverse heavy computation applications. Several optimization techniques can be used for performance tuning and to maximize the possible benefits of different high-speed processing platforms.
e Intel FPGA compiler generates a pipelined datapath hardware circuit in which several operations of different loops' iterations are executed concurrently (as shown in Figure 2). is model of parallelism is called a task-parallel model (single work-item), by which processing is sped up by overlapping the loop iterations' execution in a given piece of code. e execution of each loop iteration is divided into multiple steps, each of which involves one or more instructions. Also, each step is executed in a one clock cycle time. In an ideal case, the next loop iteration starts the execution of the first step by shifting one clock cycle (one step) from the previous iteration. is is commonly referred to as an II and should be equal to 1 if the pipelined circuit is created perfectly.
As indicated in Figure 1, the execution of the proposed loop iterations overlaps, meaning the total execution time for the whole loop iterations is approximately equal to M clock cycles, where M >> N.
An effective pipelined datapath is created when the Intel compiler generates multiple files. Among these files is the optimization report file, which can be interpreted to modify the design so that the best possible pipelined circuit is created. e loop unrolling technique can also be utilized to improve the performance of the created design by increasing the amount of work performed per clock cycle. is reduces the total number of clock cycles by a factor of X when the main loop is unrolled X times. As shown in Figure 3, the number of iterations is reduced by half as the loop is unrolled by a factor of two.
However, the loop unrolling technique increases resource usage and may also increase the clock cycle time. If the clock cycle time for a given design is C, then the new clock cycle time after loop unrolling is α. C, where α ≥ 1. Loop unrolling (where U is the unrolling factor) is feasible if it reduces the overall time (T), given that en, T no_loop_unroll should be greater than T loo-p_unrolled_U , given that If the condition in equation (8) is satisfied, the application of the loop unrolling technique is feasible.
Data dependencies between consecutive operations prevent the creation of effective pipeline circuits with an II that is close to one. A possible solution to this problem is to use the shift register concept, as it eliminates dependencies and allows the compiler to create a robust design with a better II value [52]. For example, if II is equal to four because there is a dependency between the output and at least one of the inputs, the next iteration waits four clock cycles before starting its execution (see Figure 4).
Other techniques might also resolve this problem, such as using local memory, which reduces a signi cant part of executions by reducing data transfer time and moving the dependencies from global memory to local memory.
Using the mentioned optimization techniques together with the Intel compiler, we created an e ective hardware design to run the fuzzy K-means algorithm much more quickly than a similar algorithm based on a general-purpose CPU-based computation platform. e characteristics of the FPGA device used in this study which has a substantial amount of local memory and the ability to create an e ective pipeline circuit with a potentially large number of pipeline stages make the FPGA platform a favorable choice for accelerating complex computation-based algorithms. Figure 5 shows the optimization process of a signi cant part of the code. In Figure 5(a), the loop runs without performing any optimization technique other than the Intel FPGA compiler's optimization of the hardware design to maximize performance. e II equals 9 because of the data's dependency on the variable that accumulates the sum of weights (sumW). Using loop unrolling ( Figure 5(b)) reduces the execution time. However, the II increases to approximately 51, as more dependencies are generated in each iteration, and there is still an area of optimization. By combining local memory and shift register with loop unrolling (Figure 5(c)), we can create a more powerful pipeline design that solves the data dependencies obstacle with an II of one. e size of the shift register is set so that the number of pipeline stages is increased to hide the latency and eliminate pipeline stalls. As a result, the speed is increased by more than two times, as will be discussed further in the results and discussion section.
e primary challenge associated with optimization is the amount of resource usage in the rst useable FPGA device (De5-Net Stratix V GX FPGA). Unrolling the loop ve times increases the number of ALUTs from 169 k (36% of the total ALUTs) to 495 k (105%) and the number of DSPs from 200 blocks (78% of the total DSP blocks) to 584 blocks (228% of the total DSP blocks). is challenge motivates the Step-1 Step-2 Step-3 Step-4 Step-1 Step-2 Step-3 Step-4 Step-1 Step-2 Step-3 Step-1 Step-2 Step-N Step-N Step-N Step-N Step-4 Step-3 Step-4 Step-l

Experimental Results
Several datasets are considered in this study to compare the performance comparison of the FPGA with that of singlecore CPU-based platforms. e first dataset is the balanced iterative reducing and clustering using hierarchies (BIRCH), which is a two-dimensional dataset containing 100 clusters and 100,000 vectors in a regular grid structure [63]. e A1 and A3 datasets are two-dimensional datasets with circular clusters. A1 has 20 clusters and 3000 vectors of elements, whereas A3 contains 7500 vectors with 50 clusters [64]. e last dataset is the unbalanced dataset, which is a two-dimensional dataset with eight clusters and 6500 vectors [65]. A sample output that presents the FCM clustering approach for one of the used datasets is shown in Figure 6. e traditional serial-based code [66] is modified, compiled, and run on a CPU with an Intel Core i7-6700@ 3.4 GHZ with 16 GB of memory. e code is compiled using the GNU Compiler Collection (GCC/G++) with an "O3" standard level of optimization. e second code is optimized to run on the FPGA platform using a high-capability Intel FPGA device, namely, a De5a-Net Arria-10 powerful device. Table 1 shows that the speedup factor is increased from 25 (for a smaller number of computations) to 39 (for a larger number of computations) when the optimized version of sequential code is compared. Speed is increased by up to 186fold when FPGA is used to accelerate the computations  Journal of Electrical and Computer Engineering instead of the sequential code compiled using the gnu compiler g++ with a default level of optimization. e other measurement that can be used to compare performance is the number of floating-point operations (addition or multiplication) per second (FLOPs) performed. All high-order functions, such as square-root and exponential functions, can be constructed using basic adder and multiplier components. For the DE5a-Net board used in this research, the maximum theoretical FLOPs can be calculated by multiplying the total number of DSP blocks (each of which can perform two FLOPs per clock cycle) by the maximum clock cycle frequency (which is approximately 400 MHz). us, the maximum theoretical GFLOPs is 1248.
It is hard to achieve the maximum possible number of FLOPs in practice because it is not probable that all computational units will work at the same time [67]. e communication overheads between the utilized components and the data transfer, which generates more timing constraints, also prevent the maximum number of FLOPs from being achieved.
Using the performance application program interface (PAPI) [68], we can measure the number of FLOPs and other performance parameters such as the number of cache hits and misses. Table 2 shows the frequency of each arithmetic function used inside the proposed FCM algorithm for every iteration in the unit of MFLOP. From Table 2, we can see that the processing speed is increased by up to 89 GFLOPs when the FPGA accelerator device is used. All basic operations (add, sub, mul, and div) are considered as one floating-point operation. Furthermore, the square-root function is three floating-point operations, and the power function is eight floating-point operations.
e total number of floating-point operations carried out during all iterations is shown in Table 3. Table 3 also shows the GFLOPs performance of the PAPI tool.

Conclusion
e study introduced a hardware solution approach that utilizes a high-level abstraction language (namely, OpenCL) to accelerate the computation heavy FCM algorithm. e study also provides a simple way to create an efficient design that can be synthesized on the FPGA acceleration platform. e results demonstrate the effectiveness of the acceleration device and the optimization, indicated by speed improvements of up to 187 times when compared to a regular singlecore CPU platform.

Data Availability
Datasets are derived from public resources and made available with the article through providing the required references.

Conflicts of Interest
e authors declare that they have no conflicts of interest.