Controllable Clustering Algorithm for Associated Real-Time Streaming Big Data Based on Multi-Source Data Fusion

Aiming at the problems of poor security and clustering accuracy in current data clustering algorithms, a controllable clustering algorithm for real-time streaming big data based on multi-source data fusion is proposed.)e FIR filter structure model is used to suppress network interference, and ant colony algorithm is used to detect the abnormal data in the big data. By optimizing the iteration, the pheromone concentration is placed in the front position as the abnormal data point, and the filter is introduced.)e fusion scope of multi-source data fusion is set. Combined with the data similarity function, the multi-source data fusion concept is used to construct the associated real-time streaming big data fusion device, and the data deduplication results are substituted into the fusion device to obtain the data clustering result. )e experiments show that the proposed algorithm has high safety factor, good data clustering accuracy, high clustering efficiency, and low energy consumption.


Introduction
e arrival of big data not only promotes industrial development but also promotes the evolution of new business models [1,2]. To this end, the rapid mining and discovery of knowledge from massive data has become the focus of researchers and companies. Data clustering is an important research direction in data mining. It aims to divide data into categories consisting of similar objects by clustering data [3,4]. However, the association of real-time streaming big data is versatile and complex. Traditional frameworks only consider computing resources for data stream extraction and storage and cannot effectively use computing resources to provide more and faster clustering services [5][6][7]. is requires a new data clustering algorithm [8,9].
Cao and Qian proposed a big data clustering algorithm based on local key nodes. For the initial node uncertainty and the time consumption caused by the fitness function calculation, local key nodes were introduced and the fitness formula was improved to reduce the time consumption. Experiments with classical algorithms in small-scale data networks and large-scale data networks showed that the clustering time was short, but there was a problem of poor security [10][11][12][13]. Zhang et al. proposed a SOM hybrid attribute data clustering algorithm based on heterogeneous value difference metrics. e algorithm used self-organizing map neural network as the framework and used heterogeneity difference based on sample probability to measure the dissimilarity of mixed attribute data. e frequency of occurrence of the classification feature in the Voronoi set was used as the basis of the reference vector update rule of classification attribute data, and the update of the numerical attribute and the classification attribute data rule was realized by the hybrid update rule. e proposed clustering algorithm was tested by using the classification attribute and mixed attribute dataset in UCI public database. e experimental results showed that the algorithm had low running complexity, but there was a problem of poor classification accuracy [14]. Wang et al. proposed a data classification algorithm based on Spark-FCM. Firstly, the matrix was distributed by horizontal partitioning, and different vectors were stored in different nodes. en, based on the computational characteristics of FCM algorithm, a distributed and cache sensitive common matrix operation and Spark-FCM algorithm are designed. e main data structure adopted distributed matrix storage, which had less data movement between nodes and distributed computing features at each step. e experimental results showed that the algorithm had good stability, but it had a long timeconsuming problem [15][16][17][18]. Liu et al. proposed an improved manifold clustering algorithm based on density peak search. e global and local spatial manifold distribution of the dataset was comprehensively considered, and the local density of each sample point was defined. According to the local density of each sample point and its local density relationship with other sample points, the cluster center criterion was defined. Clustering is implemented. But the algorithm had certain classification accuracy, and there was a problem of high energy consumption of clusters [19].
Aiming at the problems existing in the current research results, a controllable clustering algorithm for associated real-time streaming big data based on multi-source data fusion is proposed. e general framework is as follows: (1) e anti-interference filtering of the associated realtime streaming big data is realized by FIR filtering algorithm, and the abnormal data in the filtering result by ant colony algorithm are detected and filtered out to improve the data clustering security in real time.
(2) Redundant data in associated real-time streaming big data are removed to reduce energy consumption of data clustering. (3) Data clustering is achieved through multi-source data fusion. (4) e experiment and discussion method are used to verify the controllable clustering algorithm for associated real-time streaming big data based on multisource data fusion. (5) e full text is summarized and the next research direction is proposed.

Processing Abnormal Associated Real-Time Streaming Big
Data. In order to improve the security of the data clustering process, the abnormal data need to be eliminated [20][21][22][23]. In the process, the FIR filtering algorithm is first used to realize anti-interference filtering of the associated real-time streaming big data. e structure diagram is shown in Figure 1. Assuming that the data traffic is generated by a linearly correlated nonlinear time series, the following FIR filter structure model is used for interference suppression: where x n represents the network traffic information model of data center, a 0 represents the sampling amplitude of the initial network traffic, x n−i represents the scalar time series of network traffic with the same mean and variance in data center, b j represents the oscillation amplitude of the network traffic in data center, and η n represents the delay sequence of data transmission. According to the calculation of equation (1), the network traffic information flow of data center is Fourier transformed to obtain the time series x(k), and the oscillation attenuation of the network traffic is obtained after the interference filtering process: where a represents the inter-domain variance coefficient of network traffic, m represents the embedded dimension in phase space, and B H (t) represents the correlation function of data flow anomaly feature detection. Assuming that the input sequence x(k) is a set of wide stationary time series, the transfer function H B (c) of the filter is where where G(c) represents the filter transfer model. Based on the above calculations, the designed interference suppression FIR filter of network traffic is as follows.
According to the above calculation and analysis, the final data streaming output result c(t) obtained by the FIR antiinterference filtering process is where x(t) represents the real part of the data streaming time series, y(t) represents the imaginary part of the data streaming time series, and n(t) represents the other influence vectors. Based on the above interference suppression result, the ant colony algorithm is used to detect the abnormal data therein. e testing process is mainly divided into the following aspects: (6) Return to step (4) until convergence or meet the termination condition. (7) Select all paths whose pheromone is greater than the set threshold and save or modify S as required. e determination of the pheromone concentration in the table in the front position is determined as abnormal data.
In step (2), the pheromone χ i′ of each edge e i′ (1 ≤ i ′ ≤ En) on DG is expressed as follows: where En represents the number of edges in DG.
In step (2), the table S � (T, A, V, M), where T represents the tuple address of the ant walking the path, A is the target attribute name, V is the target attribute value in the tuple, and M is the attribute measure value of A in the tuple.
In step (4), the node v is given, and the probability P i′ (t) of the ant selecting the adjacent edge e i′ is expressed as follows: where λ i′ represents the edge heuristic factor. e larger the value is, the greater the probability of selecting this path is.
In step (5), the pheromone update on each side of the most recent path L is expressed as follows: where ρ represents the volatilization rate of pheromone, which can effectively inhibit the ant colony from rapidly converging to the path that has already passed. According to the continuous update of the pheromone, the pheromone whose concentration exceeds the set threshold w is stored in the table S, and the abnormal data are judged. e abnormal data filter is introduced here to filter out the abnormal data, and the obtained normal data set can be expressed as where C(t) represents the normal dataset in the associated real-time streaming big data and F(x) represents the abnormal data filter.

Deduplication of Real-Time Streaming Big Data.
In order to reduce the burden of data clustering, reduce the energy consumption of data clustering, and improve the classification accuracy, the redundant data need to be cleared [24]. For a new data segment, there are two similarities in its similar characteristics, as shown by the black points A 1 and A 2 in Figure 2, respectively, wherein, the black dot A 1 indicates that the position of the data segment in the plane is outside the three class boundaries, and the most similar data segment is the closest to the distance, and the white points are at the boundary of the class [25,26]. e black dot A 2 indicates that the position of the data segment in the plane is inside the class, and the most similar data segment is the closest distance between several white dots and black dots in the same class. Since the similarity of the data objects in the same class is high, the class boundary is selected. e data segment is used to build a check cache and also provides a good deduplication ratio [27]. According to the problem description, the redundant data are filtered out by using the check metadata deduplication algorithm. Firstly, the weighted dataset is clustered, and then the compressed neighboring algorithm is used to obtain the weighted subset, and the similarity metadata are eliminated based on the weighted subset, thereby reducing the size of the index. Eliminating metadata with high similarity can effectively reduce the amount of metadata and further reduce system resource overhead while maintaining the deduplication ratio [28,29]. e whole process of deduplication is divided into two parts: data segment similar clustering and deduplication.
In the similar clustering phase, for the subsequent description to be clear, the subsequent symbols are uniformly defined: the sim fingerprint set defining the data to be checked is S′ � s 1 ′ , . . . , s n ′ , the cluster before the similar clustering is defined as C ′ � C 1 ′ , . . . , C K ′ , the cluster after clustering is C ″ � C 1 ″ , . . . , C K ″ , a distance measure between two similar data segments is defined as dist(s i ′ , s j ′ ), and the Figure 1: Structure of FIR filter.

Wireless Communications and Mobile Computing
Hamming distance between two sim fingerprint values is Ham(s i ′ , s j ′ ). Here, the saved sim fingerprint value information is used to represent a data object, and the fingerprint set S ′ � s 1 ′ , . . . , s n ′ of the check data is obtained. en, the data objects in the set S ′ � s 1 ′ , . . . , s n ′ are clustered, and the data segments are divided into K-class C ″ � C 1 ″ , . . . , C K ″ . e distance measure between two similar data segments is expressed as the Hamming distance of the sim fingerprint values, which is erefore, the entire clustering process can be described as follows: (1) Select K representative objects b 1 , . . . , b K from S ′ as the initial center point.
e object with the lowest total cost is selected as the new center point. (4) Repeat steps (2) and (3) until the K center points no longer change. e K cluster C ″ � C 1 ″ , . . . , C K ″ is the obtained, that is, the required K class similar data.
In the deduplication phase, the check weight subset is defined as S ″ � s 1 ″ , . . . , s n ″ , and the process of the deduplication phase is as follows.
e sim fingerprint value of the data segment is used to replace the data segments, and the sim fingerprint value set S ′ � s 1 ′ , . . . , s n ′ of all data segments is obtained. Two memories st and gr for the set S ′ � s 1 ′ , . . . , s n ′ are set. All the samples of S ′ � s 1 ′ , . . . , s n ′ are put into gr, and the fingerprint value of a data segment sim is randomly extracted from gr and put into st. A fingerprint value s K ′ of a data segment sim is randomly extracted from gr, and the sim fingerprint value in st is used as the reference set. s K ′ is classified, and a closest s nK ′ to s K ′ from st is found. Assuming s nK ′ ∈ C i ′ is the same as s K ′ ∈ C i ′ , then it thinks that the redundancy judgment is correct, and to delete s K ′ ; otherwise, s K ′ is stored as a new category in st.
e above steps for all samples in gr are performed until gr is empty. e set after removing redundant data at this time is S ‴ � s ‴ 1 , . . . , s ‴ n . According to the above process, redundant data in the real-time streaming big data can be cleared.

Controllable Clustering for Associated Real-Time Streaming Big Data Based on Multi-Source Data Fusion.
Assume that the fusion range of multi-source data fusion is [q min , q max ], where q min represents the minimum number of neighbors and q max represents the maximum number of neighbors.
e q nearest neighbors of N ″ search for n ″ data points are mapped to a similarity function similarity, and the attribute similarity of the data point (z 1 , z 2 ) can be recorded as similarity(z 1 , z 2 ), namely: e similarity function of equation (12) is updated: for data point (z 1 , z 2 ), assume z 1 is in q neighbors of z 2 , or vice versa: e loop variable t ″ � t ″ + 1; assuming t ″ < N ″ , the similarity function is reconstructed; otherwise, the following steps are performed.
According to the similarity function formed by the above process, the multi-source data fusion concept is used to construct the real-time streaming big data fusion device, and the output result is the final clustering result: where ⊗ represents the connector of attribute similarity data in the real-time streaming big data, δ represents the clustering factor, and the value controlled in [0.6, 0.7] can improve the clustering precision.

Results
In order to verify the effectiveness of the controllable clustering algorithm for associated real-time streaming big data based on multi-source data fusion, an experiment is conducted. e experimental hardware environment is shown in Figure 3. e algorithm is implemented by using C programming under Linux and the associated real-time streaming database. e experimental indicators are as follows:  e experimental results are as follows. Analysis of Figure 4 shows that the big data clustering algorithm based on local key nodes and the SOM hybrid attribute data clustering algorithm based on heterogeneous value difference metrics have lower operational safety factors. e controllable clustering algorithm for associated real-time streaming big data based on multi-source data fusion has higher data clustering security coefficient under different types of attack data.
is is mainly because the algorithm uses the particle swarm optimization algorithm to detect the anomaly data before data clustering and eliminates it, which effectively enhances the security performance of the algorithm.
Analysis of Figure 5 shows that the proposed algorithm is superior to the current data clustering algorithm in data clustering accuracy. e algorithm uses filtering technology to suppress network interference and initially improves data clustering accuracy. e clustering factor is introduced to further improve the data clustering accuracy of the proposed algorithm.
In Figures 6 and 7, the clustering time of using the controllable clustering algorithm for real-time streaming big data based on association of multi-source data fusion is not affected by the clustering data amount, the clustering efficiency is high, and the overall energy consumption of clustering is lower.
is is mainly because the algorithm reduces the time spent on data clustering, the energy consumption of clustering, and the burden of data clustering by eliminating redundant data, effectively controlling the time and energy consumption of data clustering and enhancing the overall performance of the proposed algorithm.

Discussion
In the discussion, the clustering factor δ is taken as the discussion object, and the influence of its value range on data clustering is observed. Matlab2017a is used to simulate the influence of the value change on the data clustering accuracy rate. It is observed that the value is controlled within the interval of [0.6, 0.7], and the data clustering accuracy rate is the highest, that is, δ can effectively improve the clustering accuracy in this interval. e results are as follows.
It can be seen from Figure 8 that when the value of δ is [0.4, 0.6], the data clustering accuracy rate is generally improved, but it is not very ideal; when the value of δ is [0.7, 0.9], the data clustering accuracy rate is decreasing; when the value of δ is at [0.6, 0.7], the data clustering accuracy rate is the highest, about 98%. It can be observed from the data comparison that the value of δ has a great influence on the accuracy of data clustering, and when the value of δ is [0.6, 0.7], it can control the clustering accuracy to the best, indicating that the value range is reliable.

Conclusions
Efficient clustering of large-scale data is very important for data utilization. At present, the performance of relevant research results is to be improved, and a controllable clustering algorithm for associated real-time streaming big data based on multi-source data fusion is proposed. By detecting and eliminating abnormal data and redundant data, it lays a foundation for data clustering and realizes data clustering by constructing a correlated real-time streaming big data fusion device. e experimental results show that the proposed algorithm has strong clustering performance and is feasible.
In the next step, the following aspects can be regarded as the research focus: data clustering technology is constantly innovating, including data clustering system, which can   Wireless Communications and Mobile Computing integrate clustering system with clustering algorithm and use clustering algorithm to control clustering system to further improve data clustering performance.
Data Availability e datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.