Design of Management Platform Architecture and Key Algorithm for Massive Monitoring Big Data

With the construction and development of industrial informatization, industrial big data has become a trend within the smart industry. To obtain valuable information on massive data, achieving the acquisition, storage, analysis, and mining is becoming an important area of research. Focusing on the application requirements for industrial fields, we propose a data acquisition and analysis system based on the NB-IoT for industrial applications. The system is an integrated system that includes sensor data acquisition, data transmission, data storage, and analysis mining. In this study, we mainly focused on the use of the NBIoT network to collect and transmit real-time data for sensors. First, for the long time series (e.g., if we collect the data streams for one year for the sensor with a frequency of 1Hz, the length of the series will reach 10). Then, we propose DSCS-LTS, a distributed storage and calculation model, and CCCA-LTS, an algorithm for the correlation coefficient of long time series in a distributed environment. Third, we propose a granularity selection algorithm and query process logic for visualization. We tested the platform in our laboratory and an automated production line for one year, and the experimental results using real data sets show that our approach is effective and scalable, can achieve efficient data management, and provide the basis for intelligent enterprise decision-making.


Introduction
With the rapid development of "interconnected" and "intelligent" industries, industrial big data has become the focus of current research. For all types of real-time monitoring of data in the industrial field, the technology of collection, storage, and analysis, to enable deep mining, is facing great challenges. The construction and development of the "interconnected" industry and "intelligent" industry require a large amount of basic data. These data are not only to stay in the collection stage but also to carry out deep storage, analysis, and mining, for example, real-time analysis of industrial field data, alarm of abnormal data, supervision of every process in the production process, and intelligent decision-making. This series of problems is becoming an important topic in industrial informatization research.
With the development of the Internet of Things technology, the traditional big data acquisition platforms have been unable to adapt to the growing mass of data. However, the scheme of "centralized collection and centralized management" provides an effective solution for massive data management in the industrial field. In this paper, this specific project is an example: a large tomato sauce factory production line, collecting all kinds of monitoring data (such as sterilization temperature, tank pressure, pipeline flow rate, and motor power), real-time data through the network transmission to the data management platform, for massive data design distributed storage and analysis methods, realize data depth mining and visualization, and provide the basis for decision-making.
Based on the current situation of large tomato sauce production lines, this project designs an integrated platform for data collection, analysis, mining, and visualization. We propose a distributed storage, calculation model DSCS-LTS, correlation coefficient estimation method CCCA-LTS for massive, long time series data, and propose a data visualization method that addresses the key problems related to the management of massive monitoring data. In recent years, as the time series data management system (TSDB) has been developing rapidly, many time series management systems have emerged. Popular open-source time series data management systems include InfluxDB [1] and OpentsDB [2]. InfluxDB defines the field types and statistical queries for time series. However, it does not support complex queries, such as a similarity search. OpentsDB is based on the HBase-distributed time series management system, which supports simple statistical queries on time series. Recently, Tsinghua University in China developed IoTDB [3], which has become the Apache incubation project. The time series storage structure is similar to that of Parquet [4] and supports distributed computing frameworks such as MapReduce and Spark. There are also many commercial time series databases, such as the domestic T-engine. However, these systems only support basic aggregation queries but do not support association analysis, similarity queries, and other functions.

Similarity Query Technology for Time Series.
Time series similarity query technology has been studied for approximately 20 years. In 1993, Lomet first proposed the research problem of similarity queries for time series databases [5] and discrete Fourier transforms to reduce the dimension of time series. Second, the R-tree was used to build the index processing method. Third, the main research uses other dimension reduction technologies, which follow the routine of dimensionality reduction before index building, including discrete wavelet transform [6], APCA [7], and SAX [8]. In 2013, with respect to the complete sequence similarity query, VLDB [9] proposed a mechanism combining dimensionality reduction and index construction. In 2014, Zoumpatianos et al. proposed refining the index mechanism in the query stage, which shortened the waiting time for index construction [10]. In 2012, Rakthanmanon et al. proposed a UCR suite algorithm for subsequence similarity query [11]. Their approach supports standardized subsequence similarity queries. However, this approach cannot build an index and scan the entire sequence. In 2019, the VLDB paper proposed a variable-length standardized subsequence similarity query algorithm [12]. In 2016, the Harbin Institute Technology team proposed a set-based approximate query algorithm for time series at the SIGMOD conference [13]. The team from the University of Chinese Academy of Sciences and the State Grid Electric Power Research Institute proposed a multidimensional query system DGFIndex for smart grid data [14], and the Beihang team proposed an approximate representation and query algorithm for trajectory time series [15]. Their approach supports one-dimensional, few-dimensional time series or specific aggregate functions and cannot achieve aggregate query processing on a large-scale time series. In summary, current scientific computing, the Internet of Things, and intelligent manufacturing have become research hotspots globally. The MengXiaofeng team of Renmin University of China proposed a scientific big data management framework, suitable for the entire life cycle of scientific data management and analyzed the key technologies in the scientific big data management system [16].

Correlation Coefficient Calculation of Time Series.
In the past 20 years, there have been many mining algorithms and query algorithms for time series data. Time series mining algorithms include classification, clustering, outlier detection, and motif mining [17]. Time series query algorithms include approximate queries [18], aggregate queries [19], and range queries [20,21]. The above algorithm can only be used in a single machine environment and is not suitable for processing massive, long time series data (for example, a sensor with a frequency of 1 Hz can continuously generate 10 7 data for one year).
Computing correlation coefficients for long time series in a distributed environment has the following problems: (1) distributed calculations cannot be performed. Although the Euclidean distance can be computed in a distributed fashion, it requires the mean and standard deviation of the entire sequence to compute the correlation coefficient, so it cannot be computed in a distributed fashion; (2) when a query sequence is long, it requires extensive I/O and network costs, thus causing delays, and cannot be used in interactive query applications. To solve these problems, we propose a method to estimate the correlation coefficients of two sequences on HBase and design a fast estimation method, CCCA-LTS, for the upper and lower bounds of the correlation coefficients. The HBase algorithm iteratively estimates the correlation coefficients.

Research
Contents. This study conducts research, based on a mass monitoring data management platform, to provide an intelligent decision-making and control basis for enterprise production. The specific work is as follows: first, we designed a collection of various monitoring data collection terminals and used the NB-IoT network to transmit the data to the management platform. Second, according to the characteristics of the data, we designed the distributed storage and calculation model DSCS-LTS, which realizes the efficient storage of long-term sequences. Third, in order to calculate the correlation between the series, we designed the correlation coefficient estimation method CCCA-LTS for long-term series data; and fourth, we designed the granular selection algorithm and query process, which logically realizes the visualization of the data. The overall system architecture and data acquisition terminal design are shown in Figures 1 and 2, respectively.
The overall framework of the system is mainly composed of a collection equipment layer, a communication channel layer, and a master station layer. The acquisition device layer realizes the acquisition, processing, and real-time monitoring of monitoring data; the communication channel layer transmits the massive real-time data stream to the master station layer; the master station layer completes the data 2 Wireless Communications and Mobile Computing stream processing, storage, data analysis, mining, and visualization. According to the data characteristics of sensors and instruments in the industrial field, we designed the data acquisition and data transmission terminal (including the latest NB-IoT module), as shown in Figure 2. The terminal uses an STM32 single-chip microcomputer as a processor, connects field sensors, meters, etc., through the acquisition module, and then transmits the collected data to the data management platform through the NB-IoT network.
The master station layer regards the construction of the sensor data management system as the overall goal and conducts research on actual business needs and data characteristics, as well as from platform construction, data collection, data analysis, data mining, and other levels. Definition 1 (time series). The time series can be expressed as S = ððt 1 , s 1 Þ, ðt 2 , s 2 Þ, ⋯, ðt n , s n ÞÞ. n is the total length of S, t i is the timestamp, and s i is the value of t i and ð1 ≤ i ≤ nÞ.
Definition 2 (equally interval time series). The equal interval time series refers to the set S that arranges the indicators of a certain phenomenon in time sequences and equal time intervals, denoted as S = ðs 1 , s 2 , ⋯, s n Þ.
For ease of description, the time series presented in this article are all equally spaced time series, and this algorithm is also suitable for nonequally spaced time series.
Definition 4 (Pearson correlation coefficient). The lengths of the time series X = ðx 1 , x 2 , ⋯, x n Þ and Y = ðy 1 , y 2 , ⋯, y n Þ are both n. The Pearson correlation coefficient was calculated as follows: μ x and μ y are the mean values of X and Y, respectively, Problem definition: the time series database SS is stored in a distributed architecture HDFS or HBase. Two subsequences X and Y in database SS calculate whether the Pearson coefficient ρðX i,l , Y i,l Þ ≥ ε is valid, where i is any integer, l is the subsequence length, and ε is the correlation coefficient threshold. In this study, the query window is set to ði, i + 1 , i + 2, ⋯, i + l − 1Þ.  [17]. There are two schemes based on the HBase time series storage method.
Scheme 1: Figure 3 shows the first storage solution of HBase. Based on storage method 1, we can access the value of any sequence at any timestamp.
Scheme 2: Figure 4 shows the second storage solution of HBase. The scheme is based on Scheme 1, storing a subsequence consisting of a continuous period value.
This paper proposed an algorithm that satisfies the above two schemes.

Distributed Storage and Computing Model DSCS-LTS.
To improve the generalization, we use a distributed storage model and computing model DSCS-LTS (distributed storage and calculation scheme for long time series) for the above    Distributed storage: L storage nodes fN 1 , N 2 , ⋯, N L g in a distributed environment. Time series data in the database SS are divided into several disjoint subsequences and store them into L sequence nodes. The subsequence is The subsequence database stored by node N j is represented as SS j , and the sequence S i subsequence S is represented as S i ∈ SS j if it is stored within a node N j . When the length of a subsequence equals 1 (w = 1), only one value of a time series exists in each row, so Scheme 1 is meaningless.
Distributed computing: as shown in Figure 5, it is a distributed computing process. There are L several computing nodes fN 1 , N 2 , ⋯, N L g. All L nodes have storage and computing capabilities. N 0 is the query driving node, which comprehensively handles all o i results.

CCCA-LTS Calculation Method of the Long Time Series
Correlation Coefficient in a Distributed Environment 2.5.1. CCCA-LTS Algorithm. CCCA-LTS (correlation coefficient calculation algorithm for long time series) is a Pearson correlation coefficient estimation method in a distributed environment. In the correlation coefficient estimation algorithm, we assume that X = ðx 1 , x 2 , ⋯, x n Þ and Y = ðy 1 , y 2 , ⋯, y n Þ are complete subsequences, and the query window is ð1, 2, ⋯, nÞ. The CCCA-LTS algorithm can be directly extended to any query window.
As shown in Figure 6 (step 1 and step 2), the sequences X and Y are divided into six subsequences (step 1). Then, six subsequences are distributed to the data nodes (step 2). A simple way is to transmit all subsequences to N 0 for calculation, which will increase network transmission. To effectively reduce the cost of network transmission, this study proposes the CCCA-LTS algorithm.
The core of the CCCA-LTS algorithm is illustrated in Figure 6 (step 3). All L nodes need to calculate the mean and standard deviation of the subsequence stored in the node (step 3). For example, in Figure 6, the subsequence X 1 is stored at node N 1 , and node N 1 calculates μ X 1 and σ X

1
, Then, the values of the calculated results are transmitted to the node N 0 . N 0 estimated the correlation coefficient according to these values. Next, the CCCA-LTS algorithm was introduced in detail.

Relationship between Correlation Coefficient and
Euclidean Distance. We first provide the estimation formulas, based on the upper and lower bounds in Figure 6 (step 3). The normalized sequences of X and Y arex i = ðx i − μ X Þ /σ X andŶ = fŷ i g 1≤i≤n . The relationship between the correlation coefficient of X and Y and the Euclidean distance: According to Equation (2), the formula d 2 ðX,ŶÞ ≤ 2nð1 − εÞ can be used to estimate the upper and lower bounds.  The segments of X and Y in were stored in the cluster Control node N 0 Step1: sequence X and Y were divided into three segments Step3 Step3 Step1 Step2  Wireless Communications and Mobile Computing approximate representation method EAPCA of the time series, proposed in Reference [9]. Then, we represent RðXÞ and RðŶÞ according to the EAPCA ofX andŶ. Finally, we give estimates of dðX,ŶÞ. EAPCA first divides the series S into S = ðS 1 , S 2 , ⋯, S m Þ; the arbitrary segment is S j = ðs r j−1 +1 , s r j−1 +2 , ⋯, s r j Þ, (1 ≤ j ≤ m, 1 ≤ r 1 < r 2 < ⋯<r m ≤ n). The EAPCA of S is denoted as RðSÞ = ððμ S 1 , σ S 1 , r 1 Þ, ðμ S 2 , σ S 2 , r 2 Þ, ⋯, ðμ S m , σ S m , r m ÞÞ , where X and Y are the mean and standard deviation of S j , respectively. We denoteX andŶ as RðXÞ and RðŶÞ, as follows:

RX
À Á = μX 1 , σX 1 , r 1 , μX 2 , σX 2 , r 2 , ⋯, μX m , σX m , r m , According to [9], we obtain the boundary of the Euclidean distance ofX andŶ as follows: Through the above analysis, we summarize as follows: is more than dðX,ŶÞ (upper bond), then there must be ρðX, YÞ > ε (ii) If ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2nð1 − εÞ p is less than dðX,ŶÞ (lower bound), then ρðX, YÞ < ε must exist (iii) If neither of the above two situations is true, it is impossible to judge whether ρðX, YÞ > ε is true 2.5.4. Distributed Estimation Methods. In a distributed environment, as shown in Figure 6 (step 2), we cannot calculate the mean and standard deviation of the standardized subse-quenceX. The calculation was the same for the standardized subsequenceŶ. This paper proposes a new method for estimating the standardized mean and standard deviation in a distributed environment. As shown in Figure 6 (step 3), each node first calculates the μ i X , δ i X , μ i Y , and δ i Y and then enters the result into N 0 . These mean and standard deviation values were used to estimate the upper and lower bounds, respectively. We provide estimation methods for μX i and σX i , 1 ≤ i ≤ m. We can estimate the complete series by Equations (5)- (8), where μ i X and σ X i are the mean and standard deviation of the known subsequence; μ X and δ X are the overall mean and standard deviation, respectively. The proofs of Equations (6)- (9) are given in Reference [2] and will not be repeated here.

CCCA-LTS Algorithm.
This section discusses two problems with the CCCA-LTS algorithm. The first problem is that in the previous description, we assume that the query window is the entire window, expressed as ð1, 2, ⋯, nÞ. The actual query window boundary does not necessarily fall on the boundary of the subsequence and may fall within the subsequence. When the query window is within the internal sequence, the query sequence is necessary to read out the window border, and there is a need to calculate the mean and standard query portion difference within the window.
The second question is whether there are three possible outcomes. When the third result appears, that is, when it is impossible to judge, we need to estimate based on the more fine-grained mean and standard deviation. At this point, we need a second or even third round of calculation. In the first round, the subsequence length is w, the mean and standard deviation are calculated, and N 0 is returned for comprehensive calculation and judgment. When the judgment result is the third type, we need to the second round of calculation; that is, we calculate the subsequence with length w/2 and then return N 0 for comprehensive calculation and judgment. If the judgment result is of the first or second type, it stops. Otherwise, the third round is performed, that is, e the mean and standard deviation of the subsequence with length w/4 are calculated. Thus, the judgment accuracy meets the needs.
CCCA-LTS algorithm analysis: the CCCA-LTS algorithm is a multiround algorithm. If only one round of calculation is required, the CCCA-LTS algorithm has the same I/O overhead as the direct calculation method. As only the mean value and standard deviation of the subsequence are transmitted, the network has a significantly reduced transmission cost. If multiple rounds of calculation are required, the query window sequence requires multiple reads, thus causing more I/O overhead.

Visualization
Technology. This section focuses on two issues: granularity data visualization algorithms and query process technology. The selected size has a significant influence on the response time of the query. When the particle size is too small, a larger network transmission is required, 5 Wireless Communications and Mobile Computing and the resulting query response time is too high; therefore, a large amount of data is not stored in the client memory. When the granularity is large, the amount of data transmitted to the user is small, which cannot accurately represent the trend of the original time series.
In the query process, the user sets a good time interval and a data channel. In general, a user should choose an appropriate statistic. If the user is not explicitly given, the default is the appropriate statistical median. The statistics currently supported include mean and median. When a user requests a query data for a certain period of time, he/she can select a period of interest in the front end of the data to view more detailed data trends and other information during this period.
2.6.1. Granularity Algorithm. Select the required particle size to be determined, according to the size of the data. If the granularity is too large, the user query response time will be smaller, but the amount of data returned to the client will be smaller, thereby increasing the error. If the granularity is too small, the amount of data returned to the client will be too large. Although the error is small, the user's query response time is long.
To improve the response time of the query, we design a historical data statistics table. The historical data statistics table contains statistical data of different granularities, such as the maximum and minimum values of a certain time series within one hour. The existing granularities are day, hour, minute, and second. It is a challenge not only to find a good representation of the original time series trends in large amounts of historical data in tables but also to avoid the transmission of data to the client that is too large. The idea to solve this problem is to calculate the amount of data of different granularities according to the frequency of the time series and then sort the amount of data of different granularities from large to small. According to this order, we find the first corresponding granularity that is smaller than the maximum amount of data that can be displayed on the client. The granularity at this time can not only better represent the original time series trend but also avoid the situation in which the client data volume is too large. The granularity is jointly determined by the maximum amount of data that can be displayed by the client, the amount of data that the user queries, and the granularity in the historical data statistical table. The algorithm is illustrated in Algorithm 1. (2) Sort the amount of data in step 1 in descending order; (3) Select the first granularity that is smaller than the maximum amount of data that can be displayed on the client.

Inquiry Process.
When the user provides a channel and time interval and then selects a statistic, if the user is not given, the median is the default statistic. The statistics currently supported are mean, median, maximum, minimum, and variance. The query process is illustrated in Figure 7. DisplayHistoryAction is the front-end Servlet, which accepts user requests and passes the request parameters to the background. HistoryQuery is the core processing class in the background and is responsible for interacting with the His-toryDataHandler class in the data layer. HistoryDataHandler encapsulates HBase's API for reading and writing data and is mainly responsible for interacting with HBase. When the user is more interested in the data in a certain period of time, the user can select a period of interest in the front end to view more detailed trend information during this period.

Experimental Data.
We use experimental results to illustrate the effectiveness of the distributed storage and calculation model DSCS-LTS designed in this study and the longterm series correlation coefficient calculation method Algorithm 1: Select granularity (X, start time, end time, M) Input: X time series ID, start and end time, client can display the maximum amount of data Output: < size of granularity >Calculate the size of different granularity data of X Time Series in the historical statistics table from start time to end time; Sort the amount of data in step 1 in descending order; Select the first granularity that is smaller than the maximum amount of data that can be displayed on the client.    Figure 8. Although the CCCA-LTS algorithm has more iterations, it is more efficient than the Bruteforce algorithm.  Figure 9. Because the Bruteforce algorithm needs to take out all the data for calculation, the change in the threshold does not affect its operating efficiency, although the CCCA-LTS algorithm has many iterations, it does not need to read the entire sequence. It uses a segmented sequence to estimate the correlation coefficient of the entire sequence in an iterative manner, which is more efficient than the Bruteforce algorithm.

Conclusion
Aiming at a massive monitoring data management platform, this study investigates the acquisition, storage, and analysis of big data monitoring. We designed a data collection terminal, collected sensor data to a distributed storage platform, through the NB-IoT network, and proposed a storage and calculation method called DSCS-LTS. According to the calculation of Pearson's correlation coefficient of a long time series, the algorithm CCCA-LTS is designed, which effectively improves the efficiency of the similarity query, by designing algorithms for particle size and query processes to solve the problem of visualization systems. In the development of the core platform to test the performance of the algorithm, the results show that the efficiency is better than that of the traditional method. Mass monitoring is an intelligent management reference for efficient time series data types.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no competing interests.