Industrial Efficiency Algorithm Based on Spatio-Temporal-Data-Driven

Stochastic frontier model is an important and effective method to calculate industry efficiency. However, when dealing with temporal and spatial data from the industry, it is difficult to accurately calculate the industrial production efficiency due to the influence of spatial correlation and time lag effect. If the traditional spatial statistical method is used, the setting method of spatial weight matrix is often questioned. To solve this series of problems, one possible idea is to design a spatial data mining process based on stochastic frontier analysis. Firstly, the stochastic frontier model should be improved to analyze spatiotemporal data. In order to accurately measure the technical efficiency in the case of dual correlation between time and space, a more effective spatio-temporal stochastic frontier model method is proposed. Meanwhile, based on the idea of generalized moment estimation, an estimation method of spatiotemporal stochastic frontier model is designed, and the consistency of estimators is proved. In order to ensure that the most appropriate spatial weight matrix can be selected in the process of model construction, the K-fold crossvalidation method is adopted to evaluate the prediction effect under the data-driven idea. This set of spatio-temporal data mining methods will be used to measure the technical efficiency of high-tech industries in various provinces of China.


Introduction
Stochastic frontier analysis (SFA) is an important method to measure technical efficiency and calculate total factor productivity. The whole process is divided into two steps: the first step is the model estimation process, which can be regarded as a supervised learning process; the second step is to use the estimated model to calculate the technical efficiency, which can be regarded as an unsupervised learning process.
From the perspective of machine learning, supervised processes have three main objectives: (a) feature selection and reduction of the dimension of feature variables; (b) selecting the optimal one from multiple classifiers or prediction models; (c) model evaluation, which estimates the prediction error of the selected classifier or prediction model on the new data.
The paper found that the traditional stochastic frontier analysis method has the following defects: (a) it is not suitable for the special structure of spatial data or spatio-temporal data; (b) the modeling process lacks variety. The traditional analysis process is knowledge-driven and completely relies on a single theoretical model for estimation and testing. The above two characteristics lead to the large deviation of the traditional stochastic frontier model when analyzing the spatio-temporal data, and it is impossible to make an accurate measure of the production efficiency with spatial relationship, either. To solve the two problems above, this study considers two improvements to the industrial efficiency calculation process based on temporal and spatial data: (1) improve the existing stochastic frontier model and make it suitable for spatial data or spatio-temporal data; (2) turn the modeling process into a spatial data mining process. In view of the unique structure of spatio-temporal data, a more suitable crossvalidation method is proposed for the selection of prediction model.
Stochastic frontier analysis (SFA) was successively proposed by Aigner et al. (1977) [1], Meeusen and Broeck (1977) [2], and Battese and Corra (1977) [3]. Over the past 40 years, its theoretical system and methods have been continuously expanded and innovated; it is widely used to measure the operating efficiency of different industries.
The development of spatial statistics provides a theoretical basis for studying spatial interactions in stochastic frontier models. Druska and Horrace (2004) [4] first applied the method of spatial econometrics to the analytical framework of stochastic frontier model and started the research of spatial stochastic frontier model. Affuso (2010) [5] established the spatial stochastic frontier model and gave the maximum likelihood estimation in the empirical study. Tonini and Pede (2011) [6] applied maximum entropy method to parameter estimation of spatial stochastic frontier model. Vidolia et al. (2016) [7], Tsionas and Michaelides (2016) [8], Carvalho (2018) [9], and Adetutu et al. (2015) [10] consider SF models with local spatial dependence. Jin and Lee (2020) [11] proved the asymptotic properties of a maximum likelihood estimator of a spatial autoregressive stochastic frontier model. Kutlu et al. (2020) [12] proposed a spatial autoregressive stochastic frontier model, which allows for the endogeneity in both the frontier and environmental variables, and discussed a single-stage control function approach to estimate the parameters.
Because spatial stochastic frontier analysis methods can fully consider the impact of spatial correlation, they can obtain more accurate results in efficiency analysis of data with spatial spillover effect and thus have been more widely used in recent years. Bergantino et al. (2020) [13] analyses the potential impact of airport competition on technical efficiency by applying the spatial stochastic frontier. Graaff (2020) [14] used spatial stochastic frontier model to estimate spatially correlated technical efficiencies within a European regional production function context. At present, some literatures have studied panel spatial stochastic frontier model, for example, Druska and Horrace (2004) [4], Tonini and Pede (2011) [6], and Lin Jia-Xian (2014) [15]. These literatures all focus on the static panel space stochastic frontier model, and the model utilizes two-dimensional information from panel data; formally, the spatial lag term of the explained variable and the spatial lag term of the error are used to capture the spatial correlation of the production unit. The time lag term is not included in the model, which means that the model still cannot fit well when there is significant inertia in the research problem. In input-output analysis, current behavior is largely dependent on past behavior, for example, the adjustment of capital stock is often influenced by previous capital. Therefore, a dynamic stochastic frontier model should be established, and the model should describe the double lag effect of space and time, so as to reflect the influence relationship between economic variables more objectively. The spatial weight matrix in spatial statistics is often considered to be "subjective." Moreover, due to the various setting methods of spatial weight matrix, the selection of different spatial weight matrix may lead to the difference of model estimation results. In addition, the selection of spatial weight matrix has not formed a unified principle. Based on the above three points, the spatial weight matrix is often questioned. But in the era of "big data," such skepticism may end [16]. This paper proposes the spatiotemporal stochastic frontier model; considering that the model may be endogenous in time and space dimensions, a generalized method of moments (GMM) estimation process is designed to estimate the model. When Druska and Horrace (2004) [4] studied the static panel space stochastic frontier model, a generalized moment estimation process was proposed by referring to Kelejian and Prucha (1999) [17] for spatial error correlation. In this paper, Druska and Horrace (2004) [4] is used to deal with model's error space autocorrelation, which is different from that of the stochastic frontier model. According to the method of Kapoor et al. (2007) [18], the compound error term was processed, and the moment condition was constructed to estimate the distribution parameters of the error term. In this paper, Jacobs et al. (2009) [19] was used as a reference to construct the moment condition, and Anselin (1988) [20] was used as a reference for the selection of tool variables to obtain the generalized moment estimator. Furthermore, the consistency of the obtained structural parameter estimators is proved by using the extreme value consistency theorem and the law of uniform large numbers (ULLN). To solve the problem of selecting spatial weight matrix, we can consider a crossvalidation method suitable for spatio-temporal data. Fortunately, a series of methods such as dimensionality reduction, feature selection, and model generalization has been provided by machine learning methods. The earliest crossvalidation method was called hold-out, which relied on only one partition of the data, and there was no crossover process, so it was also called the verification method [21]. Noting that the hold-out method relies on a partition of data and is easily affected by contingency factors, Geisser (2010) [22] proposed a crossvalidation method that includes the average of multiple hold-out estimates, realizing the transition from verification estimation to crossvalidation estimation. In order to reduce the combination number of data partition in crossvalidation, Shao (1993) [23] proposed the leave-P-out crossvalidation (LPOCV) in which the number of test samples in each data partition was the same. Especially in the special case when P = 1, the method is evolved to leave-one-out crossvalidation (LOOCV). LOOCV is the simplest and most widely used crossvalidation in traditional analysis. Compared with the LPOCV considering all data partitioning, Geisser (2010) also proposed a crossvalidation based on only partial data partitioning, which is called RLT method. K-folded crossvalidation is proposed as an alternative to LOOCV which has a large computational overhead and relies on a basic partition of data divided into K-fold, each of which has a data capacity of N/K. In the case of limited samples, k-fold crossvalidation is the simplest and most widely used method of generalization error estimation. From the various crossvalidation methods that have appeared in the past, each method fully considers the randomness of the validation set to ensure the generalization ability of the test model. However, for the special panel data such as spatio-temporal data, there is usually an internal connection between spatial individuals, and the overall data also tends to have time trend. This problem is not taken into account by the previous crossvalidation methods, which may break the inherent regularity of spatio-2 Wireless Communications and Mobile Computing temporal data. Based on the above considerations, this paper designs a kind of crossvalidation scheme suitable for spatiotemporal data. It is used to select stochastic frontier models, especially models with different weight matrices. Finally, the technology efficiency of China's high-tech industry is analyzed by establishing a spatiotemporal stochastic frontier model.

Methodology
Previous studies on panel spatial stochastic frontier models mainly involved static panel spatial stochastic frontier models. Spatial lag effect is considered in the process of model building, but the influence of time lag effect is not included. If the time lag term and time-spatial lag term are added into the model, this kind of model can be called spatiotemporal stochastic frontier model. Obviously, the time-space double lag effect will produce stronger endogeneity, and new estimation methods should be considered to solve it.
where Y t , E t , ε t , and ν t are N-dimensional vectors, whose components at time t = 1, ⋯, T are given by Y t = ½y 1t ,⋯,y Nt ′ , E t = ½E 1t ,⋯,E Nt ′ , ε t = ½ε 1t ,⋯,ε Nt ′ , and v t = ½v 1t ,⋯,v Nt ′ . The vector Y t consists of the outputs of the N production units, E t and ε t are the composite error vectors corresponding to Y t , v t is the heterogeneous error vector, and u = ½u 1 ,⋯,u N ′ is the vector of time-invariant inefficiency terms. This kind of setting is appropriate when the time span is not large. As u is time invariant, it can be regarded as the individual effect, and thus, this paper primarily considers u as a fixed effect. X t is an N × K-dimensional matrix consisting of the K exogenous input variables of the N production units at time t. W and M are N × N spatial weight matrices which are usually assumed to be different. If W = M, λ 1 and ρ cannot be distinguished by means of the maximum likelihood method although they can be effectively distinguished by the GMM method [19]. The variables are stacked according to the section and time series in the following matrix form: where ⊗ represents the Kronecker product of matrices, I T and I N are, respectively, the identity matrices of orders T and N, and e T is a T-dimensional column vector with all the entries equal to 1. The parameter vector of the model to be estimated is ðλ 1 , λ 2 , γ, Β, ρ, σ 2 v , σ 2 u Þ, and its dimension is ðK + 6Þ, where Β is the parameters corresponding to the K-dimension exogenous explanatory variables X t . Β and λ 1 , λ 2 , γ together constitute the structural parameters, and ρ, σ 2 v , σ 2 u are the error term parameters.

Model
Assumption. The assumptions of the spatiotemporal stochastic frontier model are the following: and v and u have finite fourth moment. Assumption 1 is a classic assumption of the spatial error autocorrelation model. By Assumption 2, the same individual inefficiency term remains constant at different times. When using GMM to estimate the structural parameters of the model, the distribution of error terms can be ignored; nevertheless, in order to improve the efficiency of computation, the half normal distribution for the inefficiency term is usually assumed. Assumption 3 ensures the boundedness of the variance of the error term in this model, which is an important condition for the consistency of the estimator. Assumption 4 is a classical assumption commonly used in traditional regression analysis methods, and the moment condition is set according to this assumption in the generalized moment estimation of this model. Assumption 5 is set according to the space weight matrix of this model and the properties of space station autoregressive coefficient and space-time autoregressive coefficient, which also ensures the consistency of parameter estimators.

Parameter
where can be obtained from the assumptions of the regression model, and COVf½WA −1 Ε t , Ε t g is quadratic. In spatial econometrics, W is usually not a zero matrix, and so, WA −1 is not a zero matrix. While taking into account the expected value of the compound error term cannot be 0, it can be considered that the quadratic form COVf½WA −1 Ε t , Ε t g is almost impossible to be equal to 0 (see Appendix A for proof). Therefore, in the dynamic panel spatial stochastic frontier models, there is an endogeneity problem which will lead to the inconsistency of traditional estimators. So, we considered GMM as a good way to solve the endogeneity problem.
The parameter vector to be estimated in the model is and Β are the structural parameters of the model, and ρ, σ 2 v , and σ 2 u are the error distribution parameter of the model. The estimation of the model is completed in three steps: Step 1. Using the GMM to estimate the structure parameter ðλ 1 , λ 2 , γ, ΒÞ in the model.
Step 2. Making a moment estimation of the parameter ðρ, σ 2 v , σ 2 u Þ that is included in the error term.
Step 3. Using the estimator obtained in Step 2 to modify the result of Step 1.

Estimation of Structural
Parameter ðλ 1 , λ 2 , γ, ΒÞ (1) Difference Model and Level Model. Anderson and Hsiao (1981) [24] proposed to use y i,t−2 as the instrumental variable of Δy i,t−1 , and then, 2SLS estimation is carried out. This estimator is called "Anderson-Hsiao estimator." According to the same logic, lag variables of higher order are also valid IV. Arellano and Bond (1991) [25] used all possible lag variables as IV (the number of IVs is more than the number of endogenous variables) to conduct GMM estimation. This GMM estimator is called Arellano-Bond estimator or difference GMM. The disadvantage of difference GMM is that the variable which does not change with time is eliminated, and its coefficient cannot be estimated. If the series fy i,t g has a strong persistence, that is, the first-order autoregressive coefficient is close to 1, then the correlation may be very weak and lead to the problem of weak instrumental variables. In order to solve the above two problems, Arellano and Bver (1995) [26] returned to the level equation and used fΔy i,t−1 , Δy i,t−2 ,⋯g as IV to estimate the GMM of the level equation, which was called "level GMM." Blundell and Bond (1998) [27] combined difference GMM with level GMM and estimated the difference equation and level equation as one equation system for GMM, which was called "system GMM." The advantage of system GMM is that it can improve the efficiency of estimation (small sample properties are better), and it can estimate the variable that does not change with time (the system GMM contains the level equation). In order to solve the endogenous problem of dynamic panel data model, Arellano and Bond (1991) [25], Arellano and Bover (1995) [26], and Blundell and Bond (1998) [27], respectively, considered from the perspective of difference model and level model, and different instrumental variables were selected. The corresponding difference model and level model of Equation (1) are simplified as (4) and (5) can also be collectively called spatial system model, where Equation (4) is the difference model, and Equation (5) is is the vector composed of all explanatory variables, and θ = ½λ 1 , λ 2 , γ, B′ is the vector composed of structural parameters. The expansion of Equation (5) is Equation (1); the expansion of Equation (4) can be expressed as follows: (2) Moment Condition and Instrumental Variable. Since ΔX t is a strictly exogenous variable, it is not related to the compound error term ΔΕ t , nor is it related to Ε t . The moment conditions for identifying B in the difference model and the level model are as follows: The moment condition structure for identifying λ 2 and γ in the two models is as follows: since the spatial lag term and time lag term of the dependent variable ΔY t are both endogenous variables, therefore, it is necessary to find a set of instrumental variables that is related to time lag and space lag and exogenous explanatory variables, but not related to the difference error term ΔΕ t ðt = 3,⋯,TÞ. Arellano and Bond (1991) [25] uses all possible level lag variables ðy t−2 ,⋯,y 1 Þ of Y t as instrumental variables for the time-lag first-order difference term (ΔY t−1 ) of the dependent variable. These instrumental variables are related to (ΔY t−1 ), but not to ΔΕ t . The moment conditions corresponding to the difference model and the level model are as follows: 4 Wireless Communications and Mobile Computing The moment conditions for identifying λ 1 in the two models are as follows.
Construct a spatial lag item WY t as follows; Jacobs et al. (2009) [19] provided a method of finding instrumental variables, that is time lag terms of spatial lag dependent variables, who also proved that the moment condition obtained by this method was as valid as Equation (8). So, corresponding to the difference model and the level model, the following moment conditions can be listed: where l is the exponential of matrix W and the integer L is the maximum order of spatial lag that can be used as the instrumental variable.
In addition, based on the method provided by Kelejian and Robinson (1993) [28], formula (1) shows that WY t depends on WX t , so the instrumental variable WΔX t can be selected by the first-order difference method for WΔY t .
Since ΔX t is a strictly exogenous variable, it is not related to the compound error term ΔE t , so corresponding to the difference model and the level model, the instrumental variables satisfy the following moment conditions: (3) GMM Estimation. When we estimate the parameters of the spatio-temporal stochastic frontier model, we use the system GMM method similar to the general dynamic panel model to construct the spatial system GMM estimation.
Unlike the system GMM, the IVs of the spatial system GMM are composed of time lag variable and spatial lag variable.
For each period of t, the moment condition of J ≥ K + 2 can be given. The moment conditions corresponding to the difference model and the level model can be abbreviated as The matrices H N,ABt and H N,Lt are expressed as follows That is, H N,ABt and H N,Lt are matrices with instrumental variables as column vectors, and the subscript N means that the matrix depends on the unit number of individuals. Let H N,AB and H N,L be block diagonal matrices composed of block H N,ABt and H N,Lt , respectively. In order to define the GMM (Spatial Blundell Bond, SBB) estimator of the spatial dynamic panel stochastic frontier model, the difference variables and level variables are combined to define the matrix as follows: , where H N,AB is the instrumental matrix of spatial difference GMM estimation, and H N,L is the instrumental matrix of spatial level GMM estimation. The weight matrix is This diagonal of the matrix is composed of the weight matrix defined in the process of spatial difference GMM estimation and an identity matrix, where G N, This weight matrix is proposed by Arellano and Bond (1991) [25] which is further define the weight matrix: Through the above process, combining the spatial difference equation with the spatial level equation, we get the spatial system GMM estimation process. Get the objective function of generalized moment estimation for spatial system as follows: The one-stage SBB estimator of θ can be obtained by minimizing Equation (20) Equation (21) can also be called the spatial system GMM estimator.
(4) Improvement of Instrumental Variable Matrix. The instrumental variable matrix constructed in accordance with the above method has a high dimension and grows exponentially as the values of T and L increase. In order to reduce the dimension of the instrumental variable matrix and avoid overfitting the instrumental variable, we can simplify it by using the "condensing instrumental variable matrix" proposed by Beck and Levine (2004) [29].
We still set s = 2 and L = 1 in the GMM instrumental variable matrix of the space system, and the corresponding condensed instrumental variable matrix is where H i L (i = 1, ⋯, N) is the instrumental variable quantum matrix of the level model corresponding to the i individual.  (1) is obtained in the first stage, the model residualΕ t = Y t − Z t b θ N can be further obtained, in which the Z t = ðWY t , WY t−1 , Y t−1 , X t Þ is the vector set composed of all explanatory variables in model (1). Consistent GMM estimation can be obtained by using residualΕ t and modifying the moment condition proposed by Kapoor et al. (2007) [18]. The specific process is as follows.
According to the assumptions of the model, the individual effect of the model is the inefficiency term. According to the covariance structure of the compound error term, it can be known that Introduce transformation matrix: where I T and I N are the identity matrix of order T and order N, J T = e T e T ′ is the matrix of order T × T, and the elements of that are all 1. Properties of transformation . From the properties (ii), we can further deduce the special properties (iv) Q 0 ε = Q 0 v of matrix Q 0 in this paper, where ε and ν are the corresponding error terms in model (1). Then, Based on the above transformation and referring to the first three of the six moment conditions given by Kapoor et al. (2007) [18] and related properties, the following three moment conditions are given in this paper: To further integrate the above moment conditions, we can get that Substitute Equations (26) and (27) into Equation (29) to obtain that The residualΕ t = Y t − Z t b θ N estimated in the first stage is substituted into Ε and Ε in Equation (30) to obtain the sample moment equation. In the sample moment equation, the estimated value b ρ, b σ 2 v of ρ and σ 2 v can be solved by the following objective function: where 2.3.2. Estimation of σ 2 u . The fourth moment condition given by Kapoor et al. (2007) [18] is where σ 2 1 = Tσ 2 u + σ 2 v . However, considering the characteristics of the stochastic frontier model, the compound error term ε obeys the asymmetric distribution of the expected nonzero, so the moment condition (33) cannot be directly applied, and the following formula can be proved:

Spatial Correction of Estimators.
Although it can be proved that the estimator (21) is a consistent estimator, it can also be proved that the consistency of the GMM estimator can be guaranteed even if the model has spatial error autocorrelation. However, the estimator (21) cannot solve the spatial dependence of the error term, and the variance of the estimator is relatively large. After obtaining the consistent estimator of ρ by Equation (31), the consistent estimator can be obtained by a correcting transformation. According to the spatial correction method given by Jacobs et al. (2009) [19], the estimator obtained in the first step was corrected.
The estimator b ρ obtained from Equation (31) is used to construct matrix I − b ρM, and left the difference GMM and the explained variables and the instrumental variables matrix of the system GMM estimation, if The corresponding explanatory variable set Z t = ½WY t , WY t−1 , Y t−1 , X t is corrected as The instrumental variable matrix and weight matrix corresponding to the GMM estimation of the spatial system are corrected as follows: Then, the corrected system GMM estimator is  (21) and (27), are consistent.

Results and Discussion
Proof. see Appendix C.

Crossvalidation Scheme and Selection of Spatial Weight
Matrix. In order to avoid affecting the accuracy of model estimation due to the choice of spatial weight matrix, the optimal spatial weight matrix was selected by crossvalidation. This is a widely used model selection and generalization method in machine learning. However, since the data used is panel data and the model used is spatio-temporal model, the structural features of spatio-temporal data may be destroyed if the training set and validation set are generated by hold-out method and LOOCV or K-folded crossvalidation. Therefore, this paper considers a stratified crossvalidation approach. For the spatio-temporal data, if N and T are assumed to be the number of spatial individuals and the number of periods contained in the observed samples, respectively, and the rest of the conventions on independent variables and dependent variables are the same as Equation (1), stratified crossvalidation can include the following three forms.

Leave-One-Out Crossvalidation for the Time Dimension (TLOOCV)
. Select the date t as the validation set and the rest T − 1 of the date as the training set. Let C 1 , C 2 , ⋯, C T denote, respectively, the index values of the observations contained in period t (t = 1, 2, ⋯, T), and N 1 , N 2 , ⋯, N T the number of observations contained in period t (t = 1, 2, ⋯, T). Let n 1 , n 2 , ⋯, n T denote the number of the observations in part t. Do the above for each period t = 1, 2, ⋯, T in turn and calculate where andŷ it is the fitting value of the ith observed value in period Ty it . This crossvalidation method is suitable when the number of periods T is not too large.

K-Fold
Pooled Crossvalidation for the Spatial Dimension (SK-Fold PCV). When the total number of 8 Wireless Communications and Mobile Computing periods T is large and the number of spatial individuals N is also large, this method is suitable for use. All observed values in each period t (t = 1, 2, ⋯, T) were randomly divided into K groups of equal size (i.e., the subsample size of each group was N/K), and a group was randomly selected from each period t (t = 1, 2, ⋯, T) to obtain NT/K observed values combined as the validation set and NðK − 1ÞT/K remaining observed values in each period combined as the training set. Do the above for each period sequence and calculate where andŷ it is the fitting value of the ith observed value in period Ty it .

Leave-One-Out Crossvalidation for the Spatial Dimension (SLOOCV).
When the crossvalidation method presented in Section 3.2.2 and the condition K = N attached, it can be called K-fold pooled crossvalidation for the spatial dimension (SN-fold PCV) or stratified leave-one crossvalidation (SLOOCV). When the total number of periods T is large and the number of spatial individuals N is small, this method is suitable for use.

Determination of Weight Matrix and Industrial
Efficiency Measure. To discuss industrial efficiency from the perspective of spatial statistics or spatial data mining, a good spatial weight matrix should be determined first. In this paper, the spatial lag production function is chosen as the basic model, and the spatial weight matrix involved in the construction of the model can take various alternative forms. In a data-driven way, the training samples were imported into the model for parameter estimation, and then, the most appropriate spatial weight matrix was determined by stratified crossvalidation. To determine whether the spa-tial model is selected for analysis, the spatial correlation test is further carried out. If there is a strong spatial correlation, a spatial stochastic frontier model or a spatio-temporal stochastic frontier model will be established; if the spatial correlation is weak, an ordinary panel stochastic frontier model will be selected. After the estimation is completed, the best performing model is used to measure the technical efficiency. The flow chart of the entire analysis process is shown in Figure 1

The Efficiency of the High and New Technology Industries in China
China has developed high and new technology industries for many years in order to transform the economic growth mode, cultivating knowledge and technology intensive new companies with great growth potential and low resources consumption that provide a sustainable development. As the technology of such industries disseminates, some issues emerge, such as spatial technology spillover, continuity of technological upgrade, and delay from research and development to market acceptance. In this paper, we analyze the efficiency of this kind of industries by the above analysis process.

Introduction to the Model and Data.
In the framework of spatio-temporal model analysis, the spatial lag production function model based on the Cobb-Douglas production function is chosen as the basic model for weight selection.
The matrix form of the model is as follows: where ln Y, ln k, and ln l are, respectively, the logarithm vectors of the main business income, the assets investment, and the mean number of employees of the high and new technology industries in every province in China; W N is the spatial  Table 1.
Before the fitting, each weight matrix was row-standardized, and the optimal weight matrix determined by crossvalidation was implemented to establish the stochastic frontier model. Starting from the ordinary panel stochastic frontier model and considering the spatial correlations, we construct the static panel spatial stochastic frontier model and the spatiotemporal stochastic frontier model. In order to determine if the model variables present spatial correlation, we let them go through a spatial correlation test, and to determine if there should be a time lag term in the model, we test the significance. Comparing the results of the three models and selecting the one that provides the best fit, we estimate the technology efficiency of the high and new technology industries in every province.
The matrix forms of the three models are as follows: Ordinary panel stochastic frontier model: Static panel space stochastic frontier model: Spatiotemporal stochastic frontier model: where v is a general vector of stochastic error, u is the inefficiency term, I T is the identity matrix of order T, λ and ρ are the spatial correlation coefficients of the corresponding equations, π is the spatiotemporal time lag coefficient of the spatiotemporal stochastic frontier model, and B is the regression coefficient vector. The development plan of high and new technology industry in China started in 1988, but due to the relatively slow progress in the beginning, the scale development of this industry did not start until the beginning of the twenty-first century. For this reason, we have chosen as research sample the panel data of the high and new technology industries of the 31 provinces in China from 2001 to 2018. The data of capital and labor input factors have been taken from "China high-tech industry yearbook." Descriptive statistics are shown in Table 2. Figure 2 is the histogram drawn by taking the intragroup mean of data in each region according to year. Figure 2 shows the difference in investment and average development level of high-tech industries in different provinces of China from 2001 to 2018. As can be seen from the figure, Guangdong, Jiangsu, and Shandong provinces have The weight matrix is constructed by rook, bishop, and queen position adjacence, and queen adjacence matrix is selected in this paper.
Matrix element w ij = 0 Region i and j are not adjacent 1 Region i and j are adjacent The weight matrix is constructed by geographical distance between regions, and this paper constructs the weight matrix by reciprocal distance between the centers of provincial capitals in China. Matrix element w ij = 1/d ij , where d ij is the geographical distance between regions i and j W 3 Economic distance weight matrix The weight matrix is constructed by the difference of economic level among different regions, that is, the smaller the economic gap, the stronger the spatial correlation. In this paper, the GDP of each province in China is used as the proxy variable of economic development level to construct the matrix Matrix where d ij is the geographical distance between region i and j; GDP i represents the annual average GDP of region i) 10 Wireless Communications and Mobile Computing the highest input and output levels of high-tech industries, and the differences among these three provinces are also very large. In the past 18 years, the average output value of the high-tech industry of Guangdong, which ranks first, reached 23.7 billion yuan, while that of Shandong, which ranks third, reached only 6.997 billion yuan, less than one third of that of Guangdong. In terms of the development and distribution of high-tech industries nationwide, the gap between provinces is even more obvious. The average output value of the high-tech industries in Tibet, which ranks the last, is only 0.085 billion yuan, less than 1/1000 of that of Guangdong.

Crossvalidation Results and the Selection of Weight
Matrix. According to the characteristics of the data obtained, the time limit contained in the data is 18 years, which is relatively short and smaller than the number of regions. Therefore, the leave-one-out crossvalidation for the time dimension (TLOOCV) method was chosen. Each weight matrix in Table 1 was introduced into model (31), the training set data were imported into the model one by one for fitting, and then, Equation (40) was calculated to obtain the CV statistics corresponding to each weight matrix. The calculation results are shown in Table 3. By comparing the calculation results of CV statistics of validation set, it can be found to be the optimal spatial weight matrix required by this paper.

Empirical Results.
To ensure that spatial econometrics is applicable to the problem we are studying, we need to test the spatial correlation of the variables we are interested in. The most popular method to measure spatial autocorrelation is Moran's index I (Moran's I): where S 2 is the sample variance, w ij is the (i, j) element of the spatial weight matrix (used to measure the distance between region i and region j), and ∑ n i=1 ∑ n j=1 w ij is the sum of all spatial weights.
The value of Moran's I is generally between -1 and 1, and its greater than 0 indicates positive autocorrelation. That is, the high value is adjacent to the high value and the low value is adjacent to the low value. Less than 0 means negative autocorrelation. That is, a high value is adjacent to a low value. If the Moran's I is close to 0, then the spatial distribution is random, and there is no spatial autocorrelation.
To test the existence of spatial correlations in the variables of the high and new technology industry, we calculate the global Moran's I indices of the production values of the industry from 2011 to 2018 (Table 4).
From the results of Moran's I index calculation, we found that the P value of the index is smaller than 0.01 for every year, demonstrating that the index is significant below 1% for every year, and the average Moran's I index is also significant for every year. The Moran's I index reached the minimum value 0.286 in 2011 and the maximum value 0.340 in 2013. We observe that the production value of the high and new technology industries of every province shows significant spatial correlation for every year and conclude that the production values of these industries of the provinces in China have apparent spatial aggregating effects. Furthermore, the testing of spatial correlation and the location quotient calculation both demonstrated that the high and new technology industries in different regions of China have apparent spatial correlation. We therefore choose the panel space stochastic frontier model for the analysis. Table 5 presents the estimation results of the static, spatiotemporal stochastic frontier model, where the static model was further analyzed by considering fixed and stochastic effects.
The spatial autoregressive coefficients λ and ρ of the three models can all pass the significance test. From this and the spatial correlation test, one can conclude that the spatial stochastic frontier model is more reasonable. The estimated spatial autoregressive coefficients of the three models are all positive, which implied that the spatial effects have a positive impact on the development of the high and new technology industries. The negative value of the Hausman statistics of the static panel space model implies that the random effect model should be chosen. The random effect σ 2 u value of the static panel space stochastic frontier model is far greater than the σ 2 v value, and the value of γ is 0.719, implying that there apparently exists technical inefficiency. Comparing the spatiotemporal stochastic frontier model with the estimation of the static panel spatial random effect, the spatiotemporal lag coefficient of the former can    pass the 10% significance test, obtaining that the spatiotemporal lag term in the model has a significant function. The distance statistic of the spatiotemporal stochastic frontier model pass the 5% significance distance test, proving that the spatiotemporal stochastic frontier model is globally significant. Moreover, the estimation of the γ value of the spatiotemporal stochastic frontier model is higher than that of the static panel spatial stochastic frontier model, demonstrating that the inefficiency term of the spatiotemporal stochastic frontier model has a more significant function. All the impact factor variables of the technical inefficiency terms of the analysis of the two models can at least pass the 5% significance test, and the signs of the regression coefficient of the two models are consistent, the numerical values are relatively close. Taken together, the above results all demonstrate that the analysis of the spatiotemporal stochastic frontier model is more reasonable and the development of the high and new technology industry has positive correlations in space and time. Due to space limitation, this paper does not report the annual technical efficiency of the high-    It can be seen from the calculation results in Table 6 that the average technical efficiency value of the high-tech industries in all provinces in China is less than 1, which indicates that the actual output of the high-tech industries in all provinces has not reached the most effective output level, and there is technological inefficiency in production. The fiveyear national average technical efficiency level was 0.837, and there are obvious regional differences in the technical efficiency values presented in Table 4. Nine provinces (Beijing, Chongqing, Fujian, Gansu, Jiangsu, Liaoning, Shanghai, Tianjin, and Zhejiang) achieved an average technical efficiency of more than 0.9, seven provinces are located in the eastern region, one in the central region, and one in the western region. There are 11 provinces with average technical efficiency below 0.8, namely, Guangdong, Guangxi, Guizhou, Hebei, Henan, Jiangxi, Ningxia, Qinghai, Shaanxi, Xinjiang, and Tibet. Only one of the provinces is in the east, six in the central region, and four in the west.

Conclusion
In this paper, taking into account that the variables to be explained might be affected by the time lag term and the space-time interaction, we develop a dynamic model within the framework of the panel spatial stochastic frontier model. Due to the apparent endogeneity of the model, we use the systematic GMM method to estimate the parameters, choose suitable tool variables according to the model assumptions and variable characteristics, and construct the suitable spatiotemporal stochastic frontier model. We use the extreme value consistence theorem and the uniform law of large numbers (ULLN) to prove the consistency of the structural parameter estimators and of the estimators of the error term distribution parameters. Aiming at the selection of spatial weight matrix of spatio-temporal model, a stratified crossvalidation method is designed to select the most appropriate spatial weight matrix in a data-driven way according to the characteristics of spatio-temporal data. Although the spatial weight matrix selected by supervised learning may not be suitable for analyzing all problems, this data-driven model selection method is undoubtedly valuable and efficient.
From the analysis of the stochastic frontier model of high-tech industries in China and the measurement of their technical efficiency, we can draw the following conclusions.
There is a spatial positive correlation in the development of high-tech industries between different regions of China. The positive correlation between the output values of these industries in different regions has been obtained by calculating the Global Moran's I index in each year. The estimation of the spatial panel stochastic frontier model also indicates that the spatial autoregressive coefficient is positive, proving the existence of such a positive correlation which has a positive impact on the development of high-tech industries. There is also a spatial agglomeration effect and a spatial and temporal lag effect in these industries, illustrating that both static spill over and dynamic continuity occur in the development of the hightech industries in China. The technical efficiency of high-tech industries is relatively low. The strategic emerging industries started earlier in eastern region, but developed more slowly than in the central and western regions.
The Chinese economy is at a critical stage of replacing old drivers of growth with new ones and transforming and upgrading industries. The new round of technological and industrial revolution 5.0 has given rise to new technologies, new industries, new forms of business, and new models. In this study, the data mining algorithm based on stochastic frontier is used to calculate industrial efficiency, which is not only suitable for high-tech industry but also helpful to further enrich the research on the efficiency of new industry and new mode and has certain practical significance to promote the steady development of the new round of scientific and technological revolution of industry 5.0.