Multivariate Multiple Regression Models for a Big Data-Empowered SON Framework in Mobile Wireless Networks

In the 5G era, the operational cost ofmobile wireless networkswill significantly increase. Further,massive network capacity and zero latency will be needed because everything will be connected tomobile networks.Thus, self-organizing networks (SON) are needed, which expedite automatic operation of mobile wireless networks, but have challenges to satisfy the 5G requirements. Therefore, researchers have proposed a framework to empower SON using big data. The recent framework of a big data-empowered SON analyzes the relationship between key performance indicators (KPIs) and related network parameters (NPs) usingmachine-learning tools, and it develops regressionmodels using a Gaussian process with those parameters.The problem, however, is that themethods of finding the NPs related to the KPIs differ individually. Moreover, the Gaussian process regression model cannot determine the relationship between a KPI and its various related NPs. In this paper, to solve these problems, we proposed multivariate multiple regression models to determine the relationship between various KPIs and NPs. If we assume one KPI and multiple NPs as one set, the proposed models help us process multiple sets at one time. Also, we can find out whether some KPIs are conflicting or not. We implement the proposed models using MapReduce.


Introduction
The technology of self-organizing networks (SON) has been developed to more economically manage wireless communication and mobile networks in increasingly complex environments [1,2].SON, however, do not fully handle data from all sources in mobile wireless networks such as mobile app-based data (mobile data) and channel baseband power (wireless communication information) [3,4].Thus, SON encounter challenges that hinder the current self-organizing networking paradigm from meeting the 5G requirements because 5G networks are more complex [4].
Engineers have thus come up with a big data-empowered SON (BSON), which develops a SON with big data in mobile wireless networks.BSON, currently a necessary technology for 5G [3][4][5][6], is still in its initial stage.Indeed, in its current iteration, it is insufficient for practical use.The BSON framework was proposed in [4] and includes the concrete concept of using big data in mobile wireless networks and applied them to SON.It ranks the key performance indicators (KPIs), selects network parameters (NPs) related to each KPI, and creates a Gaussian process regression model in which the KPI is the dependent variable and each NP related to this KPI is the independent variable.The Gaussian process regression models are then applied to the SON engine for management optimization.In this context, the KPIs include capacity, quality of service (QoS), capital expenditure (CAPEX), and operational expenditure (OPEX) from the perspective of a wireless communication operator.In addition, from a user perspective, the KPIs include seamless connectivity, spatiotemporal uniformity of service, demand for almost infinite capacity or zero latency, and cost of service.For instance, because 5G technology aims to connect everything such as automobile, wearable devices, and home network and to help human escape emergency situations, massive network capacity and zero latency are needed in a wireless ecosystem.
This BSON framework [4], however, has some aspects that need be improved.For example, the individual selection of NPs related to a KPI is considerably intricate because a typical 5G node is expected to have more than 2000 parameters.Moreover, in a single Gaussian process regression model, computing an exact KPI value according to the

Background and Related Work
SON facilitates automatic operation of mobile wireless networks.It initially exploits big data in mobile wireless networks to improve the networks.This current research is devoted to BSON [3].The researcher in [4] proposed the BSON framework.

SON.
Operating mobile wireless networks is a challenging task, especially in cellular mobile communication systems due to their latent complexity.This complexity arises from the number of network elements and interconnections among their configurations.In heterogeneous networks, handling various technologies and their precise operational paradigms is difficult.Today, planning and optimization tools are typically semiautomated and the management tasks need to be closely supervised by human operators.This manual effort by a human operator is time-consuming, expensive, and error prone and requires a high degree of expertise.SON can be used to reduce the operating costs by reducing the tasks at hand and enhancing profit by minimizing human error.The next subsection details the SON taxonomies.
Configurations may also be needed when a change in the system is required, such as failure of a node, drop in network performance, or change in service type.In future systems, the conventional process of manual configuration must be replaced with self-configuration.We can foresee that nodes in future cellular networks should be able to self-configure all their initial parameters including the IP addresses, neighbor lists, and radio-access parameters.
2.1.2.Self-Optimization.After the initial self-configuration phase, we need to continuously optimize the system parameters to ensure efficient performance of the system to maintain all optimization objectives.Optimization in the legacy systems can be done through periodic drive tests or analysis from log reports generated from network operating center.Self-optimization includes load balancing, interference control, coverage extension, and capacity optimization.
2.1.3.Self-Healing.Wireless cellular systems are prone to faults and failures due to component malfunctions or natural disasters.In traditional systems, failures are mainly detected by the centralized operation and maintenance (O&M) software.Events are recorded and necessary alarms are set off.When alarms cannot be remotely cleared, radio network engineers are usually mobilized and sent to cell sites.This process could take days or even weeks before the system returns to normal operation.In future self-organized cellular systems, this process needs to be improved by consolidating the self-healing functionality.Self-healing is a process that consolidates remote detection, diagnosis, and triggering of compensation or recovery actions to minimize the effect of faults in the mobile wireless network equipment.

Big Data in 5G and BSON.
The massive amount of information comes from various elements in the mobile wireless networks, such as base stations, mobile terminals, gateways, and management entities, as shown in Figure 1 [3].The authors in [4] classified the big data in cellular networks as follows.
2.2.1.Subscriber Level Data.This classification contains control data, contextual data, and voice data, which not only can be used to optimize, configure, and plan network-centric operations, but also are equally meaningful to support key business processes such as customer experience and retention enhancement.

Cell Level
Data.This classification contains physical layer measurements that are reported by a base station and all user equipment within the coverage of this base station to the O&M center.The utilities of the cell level data can complement the subscriber level data.For example, minimization of drive test measurements, which contains the reference signal received power and reference signal received quality values of the serving and adjacent cells, are particularly useful for autonomous coverage estimation and optimization [9].network level problems.The complexity of identifying problems in a core network is increased many times, particularly if the equipment used is supplied by different vendors that provide their own proprietary solutions for different network performance.

Additional Sources of Data.
This classification contains the structured information already stored in the separate databases, including customer relationship management as well as billing data.This also includes unstructured information such as social media feeds, specific application usage patterns, and data from smart phone built-in sensors and applications.
As discussed in the Introduction, SON technology uses this aforementioned big data to improve itself.This process is facilitated using BSON.The three main features that make BSON distinct from the state-of-the-art SON are the following: (i) full intelligence of the current network status, (ii) capability in predicting user behavior, (iii) capability in dynamically associating the network response to the NPs.
These three capabilities can go a long way in designing a SON that can meet the 5G requirements.The BSON framework shown in Figure 2 involves the following steps.
Step 1 (data gathering).This includes gathering of data from all sources of information into an aggregate data set.
Step 2 (transforming).This includes transforming the big data into right data.The steps in this transformation are explained below.The underlying machine learning and data analytics are subsequently explained.
(1) Classifying.This means classifying the data with respect to key operational and business objectives (OBOs) in which accessibility, retainability, integrity, mobility, and business intelligence are present.(2) Unifying/Diffusing.This means unifying multiple PIs into more significant KPIs.(3) Ranking.This means ranking KPIs within each OBO with respect to their effect on that OBO. ( 4) Filtering.This means filtering out KPIs that affect the OBO below a predefined threshold.( 5) Relating.This means, for each KPI, finding the NP that affects that KPI. ( 6) Ordering.This means, for each KPI, ordering the associated NP with respect to the strength of their association.( 7) Cross-Correlation.This means, for each NP, determining a vector that quantifies its association with each KPI.
Step 3 (modeling).This includes developing a network behavior model by learning from the right data obtained in Step 2 using the Gaussian process regression and Kolmogorov-Wiener prediction.
Step 4 (running the SON engine).This includes using the SON engine on the model to determine a new NP and expected new KPIs.
Step 5 (validating).If the simulated behavior tallies with the expected behavior (KPIs), proceed with the new NPs.
Step 6 (relearning/improving).If the validation in Step 5 fails, make feedback to the concept drift block, which updates in turn the behavior model.[8,10].

Multiple Regression Models
Step 2 (transforming) and Step 3 (modeling) presented in Section 2.2 (BSON framework) are replaced with the multiple regression models.The key factors in Step 2 (transforming) and Step 3 (modeling) are finding the associated NPs for each KPI and creating the model using a KPI and the associated NPs.They should, however, separately determine the associated NPs using machine-learning tools [11].Moreover, calculating the accurate value of a KPI according to the change in the NP values is difficult.In other words, the model presented in Section 2.2 allows us to determine the value of a KPI according to only one NP because the model is merely a single regression model.The single regression model shown in Figure 2 identifies the relationship between a KPI and only one NP.Of course, many single regression models exist according to the NPs, but calculating a KPI value when the NP values simultaneously change is difficult.In contrast, the multiple regression models shown in Figure 3 enable easy identification of the relationship between a KPI and the NPs.
We proposed the multiple regression models to enhance the previous BSON framework [8].The multiple regression model is written in [10] as Step 3; compute (X  X) −1 Step 2: compute X  X or X  Y  i and it can be expressed as where The elements in X and Y are the values of the NPs and KPI, and the parameter is estimated as We can create multiple regression models ( B) by calculating the multiplication of (X  X) −1 and (X  Y). Figure 3 shows four steps to compute B using MapReduce, and we provided the detail of each step in [8].[12,13].MapReduce is a computation method that has been implemented in several systems, including Google internal implementation and the popular open-source implementation Hadoop.

Matrix Multiplication Using MapReduce
(Hadoop can be obtained, along with Hadoop Distributed File System from the Apache Foundation.)We can use an implementation of MapReduce to manage many largescale computations in a manner that is tolerant of hardware faults.Only two functions need to be written-Map and Reduce-while the system manages the parallel execution, coordinates tasks that execute Map or Reduce, and deals with the possibility that one of these tasks will fail to execute.

Matrix Multiplication with One MapReduce
Step.If M is a matrix with element   in row  and column  and N is a matrix with element   in row  and column , then the product, P = MN, is matrix P with element   in row  and column , where We can possibly use only a single MapReduce pass to perform the matrix multiplication, P = MN.Here, we present an abstract of the Map and Reduce functions.
(2) Reduce Function.Each key (, ) will have an associated list with all values (, ,   ) and (, ,   ), for all possible values of .The th values on each list must have their third components, namely,   and   , extracted and multiplied.Then, these products are added, and the result is paired with (, ) in the output of the Reduce function.[14].The LU algorithm splits the matrix into square submatrices and individually updates these submatrices.The block method splits the input matrix, as shown in Figure 4.

Matrix Inversion Using MapReduce
In this method, the lower triangular matrix L and the upper triangular matrix U are both split into three submatrices, whereas the original matrix A is split into four submatrices.These smaller matrices satisfy the following equations: where both P 1 and P 2 are permutations of the rows.The entire LU decomposition can be represented as where P is also a permutation of the rows obtained by augmenting P 1 and P 2 .
If submatrix A 1 is sufficiently small (e.g., on the order of 10 3 or less), it can be very efficiently decomposed into L 1 and U 1 on a single node.If submatrix A 1 is not small enough, we can recursively partition it into smaller submatrices, as shown in Figure 4.After obtaining L 1 and U 1 , the elements of L  2 and U 2 can be computed using the following two equations: We can compute A 4 − L  2 U 2 using the L  2 and U 2 matrices mentioned above.Subsequently, we can decompose it into L 3 and U 3 .

Multivariate Multiple Regression Models for BSON Framework
The multiple regression models presented in Section 2.3 suffer from a shortcoming-they can calculate the relationship between only one KPI and NPs.Many KPIs exist, however, such as those from the operator perspective that include OPEX, CAPEX, QoS, and capacity and from the user perspective that include seamless connectivity, cost of service, capacity, and latency [4].These are high-level KPIs; however, many precise technical KPIs also exist, such as the cell power and cell coverage.To reveal the relationship between the KPIs and NPs, we must calculate the multiple regression models several times for each KPI in the previous multiple regression models.This process is inconvenient and requires a long time.Meanwhile, finding the conflicting or concordant relationship among KPIs is not easy when the NP values simultaneously change.As we mentioned earlier, we should perform multiple regressions several times for each KPI to finally learn the conflicting or concordant relationship among KPIs.
In contrast, the proposed multivariate multiple regression models shown in Figure 5 allow simultaneous determination of the relationship between the KPIs and NPs.
To enhance the multiple regression models for BSON, we propose the multivariate multiple regression models.The multivariate multiple regression is expressed as follows [15,16]: and it can also be expressed as where The elements in Z and Y are values of NPs and KPIs, and the parameter is estimated as We can create multivariate multiple regression models ( B) by calculating the multiplication of (Z  Z) −1 and (Z  Y). Figure 5 shows four steps to compute B using MapReduce, and we specifically describe each step below.
Step 1 (integrating).Each message has limited information such as the location, time, reception sensitivity, cell power, mobile power, data traffic, and mobility status.Hence, we simultaneously integrate the whole messages to determine the values of the KPIs according to the NPs in the Map function.Then, we extract the values of the KPI and all the NPs in the Reduce function.The MapReduce key-value pair in Step 1 is presented in Algorithm 1.
In the Map function, the key is time, and the value is the name and value of each NP and KPI.When the Map tasks are all completed, the key-value pairs are grouped in terms of time.Thus, the input of the Reduce task contains the corresponding information and the key-value pairs are grouped according to each KPI (i.e., KPI  ) in the Reduce tasks.Therefore, we can simultaneously obtain the value of each KPI and NP as the output of the Reduce tasks.
For example, if we take one sample per minute for 1 hour, we can obtain 60 samples.Assuming that the numbers of NPs and KPIs are 30 and 10, respectively, then the orders of Z and Y are 60 × 30 and 60 × 10, respectively.Therefore, we can convert key (i.e., time ℓ ), NP  elements and KPI  elements in the Reduce function into the ℓth row of Z and Y and the th column of Z and th column of Y, respectively.Step 2 (computing Z  Z and Z  Y).We compute Z  Z and Z  Y using the result in Step 1.Because the result in Step 1 includes the Z and Y matrices, we can easily compute Z  Z and Z  Y using MapReduce.As noted in Section 2.4, we can obtain matrix multiplication with one MapReduce step [12].For instance, if we calculate matrix multiplication, P = MN,   is used to obtain  1 ,  2 , . . .,   ( is the number of columns in N).Therefore, through   forking off the th elements in the Map function, we can calculate the element of P  in the Reduce function at the same time.

The Map Function {time, (NP
The MapReduce key-value pair in Step 2 is presented in Algorithm 2. Note that   ,   ,   , , or  are the names of these matrices and not of the entire matrix.Note also that  reaches up to the number of samples (i.e., time),  reaches up to the number of NPs plus one, and ℓ reaches up to the number of KPIs.Step 3 (computing (Z  Z) −1 ).To calculate the multivariate multiple regression, we compute (Z  Z) −1 using the result in Step 2. However, computing the inverse of a matrix using MapReduce is difficult when the order of the matrix is large.Fortunately, the authors in [14] proposed a method of matrix inversion using MapReduce.They proposed a block method for scalable matrix inversion using MapReduce.The block method enables parallel calculation of the LU decomposition.
If the order of the matrices is not large (≤10 3 ), the matrix can be very efficiently decomposed into L and U on a single node.If the order of the matrices is not large, sequentially calculating the inverse of a matrix using LU decomposition in one node becomes easy.We can compute the L and U matrices using the following equations for the LU decomposition algorithm [14,17]: We can then easily compute L −1 using the following equations [14], and the inverse of the upper triangular matrix (U −1 ) can be equivalently computed.We invert upper triangular matrix, U, by calculating the inverse of U T , which is a lower triangular matrix (L): The output key-value pair in Step 3 is presented in Algorithm 3. Note that (  ) −1 is the name of this matrix, and not of the entire matrix.
Algorithm 3 (the output key-value pair of Step 3).
Step 4 (computing B).We compute B = (Z  Z) −1 Z  Y using the results in Steps 2 and 3. We perform the multiplication of two matrices (i.e., (Z  Z) −1 and Z  Y) using MapReduce.We can also perform matrix multiplication using one MapReduce step such as in Step 2 [12].
The MapReduce key-value pair in Step 4 is presented in Algorithm 4. Note that (  ) The Reduce Function {(, ), ( (−1) )} We can recognize that estimated parameters (i.e.,  (−1) ) separate the NPs from the NPs unrelated to a KPI.If  (−1) is close to zero at KPI  , then NP −1 is unrelated to KPI  .In addition, we can identify whether a conflicting or concordant relationship among KPIs exists.For example, if the sign of all row elements of   and   for KPI  and KPI  are totally different, these KPIs are conflicting.Otherwise, they are concordant.(LU decomposition and LU inverse) Figure 6: MapReduce pipeline for estimate parameter (  ).

Time Complexity of Multiple Regression Models
We calculate the time complexity of the multivariate multiple regression models.The result of the multivariate multiple regression models can be obtained as the product of (Z  Z) respectively, as listed in Table 1 [18,19].Thus, the entire time complexity of the multivariate multiple regression models is ( 3 ) when  < .We can reduce this time complexity using distributed programming such as MapReduce.Let T(L) be the time complexity of  tasks.T(L) can then be presented as follows, assuming an ideal case without consideration of a network bottleneck: Thus, the time complexity of the  tasks is ( 3 /L), and if L is sufficiently large, we can obtain almost constant or linear time complexity, which shows that the time complexity of the proposed models is equal to that of the multiple regression models [8].
All experiments were performed in our laboratory cluster, which has 32 machines.Each machine has four CPU cores and 24 GB of memory, where each CPU is an Intel5 Xeon5 CPU X5650 at 2.67 GHz.For implementation in MapReduce, several phases were required.Thus, we had a pipeline of MapReduce jobs as shown in Figure 6.MR  is one MapReduce job.Three phases are required to calculate (Z  Z) −1 .
In MR 0 , we computed the product of Z  and Z.In MR 1 , we computed the L and U matrices using (13).In addition, in MR 1 , we can easily compute L −1 using ( 14), and the inverse of the upper triangular matrix (U −1 ) can be equivalently computed.We inverted upper triangular matrix, U, by calculating the inverse of U T , which is a lower triangular matrix (L).Finally, in MR 2 , we computed (Z  Z) − In this implementation, we compared the execution time according to the number of MapReduce jobs as shown in Figure 7.We used 600 × 400 matrix as input Z and 600 × 100 matrix as input Y. Thus, the order of estimated parameter (i.e.,   ) was 400 × 100.In a practical experiment, we need to calculate a large order of matrices.Much time, however, is needed to calculate matrix multiplication in our cloud when the matrices are in a large order.Hence, we reduced the order of matrices and simply compared the execution times according to the number of tasks.
Figure 7 shows the execution time for calculating each phase (i.e., MR  ).In Figure 7, the execution times of MR 0 , MR 2 , MR 3 , and MR 4 are linearly reduced when the number of Reduce tasks increases from 10 to 20.They, however, later gradually decreased when the number of Reduce tasks increases from 20 to 50 because network bottleneck, communication cost, or additional management time exists [22,23].
To the left of the three bars in MR 1 in Figure 7, we can see the execution time for calculating MR 1 on a single node.No reduction in the execution time can be observed by increasing the Map tasks.Thus, if we want to reduce the execution time in MR 1 , we need to use parallel LU decomposition.
The last bar in MR 1 in Figure 7 shows the execution time for calculating MR 1 using parallel LU decomposition as presented in Section 2.5.On a single node (i.e., one Reduce), this process takes approximately 110 s to calculate the LU decomposition of a 400 × 400 matrix and to obtain the inverse of L and U matrices, whereas, on parallel LU, we split the 400 × 400 matrix into four submatrices, from A 1 to A 4 (the order of each matrix is 200 × 200), and then obtain L 1 , L 2 , L 3 , U 1 , U 2 , and U 3 as presented in Section 2.5.We need two MapReduce phases and require 89 s to calculate the results to be the same as those in a single node.
Figure 8 shows the total execution time to obtain estimated parameter (i.e.,   ).By increasing the number of tasks, the execution time is reduced.If we can increase the task capacity by building additional machines in a cluster, we may be able to calculate matrix operations faster than we can currently perform.In addition, we can easily perform numerous matrix operations using MapReduce.
Figure 9 shows the comparison of the execution times to calculate MR 3 and MR 4 when we use multiple regression and multivariate multiple regression models.The reason why we compare these two models using only MR 3 and MR 4 is that  MR 0 , MR 1 , and MR 2 of the two models are the same.We consider only one KPI at a time in the multiple regression models.Thus, the order of the Y matrix is 600 × 1.In the multivariate multiple regression models, we consider 100 KPIs; thus, the order of the Y matrix is 600 × 100.Given the complexity of matrix multiplication, it is likely that the execution time for MR 3 and MR 4 in the multiple regression models is 100 times faster than that in the multivariate multiple regression models.
In Figure 9, however, the execution time for MR 3 and MR 4 when the order of Y matrix is 600 × 1 is about 1.4 times faster than that when the order of Y matrix is 600 × 100.This happens because a minimum amount of time is needed for MapReduce execution, which includes time for forking Map, sorting, and merging Reduce.Therefore, in this case, multivariate multiple regression models are more efficient than multiple regression models.

Conclusion
In BSON, recent research has indicated that a framework using machine-learning tools and the Gaussian process regression model facilitates a more automatic operation of SON.This approach suffers from some limitations.However, although it determines NPs individually related to a KPI, it cannot inform us of the exact value of the KPI according to the change in the NP values.Therefore, we have proposed the multiple regression models to easily determine the relationship between a KPI and the NPs [8].These multiple regression models, however, were found to have their own shortcomings.If we want to identify the relationship between various KPIs and NPs, we must calculate the multiple regression models several times.

2. 2 . 3 .
Core Network Level Data.This classification can be exploited to fully automate fault detection and troubleshoot

Figure 1 :
Figure 1: Big data gathering path in mobile wireless network architecture.

− 1
and (Z  Y) are the names of these matrices and not of the entire matrix in the Map function.In the Reduce function, the jth element of (Z  Z) −1 multiplies the jth element of Z  Y in same (, ) key; then all the results are added.The result is the (, ) element of B. In the Reduce function, note that  reaches up to the number of NPs plus one, and  reaches up to the number of KPIs.Algorithm 4 (the MapReduce key-value pair of Step 4).The Map Function {(, ), ((  ) −1 , , (Z  Z) −1  )} for  = 1, 2, . . . up to number of rows of (Z  Y) or {(, ), ((  ), , (Z  Y)  )} for  = 1, 2, . . . up to number of rows of (Z  Z) −1

Figure 7 :
Figure 7: Execution time for calculating each phase (MR  ) according to the number of tasks.

Figure 9 :
Figure 9: Execution time for calculating MR 3 and MR 4 according to the number of tasks when the order of Y is 600 × 100 or 600 × 1.

Table 1 :
Time complexity of matrix multiplication.
1as the product of U −1 and L −1 .Meanwhile, MR 3 is required to calculate Z  Y. From the output of MR 2 and MR 3 , we can calculate estimated parameters (i.e.,   ) as the product of (Z  Z) −1 and Z  Y.In reference to Section 3, Step 1 phase creates Z and Y. Step 2 presents MR 0 and MR 3 .Step 3 presents MR 1 and MR 2 .Finally, Step 4 presents MR 4 .