An Efficient SDN Load Balancing Scheme Based on Variance Analysis for Massive Mobile Users

In a traditional network, server load balancing is used to satisfy the demand for high data volumes. The technique requires large capital investment while offering poor scalability and flexibility, which difficultly supports highly dynamic workload demands from massive mobile users. To solve these problems, this paper analyses the principle of software-defined networking (SDN) and presents a new probabilistic method of load balancing based on variance analysis. The method can be used to dynamically manage traffic flows for supporting massive mobile users in SDN networks. The paper proposes a solution using the OpenFlow virtual switching technology instead of the traditional hardware switching technology. A SDN controller monitors data traffic of each port by means of variance analysis and provides a probability-based selection algorithm to redirect traffic dynamically with the OpenFlow technology. Compared with the existing load balancing methods which were designed to support traditional networks, this solution has lower cost, higher reliability, and greater scalability which satisfy the needs of mobile users.


Introduction
In considering network overhead, techniques for load balancing are of significant importance.Load balancing directly impacts application and service availability for mobile users [1].Load balancing aims to optimize the utilization of the resource by maximizing the throughput, minimizing the response time, and avoiding overloading of any single resource.To alleviate heavy-traffic network flux and reduce the risk that a single server will become the main overhead contributor, many data centers adopt dedicated hardware methods to enable load balancing in order to support a large number of users [2,3].However, the hardware systems are usually expensive to procure, can be technically challenging to be deployed, and may require human intervention to work consistently.
Software-defined networking (SDN) as a type of computer networking provides a simple, convenient, maneuverable network flow control method with a minimal investment so as to reduce cost and increase benefit for massive mobile users.It controls data transport by means of software implementation of switches.When a data flow arrives at a switch, a flow table lookup has to be carried out.Flow tables ([Header : Counters : Actions]) are widely used in SDN.For each network flow, the headers and counters will be updated if flow changes are required or actions are imposed.By recording the header information into a database, an OpenFlow switch can process the data flow according to the header records.Based on the SDN model with a centralized controller, an OpenFlow switch is designed for different rules to control the network traffic using the header records.The flow control system will theoretically make it possible to define an algorithm to balance the network load.
This paper aims to present a new probabilistic method of load balancing based on variance analysis in SDN networks for supporting dynamic demand from massive mobile users.The SDN controller can monitor data traffic of each server port and manage all inbound and outbound traffic from server clusters.By deploying dynamically extensible load balancing strategy, an efficient model is proposed to reduce the packet latency in traditional communication networks and guarantee the reliability for massive mobile users, continuity, and timeliness of their business.In comparison to existing load balancing methods, the proposed method is able to solve the observed deficiencies of traditional methods such as high cost, low reliability, and poor extensibility.

Related Work
SDN (software-defined networking) is a technology in the field of computer networking which is presently generating significant interest.It originated from a project that began at UC Berkeley and Stanford University around 2008 [4].SDN is currently seen as one of the emerging approaches to computer networking that allows network researchers to manage network services through abstraction of lower level functionality [5,6].This is achieved by decoupling the network control that makes decisions about where traffic is sent (the control plane) from the forwarding systems that forward traffic to the selected destination (the data plane).The network becomes directly programmable and allows the infrastructure to be abstracted for applications and network services.The experts and vendors of these systems claim that this simplifies networking [7].At its core, SDN offers higher flexibility and rapid routing of traffic flows.Within the framework of this separation, developers can utilize the control plane to change the behavior of the network without physical modification of the existing network infrastructure implementation.This allows developers to conduct experiments flexibly and efficiently and enables the rapid deployment of new network architectures.This architecture is visualized in Figure 1.Within the SDN architecture, the application layer provides users with a wide range of innovative services and applications, while the control layer is achieved by SDN software on the server.For ease of use, the SDN software includes a uniform application program interface (API) [8].The data layer is comprised of generic network devices which are able to provide hardware or switching operations which are software defined in the control layer and communicated through the OpenFlow standard protocol [9].
The OpenFlow protocol is a fundamental element for building SDN solutions.It is the first standard communication protocol defined between the control layer and the infrastructure layer in SDN architecture [10,11].OpenFlow uses the concept of flows to identify network traffic based on matching rules that can be statically or dynamically programmed by the SDN control software.Switches are responsible for applying the proper actions on packets and updating records in the flow table entry.The switches simply forward packets according to the relevant entry in the flow tables without being concerned with how to construct or modify the flow table.The controller creates and installs a rule in the flow table for the corresponding packet if necessary, and the controller may at any time manage all switches by the flow table.OpenFlow-based SDN architectures provide extremely granular control, enabling the network to react to real-time changes from the application or the service user [12].OpenFlow-based SDN technologies increase the bandwidth capability, dynamic nature of applications and significantly reduce operation and management complexity [13].
At present, the existing traffic scheduling algorithms mainly include Round-Robin scheduling algorithm and Greedy scheduling algorithm.These scheduling algorithms have some drawbacks, such as high cost, low reliability, and low scalability.The ability of data algorithms to deal with mass-traffic becomes more important with the increase in mobile user.To solve the complex selection problem that network faces, probability selection algorithm can be regarded as a kind of good method.For probability selection algorithm, the concerns are not a matter of a signal choice but the developing trend of server traffic and the load of servers.Briefly, in the part of the solution space, we get the existence of the optimal solution under complex environment.In each iteration, we save a set of candidate solutions and choose better feasible solution by using probability selection algorithm based on the mapping of server load and then produce a new generation of candidate solutions.The process is repeated until -test value converges to the threshold.

Design and Implementation of Our Scheme
3.1.Load Balancing Technology.Load balancing provides a transparent way to increase the bandwidth of servers and other network devices and enhance data packet processing capacity and network throughput to ultimately improve the usability and flexibility of a network [14,15].Load balancing aims to optimize resource use, maximize throughput, minimize response time, and avoid overloading any single resource.The importance of server load balancing is recognized such that methods to improve load balancing are actively and continually researched.In comparison to the rapid development of network technology, the growth rate of server processor speed and memory access is comparatively slow.At present, the processing overhead of servers is a major bottleneck of the network development.Paradoxically, with the development of high-speed networks and increasing demands for services, many enterprise data centers and portal servers are becoming overwhelmed by the explosive growth in data traffic.Load balancing is the key technology used to distribute data demands across a cluster of server systems.
In this scheme of server load balancing based on a forward switching method, a novel method is proposed in this paper by utilizing the Network Address Translation (NAT) in the SDN architecture to construct a hybrid load balancing model.NAT refers to using a virtual address to represent the actual server address and rewriting the destination address of the request packet.Ultimately, data retransmission is performed [16,17].The present load balancing techniques are characterized by high investment, high consumption, low agility, and low reliability.Many of these issues can be solved by software-defined networking.This paper submits a new probability method of load balancing based on the variance analysis in SDN networks.

Variance Analysis
3.2.1.-Test.Analysis of variance (ANOVA) is a set of statistical models which are used to analyze the differences between group means and their associated procedures, developed by R. A. Fisher.In the ANOVA setting, the observed variance of a particular variable is partitioned into components attributable to different sources of variation.This paper utilizes a variance analysis method to determine whether the averages of several sets of data are equal by analyzing data statistics.
For analyzing the statistical characteristics of port flux, this paper adopts the -test method to detect whether there are significant differences among ports in order to determine if the operation is valid for the current state.Additionally, because data flow in the network is randomly selected, the traffic from each port can be viewed as independent with a normal distribution.The overall differences are divided into two basic classes of within-group variation and betweengroup variation.Differences in the between-group class are calculated to evaluate a meaningful dispersion between the average values of intragroup traffic and the population mean.Differences in the within-group class are calculated to evaluate the dispersion between an unbiased sample in the same group and the population mean.-test analysis is a statistical technique that is used to identify a set of groups based on differences.The mean square is obtained through the calculation of differences between the two parts divided by their degrees of freedom.The -inspection value is defined as the ratio of the "intra-" and "inter-" differences, according to the comparative analysis of the -inspection and significance level threshold [18].This will determine whether there is a significant difference between ports.Based on the above conclusions, the -test formula is as follows: In ( 1), MS  is the between-group difference, MS  is the within-group difference.
To clarify by example, let  1 ,  2 , . . .,   be a factor set having  different parts, let   be the number of monitoring times at level   , let  be related to the traffic, and let  1 ,  2 , . . .,    be the set of  samples at level   .Consider In ( 2),  equals the average of all of the traffic values. refers to the number of groups. refers to the number of the total monitoring times.Consider In ( 3), SS  is the total sum of squares, which equals a square sum of deviations between every subsample in population and population mean.Consider In ( 4), SS  is between-group sum of squares, which refers to a sum of squares of the deviations about the value between each group mean and population mean.Consider In (5), SS  is within-group sum of squares, being equal to a sum of squares of the deviations about the value between every subsample value in group and each group mean.Consider In (6), the division of SS  by the degree of freedom df  returns a numeric result and assigns the result to MS  which indicates within-group variance.Similarly, MS  which refers Mobile Information Systems to between-group variance can be obtained according to the result of (7).
The -test is employed to compare the factor of the total deviations.The -inspection value is defined as the ratio of the "intra-" and "inter-" differences.An observed value of  which is greater than the critical value of  determined from tables indicates that there are significant differences among groups.Conversely, a small -test value which does not exceed the critical value of  determined from tables indicates that there is no fundamental distinction among groups.

𝑡-Test and Multiple Comparisons.
Based on the results of the above calculations, we obtain the -inspection value which can only be used to indicate whether there are significant differences among groups.The -inspection value does not make it clear which of these groups, which should be few in number, contain noteworthy differences.There is a need to compare the calculated averages further by adopting the multiple -tests.Before discussing the multiple -tests, we first focus on two independent -tests and assume that  0 :  0 =  1 ,  1 :  0 ̸ =  1 .The -test method expression is shown as follows: Then, Let  =  0 +  1 , and  denotes the sum of monitoring numbers.
According to the above principle, there is a formula of multiple -test about  ( > 2) ports.As  0 is true, the hypothesis is as follows: Then the multiple -test method is with the formula as follows: where   ,   are the point of any two of these averages; MS  is the mean square and df  is the degree of freedom.Consider That is, if the difference between any two averages reaches or exceeds the significance level , then the null hypothesis is rejected.It is then necessary to proceed effectively with dynamic load balancing to avoid contention.

The Existing Problems in 𝑡-Test.
As to the comparison among the service port flux, when the number of groups is greater than 2, the probability of making type Ι error in a short period is increased.
When factor  consists of multiple independent ports, we can assume the following: If  0 is true, the counts of computation are  =  2  = ( − 1)/2 times by using -test.Now we suppose that the significance level is ; then the correct probability is 1 − .In the meanwhile, through  series of comparisons, the probability of avoiding type I error is 1 − (1 − )  .When the significance level is  = 0.05 and the number of ports is  = 4, the probability of avoiding type I error is  = 0.265.When the significance level is  = 0.05 and the number of ports is  = 10, the probability of avoiding type I error is  = 0.402.The error probability increases considerably.

The 𝑡-Test Adjustment of Multiple Comparisons.
Research proves that analyzing the intragroup differences by using a significance level increases the probability of making type I error.Therefore we adjust the significance level; assume that the new significance level is 1 and then the following formula can be used: Substitute  ∼ into the following formula: Finally, the minimum critical value is recalculated on a new significance level to ensure that the probability of making type I error can be controlled within a reasonable scope.

Selection Probability-Based Algorithm
Our ultimate aim is to reach a balance during transit from the source to its final destination and find another alternative server for releasing the overloaded one.The SDN controller modifies flow table entries for all possible switches in advance and sends flow tables to switches in time.By monitoring the flow direction, dynamic load balancing can be effectively implemented.
The stability of the network is analyzed with -test.Lower -test values indicate greater network stability.Hence network stability is inversely related to the -test value which is adopted as a threshold parameter.The main process is illustrated as shown in Algorithm 1.
Step 1.If (df  , df  ) > (), calculate the minimum boundary value of differences   ∼ (df) .Step 2. If   ∼ (df) > (), populate ports ID which might have considerable differences in the comparison matrix (Table 1) integrated into our control module of SDN network.The comparison matrix is shown in Table 1, where   (1 ≤  ≤ 4) denote horizontal ports.After the operation, the difference of the two means is compared with the minimum threshold.If the difference is greater than  ∼ = 0.05, a symbol * is used in the comparison matrix to indicate that there are differences between ports.If the difference is greater than  ∼ = 0.01, the symbol * * is used to indicate that there are significant differences.The symbol * * is always a priority task for SDN controller.
Step 3. Perform operations with our algorithm based on a similar roulette wheel selection.Selection formula is as follows: According to the traffic, a collection of port probability can be calculated: {Port 1 , Port 2 , . . ., Port  , . . ., Port  }.
Step 4. Arrange the probability in the descending order of the ranks.
Step 5. Compare a random number  that is uniformly distributed with cumulative probability.Thus, the obtained variable  represents the index of the selected port.Consider Step 6. Repeat this first step.The real-time nature on SDN network is emphatically analyzed.
The important thing to note here is that performance overhead of control signals is not considered.We assume that there is no propagation delay.
For example, when the -test value exceeds the significance level threshold, this indicates that the current network load is not balanced.Then we can find the port with the largest inflow traffic in the network by the -test adjustment of multiple comparisons and can find the busiest server.At this time, the central controller in the network will perform traffic scheduling according to the algorithm based on selection probability, and the remaining traffic will be transferred to other servers.
It should be noted that the above probability selection algorithm is mainly for network traffic analysis and scheduling, and the overhead of SDN controller sending flow tables to SDN switches is not taken into account.

ARP Processing. The Address Resolution Protocol (ARP)
is a telecommunication protocol used for resolution of network layer addresses into link layer addresses, which is a critical function in multiple-access networks.In the SDN network, on the client side, before sending an HTTP request, first send a gratuitous ARP to the Openswitch [19].The Openswitch does not get a matching table and generates a Packet-In message sent to the SDN controller.The load balancing module, which is integrated into the controller, will resolve the Packet-In message.A new ARP packet which is filled in the destination address, IP and forwarding port information, and so on will be reassembled in Packet-out forms and sent to Openswitch again.The client receives a new ARP reply packet and accepts the ARP entry into its ARP table.The ARP request processing is carried out by our load balancing module.

TCP Request.
For end-users accessing the site, load balancing makes all servers appear as a single server with a single IP address; all load balancing is transparent.When the Openswitch receives the initial HTTP access request, it does not have a matching table and generates a Packet In message which is sent to the SDN controller [20]   Figure 2 using object methods in the SDN controller such as OFMatch, OFAction, and OFFlowMod.On deployment, the flow table is sent to the Openswitch.This process will replace the virtual address with the physical address.

Experiment Result and Performance Analysis
In the experiment, our operating system was Ubuntu 14.04.3desktop-amd64,controller was Floodlight version 1.0, Mininet 2.1.0which is a network emulator for the creation of virtual network using the Ubuntu kernel was used to define the topology of the whole network, and Open vSwitch 2.3.2 was used to simulate the required OpenFlow switch.In order to measure the experiment, an OpenFlow test platform was built as depicted in Figure 3.The SDN network model contains four independent server nodes, three OpenFlow switches, and a Floodlight controller.Our experiment presented some server code written in Python, but almost the same design would apply for nearly any language.Python came with a simple platform built in HTTP server.
In the experiment, Mininet was to first build architecture with different paths.Each server represented a physical machine in Mininet and had its own actual IP address.We created a virtual IP address which is advertised from the NAT, and incoming traffic destined to this virtual IP address was routed by Floodlight controller to different actual IP addresses.We created automated scripts of access requests on the clients.Next, we supplied real-time traffic flux statistical analysis through the analysis of variance and traffic scheduling algorithm modules which were integrated into the Floodlight controller and chose appropriate path based on our analysis result.
These switches do not limit the transmission speed and work with maximum link rate.Java code was used to implement the analysis of variance and traffic scheduling algorithm modules which are integrated into the Floodlight controller.By analyzing the results of the experiment and comparing with other algorithms, the improvement in the performance of the proposed algorithm was verified.The experiment consists of initialization of data flow, traffic analysis of variance, and calling of the load balancing algorithm.
The experiment requires a measurement of quantitative analysis based on a measurement server in the SDN.The controller can save data plane information and interplay relationship between controller and OpenFlow switch to a local log file.The measurement server (testing server) executes synchronization operation on the Floodlight controller and servers and gets the traffic data.The general arrangement of the testing platform is shown in Figure 4.
Through a simulation experiment, the performance of the proposed algorithm is verified.Assume that a user requests access to a virtual address.The communication time of each server including a web service request is 5 seconds.
When excessive server load occurs on one server, the following three load balancing algorithms are executed individually: Round-Robin scheduling, Greedy scheduling, and probability scheduling.The traffic at each server is captured by the variance analysis module in the Floodlight controller and this piece of information is saved to log files.Figures 5,  6, and 7 illustrate the starting and ending positions of port traffic.Figures 5 and 6 show the response of the proposed algorithm and the Greedy algorithm, respectively.It can be seen from these figures that the peak values of all the columns are essentially flat, meaning that all four servers are able to load balance using by the two methods.In a real environment, it is not necessary to keep strict equilibrium at any time.Figure 7 shows the result of the Round-Robin scheduling algorithm.There is considerable difference in the  peak values of each column, which indicates that this method is ineffective.
Figure 8 displays curves for all scheduling algorithms in one figure and their relationship with the -test value.The smaller the values of the -test in a series of experiments are, the fewer the differences among the monitoring data traffic of each port are.As shown here, the proposed algorithm matches well with the Greedy algorithm.With the increase of the times of interaction transmission, the -test values are evidently reduced, and these methods can modify all servers load balancing exactly in real-time and redirect traffic more efficiently.It is also clear from the figure that the Round-Robin algorithm is not the best one which means there are

Figure 2 :
Figure 2: Package format of the flow table.

Figure 3 :
Figure 3: Test platform set up to measure model realization.

Figure 8 :
Figure 8: The -test values of three scheduling algorithms.

Table 1 :
while (df  , df  ) > () do calculate the   which replaces the primary , set it to be our object parameter; find    (df  ) which is calculated by parameter , make it as a variety to calculate the    (df); while    (df) > (  ) do calculate probability parameters for all ports Port  = Flow(  )/ ∑  =1 Flow(  ), and find Port  among (Port 1 , Port 2 , . . ., Port  , . . ., Port  ), with the maximum port data flow; arrange the rest of the ports { 1 ,  2 , . . .,   } in descending order.Generate a random number  and compare with the cumulative probability   = ∑ Pairwise comparison among different ports.
. The controller parses the message and reassembles the flow table as shown in