Hybrid Approach for Resource Allocation in Cloud Infrastructure Using Random Forest and Genetic Algorithm

In cloud computing, the virtualization technique is a signiﬁcant technology to optimize the power consumption of the cloud data center. In this generation, most of the services are moving to the cloud resulting in increased load on data centers. As a result, the size of the data center grows and hence there is more energy consumption. To resolve this issue, an eﬃcient optimization algorithm is required for resource allocation. In this work, a hybrid approach for virtual machine allocation based on genetic algorithm (GA) and the random forest (RF) is proposed which belongs to a class of supervised machine learning techniques. The aim of the work is to minimize power consumption while maintaining better load balance among available resources and maximizing resource utilization. The proposed model used a genetic algorithm to generate a training dataset for the random forest model and further get a trained model. The real-time workload traces from PlanetLab are used to evaluate the approach. The results showed that the proposed GA-RF model improves energy consumption, execution time, and resource utilization of the data center and hosts as compared to the existing models. The work used power consumption, execution time, resource utilization, average start time, and average ﬁnish time as performance metrics.


Introduction
Cloud computing is a form of distributed computing that brings in utility models to deliver measurable and scalable resources remotely. Cloud is also a materialistic implementation of parallel computing, grid computing, and distributed computing [1]. e cloud environment offers a shared pool of resources to users as a service on an "ondemand" approach [2]. Effective computing capability and enormous storage capacity allow the users to access services of the cloud anytime and anywhere. A cloud data center comprises IT resources like databases, servers, communication devices, network, and software systems. More user's demand for cloud resources makes the cloud providers scale up the number of servers or required hardware. As a result, the creation of more physical nodes will lead to an increase in power consumption by the data center. Data centers consume 2% of today's worldwide electricity. It is expected to reach 8% by 2030. ere are three power consumers in a data center, namely cooling systems, data center networks, and servers. 10 to 25% of power is consumed by the network, cooling systems consume 15 to 30% power, and servers will consume power around 40 to 55% [3].
IaaS (Infrastructure as a Service) offers computing resources like RAM, CPU, Network, and Storage as a service and their use is likely abided by SLA (Service Level Agreement). Resource utilization is also having an impact on energy consumption. Low resource utilization is one of the reasons for the energy inadequacy of the data center [4]. If the CPU utilization is as low as 10%, that means the workload is less and energy consumption is above 50% of the peak power. e virtualization techniques in IaaS play a vital role here, which efficiently improve resource or cloud utilization [5]. Virtualization offers resource sharing which allows virtual machines (VMs) to execute physical machines (PMs) to process user requests. ree possible operations performed using virtualization are VM isolation, VM migration, and VM consolidation. VM migration technique shifts the running virtual machines from a physical machine to another. In the VM consolidation process, virtual machines running on different hosts will leave that host and gather on fewer ones to reduce energy wastage by switching off the initial running host or moving it to hibernate mode [6]. Virtual Machine Placement (VMP) is a technique of executing virtual machines on suitable physical machines. To enhance power efficiency and maximize resource utilization, an efficient VMP technique is very much necessary [7]. e VMP problem is an NP-hard optimization problem [8].
In this work, efficient hybrid VMP technique is using the genetic algorithm (GA) and random forest (RF) algorithm. Our objective is to reduce the energy consumption of the data center while maintaining the load across several physical machines. Maximizing resource utilization of the physical machines is also taken as one of the key parameters for the evaluation of the proposed method. Cloud demands the least waiting time and least request completion time; therefore, minimizing execution time, average start time, and average finish time is the objective of this work. Another objective of the proposed model is to reduce the time to find the optimal solution which takes maximum time iterative metaheuristic algorithms like GA, ACO, PSO, and many more. e model is aimed to train the machine learning model with the best optimal solution and then, in the next iteration, the trained model can be used to predict the optimal solution in a constant time removing the time taken by evolutions in search of the global best solution.
One of the metaheuristic techniques used to find a globally optimal solution is the genetic algorithm. Firstly, the GA focuses on generating an optimized schedule for resource allocation which acts as training dataset that contains the mapping of virtual machines to the physical machines. Next, the dataset derived from GA is used to train the random forest algorithm and then it performs the classification, i.e., allocation of virtual machines to the physical machines. Classification accuracy of the RF is tested using the subset of the dataset obtained from the GA. e random forest is a supervised machine learning technique and it can perform the placement of virtual machines on the best physical machines with good accuracy as it reduces overfitting in decision trees. Figure 1 represents the system model. It comprises several physical machines within a data center. Each physical machine can execute many virtual machines.
e Virtual Machine Monitor (VMM), also known as Hypervisor, is a software program that facilitates the creation, managing, and monitoring of virtual machines. It also manages a virtualized environment on top of physical machines. When a request for VM execution is received by the data center manager, firstly it gathers the status information from all available physical machines and delivers it to the VM scheduler. e VM scheduler is developed using GA-RF technique. In the next step, VM scheduler analyzes status information and then allocates virtual machines to apt physical machines. e rest of the work is organized as follows: Section 2 brings in a literature survey with existing models and their comparison. e proposed method is discussed in Section 3, Section 4 gives an evaluation of the proposed method with experimental setup and results, and, lastly, conclusion and future work are given in Section 5.

Literature Review
is section brings in some of the work done by researchers. Table 1 depicts different approaches in the field of virtual machine placement and the parameters they have considered for the performance evaluation. Authors in [9] propose VMP technique using a combination of genetic algorithm and Tabu search algorithm. e authors have focused on achieving energy efficiency with an increase in load balance. ey have also taken execution time for comparison of different algorithms in their work. Abohamama et al. [10] presented a VMP algorithm using an improved permutation-based genetic algorithm to improve energy consumption rate by reducing the number of active hosts that run VMs. e proposed method is compared with the Flow Shop Scheduling Problem and Traveling Salesman Problem. Yao et al. [11] introduced a VM placement procedure based on Weighted PageRank. ey focused on minimizing the number of active physical machines and also increased resource utilization of all hosts in the data centers. To avoid the proposed method falling in local optimum solution, the impact of nonplaced virtual machines is considered. During the process of selecting a physical machine for VM, Weighted PageRank covers unplaced VMs of several types and the algorithm measures the possibility of a physical machine making complete use of resources under different conditions.
Authors in [12] proposed a stochastic VMP approach to increase energy efficiency and resource utilization of the data center. Here resource requirements are modeled as random variables instead of taking deterministic values to denote resource requirements. Due to variation in the resource requirements, the proposed optimization model is subject to a probabilistic restriction on resource overflow probability on every physical machine [13]. VMP in the heterogeneous data center using a Binary gravitational search algorithm (BGSA) was proposed. e aim of this work is to decrease energy consumption. e proposed method uses agents as objects and their mass is used to measure the performance. e object having a higher mass has a better solution [14]. Authors have proposed an evolutionary approach to increase energy efficiency. Here the proposed approach incorporates the reserved virtual machines also. Both simulation and real-time cloud environments were used to evaluate the performance of different techniques.

Scientific Programming
Consolidation of VMs on lesser hosts resulted in a decrease in energy consumption. Ghasemi et al. [15] have designed a reinforcement learning-based approach to address VM placement. e authors focused on load balancing while maximizing resource utilization and a number of hosts shutdown. e proposed method chooses an action from available acceptable actions and executes it on a cloud environment. It receives a reinforcement signal conforming to the suitability of the virtual machine placement solution by using that action [16]. e authors proposed a hybrid approach based on Naive Bayesian Classifier and Random Key Cuckoo Search for VM consolidation problem to minimize energy consumption. Here Naïve Bayes is used for detecting the future state of the hosts which is necessary to perform virtual machine placement in an efficient manner. Khan et al. [17] presented HeporCloud, a framework for a hybrid cloud platform that includes an integrated, workload-aware single resource scheduler and orchestrator. e proposed resource management may assign and forecast the placement and transfer of effective workloads. e empirical study shows that HeporCloud can efficiently plan and consolidate various types of workloads in terms of energy, performance, and cost. An extended version of the cloudsim simulator was proposed [20] to improve the accuracy and precision of the cloudsim. e evaluation of the extended version proved that it performed better in terms of energy, the performance of resource allocation, and even consolidation in heterogeneous data centers. Authors in [18] presented a consolidation approach that prioritizes the most efficient migration, which could be a VM, a container, or a specific application running within a container. Here authors modeled the Scientific Programming heterogeneity of cloud applications and resources and demonstrated how the consolidation of heterogeneous apps, containers, and virtual machines affects heterogeneous data center performance and energy efficiency. In [19], gametheoretic resource management techniques for multiaccess edge computing were developed. Google's workload traces were used to evaluate the proposed work. e goal is to develop a resource management technique that is efficient in terms of energy, performance, and cost. [21] Ilias Mavridis and Helen Karatza proposed an approach to combine virtual machines and containers to enhance the isolation and extended functionality of the cloud. e authors highlighted the benefits of running containers on virtual machines, as well as an investigation of how different virtualization approaches and configurations influence the method's performance. Docker containers were made to run on KVM and XEN virtual machines, and Linux containers were run on Windows Server to see how they performed. By running multiple benchmarks and installing real-world apps as use cases, authors were able to estimate the performance cost caused by the additional virtualization layer of virtual machines. Lastly, the authors investigated several operating systems designed to host containers, as well as techniques for storing persistent data, to see how isolation is implemented on virtual machines and containers.

Proposed Method
e proposed virtual machine placement technique in this work is a hybrid model using a genetic algorithm and random forest technique. A genetic algorithm is an optimization technique and here it will generate the dataset required for training the random forest algorithm. e training dataset comprises allocation or mapping of virtual machines to physical machines.

Genetic Algorithm.
A genetic algorithm (GA) is a metaheuristic technique that generates a global optima solution. Here we use the genetic algorithm to generate a mapping of VM to available physical machines. e following steps are incorporated in our work: (1) Initialize the population. e population is initially generated in a random manner and all the virtual machines VMi (i � 1 to n) are mapped to physical machines PM j (j � 1 to m). Figure 2 shows the sample of the initial population.
(2) Fitness function: fitness values of the host is derived as where β is the random constant. e individuals having higher fitness values are used for the reproduction process.
(3) Selection: the fittest individuals are selected from the population using the tournament selection method. Tournament size is taken as N (individuals) and they are chosen randomly from the population. e winner of the tournament is taken for crossover operation.
(4) Crossover: in this work, a 2-point crossover is used to generate new individuals/offspring from the selected best individuals. Crossovers like single-point crossover, multipoint, and uniform crossover are also available. To improve the diversity, a newly generated individual is added back to the population. An example of a 2-point crossover is given in Figure 3. e pseudocode of the proposed genetic algorithm is shown in Figure 5. We have used 100 evolutions to get better results.
e final result will have a mapping of virtual machines to physical machines.

Random Forest Classifier.
Random forest is one of the supervised machine learning techniques used for both regression and classification. It is one of the flexible and easyto-use algorithms. A random forest is made up of trees, and the more trees there are, the more resilient the random forest is. e random forest creates each decision tree by first selecting at random, at each node, a small set of features to split on and, secondly, by calculating the best split based on these features in the training set. Finally, it gets a prediction from each tree and chooses the best solution either by means of 'majority voting' or 'performance voting' as expressed in Figure 6.
Module 1 (dataset creation): the dataset is created using the genetic algorithm. e dataset consists of mapping of VM allocation to the best possible physical machine. e procedure to create the dataset using GA is discussed in the previous Section 3.1. Here, the dataset is divided into 2 sets. e first set consists of 80% of the dataset which is used as the training dataset to train the model and the remaining 20% of the dataset is used for testing.
Module 2 (training): consider a training dataset T � (x 1 , y 1 ), . . . (x N , y N ) consist of N observations from the random vector (x, y). Vector x � (x 1 , . . . , x P ) contains predictors or independent variables and y ∈ C where C is the class label. Using this training set, the developed random forest will be an ensemble of B trees (t 1 (x), . . . , t b (x)) . e ensemble results in B outputs      Draw a Sample Z of size N from the training data.

3.
Derive a random-forest tree T b to the data, by recursively repeating the following steps for each terminal nod of the tree, until the either minimum node size n min is reached or given depth is reached i. Select m variables at random from the p variables.
ii. Pick the best variable/split-point among the m.
iii. Split the node into two daughter nodes.

4.
Output the ensemble of trees 5.
Classification: Let C b (x) be the class prediction for the bth random forest tree. the prediction for the classified data object by the Bth tree. All the trees outputs are combined to produce the final class y which receives the maximum votes by all the trees.

Tree Construction.
e training dataset is used in tree building and it follows a top-down approach. We use information gain to identify the attribute that best splits the given training set. "Best" is measured using information gain: It is produced by partitioning the set D of examples into two subsets D i according to the given attribute. Here E(d) is the entropy − N i�1 q i log 2 (q i ) with q j as the proportion of examples in d belonging to class i and |.| is the size of the set. e process of selecting the attribute is repeated for each nonterminal node; the process is stopped when the node receives less examples or when it reaches the given depth.
Module 3 (testing): once the training of the model using the training set is over, prediction is performed on the test set. After training, the accuracy is checked using actual values and predicted values. If the obtained accuracy is less than the desired value, some tuning will be done and again the model is trained and tested. is repeats until the desired accuracy is achieved. Figure 7 depicts the block diagram of the proposed hybrid technique. e genetic algorithm reads the PlanetLab dataset, which is a real-time dataset. e genetic algorithm goes through all the processes prescribed in Section 3.1 to generate the dataset having mapping of virtual machines to physical machines. is dataset is used as the dataset for a random forest classifier. Random forest classifier splits the dataset into training and testing datasets. Using training data, RF generates a specified number of random trees for each subsamples. e test dataset is used to test the classification process. Finally, majority voting decides the final class label (output value). Here the output value consists of a physical host number which is used to execute the required virtual machines on it.

Flow Diagram.
Once the virtual machines are placed on the physical machines, it starts the execution. After some time, a physical machine may get more VMs to execute, in this situation, the physical machine which gets overloaded. We have used a VM migration technique to handle this kind of scenario. When a PM gets overloaded, some of the virtual machines will be selected and migrated to other PMs having less load, thereby maintaining load balance across all the physical machines in the data center. Interquartile range (IQR) is used to detect overloaded physical machines in the data center, which is one of the available overload detection methods [22]. Next, maximum correlation policy is used to select the VMs to be migrated from the overloaded PM. It selects the VMs having the maximum correlation of the CPU utilization with other VMs.

Simulation Results
Experimental setup, performance metrics, and experimental results are discussed in this section.

Experimental Setup.
We have used the CloudSim 3.0 toolkit simulator to evaluate the proposed algorithm. Cloudsim provides different VM provisioning techniques and virtualized resources. To carry out the experiment, we have taken real workload traces from PlanetLab. PlanetLab is part of the CoMon project, which consists of CPU utilization from more than 1000 virtual machines running on various hosts in more than 500 locations around the world. In our experimental setup, we used 4 different types of virtual machines, Micro, Small, Medium, and Extra-Large instances. 800 heterogeneous hosts are deployed which belong to HP ProLiant G4 and HP ProLiant G5 category. Characteristics of these servers are shown in Table 2. For simulation, 500 hosts and VM vary from 500 to 650 with data center configuration as shown in Table 2. PlanetLab dataset is a log file of a real word data center with incoming traffic, task size, size of VM requested by task with VM configuration like RAM, number processor, and MIPS count.

Performance Metrics and Results.
e following metrics are used to evaluate the proposed algorithm and other algorithms.

Energy Consumption.
It denotes the total energy consumed by all the physical machines (PMs) in the data center. PMs energy consumption is calculated according to the linear cubic power consumption model. In this power model, the power consumption of the physical host grows linearly with an increase in the CPU utilization.
Let us consider the following parameters for the power model: (i) P max k : maximum power consumed when the host k is completely utilized (ii) P idle k : idle power value of the host k (iii) U k : current CPU utilization host k (iv) T: total number of hosts in the data center e power consumption of host P k can be expressed as Our goal is to minimize power consumption of the data center; then, we aim to minimize (4) Figure 8 shows the energy consumption of different algorithms. e energy consumption of our proposed algorithm GA-Random Forest (RF) has declined on an average by 17%, 31%, and 39% compared to default GA (genetic algorithm), ACO (ant colony optimization), and PSO (particle swarm optimization), respectively, for Plan-etLab "20110409/planetlab-1_amst_nodes_planet-lab_or-g_Arizona_beta and 20110409/pl1_rcc_uottawa_ca_Google _ highground.

Execution Time.
Executing all the user requests in lesser time is indeed an important factor from the cloud provider perspective. So, execution time is taken as one of the key performance factors to evaluate the algorithms.
where t l (i, j) represents the time required to execute the instruction of length l on virtual machine j and VM c denotes So, task i runtime on virtual machine j can be represented as e completion time of virtual machine j is the sum of the task run time of all tasks on that virtual machine.
Total execution time represents the time required to execute all the virtual machines thereby completing the execution of all the user requests. It can be expressed as As shown in Figure 9, the average execution time of GA-RF is less by 15%, 29%, and 37% compared to GA, ACO, and PSO, respectively; i.e., GA-RF has a faster execution time.

Resource Utilization.
To process a user's request based on the resource requirement by the user, the Cloud data center creates various types of VMs. VMP technique aims to place a virtual machine on a suitable physical machine to improve resource utilization. Let P ij denote whether VM j is placed on PM i . If VM j is placed on PM i , then P ij � 1 or else if VM j is not placed on PM i , then P ij � 0. Equation (9) represents the requirements of all the virtual machines placed on a physical machine that cannot exceed the resource capacity of the physical machine.
Once the virtual machine types are determined, the resource utilization of all the physical machines needs to be maximized. e CPU utilization of physical machine PMi is expressed as Maximizing the CPU utilization of all the physical machines can be expressed as Figure 10 shows the average CPU utilization of all the active PMs. Average CPU utilization of GA with the random forest is higher by 6%, 10%, and 11% compared to GA, ACO, and PSO, respectively.

Average Start Time and Finish Time.
Delivering high performance to the cloud user is becoming an important criterion for the cloud provider. In this regard, parameters like start time and finish time of the user request/task can be considered as major factors. Figure 11 and 12 show the average start time and finish time of different algorithms. As a result, GA-RF can finish the user's requests/tasks in less time compared to other existing algorithms.
In cloud computing, time complexity plays an important role in studying the performance of the algorithm. ere exist various studies which show the comparative analysis of the complexity of GA, ACO, and PSO [23,24]. e study shows that GA finds the best global optimal solution at the  8 Scientific Programming cost of high searching time but finds a solution better than ACO and PSO. In order to reduce the searching time of GA, the optimal solution is trained to a random forest model for training and further prediction which gives an optimal solution in constant time. e cost overhead is only one time overhead, that is optimal solution generation using GA and training the model.

Conclusion
With the development of virtualization technology, designing a multiobjective virtual machine placement technique has become a hot research topic. Our work is two-fold. Firstly, a hybrid approach for VMP using a genetic algorithm and random forest algorithm is proposed to reduce the search time to find the best optimal solution. e genetic algorithm is used to find an optimal solution as a dataset that contains the mapping of the virtual machines to the physical machines. is dataset is used to train the random forest algorithm to place the virtual machines on apt physical machines and also the placement accuracy of RF is evaluated with the test data. Secondly, the trained model is used for load balancing through migrating the virtual machines from overloaded physical machines to underloaded physical machines. is is accomplished using IQR technique to detect overloaded physical machines and VMs to be migrated are selected using the maximum correlation policy. e result shows that the proposed model provides an energy-efficient placement scheme by reducing power consumption, execution time, average start time, and finish time as compared to the existing approaches. In the future, the model can be tested with various machine learning and deep learning approaches for better solutions and performance study.
Data Availability e dataset for simulation is supported by parallel workload.com for real-time analysis.

Authors' Contributions
Madhusudhan H. S., Satish Kumar T., and Punit Gupta developed the theory and proposed model. Dr. S.M.F. D. Syed Mustapha and Rajan Prasad Tripathi worked on data preparation, ML model for the dataset, and verified the analytical methods. All authors discussed the results and contributed to the final manuscript. All authors confirm sole responsibility for the following: study conception and design, data collection, analysis and interpretation of results, and manuscript preparation. All coauthors have seen and agreed with the contents of the manuscript.  Scientific Programming 9