FPGA-Aware Scheduling Strategies at Hypervisor Level in Cloud Environments

Current open issues regarding cloud computing include the support for nontrivial Quality of Service-related Service Level Objectives (SLOs) and reducing the energy footprint of data centers. One strategy that can contribute to both is the integration of accelerators as specialized resources within the cloud system. In particular, Field Programmable Gate Arrays (FPGAs) exhibit an excellent performance/energy consumption ratio that can be harnessed to achieve these goals. In this paper, a multilevel cloud scheduling framework is described, and several FPGA-aware node level scheduling strategies (applied at the hypervisor level) are explored and analyzed.These strategies are based on the use of a multiobjective metric aimed at providing Quality of Service (QoS) support. Results show how the proposed FPGA-aware scheduling policies increment the number of users requests serviced with their SLOs fulfilled while energy consumption is minimized. In particular, evaluation results of a use case based on a multimedia application show that the proposal can save more than 20% of the total energy compared with other baseline algorithms while a higher percentage of Service Level Agreement (SLA) is fulfilled.


Introduction
Nowadays, the processing computational resources have shown an important change to achieve a balance between performance and power consumption.Devices such as GPUs and FPGAs among others are integrated as specialized processing elements into cloud environments to extend the capacity of the cloud.
FPGAs are commercial off-the-shelf reconfigurable silicon devices that achieve hardware-like performance with software-like flexibility.These are facts of great interest in the context of cloud computing.Cloud mainly leverages virtualization technology (i.e., virtual machines) to manage resources of a data center.Thus, providers can share the physical resources between clients.
In this context, resource scheduling is a critical issue that can contribute to increasing the benefits of cloud platforms.On one hand, depending on how many resources are allocated to different users' requests for service, more or fewer requests can be serviced while fulfilling their SLA.This fact impacts the benefits obtained by the cloud provider.On the other hand, additional advantages can exist, such as achieving a positive effect on the data center energy consumption.The combination of both implies getting a better Return on Investment (ROI) from the infrastructure which is crucial when serving Software as a Service (SaaS) requests with QoS requirements.Typically, a SaaS user issues a request for a specific service, with particular QoS-related constraints (i.e., a deadline).The user does not need to be aware of which type and how many resources are required to get its response.The system must be able to allocate and schedule the necessary resources (both in quantity and type) to serve this request.
This paper addresses the scheduling of jobs within a cloud infrastructure which is composed of heterogeneous physical nodes (with and without FPGA accelerators) from a hierarchical point of view.In this structure the two levels of scheduling are (1) the Cluster Level Scheduler (CLS), to decide the type of node where the virtual machine (VM) that will serve the request will be deployed, (2) the Node Level Scheduler (NLS), to decide which one of the VMs allocated to the node will get the use of the accelerators (FPGAs in particular) available in a node.
The problem of high-level scheduling among physical nodes, referred to as CLS, has previously been addressed by the authors in [1,2].Now, this work focuses on the impact of different scheduling strategies within a physical node with FPGA resources, which are applied at the hypervisor level.In a nutshell, the main contributions of the paper are the following: (i) to extend the hypervisor functionality to support dynamic control of FPGA devices as cloud computational resources, (ii) to propose a novel FPGA-aware Node Level Scheduler (NLS) metric aimed at QoS provision while reducing energy consumption, (iii) to present a novel fine-grained FPGA-aware Node Level Scheduler, (iv) to evaluate the Node Level Scheduler proposals in a real testbed, comparing them to some simple scheduler techniques.
This paper is structured as follows.Sections 1 and 2 introduce and motivate the problem being tackled.Section 3 reviews literature related to the topic of the paper.The framework where the presented research has been carried out is outlined in Section 4. Next, Section 5 provides the details on how accelerators (and in particular, FPGAs) are included into the scheduling strategies, and the additional fine-grained level of scheduling is introduced in Section 6.The evaluation of results are presented in Section 7. Finally, Section 8 provides some concluding remarks.

Background and Motivation
During the last few years, cloud computing has consolidated as a paradigm that enables a flexible and on-demand use of IT resources at different levels (infrastructure, platform, or software) as a service.Typical cloud providers support their service by means of massive data centers (usually spread around the globe), while cloud users get access to the resources with a pay-per-use model.This is referred to as a public cloud model.The cloud computing paradigm can also be applied within an organization, leading to the private cloud deployment model.Resources are pooled and shared among the different organization user groups to meet their demands on IT resources.In either case (public or private), cloud computing platforms are composed of pools of computing, storage, and networking resources that are made available as a service to their users.Service Level Agreements (SLAs) are established between providers and users, to specify the conditions of the service, both from technical (availability, Quality of Service, . ..) and formal (cost of the service, penalties in case of contract breach, . ..) points of view.
Many challenges exist in this model that turn out to be the focus of many research projects, such as efficient resource management [3], security and privacy concerns [4], or standardization [5], just to name a few of them.In particular, some kind of orchestration is needed in order to allocate the adequate resources to every user request [6].
As it has been pointed out before, cloud providers face the following dilemma: admitting more users into the system would lead to more incomes, but if the available resources are not enough for fulfilling the established SLAs, penalizations will cut benefits down.Thus, care must be taken when admitting more users into the system.
From another point of view, data center's energy consumption is nowadays a big concern.Energy costs are a major contributor to a data center's Total Cost of Ownership (TCO) [7].Thus, strategies to improve data center's efficiency are the topic of many research efforts, such as adjusting the voltage supplied to servers according to their workload [8] or even creating specific hardware designs [9].One strategy that can provide benefits regarding energy consumption as well as performance in data centers is the integration of accelerators as specialized resources within the cloud infrastructure [10], such as General Purpose Graphics Processing Units (GPG-PUs) or Field Programmable Gate Arrays (FPGAs).
FPGAs allow cloud computing data centers to scale up in a more efficient way than using just conventional processors [11].It is interesting to note that the initiatives towards building exascale systems with low-energy consumption driven by the European Union (EU) are based on the integration of FPGAs with ARM processors [12].
However, using heterogeneous resources in cloud environments is still an open challenge, as illustrated by the work presented by Microsoft in [13].Frequently, cloud applications which are running on VMs with a fixed amount of hardware resources including accelerators do not use the same number of them all the time due to their features.Thus, a proper scheduling strategy might help to optimize the utilization of these resources.
Moreover, different applications could benefit from the FPGA over time, depending on their degree of compliance with QoS requirements.To achieve this objective, the FPGA should be properly shared among competing service requests.

Related Work
Incorporation of FPGAs in the cloud context is a relatively new area of research.So, the Catapult project [13] led by Microsoft represents the first detailed investigation of applying FPGAs within an enterprise-level data center application.In this case, FPGA-accelerated nodes are used for the Bing web search engine.Results show great improvements in both the latency and the throughput of the service.
Nevertheless, FPGA-aware scheduling in a cloud environment is still at initial stages.Integrating FPGAs as first class computational resources to provide cloud service demands novel high-level programming models to simplify the development of software applications, maximizing communication to FPGAs, and the development of efficient FPGA-aware scheduling algorithms.
One crucial step to efficiently run applications across heterogeneous hardware is to provide optimized versions of computational kernels (as BLAS routines or FFT) [14].OpenCL [15], VIVADO [16], and Lime [17] are high-level programming models focused on reducing the time-tomarket in the designing process.
In addition, providing FPGAs as cloud computing resources demands virtualization support in hardware with near-native performance.Open source interface frameworks have recently emerged (RIFFA [18], DyRACT [19]) that enable FPGA designs to be accessed through an abstracted software API on the host with communication throughput between the host and FPGA close to the limits of modern PCIe interfaces.Moreover, efficient resource managements and scheduling algorithms are open key challenges to providing FPGA cloud services.The scheduling of jobs faces a multiobjective optimization problem in the process of assigning jobs to a pool of limited resources.Resource management and scheduling have been a hot research topic in a cloud context, and many approaches have been proposed to place the VMs to physical CPU hosts.A survey of the most recent proposals about metaheuristic scheduling solutions for cloud can be found in [20].
In this context, cloud-centric integration of FPGAs frameworks is developed in [11,21] by extending the open source cloud management system OpenStack [22].The systems support the on-demand deployment of a user designed custom hardware while maintaining the cloud computing benefits of scalability and flexibility.The bitstream of the application is considered as a special VM image.So, FPGAs are eligible as computational resources by the clients.
Another recent related work presents a framework that integrates FPGAs in a standard server with virtualized resource management and communication [23].The resource management selects the smallest reconfigurable FPGA area able to attend the request.If no one is found, the user request is rejected, and the request can be processed in software.As an application case study, they built a MapReduce accelerator for word counting and preliminary results are evaluated about integrating FPGA devices in the cloud using partial reconfiguration.
At some points, the systems mentioned above can be complementary to the work presented in this paper because all of them focusing on supporting FPGAs as computational cloud devices.But in contrast, in our case FPGAs are not visible to the clients.The idea is to provide a QoS Software as a Service by making efficient use of FPGA-aware scheduling algorithms.Our work focuses on developing a cloud management framework able to scale up and down FPGA devices in a dynamical way.

Hierarchical Scheduler Architecture Overview
In previous works [1,2] the problem of integrating FPGAs into a cloud was faced by "HECCO" (Heterogeneous Cloud Computing) architecture.In those approaches, every node of the cloud system is composed of one or more CPUs with a certain number of cores each and zero or more accelerators.
Frequently, the number of accelerators available within a node is lower than the number of CPU cores available.Therefore, they must be shared between clients.HECCO enables the provision of IaaS and SaaS services with Quality of Service (QoS) requirements.
In this work, the proposal architecture leveraged the use of accelerators to (a) finish job tasks within a particular time frame in order to fulfill their QoS requirements, expressed as Service Level Objectives (SLO) within a Service Level Agreement (SLA), and (b) reduce the energy footprint of cloud data centers.
The whole picture of the framework is depicted in Figure 1.It is organized into several layers: (i) The Management Layer is responsible for receiving client requests and selecting the most suitable node to allocate their applications.Each request is defined by a template in which parameters such as the type of service and the utilization time are defined.This layer is composed of the Heterogeneous Cloud Computing (HECCO), which includes the Cluster Level Scheduler (CLS).
As outlined above, it provides the "intelligence" to fulfill the QoS requirements expressed by clients within their SLAs.In particular, the CLS implements a series of mechanisms that are mainly aimed at allocating every request to the most appropriate resource (i.e., a node with or without accelerators).Its decision is based on the client's SLA and the availability of the resources.
More details on the management layer can be found in [2].
(ii) The Deployment Layer, which receives instructions from the management layer, is responsible for performing actions such as the creation and monitoring of the whole virtual environment.It has a complete view of the cloud and is composed of the Virtual Infrastructure Manager (VIM).The VIM is a centralized data center manager that allows to build and manage the cloud environment.
(iii) The Hypervisor Layer provides the control of the physical resources.This layer has a local view of the system and is composed of a hypervisor, which is responsible for the supervision of the VMs and resources for each node.This layer also contains a Node Level Scheduler (NLS) that is responsible for sharing the local resources within a node.In this case, the scheduling's decision is focused on the management of the use of the accelerators within each node, in order to get the most out of them.
More details about this element are provided in Section 5.
Figure 2 depicts the two different levels of scheduling.First, the CLS receives the tasks and selects the most suitable node to allocate each of them, following the strategies proposed in [2].Second, the NLS manages the tasks allocated to a specific node.
More precisely, it selects the task which is going to receive the accelerator and for how long, by certain metrics.
In other words, given the fact that there may be more virtual machines allocated to a node than accelerators available, which virtual machine will deserve the use of an accelerator?In the rest of this work, some insight will be given on this issue.

Node Level Scheduling
In this section we will focus on the dynamic scheduling strategy for allocated jobs within a node and timing-related indicators, such as a deadline.This deadline is related to when the job must have finished in order to deem its SLA as fulfilled.
The main objective of the node level scheduling strategy is twofold: first, to meet application SLOs and second, to consume as less energy as possible.To achieve this multiobjective, the scheduling algorithm is aware of the properties of the physical nodes and due to the good performance and lowcost energy offered by the FPGAs, the scheduling maximizes their utilization.
Basically, the algorithm uses a heuristic process for assigning the physical resources inside the node (FPGAs and CPUs) to the virtual machines that will be deployed at each point of time.It is worth to mention that the use of this simple algorithm avoids the overhead of the system.Recall that the NLS is implemented within the hypervisor and should be able to run with minimal computational resources.In other words, a matching is done between computational resources and tasks (deployed as VMs) minimizing the computational impact.On one side, the scheduler algorithm has as input a list of tasks (1, 2, . ..), assigned by the CLS algorithm, with all the nonattended job requests that have been received in the node (see Figure 2).On the other side, each time a computational virtual resource is released in the node, the scheduler is notified and a matching process is done.Then, a resource driven algorithm selects the best candidates to match resources according to some efficient metric.This scheduling is called a coarse-grained scheduling because FPGA resources are busy (not available for allocation) until the end of an assigned task.
In addition, in this paper, a multiobjective metric is proposed and computed as the fraction of the computational work of a task and the deadline or time available to finish it up in order to fulfill the SLA.
Therefore, the multiobjective metric measures the computational workload demanded on the node over the time; that is, it provides a measure of the computational stress.And it is a fact that the higher the computational requirements, the more the energy consumption.That is why the task with most demanding requirements is matched to an FPGA resource.In other words, the algorithm matches the most demanding job according to this criterion to the VM with FPGA accelerators.
The computational work parameter involved in the metric reflects the workload that must process the VM.Then, we need to apply some criteria to calculate this value.For example, in our experiments with multimedia applications this parameter is computed as the number of video frames to be processed.Other examples are data-encryption services, where this parameter could be the size of the data that will be encrypted, or social network analysis services, where it could be the number of tweets stored in a log file.Techniques such as application profiling and using data from previous executions are frequently used to estimate this value.
The other parameter used in the metric is the deadline that is established by the client in the SLA.
To sum up, the node level scheduling process works as follows (see Algorithm 1): First, the selected metric is computed for each task.Next, the scheduler sorts all tasks based on that metric and the FPGA is assigned to the most demanding task.This process is repeated every time the FPGA becomes available.To achieve this goal, the status of the FPGA is periodically monitored.This synchronization process is an important issue in the overall scheduling process.Thus, some implementation details will be explained in the next subsection.The algorithm is aware of the heterogeneity of the node, because the most demanding task is executed in a VM deployed with FPGA resources.

Implementation Details.
FPGAs are physically plugged into computing nodes via the PCI express bus.The communication between VMs and the FPGAs is made through the Intel Virtualization Technology for Directed I/O (VT-d) [24], which is an extension of the Intel Virtualization Technology.It allows a direct exchange of data between I/O devices and VMs.
As pointed out above, a strategy able to control the communication between the Node Level Scheduler (NLS) and the VMs currently running in the node is required.So, an FPGA synchronization service has been developed which consists of an exchange of status messages between the NLS and every VM deployed in the node.The full node information is wrapped into a "JSON" [25] data structure because it is easy to parse.The NLS is implemented as a daemon.It is constantly listening to the status information from the VMs currently deployed at the node.Each VM also has a daemon, which is periodically monitoring the status of the FPGA (whether an FPGA is attached to the VM or not).Besides, the VMs periodically send their identification and the progress of the running applications through an UDP socket connection to the NLS.If a VM is using an FPGA, its status is also forwarded to the scheduler by means of updating a "status file." Figure 3 depicts the process of reassigning the FPGA to different VMs.In this example, the FPGA is initially attached to VM1.Then, when computation on the FPGA ends, VM1 updates the status of the FPGA as "free."As a result, the NLS learns this state and gets aware of the fact that the reassignment process can be made safely.At this point the NLS can detach the FPGA from the current VM and attach it to another VM (i.e., the one in the head of the list of pending requests) through the hypervisor.The VM which finds an FPGA attached to it (VM2, in this case) immediately uses it and updates its status accordingly.We use KVM [26] as hypervisor.It enables the management of the FPGA because it supports hot-plug devices with VT-d.So, it is a must that the underlying hardware supports VT-d.The NLS uses the "device add" [27] command to attach the FPGA to a particular VM.Also, when the FPGA has fulfilled a request, the NLS uses "device del" [27] to release the FPGA.
The NLS is not only focused on the FPGA assignment, it also considers the CPUs.In this case, the NLS matches VMs to physical CPUs using First-Fit criteria.Moreover, the scheduler uses the "taskset" [28] linux command to set the CPU affinity of a running process given its identification (PID).So, each VM has a PID and the CPU affinity "bonds" this process to the first available CPU.This is made in order to avoid swapping when a VM is assigned to different cores along time.Finally, due to the complexity of coding FPGAs, the RIFFA framework offered by Jacobsen and Kastner in [18] has been used to develop the hardware acceleration design.
In Section 7.3 the evaluation of this strategy is shown.

Fine-Grained Strategy
In the previous section, the Node Level Scheduler decides to assign the FPGA to a VM and the FPGA is not released until the end of the task running in that VM.But it might be the case that the task does not need to be accelerated during its whole runtime in order to fulfill its deadline.
Moreover, it might occur that other tasks (which initially was not placed at the head of the list due to the value of its metric) would depend on the use of the FPGA to fulfill its deadline.Thus, reconsidering the allocation of the FPGA with a finer granularity could improve the number of fulfilled SLAs and, consequently, improve the system ROI.Under this context, a strategy which consists of dividing each task into smaller subtasks (called "chunks") has been explored.A trade-off between the execution time of a chunk and the frequency of checkpoints must be taken into account to decide the size of a chunk.The rationale is to reschedule the FPGA more often, so that more tasks can potentially benefit from the acceleration and power efficiency of the FPGA.More precisely, all the chunks have the same requirements of computational workload, but different deadline restrictions.Figure 4 shows how a task with size  and deadline  is divided into three chunks, with sizes  1 ,  2 ,  3 and deadline .Chunk size is set according to the characteristics of the application that implements the task.This can be a certain amount of data to be processed or any other application relevant parameter.As an example, for the video processing service, the chunk size is set as a certain number of video frames to process.
Thus, Node Level Scheduler decisions on the use of the FPGA are taken every time and a VM finishes the processing of a chunk (see check point in Figure 4).In the current implementation, this is done every minute.The Node Level Scheduler selects the best candidate VM and attaches the FPGA to it as is depicted in Figure 5.The process is similar as in the previous case but now communication and scheduling are done at the chunk level.It is worth to notice that the application is completely independent and parallelizable.What it means is that the application receives a group of data as input and when the task has been finished the system sends the result as output.For acceleration process a wrapper function is used to map the application that will use CPU and FPGA at the same time because it is different than the one who is only running over a CPU.All the VMs of a node can take advantages of the FPGA's features and the whole node can keep the performance while the energy consumption is reduced.In Section 7.3, some experiments are made showing the necessity to adopt a balanced criteria for all requests due to the fact that FIFO leads to lack of efficiency.

Evaluation
To evaluate the impact of the scheduling algorithms proposed in this paper some experiments have been carried out on a Heterogeneous Cloud Computing testbed environment, with a real application case study.Details on the platform, application, and workload are given next, prior to explaining some evaluation results.
All the experiments have been run on a real testbed, depicted in Figure 6.
Hardware is composed of an Intel Core i5 with 6 Gb of DRAM and an FPGA Virtex 6 by Xilinx (a ML605 Evaluation Kit, based on the powerful XC6VLX-240T-1FFG1156 [29]).
Additionally, as a way to measure the energy the testbed includes a power consumption monitor node.More precisely, the WattsUp PRO [30] power meter has been used.This device aims to provide an independent managed and accessed power data collecting mechanism.This device must be positioned between the computing node power supply and the main power plug, as shown in Figure 6.The output data can be recorded into a file according to a predefined interval or when particular events occur (i.e., when the power consumption exceeds a threshold previously configured).In our experiments, the monitoring frequency has been setup to one sample per second.

Application.
As real use case application an Image Convolution Software (ICS) is used as a service provided by cloud.This application consists on convolving an image with a filter or kernel (integer value) in both directions horizontal and vertical.This technique can be applied also to process sequences of video.Different type of filters can be used depending on the target.For our experiments we use a Sobel filter to process a sequence of video.This video is an AVI file with resolution 720 × 384.The application has two versions, the first one runs over a CPU and the another one over a combination between CPU and FPGA.

Workload.
In order to generate a realistic cloud workload we have run the real ICS application and monitored its behavior.Thus, for every test, a bag of tasks (20) was created.Each of these requests are defined by two parameters: the deadline of the task and the number of videos to process (which relates to the amount of data to be processed by the task).There are two types of service requests, depending on how demanding their deadline.The first type of requests is based on more relaxed deadline requirements (referred to as soft requirements, SR), while the second type of requests exhibits a more demanding deadline (referred to as high requirements, HR).
The HR requests are more challenging than SR because they require to fulfill stricter deadlines and specific resources such as FPGAs to ensure the QoS-related SLOs.
To set up the deadline of each service request, we have previously characterized the application.Basically, the Sobel filter application has been run on the cloud using different computational resources for different video sequences.In this step, the number of frames, the power consumption, and the time have been stored.Thus, we have got the profile of each video stream.Then, the deadline is assigned to each input request as a random value between the execution times obtained in the profile with a margin of tolerance.For instance, for a video stream composed of three chunks the deadline could be a random value between 450 and 480 seconds while for nine chunks, the values are between 1400 and 1430.
To emulate the behavior of cloud clients, the input requests rate follows a Poisson distribution  = 1/ interval .The  interval is the arrival time of the next requests and it was tuned for this concrete scenario to avoid an early saturation.
Moreover, three types of bags of tasks have been created, depending on the ratio between SR and HR requests.Note that the same input request rate has been used for the different types of bags of task, only the deadline of the requests has been modified, in order to get a fair comparison of the scheduling algorithms under evaluation.In particular, workloads include 10%, 25%, and 50% of HR requests.

Evaluation for Coarse-Grained Scheduling.
To carry out a performance evaluation of the novel multiobjective metric proposed in Section 5 hereinafter called Highest-Job-First (HJF), QoS compliance and energy consumption are taken into account.Moreover, for the sake of comparison both a random (RND) (with none consideration for QoS requirements) and an Earliest-Deadline-First (EDF) scheduling algorithm (the VM with closest deadline gets the FPGA use, independently of the amount of data to be processed) have also been implemented.
In order to understand the behavior of the whole system we have selected several groups of metrics.
The first one describes the behavior of the whole system.It is composed of the total energy used for the system to process a bag of tasks (energy-to-solution [31]).The second group is relative to SLA compliance, the percentage of SLA fulfilled successfully, and the average energy invested per fulfilled SLA.
And the last one is focused on application performance, namely, the average number of frames successfully processed per unit of time (fps).
Table 1 shows energy consumed by the system.Results show that the HJF and EDF metrics reduce the energy-tosolution for all the type of bags of tasks with respect to the RND baseline algorithm.
Hence, from these results we can point out that the FPGAaware Node Level Scheduler should use a QoS-aware metric.
Nevertheless, before getting a clue time an energy-tosolution should be analyzed together with the number of SLAs fulfilled.Table 2 shows the percentage of fulfilled SLAs for the different scheduling techniques.Results show that HJF gets more fulfilled SLA for all the input workloads.
Figure 7 shows the cost per SLA measured as the energy consumed.In all the cases, the cost per SLA increases with the percentage of high requirements.These result from the fact that some unlucky scheduling decisions can affect in a quite negative way the future tasks.The allocation of the FPGA to the most demanding task will change the state of the FPGA to nonfree for a period of time.Then, if the physical node receives a quite demanding task over this period of time, as no preemption of task is possible, the new request only can run in CPU resources.This can lead to a nonfulfilled SLA.Finally, regarding the performance as is shown in Figure 8   section we will evaluate our solution (fine-grained scheduling see Section 6) to address this problem.

Evaluation for Fine-Grained Scheduling.
In this section the fine-grained scheduling strategy proposed in Section 6 will be evaluated.Evaluation conditions are the same as detailed in Section 7. Recall that now the task is subdivided into chunks and the FPGA assignation mechanism is jobaware.
Table 3 shows that the energy tends to rise with the percentage of high requirements requests (these metrics present the same behavior as in previous evaluation).However, the energy per SLA fulfilled decreases 35% for the HJF scheduling strategy in comparison with the other strategies (RND, EDF) (see Figure 9).Results also show that the energy relative to the number of fulfilled SLAs tends to decrease which means a save of effective energy.Thus, the combination of both an FPGA as an accelerator and an efficient scheduling strategy allows the system to save energy.The reason is the system's selection of the most suitable resources for each request and the speedup of the FPGA with lower energy impact.On the other hand, the percentage of fulfilled SLAs (see Table 4) shows a slightly upward trend (20%) for the HJF scheduling with 50% of high requirements.What it means is that the system is able to keep an acceptable ratio of SLA fulfilled.
Figure 10 shows the performance of the system in terms of the number of processed frames per second.The use of  the proposed HJF strategy clearly outperforms the RND and obtains similar results to the EDF.However, as shown before, this is achieved with a lower energy consumption for the HJF metric.
7.5.Fine-Grained Scheduling Optimization.The last strategy presented above raises the following concern: when we check with a certain frequency which VM needs most the FPGA in order to fulfill its requirements, why not to check also if the deadline is past?In case the deadline of the task had arrived, the VM would be killed.In this way, no task keeps running after its deadline, savings on energy.
When a SLA is violated, apart from penalties, some amount of energy is wasted because the system uses the resources to run a task that will not successfully end before its deadline.The system behavior shows that keeping VMs alive has a direct effect on the total energy and number of SLA fulfilled.Now, an optimized scheduling, where VMs running a request whose SLA has been violated are killed, is analyzed.
In this scenario, even though the scheduling and communication issues are the same as before, energy consumption for all the FPGA assignation strategies is reduced as shown in Table 5.On the other hand, if we consider the effective energy per fulfilled SLA, 40% of consumed energy is saved (see Figure 11).In addition, the percentage of successfully fulfilled SLAs is 15% better for HJF (see Table 6).Finally, Figure 12 shows an increase of 20% of frames per second for HJF in the most demanding scenario (50% of high requirements).
To sum up, releasing the resources involved in the execution of a task when its deadline is due improves the percentage of fulfilled SLAs and increases the performance of the cloud system while reducing the energy waste.A comparison summary of the node level scheduling impact is presented in Table 7.
Figure 13 shows the number of frames unsuccessfully processed due to the SLA violation occurring.However, the number of frames unsuccessfully processed decreases significantly for the optimized fine-grained strategy because it allows taking advantage of killing the VMs when their deadline is achieved.for all the cases, the proposed HJF policy obtains the best performance, in terms of SLA fulfillment and energy consumption.Moreover, the configuration that yields the best results is combining HJF policy with the optimized finegrained assignation strategy.Recall that for "fulfilled SLAs" and "Frames processed per second," the greater is better, while for "Energy/fulfilled SLA" the lower is better.

3 Figure 4 :
Figure 4: Division of a task to chunks.
the amount of frames per second keeps an acceptable level even when more demanding requests are processed.So, in the
Task  ,  = 1, 2, . . ., : Task   to be executed in the local node List Task = Set of tasks metric(Task  ): function which computes the metric for task Task  include(Task  , List Task ): function which includes Task  in List Task rank(List Task ): function which sorts List Task according to the selected metric top(List Task ): function which extracts the first VM from List Task Require:

Table 2 :
Percentage of fulfilled SLAs for different scheduling criteria.

Table 6 :
Percentage of SLAs fulfilled for different FPGA assignation criteria (optimized fine-grained).
Table 7 summarizes the different results when applying the different FPGA assignation strategies and scheduling techniques.It can be seen that