Low Power Scheduling Approach for Heterogeneous System Based on Heuristic and Greedy Method

Big data, cloud computing, and artificial intelligence technologies supported by heterogeneous systems are constantly changing our life and cognition of the world. At the same time, its energy consumption affects the operation cost and system reliability, and this attracts the attention of architecture designers and researchers. In order to solve the problem of energy in heterogeneous system environment, inspired by the results of 0-1 programming, a scheduling method of heuristic and greedy energy saving (HGES) approach is proposed to allocate tasks reasonably to achieve the purpose of energy saving. Firstly, all tasks are assigned to each GPU in the system, and then the tasks are divided into high-value tasks and low-value tasks by the calculated average time value and variance value of all tasks. By using the greedy method, the high-value tasks are assigned first, and then the low-value tasks are allocated. In order to verify the effectiveness and rationality of HGES, different tasks with different inputs and different comparison methods are designed and tested. The experimental results on different platforms show that the HGES has better energy saving than that of existing method and can get result faster than that of the 0-1 programming.


Introduction
As an important driving force of social development and world economic growth in the 21st century, ICT (information and communication technology) industry consumes 10% of the global power consumption [1], and its total carbon emissions account for 2%-2.5% of the total global carbon emissions, especially in developed countries, reaching 10% [2]. e Intergovernmental Panel on Climate Change of the United Nations has released a report, pointing out that if the temperature of global warming is to be limited to 1.5°C higher than that before the Industrial Revolution, unprecedented changes are needed, efforts should be made to completely stop using fossil fuels by 2050, and zero carbon emissions should be achieved [3]. In order to promote the sustainable development of ICT industry, green computing [4][5][6][7][8][9][10][11][12] has become the consensus of many researchers at home and abroad. At present, ICT industry represented by big data technology and artificial intelligence technology is constantly changing our life, transportation, learning, and cognition of the world, which makes the heterogeneous computing system (HCS) based on GPU (graphics processing unit) supporting the development of these technologies become the mainstream of computer system. e characteristics of GPU heterogeneous system such as high acceleration, easy to learn, and easy to expand make it develop rapidly. At present, it is widely used in big data processing, deep learning, cloud computing, artificial intelligence, unmanned vehicle driving, molecular simulation computing, and other fields. e huge application market also has greatly promoted the development of GPU heterogeneous system. In a typical GPU heterogeneous system, CPU usually allocates computing tasks to GPUs for calculation; in a computer system composed of multiple GPUs, how to allocate computing tasks to each GPU will greatly affect the power consumption of the whole system . In this paper, the low power task scheduling of GPU heterogeneous system is studied. Although the performance and power of GPU heterogeneous system are greatly improved than that of traditional computer system, its power consumption is still high in the whole computer system. In order to comply with the development of ICT industry, the power optimization of GPU heterogeneous system should be studied in depth. For reducing the power consumption of HCS, scholars have put forward various methods and models, but there are some problems in the current research work; for example, it is necessary to manually rewrite the target task code [13,19], the energy consumption of heterogeneous systems is affected by the order of task execution [32,34], and assuming that the power of GPU is constant when the tasks run [30], and it is necessary to run the task in advance to obtain the parameters [12,26] before the task is scheduled. In order to effectively alleviate the above problems and make HCS more widely adapt to the diversity of tasks, this paper mainly focuses on the energy saving of heterogeneous systems with multiple identical GPUs. Inspired by the 0-1 programming, the HGES (heuristic and greedy energy saving) scheduling model is proposed by using heuristic and greedy method. e model first obtains the power consumption and time of the tasks on each GPU, and then the energy optimization problem is transformed into a scheduling problem. HGES consists of the following steps: (1) Power measurement and task execution time acquisition: HIOKI3334 power meter is used to obtain the energy consumption of the task by measuring the current and voltage. is study does not change the running time of the task, so the running time of the task is the actual running time of the task in the experimental environment.
(2) Task scheduling: in the environment of multiple identical GPUs, the average power consumed by randomly processing a certain number of tasks is the same. e difference lies in the speed of the overall execution time. erefore, the problem of energy saving can be transformed into the problem of time minimization. In this paper, HGES method is designed to solve the problem.
(3) Verification: in order to verify the effectiveness of HGES. First, analyze its performance. en, its effectiveness, rationality, and feasibility are verified by experiments.
e contributions of this paper are as follows: (1) is paper analyzes the essence of energy saving problem in heterogeneous systems with multiple identical GPUs and transforms it into scheduling problem.
(2) Based on the 0-1 programming, HGES scheduling method is proposed by heuristic and greedy method. It calculates the average time and variance of all tasks to be executed according to the time of task execution and then divides all tasks in each GPU into high-value tasks and low-value tasks according to average value and variance; after sorting the highvalue tasks, greedy method is used to assign the highvalue tasks first and then the low-value tasks. (3) e experimental results show that the HGES method on different platforms can save more energy than that of existing methods. Compared with 0-1 programming method under best solution, HGES can get result faster. e rest of the paper is structured as follows. Section 2 shows related works; heuristics from 0-1 programming are introduced in Section 3; Section 4 presents the HGES method; our proposed method is verified and compared in Section 5; Section 6 summarizes the work of this paper.

Related Works
e research on energy saving in task scheduling can be divided into two categories: energy saving scheduling technologies based on task characteristics and energy saving technologies for task scheduling. ey are described in detail as follows.
e first is energy saving scheduling technology based on task characteristics. [9,10] point out that the storage requirements of tasks, task migration, and the improvement of scheduling strategy are helpful to the improvement of system performance. Based on this, Zhan et al. [11] research the energy optimization of hybrid scratchpad memory which consists of SRAM and nonvolatile memory. en, they propose data allocation for energy optimization which is composed of program analysis stage and data allocation stage. After the GPU supports the concurrent kernel execution feature, it provides a solution for energy saving technology. Li et al. [12] have obtained the parameter R i by running the CUDA profiler tool in advance to determine the kernel category and have used the complementary characteristics of the task category as inspiration to implement concurrent kernel execution for energy saving. Jiao et al. [13] have proposed a static estimation power-performance model by using the method of predicting the ratio of block number, and it has guided energy saving of GPU by establishing the relationship between the ratio of block number and energy consumption among concurrent kernels; however, this method requires to convert task code. Inspired by the implementation of energy saving with complementary characteristics of task categories by [12], Li et al. [14] use the established energy saving regression prediction model and scheduling method to achieve the goal of energy saving after classifying tasks. Li et al. [15] have compared the energy consumption of the concurrent kernel and the sequential kernel, and choose a way to perform tasks with less energy consumption; for the acquisition of energy, the energy estimation model and the performance estimation model are 2 Computational Intelligence and Neuroscience used. Wen et al. [16] have proposed a graph-based algorithm to schedule co-run kernel in pairs to optimize the system performance. Workloads are represented by a graph (vertices stand for distinct kernels, while edges between two vertices represent the corresponding two kernels and coexecution can deliver a better performance than run them one after another). Edges are weighted to provide information of performance gain from co-execution. Wen and Oboyle [17] have proposed a runtime framework to detect whether to merge OpenCL kernels or to schedule them to the most appropriate devices separately by using a prediction model based on machine learning at runtime, so as to schedule multi-user OpenCL tasks to the most appropriate devices in heterogeneous systems. e second is the energy saving technologies for task scheduling. Compared to a uniprocessor, multiprocessors have been shown to reduce the power problem. For saving energy of multi-processor architectures, task migration is an effective method. Based on this, Rupanetti and Salamy [18] propose a three-part framework to reduce energy which is task allocation technique, task migration, and task scheduling scheme based on the earliest deadline first method. Liu and Luk [19] obtain the task and processor resource parameters by running tasks in advance and then use the linear programming to achieve energy saving scheduling of LINPACK program in each processor, but this method requires manually rewriting the code of the target processor. According to the analysis method proposed by [21], Barik et al. [20] obtain the task characteristics and execution time parameters to adjust the load rate for achieving the purpose of reducing energy consumption of processor. After that, Ma et al. [22] have proposed a two-layer energy management framework with dynamic allocation layer and frequency regulation layer, compared four dynamic allocation schemes, and analyzed their advantages and disadvantages. Li et al. [23] point out the deficiency of researches on the energy and thermal issues of real-time applications with precedence-constrained tasks on heterogeneous systems and then propose both energy/thermal-aware task scheduling approach by assigning tasks in an energy/thermal-aware heuristic way and reducing the waiting time between parallel tasks. Bansal et al. [24] combine both the dynamic voltage scaling (DVS) and dynamic power management (DPM) techniques to save energy while scheduling preferenceoriented fixed-priority periodic real-time tasks and then propose preference-oriented energy-aware rate-monotonic scheduling and preference-oriented extended energy-aware rate-monotonic scheduling algorithms to maximize energy savings while fulfilling preference value of tasks. Silberstein and Maruyama [25] have considered the energy of tasks on each processor, and they construct a minimum energy consumption scheduling method for multiple interdependent tasks according to the directed acyclic graph and verify the feasibility of the method when the processor has no overhead. Jang et al. [26] have studied the energy optimization of single task in multi-processor environment and multi-tasks of adaptive power-aware allocation scheme and propose the optimal task allocation algorithm under single task and the optimal voltage/frequency adjustment scheme under multi-tasks. Although the energy saving method under multi-tasks is studied, more attention is paid to voltage/frequency adjustment. For dynamic random access memory (DRAM)-based main memory subsystem is a major contributor to the energy consumption of mobile devices, Zhong et al. [27] propose direct read (DR). Swap by using NVMs byte addressability which guarantees zero memory copy for read-only requests when accessing a page in swap area. e research of [28] shows that the use of shared memory architecture in mobile devices can improve the cooperation among processors, accelerate the calculation of PCA (principal components analysis), and effectively reduce the energy of mobile devices. Khalid et al. [29] have proposed an OSCHED scheduling method in the case of unbalanced computing power of processors, which comprehensively considers the computing power of devices and the computing requirements of tasks to achieve load balancing of tasks among various processors. Hamano et al. [30] have proposed an energy saving method for dynamic scheduling. e task with the smallest energy delay product (EDP) is selected, and then, it is assigned to the corresponding processor, but the method considers that the power of the scheduled task is constant. Huang [31] points out that processing elements are idle when the required data are not received which will lead to the issue of low utilization of processing elements. Choi et al. [34] have proposed an estimated-execution-time (EET) scheduling to predict the remaining execution time of programs according to the remaining execution time of tasks and pointed out the deficiency of the alternate assignment (AA) scheduling, first free (FF) scheduling, and performance history (PH) scheduling in [32,33].
at is, PH scheduling does not consider the remaining time of the application currently executed by each device, which will lead to overutilization of a single device. According to the methods proposed in [32][33][34], the scheduling tasks are extended to multiple tasks in [35,36], and a 0-1 programming method is proposed to allocate tasks to solve the problem of excessive utilization of single processor. However, the results obtained by this method are greatly affected by parameter values.
In summary, although the energy saving research of heterogeneous systems has made great progress, there are still some deficiencies. In view of the existing problems and research deficiencies, this paper proposes HGES method to alleviate the problem.

Heuristics from 0-1 Programming
In [36], authors use 0-1 programming by formalizing the problem into formulas (1)-(6) to solve the low power scheduling problem in heterogeneous system. ey assume that the currently available processor resources in the system are GPU i (0 ≤ i ≤ n), CPU, and motherboard, and then the energy consumption of the system E system can be expressed as the sum of the energy consumption of all GPUs (E GPU ), CPU (E CPU ), and motherboard (E Motherboard ) in the system and further expressed as the product of their respective power (P GPU , P CPU , E Motherboard ) and time (T). For a group of tasks to be scheduled on the same number of GPUs, the tasks to be Computational Intelligence and Neuroscience scheduled will generate different sequences according to different scheduling algorithms without changing the task structure, but the power of a single task will not be changed; that is, the average power consumption of the task sequence to be scheduled remains unchanged. erefore, the E system can be further expressed as the product of average power consumption P and time T which can be expressed by In order to minimize the energy of system when executing the program sequence, the average power consumption P and time T must be as small as possible. For different scheduling methods, the average power consumption is certain. erefore, in order to minimize the energy consumption of the system, it is necessary to minimize the execution time T. When using 0-1 programming to solve this problem, we define the following symbols. Let m represent the number of GPU in systems; let n represent the number of programs to be processed in the system. Let T ij represent the consumed time by the jth program running on . Let x ij represent assigning ith processor to complete the jth program, so the value of x ij is as follows: x ij � 1, assign ith GPU to execute jth program 0, not assign ith GPU to execute jth program.
e goal of the problem we solve is to choose a suitable combination that minimizes the time interval among each processor when executing tasks. We use the first processor as the baseline, so the objective function is the minimum executing time difference between other processors and the first processor. erefore, the objective function to minimize the task execution time among processors can be shown in According to the requirements of the problem, each program has only one processor to run, so we get the processor constraint as shown in (4).
When assigning a task to a processor, the time to complete all tasks should be as equal as possible. We cannot unlimitedly reduce performance for saving energy, so we add constraint of performance to the objective function. Due to the randomness of the execution time of the program, it is not suitable for most scenes for allocating tasks to the each GPU in an equal time manner, respectively. In order to better adapt to the real environment, we allow unequal distribution of time on each GPU. For this, the total time of tasks allocated to a single processor in the system is not more than the average time that is calculated by total time of tasks divided by number of processors, so we use the performance tuning parameter Q to fulfill this purpose. e range of Q is 0.01-1. Considering above factors, we obtain the time constraint as shown in (5).
In summary, when using 0-1 programming to solve above problem, it can be described as (6). In order to better understand it, let m be 4, n is 11, T ij is consumed time by the 11 programs running on the 4 GPUs, and x ij assigns ith processor to complete the jth program. e constraint x ij � 0 or 1 and m i�1 x ij � 1(j � 1, 2, . . . , n) guarantee each of 11 tasks assigned to only one GPU; the constraint Q n e objective function ensures that the optimal solution can be found in the feasible solution. Figure 1 shows the impact of the Q parameter changes on the result of 0-1 programming when 20 tasks are scheduled. e abscissa is the value of Q parameter, and the ordinate represents energy consumption. It can be seen from the figure that with gradual increase in the value of Q, the energy consumption is gradually reduced. erefore, the reasonable value of Q has a great influence on the energy consumption of the system. Since the solution obtained by 0-1 programming is greatly affected by the Q parameters, unreasonable Q parameter often results in nonhigh-quality solution for solving the problem. In the process of solving 0-1 programming, the feasible solution satisfying the constraint conditions is first calculated, and then the optimal solution satisfying the objective function is considered. Analyzing the solution obtained, we find that there are timeconsuming tasks assigned to each processor, regardless of the value of the Q parameter. Figure 2 shows the result of assigning tasks using 0-1 programming with a Q parameter of 0.4 (sub-figure (a)) and a Q parameter of 0.9 (sub-figure (b)). It can be seen from the figure that different Q parameters affect the scheduling results. e reason for this phenomenon is that the obtained feasible solution must meet the constraint conditions, and the time-consuming tasks will be evenly allocated to each processor so that the obtained feasible solution minimizes the objective function.
Based on the inspiration of the 0-1 programming, we should first seek the tasks satisfying the constraint conditions and then select the tasks within them which can minimize the objective function. In the process of finding tasks that satisfy the constraints, the execution time of task allocated to each processor should be made as equal as possible to minimize the objective function. erefore, the cumulative task's execution time of processors should be considered when assigning tasks. For the tasks with long execution time have a greater impact on the constraint conditions, so the tasks with longer execution time should be allocated first. In allocating tasks to suitable processors that satisfy the objective function, the greedy method is used to comprehensively consider the total time of the processor's allocated tasks and the time and energy of the tasks to be allocated to find processors that meet the objective function. After the tasks with longer execution time are assigned, the tasks with smaller execution time are allocated according to the same rules.

HGES Approach
Inspired by the above 0-1 programming, we call the method to solve this problem as heuristic and greedy energy saving approach (HGES). e main idea first assigns all tasks to each Task20  Task19  Task18  Task17  Task16  Task15  Task14  Task13  Task12  Task11   Task10  Task9  Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1   0 Task18  Task17  Task16  Task15  Task14  Task13  Task12  Task11   Task20  Task10  Task9  Task8  Task7  Task6  Task5  Task4  Task3  Task2  Task1   0  Computational Intelligence and Neuroscience processor; secondly, tasks are divided into two parts according to the execution time in each processor. e tasks with long execution time in each processor are assigned first. When a task is assigned, then it will be deleted from all task list of other processors to be allocated. is process is repeated until all tasks with long execution time are allocated; thirdly, the task with short execution time is assigned to each processor, and when the task is assigned, it will be deleted from all task list of other processors to be allocated. is process is repeated until the short execution tasks are assigned. e allocation rule is to select the task with the smallest product of the processor's cumulative time and the energy consumption of the corresponding task to be allocated among the K smallest cumulative execution time of m processors. For scheduling, the scheduling parameters of P tasks should be obtained first; therefore, we establish a two-dimensional array GPU_Time_Energy_i[P][] for each processor i to store the time and energy consumption of tasks, which are, respectively, GPU_Time_Energy_i[P][0] and GPU_Time_Energy_i[P] [1]. e variable AccumPer_i is used to store the sum of the execution time of the assigned tasks to each processor i. AVE i and SD i are used to estimate the average execution time and the corresponding standard deviation of each processor i under n tasks, and equation (7) is used to calculate the criterion Crit i to distinguish the length of execution time. Crit i is used to separate tasks that affect performance constraints. e tasks with a long execution time corresponding to the processor number i are put in the _i array, and the remaining tasks are put in the Tm_i array. Each _i array is sorted in descending order, and they are scheduled according to greedy method. e steps of greedy method are as follows. Firstly, select the K processors with the smallest cumulative sum of time, and secondly, assign the task to the processor with the smallest product of the cumulative time of K processor and the energy consumption of the corresponding task to be allocated. After the corresponding _i of each processor is scheduled, the same rules are used to schedule Tm_i.
According to the above ideas, Figure 3 shows the flow of the HGES method. e specific steps are as follows: Step 1: All P tasks are allocated to each processor i, and the corresponding time and energy are, respectively, stored in GPU_Time_Energy_i[P][0] and GPU_Ti-me_Energy_i[P] [1]; the performance accumulator AccumPer_i is initialized to 1 for each processor i and the task allocation sequence AllocTask_i[] to null.
Step 2: Average execution time AVEi and the standard deviation SDi of the P tasks in each processor i are calculated.
Step 3: Execution time of tasks that are higher than the  is equal to 0. If it is equal to 0, that means that the task has been allocated to other processors, and pth_i needs to be increased by one to determine whether the next task is allocated; if it is not equal to 0, calculate the product of the accumulated time of processor and the corresponding power consumption of current task, namely, is equal to 0. If it is equal to 0, it means that the task has been allocated to other processors, and ptm_i needs to be increased by one to determine whether the next task is allocated. If it is not equal to 0, calculate the product of the accumulated time of processor and the corresponding power consumption of current task, namely, AccumPer_i * GPU_Time_ Energy_i[Tm_j [ptm_j]][0], and assign the calculation result to TTE_j.
Step 5: Among K TTE_i or TTE_j, select the processor number minENumGPU corresponding to the smallest value, and determine whether minENumGPU comes from _i[]. If so, accumulate corresponding time to the corresponding processor; that is, add GPU_Ti-me_Energy_minENumGPU[ _i [pth_i]][0] to Accum-Per_minENumGPU. Assign the corresponding task to the list of the corresponding processor, namely, TaskAlloc_minENumGPU � _i [pth_i]. is task will not be considered in the next allocation, and the execution time of all processors corresponding to this allocated task will be assigned to zero; that is, set en, add one to the corresponding subscript to prepare for the next task; that is, add one to pth_i. At this point, one task has been assigned. erefore, the parameter Counter indicating the number of assigned tasks is added by one; if minENumGPU comes from Tm_i[], the processing process is the same as that of the task from _i[]. e difference is that the processed data come from Tm_i[].   . Line 20 accumulates the assigned task execution time to the execution time accumulator of the corresponding processor. Line 21 clears the time of the assigned task corresponding to all processor lists, indicating that the task has been assigned to the corresponding processor and cannot be assigned to other processors in the next assignment. Lines 22-26 are the ret_taskindex in Tm_i[], and the processing is similar to lines 18-21. e difference is that the task to be processed is in Tm_i []. e 28th line returns the obtained result. Algorithm 2 uses the bubble sorting idea to select the K processor number with the smallest cumulative execution time. e input is the cumulative execution time of each processor; the output is the K processor number with the smallest cumulative execution time. Line 1 initializes related variables. Lines 2-8 traverse the cumulative time of each processor; lines 3-7 are used to get K return values; lines 4-6 filter the NumGPU − K processor number of the maximum value, where the maximum value is stored in the high variable; line 9 returns the remaining K processor number. Algorithm 3 gets the task number and processor number with the smallest product of accumulated time and energy of its task among the K processor output by Algorithm 2. Lines 1-12 process the tasks in Sort_ _i[] corresponding to K processors. For the efficiency of HGES, in the case of problem size n, the time complexity of acquiring task time, power consumption, AVE i , and SD i is O(n); the time complexity of filtering feasible solutions is O(NumGPU * n); the time complexity of sorting _i[] is O(nlogn); the time complexity of assigning tasks according to heuristic and greedy methods is O(2 * NumGPU * n). Generally, the number of processors in the system NumGPU is often constant. In summary, the time complexity of the HGES method is O(nlogn).

Experiment
In order to verify the effectiveness and adaptability of HGES method, we choose two platforms for verification, named  Table 1.
For better verified HGES, six typical CUDA benchmark tasks are selected to better verify the algorithm and different input sizes and different numbers are selected to simulate. ese benchmarks are, namely, matrix multiplication (MM), histogram (HG), scalar products (SP), BlackScholes (BS), vectorAdd (VA), and mergeSort (MS). eir specific parameters are shown in Table 2.
For getting the energy, HIOKI 3334 AC/DC power meter is selected to measure the energy of the system. For the number of tasks is less than that of GPUs, PH and EET methods almost all degenerate into FIFO methods, resulting in a little performance difference. For this purpose, these experiments are unnecessary to be done. e HGES approach in this paper is implemented as follows. Firstly, the pseudocode in Algorithm 1 through 3 is running in VS2015. Secondly, we reprogram the task order based on the output result of HGES by the first step. For measuring the energy, the energy consumptions of the algorithm itself and the energy consumption of the running task are two parts of the approached energy, so we, respectively, record them as Energy 1 and Energy 2; finally, Energy 1 adds Energy 2 are the energy of HGES. e trend in these figures shows that the optimal solution obtained by the 0-1 programming is affected by the Q parameter. e smaller the Q value, the worse the scheduling effect. When the Q is 1, there is no solution. Compared with 8 Computational Intelligence and Neuroscience 0-1 programming, the time difference of HGES under different tasks in different platforms is the smallest. In Figures 4  and 5, when the number of scheduling tasks is 10 and the Q sets 0.1, the time difference of 0-1 programming is 5.22 s in Figure 4 and 2.259 s in Figure 5. When Q sets 0.9, the time difference is 0.573 s in Figure 4 and 0.3667 s in Figure 5. e time difference of HGES is 0.445 s in Figure 4 and 0.3424 s in Figure 5 under the 10 scheduling tasks. When the scheduling quantity is 20 tasks, the time difference of 0-1 programming is 7.86 s in Figure 4 and 5.417 s in Figure 5 when Q is 0.1. When Q is 0.9, the time difference is 0.536 s in Figure 4 and 0.361 s in Figure 5. e time difference of HGES is 0.365 s in Figure 4 and 0.233 s in Figure 5. In the case of scheduling 40 tasks, the time difference of 0-1 programming is 9.709 s in Figure 4 and 7.367 s in Figure 5 when Q is 0.1. When Q is 0.9, the time difference is 0.974 s in Figure 4 and 0.756 s in Figure 5. e time difference of HGES is 0.271 s in Figure 4 and 0.173 s in Figure 5. When scheduling 80 tasks, the time difference of 0-1 programming is 22.8 s in Figure 4 and 12.3 s in Figure 5 when Q is 0.1. When Q is 0.9, the time difference is 2.53 s in Figure 4 and 2.1 s in Figure 5. e time difference of HGES is 0.203 s in Figure 4 and 0.129 s in Figure 5. In Figures 4 and 5, the results of HGES method are similar under different numbers of scheduling tasks. From the experimental data, HGES has obvious effect on reducing the time difference when scheduling tasks.  Figures 6 and  7, as the Q parameter gradually approaches 1, the energy is gradually reduced; when the parameter Q sets to 0.9, the energy consumption is the smallest. When the Q parameter is 1, the 0-1 programming in sub-graphs (a), (b), (c), and (d) cannot be solved. e trend in the figure shows that the optimal solution obtained by the 0-1 programming is affected by the Q parameter. e larger the Q value, the lower the energy consumption. When the Q is 1, there is no solution. Compared with 0-1 programming, HGES method consumes the least energy consumption under different tasks in Figures 6 and 7. In the experiment, when the number of scheduling tasks is 10 and Q is 0.1, the energy of 0-1 programming is 456.21 J in Figure 6 and 323.6 J in Figure 7. When Q is 0.9, the energy is 321.13 J in Figure 6 and 237.5 J in Figure 7. e energy of HGES is 313.12 J in Figure 6 and Input: e set of P programs to be executed, the number of GPUs NumGPU Output: Program sequence to be executed on each GPU (AllocTask_NumGPU[]) Algorithm: (1) getTime_Energy_AVE_SD(P, NumGPU) (2) for i 0, NumGPU do//Filter feasible solutions (3) for j 0, P do (4) if e time of the jth task in the ith processor > Crit i (5) put task number j to _i [] array//Only store the task number (subscript) (6) else (7) put task number j to Tm_i array (8) end if (9) end for (10) end for (11) record the number in _i and Tm_i as NumIn , NumInTm (12) for i 0, NumGPU do//Sort (13) use sort( _i) function to sort each _i array and store the result to Sort_ _i[ ] (14) end for (15) for i ⟵ 0, i < P do//Filter K processor number with the smallest cumulative time (16) K_GPUIndex[] select_Min_Time(AccumPer_i, K) (17) (ret_GPUIndex, ret_taskindex, thortm) select_Min_Energy(Sort_ _i, Tm_i, K_GPUIndex) (18) if thortm is 0//ret_taskindex is from array (19) put the task number Sort_ _ret_GPUIndex[ret_taskindex] to AllocTask_ret_GPUIndex[] (20) add time of ret_taskindex in Sort_ _ret_taskindex to AccumPer_ret_GPUIndex (21) assign the time of ret_taskindex in Sort_ _ret_taskindex in other GPU to 0 (22) else// ret_taskindex is from Tm array (23) put the task number Tm_ret_GPUIndex[ret_taskindex] to AllocTask_ret_GPUIndex[] (24) add time of ret_taskindex in Tm_ret_taskindex to AccumPer_ret_GPUIndex (25) assign the time of ret_taskindex in Tm_ret_taskindex in other GPU to 0 (26) end if (27)
Input: Sort_ _i, Tm_i (0 ≤ i <�NumGPU−1), K_GPUIndex Output: the GPU index, task index, and the task index in array or Tm Function select_Min_Energy( Sort_ _i, Tm_i, K_GPUIndex ) (1) if pth_K_GPUIndex < NumIn // e task number is in the array (2) while the time of pth_K_GPUIndex in Sort_ _K_GPUIndex is 0//skip the assigned tasks (3) pth_K_GPUIndex ++ (4) if pth_K_GPUIndex > Max number in array − 1// e task number is not in the array Energytemp_K_GPUIndex ⟵ AccumPer_K_GPUIndex * the energy of task number pth_K_GPUIndex (10) flag_K_GPUIndex ⟵ 1; (11) end if (12) end if (13) if flag_K_GPUIndex is 0// e task number is in the Tm array (14) while the time of pth_K_GPUIndex in Tm_K_GPUIndex is 0//skip the assigned tasks (15) ptm_K_GPUIndex++ (16) if ptm_K_GPUIndex > NumInTm − 1 (17) break (18) end if (19) end while (20) if ptm_K_GPUIndex ≤ NumInTm − 1 (21) Energytemp_K_GPUIndex ⟵ AccumPer_GPUIndex1 * the energy of task number ptm_K_GPUIndex (22) end if (23) end if (24) // e above calculation traverses K_GPUIndex array (25) for i ⟵ 1, NumGPU − 1 do//Select the minimum product (26) if Energytemp_i < MinEnergy (27) assign the GPU number i to Return_GPUIndex (28) assign the Energytemp_i to MinEnergy (29) end if (30) end for (31) if Return_TempGPUIndex is from array (32) assign 0 to flag_ Return_TempGPUIndex (33) return Return_GPUIndex, pth_Return_GPUIndex, 0 (34) else (35) return Return_GPUIndex, ptm_Return_GPUIndex, 1 (36) end if ALGORITHM 3: Getting the task and GPU number with the smallest product of accumulated time and energy of its task. Figure 7. When 0-1 programming schedules 20 tasks and the Q is 0.1, its energy is 596.16 J in Figure 6 and 427.2 J in Figure 7. When Q is 0.9, its energy is 418.74 J in Figure 6 and 318.5 J in Figure 7. e energy of HGES under 20 tasks is 393.73 J in Figure 6 and 291.7 J in Figure 7. When 0-1 programming schedules 40 tasks and Q is 0.1, its energy is 1,453.42 J in Figure 6 and 1,044.4 J in Figure 7. When Q is 0.9, its energy is 1,012.82 J in Figure 6 and 761.9 J in Figure 7. e energy consumption of HGES under 40 tasks is 945.84 J in Figure 6 and 703.3 J in Figure 7. When 0-1 programming schedules 80 tasks and Q is 0.1, its energy is 1,801.86 J in Figure 6 and 1278.5 J in Figure 7. When Q is 0.9, its energy is 1,227.45 J in Figure 6 Figure 8, the minimum consumed time of 0-1 programming is 0.011 s when Q is 0.2; the maximum consumed time is 0.031 s when Q is 0.1. e consumed time of HGES is 0.001 s, which improves the processing speed, respectively, 11 times and 31 times compared with that of 0-1 programming when Q is 0.2 and 0.1. In Figure 9, the consumed time of HGES is 0.001 s, which improves the processing speed, respectively, 10 times and 23 times compared with that of 0-1 programming when Q is 0.2 and 0.9. In the case of 20 tasks in Figure 8, the minimum consumed time of 0-1 programming is 0.03 s when Q is 0.1. e maximum consumed time is 0.9146 s when Q is 0.9. e consumed time of HGES is 0.0016 s, which improves the processing speed, respectively, 18.75 times and 517.62 times compared with that of 0-1 programming when Q is 0.1 and 0.9. In Figure 9, the consumed time of HGES is also 0.0016 s, which improves the processing speed, respectively, 37 times and 585 times compared with that of 0-1 programming when Q is 0.1 and 0.7. In the case of 40 tasks in Figure 8, the minimum consumed time of 0-1 programming is 0.103 s when Q is 0.1. e maximum consumed time is 0.346 s when Q is 0.9. e consumed time of HGES is 0.019 s, which improves the processing speed, respectively, 5.42 times and 18.21 times compared with that of 0-1 programming when Q is 0.1 and 0.9. In Figure 9, the consumed time of HGES is also 0.019 s, which improves the processing speed, respectively, 5.6 times and 50.5 times compared with that of 0-1 programming when Q is 0.4 and 0.9. In the case of 80 tasks in Figure 8, the minimum consumed time of 0-1 programming is 0.402 s when Q is 0.1. e maximum consumed time is 2.368 s when Q is 0.7. e consumed time of HGES is 0.047 s, which  improves the processing speed, respectively, 8.55 times and 50.38 times compared with that of 0-1 programming when Q is 0.1 and 0.7. In Figure 9, the consumed time of HGES is 0.046 s, which improves the processing speed, respectively, 11.1 times and 21.2 times compared with that of 0-1 programming when Q is 0.1 and 0.4. It can be seen from the experimental data that HGES has a great advantage in processing speed. Figures 10 and 11 show the energy comparison between HGES and the 0-1 programming. e trend shown in Figures 10 and 11 is similar to that in Figures 8 and 9, and it will not be repeated here. Figures 12 and 13 show the comparison of allocation time of HGES and 0-1 programming with different parameters in each processor. e abscissa in figures represents the various methods, which are the different parameters of 0-1 programming and the HGES method; the ordinate is the ratio of processor allocation time. It can be seen from figures that with the increase in the value of Q parameter, the more balanced the execution time of each processor is, and there is no solution when Q is taken as 1. When Q is 0.9 in Figure 12, the proportion of task execution time of each processor is 30%, 33.07%, and 37.59%, respectively. e HGES method makes the proportion of task execution time of each processor to be 33.69%, 33.08%, and 33.23%, respectively. When Q is 0.9 in Figure 13, the proportion of task execution time of each processor is 30%, 30.03%, and 39.97%, respectively. e HGES method makes the proportion of task execution time of each processor to be 33.69%, 33.08%, and 33.23%, respectively. In conclusion, HGES is more balanced in task allocation than that of 0-1 programming.      Figures 14 and 15 show the time of each method; from figures, we can see HGES method is more balanced than PH and EET methods in assigning tasks. In   Computational Intelligence and Neuroscience energy; EET consumes 1355.61 J energy; HGES consumes 1140.76 J energy. e HGES saves 22.14% energy than that of PH and 15.84% energy than that of EET. In Figure 15, PH consumes 1057.16 J energy; EET consumes 986.82 J energy; HGES consumes 848.65 J energy. e HGES saves 19.72% energy than that of PH and 14.01% energy than that of EET. In conclusion, HGES is more effective than PH method and EET in task allocation and energy saving.

Conclusion
Today's society is increasingly advocating sustainable development, and the energy consumption of heterogeneous systems has become an important issue that people are concerned about.
is paper studies the energy saving problems of heterogeneous systems composed of multiple identical GPUs, analyzes the reasons for their energy consumption, summarizes the characteristics of the solution obtained by the 0-1 programming, lists the shortcomings of the current method, and adopts heuristic and greedy methods to solve the energy saving problem of heterogeneous systems composed of multiple identical GPUs. Based on this, the HGES scheduling method is proposed. Firstly, HGES assigns all tasks to each processor in the system, and the average value and standard variance of execution time are calculated. en, the tasks are divided into high-value part and low-value part according to the average and standard variance value. e high-value part is allocated first according to the accumulated time of each processor, and then the low-value part is allocated after the allocation of the high-value part. In order to verify the effectiveness and rationality of HGES, experiments were conducted with different numbers and different input tasks and different methods are used to compare. e experimental results show that the HGES has better effect on performance and energy saving.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.