Energy-Aware High-Performance Computing : Survey of State-ofthe-Art Tools , Techniques , and Environments

(e paper presents state of the art of energy-aware high-performance computing (HPC), in particular identification and classification of approaches by system and device types, optimization metrics, and energy/power control methods. System types include single device, clusters, grids, and clouds while considered device types include CPUs, GPUs, multiprocessor, and hybrid systems. Optimization goals include various combinations of metrics such as execution time, energy consumption, and temperature with consideration of imposed power limits. Control methods include scheduling, DVFS/DFS/DCT, power capping with programmatic APIs such as Intel RAPL, NVIDIA NVML, as well as application optimizations, and hybrid methods. We discuss tools and APIs for energy/powermanagement as well as tools and environments for prediction and/or simulation of energy/power consumption in modern HPC systems. Finally, programming examples, i.e., applications and benchmarks used in particular works are discussed. Based on our review, we identified a set of open areas and important up-to-date problems concerning methods and tools for modern HPC systems allowing energy-aware processing.


Introduction
In today's high-performance computing (HPC) systems, consideration of energy and power plays a more and more important role.New cluster systems are designed not to exceed 20 MW of power [1] with the aim of reaching exascale performance soon.Apart from the TOP500 (https://www.top500.org/lists/top500/)performance-oriented ranking, the Green500 (https://www.top500.org/green500/)list ranks supercomputers by performance per watt.Wide adoption of GPUs helped to increase this ratio for applications that can be efficiently run on such systems.Programming and parallelization in such hybrid systems has become a necessity to obtain high performance but is also a challenge when using multiand manycore environments.In terms of power and energy control methods, apart from scheduling, DVFS/DFS/DCT, and power capping APIs have become available for CPUs and GPUs of mobile, desktop, and server lines.Power capping is now also available in job management systems for clusters such as in Slurm that allows shutting down idle nodes, starting these again when required, allows us to set a cap on the power used through DVFS [2].Metrics such as execution time, energy, power, and temperature are used in various contexts and in various combinations, for various applications.ere is a need for constant and thorough analysis of possibilities, mechanisms, tools, and results in this field to identify current and future challenges, which is the primary aim of this work.

Existing Surveys
Firstly, the matter of appropriate energy and performance metrics has been investigated in the literature [3].ere are several survey papers related to energy-aware highperformance computing but as the field, technology, and features are evolving very rapidly, these lack certain aspects that we present in this paper.
Early works concerning the data centers and cloud were surveyed in [4], showing a variety of energy-aware aspects in related literature.
e authors proposed a taxonomy of power/energy management in computing systems, with distinction of different abstraction levels and presented energy-related works, including the ones describing models, hardware, and software components.Our survey extends the above work with newer solutions and provides a more compact view at today's energy/power-related issues.
e study in [5] categorizes energy-aware computing methods for servers, clusters, data centers, and grid and clouds but lacks discussion on all currently considered optimization criteria, mechanisms such as power capping as well as detailed analysis of applications, and benchmarks used in the field.us, we include analysis of available target optimization metrics, energy-aware control methods, and benchmarks in our classification.
e study in [6] reviews energy-aware performance analysis methodologies for HPC available in 2012 listing hardware, software, and hybrid approaches as well as tools dedicated for energy monitoring.However, the paper does not review the methodologies for controlling the energy/power budget.
e main goal of the paper is to collect available energy/power monitoring techniques.In addition, paper validates the existing tools in terms of overhead, portability, and user-friendly parameters.Consequently, we add analysis on energy and power control methods in our analysis.
e study in [7] includes a survey of software methods for improving energy efficiency in parallel computing from a slightly different perspective; namely, it focuses on increasing energy efficiency for parallel computations.It discusses components such as processor, memory, and network, from application to the system level and elements such as load and mixed precision computations in parallel computing.
A survey of techniques for improving energy efficiency in distributed systems focused more on grids and clouds was presented in [8].Compared to our work, it does not analyze in such great detailed possible optimization goals, node, and cluster level techniques or energy-aware simulation systems.us, we include an exhaustive list of optimization criteria used in various works and classify approaches also by device types and computing environments.
Power-and energy-related analytical models for highperformance computing systems and applications are discussed in detail in [9], with references and contributions in other works, in this particular subarea.Node architecture is discussed, and models considering CPUs, GPUs, Intel Xeon Phis, FPGAs are included.Counter-based models are analyzed.We focus more on methods and tools as well as whole simulation environments that can make use of such models.
Techniques related to energy efficiency in cluster computing are surveyed in [10], including software-and hardware-related factors that influence energy efficiency, adaptive resource management, dynamic power management (DPM), and dynamic voltage and frequency scaling (DVFS) methods.Our paper extends that work considerably in terms of the number of methods considered.
A survey of concepts, techniques, and algorithms for energy-efficient processing in ultrascale systems was discussed in [11] along with hardware mechanisms, software mechanisms for energy and power consumption, energyaware scheduling, energy characteristics of algorithms, and algorithmic techniques for energy-aware processing.e paper can be considered as complementary to our paper as it provides descriptions of energy-aware algorithms and algorithmic techniques that we do not focus on.On the contrary, we provide a wider consideration of energy metrics and methods.
Paper [12] presents current research related to energyefficiency and solutions related to power constrained processing in high-performance computing, on the way towards exascale computing.Specifically, it considers the power cap of 20 MW for future systems, objectives such as energy efficiency, power-aware computing, and energy and power management technologies such as DVFS and DCT. e work also surveys various power monitoring tools such as Watts Up? Pro, vendor tools such as Intel RAPL, NVIDIA NVML, AMD Application Power Management, and IBM EnergyScale, and finer grained tools such as PowerPack, Penguin PowerInsight, PowerMon [13], PowerMon2 [13], Ilsche, and High-Definition Energy Efficiency Monitoring (HDEEM).While the paper provides a detailed description of selected methods, especially DVFS and tools for monitoring, we extend characterization of energy approaches per device and system types and various optimization metrics.
e study in [14] presents how to adapt performance measuring tools for energy efficiency management of parallel applications, specifically the libadapt library and an OpenMP wrapper.
e study in [15] presents a survey of several energy savings methodologies with analysis concerning their effectiveness in an environment in which failures do occur.Energy costs of reliability are considered.An energy-reliability metric is proposed that considers energy required to run an application in such a system.e survey presented in [16] provides a systematic approach for analyzing works related to energy efficiency including main data centers' domains from basic equipment, including server and network devices, through management systems to end used software, all in the context of cloud computing.
e proposed analysis allowed to present existing challenges and possible future works.Our survey is more concerned with HPC solutions; however, some aspects are common also for cloud-related topics.
Topics related to power monitoring for ultrascale systems are presented in [17].e paper describes solutions used for online power measurement, including a profound analysis of the current state-of-the-art, detailed description of selected tools with examples of their usage, open areas concerning the subject, and possible future research directions.Our paper is more focused on power/energy management, providing a review of control tools, models, and simulators.

Motivations for This Work
In view of the existing reviews of work on energy-related aspects in high-performance computing, the contribution of 2 Scientific Programming our work can be considered as the up-to-date survey and analysis of progress in the field including the following aspects: ( In the paper, we focus on survey of available methods and tools allowing proper configuration, management, and simulation of HPC systems for energy-aware processing.While we do not discuss designing applications, we discuss available APIs and power management tools that can be used by programmers and users of such systems.Methods that require hardware modifications such as cooling or architectural changes are out of scope of this paper.

Tools for Energy/Power Management in Modern HPC Systems
Available tools for energy/power management can be considered in two categories: monitoring and controlling.Depending on the approach or vendor, some tools allow for only reading the energy/power consumption while others may allow for reading and limiting (capping) the energy/ power consumption.Also, some tools are intended to only limit the energy/power consumption but indirectly where a user can modify, e.g., device frequency to lower the energy consumption.Finally, there are many derived tools which are wrapping low-level drivers aforementioned above in a more user-friendly form.
A solid survey on available tools for energy/power management was presented in paper [12].Below we propose a slightly different classification choosing the most significant tools available in 2019 and filling some gaps that are missing in the aforementioned survey.

Power Monitoring.
After HPC started focusing not only on job execution time but also on energy efficiency, the researchers started monitoring the energy/power consumption of the system as a whole using external meters such as Watts Up? Pro.Such an approach has a big advantage as it monitors actual energy/power consumption.However, such external meters cannot report energy/power consumption of system subcomponents (e.g., CPU, GPU, and memory).

Power Controlling.
As mentioned before, there are several indirect tools or methods that allow us to control energy and power consumption.Dynamic voltage and frequency scaling (DVFS) considered sometimes separately as DFS and DVS is one of the approaches that allow us to lower the processor voltage and/or frequency in order to reduce energy/power consumption but also the same time degrading performance.DVFS is available for both CPUs and GPUs.e study in [18] discusses differences of using DVFS on CPU and GPU.
Dynamic concurrency throttling (DCT) and concurrency packing [19] is another technique that can result in energy/power savings.By reducing number of available resources such as number of threads for an OpenMP application, a user is able to control power consumption and performance of the application.

Power Monitoring and Controlling.
Full power management including monitoring energy/power consumption as well as controlling the power limits was implemented by many hardware manufacturers.Vendor-specific tools were described in detail in an appendix of [12].e authors identified the power management tools for Intel: Running Average Power Limit (RAPL), AMD: Application Power Management (APM), IBM: EnergyScale, and NVIDIA: NVIDIA's Management Library (NVML).It is worth to note that besides C-based programming library (NVML), NVI-DIA introduced nvidia-smi-a command line utility available on the top of NVML.Both NVML and nvidia-smi are supported for most of Tesla, Quadro, Titan, and GRID lines [20].
Intel RAPL provides capabilities of monitoring and controlling power/energy consumption for privileged users through model-specific registers (MSR).Since its first release (Sandy Bridge), RAPL has used a software power model for estimating energy usage based on hardware performance counters.According to the study [21], Haswell RAPL has introduced an enhanced implementation with fully integrated voltage regulators allowing for actual energy measurements and improving the measurement accuracy.Precision of RAPL was evaluated in [22] with an external power meter and showed that the measurements are almost identical.
e study in [23] reviews existing CPU RAPL measurement validations and focuses on validating RAPL DRAM power measurements using different types of DDR3 and DDR4 memory and comparing these with those from an actual hardware power meter.
Although Intel RAPL is well known and well described in the literature and the research considering processor power management and power capping is documented since SandyBridge was released, the competitors' tools like AMD's APM TDP Power Cap, and IBM's EnergyScale were mostly just mentioned in many papers but never fully examined in any significant work. is seems to be one of the open areas for the researchers.
Table 1 collects basic information regarding aforementioned tools for energy/power management with comments and example work related.

Derived Tools. Performance Application Programming
Interface (PAPI) since its release and first papers [27] is still developed, and recently, besides processor performance counters, it was extended by offering access to RAPL and NVML library through the PAPI interface [28].
Processor Counter Monitor (PCM) [29] is an open source library as well as a set of command line utilities designed by Intel very similar to PAPI.It is also accessing performance counters and allowing for energy/power monitoring via the RAPL interface.
Performance under Power Limits (PUPiL) [30] is an example of the hybrid hardware-software approach to achieve energy/power consumption benefits.It manipulates DVFS as well as core allocation, socket usage, memory usage, and hyperthreading.Such an approach was compared by authors to raw RAPL power capping, and the results achieved are in favor of PUPiL.
Score-P, intended for analysis and subsequent optimization of HPC applications, allows energy-aware analysis.It is shown in [31] how clock frequency affects finite element application execution time with a minimum of energy consumption on the SuperMUC infrastructure.Consequently, both energy-optimal and time-optimal configurations are distinguished with saving 2% energy and extending execution time by 14% as well as saving 14% time and taking 6% more energy.
Since Ubuntu 18.04 LTS release, power capping has become available as a user-friendly command-line utility power cap-set [32]. is tool is also based on RAPL, so it is only valuable for Intel processors.It allows for setting a power limit on each of available domains (PKG, PP0, PP1, and DRAM).

Classification of Energy-Aware
Optimizations for High-Performance Computing e paper classifies existing works in terms of several aspects and features, including the following major factors: Computing Environment.What and how many, especially compute, devices are considered, whether optimization is considered at the level of a single device, a single multiprocessor system, cluster, grid, or a cloud (Table 2).Device Type.What type(s) of devices are considered in optimization, specifically CPU(s), GPU(s), or hybrid CPU + accelerator environments (Table 3).It can be seen that all identified types of systems are represented by several works in the literature.However, there are few works that address energy-aware computing for hybrid CPU + accelerator systems.Additionally, there are more works addressing these issues for multicore CPUs compared to GPUs.Target Metric(s) Being Optimized.Specifically, it includes execution time, power limit, energy consumption, and temperature (Table 4).We can see that many works address the issue of minimization of energy consumption at the cost of minimal performance impact.
is may be performed by identification of application phases in which power minimization can contribute to that goal.Relatively few works address consideration of network and memory components for that purpose.ere is a lack of automatic profiling and adjustment for parallel applications running in hybrid CPU + accelerator systems.Energy/Power Control Method.How the devices are managed for optimization including selection of devices/scheduling, lower-level CPU frequency control, power capping APIs for CPUs/GPUs, application-level modifications, or hybrid methods (Table 5).It can be seen that direct power capping APIs, described in more detail in Section 4.3, are relatively new and have not been investigated in many works yet which opens possibilities for new solutions.In terms of system components that can be controlled in terms of power and energy, the literature distinguishes frequency, core and uncore [45], disk [53], and network [53].e latter can also be done through Energy-Efficient Ethernet (EEE) [78] that can turn physical layer devices into a low-power mode with savings up to even 70%-work [78] shows that the overhead of the technology is negligible for many practical scenarios.
e MREEF framework considered in [57] distinguishes optimization steps such as detection of system phases, characterization of phases, classification, prediction of the upcoming system state, and reconfiguration for minimization of energy consumption (with consideration of disk and network scaling).
Table 6 correlates three of the major factors defined in Tables 3-5 and presents the existing works in the context of target metrics, energy/power control methods, and device types.e combination of the factors is a strong foundation for identification of both the recent trends in research regarding energy-aware high-performance computing and also open areas for future research.
While the majority of the presented works in the literature focus on performance and power or energy optimization during an application's execution, it is also possible to consider pre-or postexecution scenarios.On the contrary, the study in [37] considers postexecution scenarios after an application on a GPU has terminated.rough the forced frequency control, it is possible to achieve lower energy consumption in such a situation compared to the default scenario.Details are considered in the tables.
Finally, applications and benchmarks used for power/ energy aware optimization in HPC systems are summarized in Table 7.It can be seen that NAS Parallel Benchmarks,

Optimization level Works Description
(1) Single device [33] A platform based on ARM Cortex A9, 4, 8, and 16 core architectures [34] Scheduling kernels on a GPU and frequency scaling [35] A chip with k cores with specific frequencies is considered, and chips with 36 cores are simulated [36] Finding best application configuration and settings on a GPU [37] Server-type NVIDIA Tesla K20 m/K20c GPUs [38] Exploration of thermal-aware scheduling for tasks to minimize peak temperature in a multicore system through selection of core speeds [39] Comparison of energy/performance trade-offs for various GPUs [40] Server multicore and manycore CPUs, desktop CPU, mobile CPU [41] Single CPU under Linux kernel 2.6-11 [42] Intel Xeon Phi KNL 7250 computing platform, flat memory mode [43] Exploration of execution time and energy on a multicore Intel Xeon CPU (2) Multiprocessor system [44] Task scheduling with thermal consideration for a heterogeneous real-time multiprocessor system-onchip (MPSoC) system [30] Presents a hybrid approach PUPiL (Performance under Power Limits)-a hybrid software/hardware power capping system based on a decision framework going through nodes and making decisions on configuration, considered for single and multiapplication scenarios (cooperative and oblivious applications) [45] With notes specific to clusters [14] Systems with 2 socket Westmere-EP, 2 socket Sandy Bridge-EP, and 1 socket Ivy Bridge-HE CPUs [46] Dual-socket server with two Intel Xeon CPUs

Optimization level Works Description
(3) Cluster [47] Proposes integration of power limitation into a job scheduler and implementation in SLURM [48] Proposes the enhanced power adaptive scheduling (E-PAS) algorithm with integration of power-aware approach into SLURM for limiting power consumption [49] Approach applicable to MPI applications but focusing on states of processes running on CPUs, i.e., reducing power consumption of CPUs on which processes are idle or perform I/O operations [50] Proposes DVFS-aware profiling that uses design time profiling and nonprofiling approach that performs computations at runtime [51] Split compilation is used with offline and online phases, results from the offline-phase passed to runtime optimization, grey box approach to autotuning, and assumes code annotations [52] Proposes a runtime library that performs poweraware optimization at runtime and searches for good configurations with DFS/DCT for application regions [53] Approaches for modeling, monitoring, and tracking HPC systems using performance counters and optimization of energy used in a cluster environment with consideration of CPU, memory, disk, and network [54] Proposed an energy-saving framework with ranking and correlating counters important for improving energy efficiency [55,56] Energy savings on a cluster with Sandy Bridge processors [57] With consideration of disk and network scaling [58] Including disk, memory, processor, or even fans [24] Analysis of performance vs power of a 32-node cluster running a NAS parallel benchmark [59] A procedure for a single device (a compute node with CPU); however, it is dedicated using such devices coupled into a cluster (tested on 8-9 nodes) [60] Homogeneous multicore cluster [61] Cluster [62] Computer system with several nodes each with multicore CPUs [63,64] Cluster with several nodes each with multicore CPUs [65] Cluster with several nodes with CPUs [66,67] Cluster in a data center [68] Sandy Bridge cluster [69] Cluster with InfiniBand [70] Overprovisioned cluster which can run a certain number of nodes at peak power and more at lower power caps [71] Cluster with 1056 Dell PowerEdge SC1425 nodes [72] A cluster with 9421 servers connected by InfiniBand (4) Grid [73] A cluster or collection of clusters allowed in the model and implementation [74] Implementations of hierarchical genetic strategybased grid scheduler and algorithms evaluated against genetic algorithm variants (5) Cloud [75] Meant for cloud storage systems [76] Related to assignment of applications to virtual and physical machines [77] Used as IaaS for computations

Device type Works Description
(1) Single/multicore/manycore CPU [33] A platform based on ARM Cortex A9, 4, 8, and 16 core architectures [49] Multicore CPUs as part of a node and cluster on which an MPI application runs [35] A chip with k cores with specific frequencies is considered [54] Cluster, 40 performance counters are investigated and correlated for energy-aware optimization, related to runtime, system, CPU, and memory power [38] A multicore system with cores as discrete thermal elements [58] Possibly also (2 multiprocessor system) [24] 32-node cluster each with 2 Sandy Bridge 8 core CPUs [40] Multicore and manycore CPUs [52] Sandy Bridge and Haswell Xeon CPUs [77] Servers with single CPUs hosting VMs [41] Single-core Pentium-M (32-bit) in a off-the-shelf laptop [59] Single-core AMD Athlon-64 [42] Intel Xeon Phi KNL 7250 processor with 68 cores, flat memory mode [43] Multicore Intel Xeon CPU (2) Multiprocessor system [44] A heterogeneous real-time multiprocessor system-on-chip (MPSoC) system-consists of a number of processors each of which runs at its voltage and speed [47] A cluster with Intel Xeon CPUs [30] A multiprocessor system with Intel Xeon CPUs [48] A cluster with ARM CPUs, a cluster with Intel Ivy Bridge CPUs [74] A grid system parametrized with the number of hosts, distribution of computing capacities, and host selection policy [50] A system with a number of nodes with multicore CPUs assumed in the simulated HPC platform and cores of an Intel core M CPU with 6 voltage/frequency levels assumed [53] Cluster with consideration of CPU, memory, disk, and network [55,56,61,62,65] Cluster with CPUs [45] Many cores within a system, and core and uncore frequencies are of interest [57] With consideration of disk and network scaling [14] Systems with 2 socket Westmere-EP, 2 socket Sandy Bridge-EP, and 1 socket Ivy Bridge-HE CPUs [76] Undefined machines in a data center capable of hosting up to 15 VMs [60] Homogeneous multicore cluster [63,64] Cluster with multicore CPUs [66,67] Cluster in a data center [68] Sandy Bridge cluster [69] Cluster with InfiniBand [46] Dual-socket server with two Intel Xeon CPUs [70] Overprovisioned HPC cluster with CPUs [71] Cluster with 1056 Dell PowerEdge SC1425 nodes (3) GPU/accelerator [34] A GPU allowing concurrent kernel execution and frequency scaling [36] Focus on the GPU version and comparison to serial and multithreaeded CPU versions [75] GPUs used for generation of parity data in a RAID [37] Postapplication minimization of energy consumed [39] Server, desktop, mobile GPUs, not yet existing GPUs can be simulated [72] A cluster with with two Intel Xeon CPUs per node (4) Hybrid [73] Consideration of both GPUs and CPUs in a cluster or collection of clusters [51] Targeted at optimization on a cluster with Intel Xeon CPUs and MICs, early evaluation performed using OpenMP on multicore Intel and AMD CPUs Table 4: Target metric.

Target metric
(1) Performance/execution time with power limit [73] Minimization of application running time with an upper bound on the total power consumption of compute devices selected for computations [47] Shows benefit of power monitoring for a resource manager and compares results for fixed frequency mode, minimum power level assigned to a job, and automatic mode with consideration of available power [30] Performance under power cap, timeliness, and efficiency/weighted speedups are considered [48] Execution time vs maximum power consumption per system considered, consideration of system utilization, power consumption profiles, and cumulative distribution function of the job waiting time [52] Application slowdown vs power reduction and optimization of performance per Watt [24] Analysis of performance vs power limit configurations [42] Analysis of performance vs power limit configurations [62] Analysis of performance/execution time vs power limit configurations [68] Turnaround time vs cluster power limits [46] Consideration of impact of power allocation for CPU and DRAM domains on performance when power capping [70] Optimization of the number of nodes and power distribution between CPU and memory in an overprovisioned HPC cluster (2) Performance/execution time/energy minimization + thermal awareness [33] Task partitioning and scheduling, heuristic algorithm task partitioning, and scheduling TPS based on task partitioning compared to Min-min and PDTM [38] Finding such core speeds that tasks complete before deadlines, and peak temperature is minimized [67] ermal aware task scheduling algorithms are proposed for reduction of temperature and power consumption in a data center, and job response times are considered [44] Minimization of energy consumption of the system with consideration of task deadlines and temperature limit [66] Workload placement with thermal consideration and analysis of cooling costs vs data center utilization (3) Performance/execution time/value + energy optimization [34] Concurrent kernel scheduling on a GPU + impact of frequency scaling on performance and energy consumption [45] Dynamic core and uncore tuning to achieve the best energy/performance trade-off, and the approach is to lower core frequency and increase uncore frequency for codes with low operational intensity and increase core frequency and lower uncore frequency for other codes, tuning for energy, energy delay product, and energy delay product squared Energy minimization with no impact on performance [35] Energy minimization at the cost of increased of execution time, integer linear programming-based approach in order to find a configuration with the number of cores minimizing energy consumption [55,56] Energy minimization at the cost of increased execution time, achieving energy savings while running a parallel application on a cluster through DVFS and frequency minimization during periods of lower activity: intranode optimization related to inefficiencies of communication, intranode optimization related to nonoptimal data, and computation distribution among processes of an application [54] A what-if prediction approach to predict energy savings of possible optimizations, and the work focuses on identification of a set of performance counters for a power and performance model [36] Finding an optimal GPU configuration (in terms of the number of threads per block and the number of blocks) [37] Minimization of energy after an application has finished through frequency control [40] Energy minimization at the cost of increased execution time through power capping for parallel applications on modern multi and manycore processors [41] Energy minimization with low-performance degradation (aiming up to 5%) [61] Energy minimization of MPI programs through frequency scaling with constraints on execution time, linear programming approach is used, and traces from MPI application execution are collected (5) Product of energy and execution time [36] Finding an optimal GPU configuration (in terms of the number of threads per block and the number of blocks) Table 5: Energy/power control method.
Energy/power control method Works Description (1) Selection of devices/scheduling [73] Selection of devices in a cluster or collection of clusters such that maximum power consumption limit is followed + data partitioning and scheduling of computations [35] Selection of cores for a configuration minimizing energy consumption [75] Using GPUs for optimization/generation of parity data [39] Selection of best GPU architectures in terms of performance/energy usage point of view [58] Specific scheduling and switching off unused cluster nodes [33,71] Task partitioning and scheduling [44] Task scheduling, a two-stage energy-efficient temperature-aware task scheduling algorithm is proposed: in the first system, dynamic energy consumption under task deadlines, in the second temperature profiles of processors, are improved [76] Application assignment to virtual and physical nodes of the cloud [66] Workload placement in a data center [68] Proposal of RMAP-a resource manager that minimizes average turnaround time for jobs provides an adaptive policy that supports overprovisioning and power-aware backfilling (2) DVFS/DFS/DCT [49] For MPI applications with the goal not to impact performance [47] Uniform frequency power-limiting investigates results for the fixed frequency mode, minimum power level assigned to a job, and automatic mode with consideration of available power [45] Core and uncore frequency scaling of CPUs [55,56] Minimization of energy usage through DVFS on particular nodes [52] DFS, DCT [14] DVFS, DCT [37] Control of frequency on a GPU [41] DVFS with dynamic detection of computation phases (memory and CPU bound) [59] DVFS with a posteriori (using logs) detection and prioritization of computation phases (memory and CPU bound) [61] Sysfs interface is used [63,64] DCT, combined DVFS/DCT [65] Sysfs interface [72] Setting the frequency according to the established computing center policies (3) Power capping [24] Using Intel RAPL for power management [40] Using Intel RAPL for analyzing energy/performance trade-offs with power capping for parallel applications on modern multi-and manycore processors [42] Using PAPI and Intel RAPL [62] Using Intel RAPL [46] Using Intel's power governor tool and Intel RAPL 10 Scientific Programming physical phenomena simulations, and compute intensive applications are mainly used for measurements of solution performance.By identifying the same benchmarks from various papers, it makes it possible to either cross check conclusions or integrate complementary approaches for future work.

Tools for Prediction and/or Simulation of Energy/Power Consumption in an HPC System
ere are several systems that allow us to predict and/or simulate energy/power consumption in HPC systems.Table 8 presents the summary of the currently used tools.
GSSim [80] (Grid Scheduling Simulator) is a tool dedicated to simulate scheduling performed in a grid environment.
e tasks are assigned into the underlying computation resources, and their communication is evaluated according to defined network equipment.Its extension DCworms [81] provides additional plugins for temperature and power/energy usage in a modeled data center.e simulator provides three approaches for energy modeling: static with various power-level modes, dynamic where the energy consumption depends on the resource load, and application specific which can be used for advanced model tuning.e experimental results of the performed simulations compared to real hardware measurements showed a high correlation between the simulation and a real HPC environment, for both power and thermal models [91].
MERPSYS [92] (Modeling Efficiency, Reliability and Power consumption of multilevel parallel HPC SYStems using CPUs and GPUs) simulator enables hierarchical modeling of a grid, a cluster, or a single machine architecture and test it against a defined application.e tool provides means (Java scripts specified using the web simulator interface) for the flexible system and application definitions for simulating energy consumption and the execution time.e simulator was tested using typical SPMD (Simple Program Many Data) and DAC (Divide and Conquer) applications [82].
CloudSim [84] is a framework dedicated to simulate a behavior of a cloud or a whole cloud federation, supporting an IaaS model.e tool enables modeling all main elements of the cloud architecture, including physical devices, VM allocation, cloud market, network behavior, and dynamic workflows.e results of the simulation support the data center resource provisioning, QoS, and energy-consumption analysis.CloudSim is used by researchers in academic and commercial organizations, e.g., HP Labs in the USA.
SimGrid [85] is a discrete-event simulation framework for grid environments focusing on versatility and scalability.e tool supports three different sources of the input data: two eoretical consideration of optimizations of an application that results in improvement of performance countervalues [36] Finding an optimal GPU configuration (in terms of the number of threads per block and the number of blocks) [53,57] Control of CPU frequency, spinning down the disk, and network speed scaling [43] Exploration of various loop scheduling ways, chunk sizes, optimization levels, and thread counts (5) Hybrid [30] Software + RAPL, the proposed PUPiL approach combines hardware's fast reaction time with flexibility of a software approach [48] Scheduling/software + resource management (including RAPL), the proposed algorithm takes into account real power and energy consumption [34] Concurrent kernel execution + DVFS [50,74] Scheduling + DVFS [38] Scheduling + DVS for minimization of temperature and meeting task deadlines [51] Scheduling jobs and management of resources and DVFS [77] Selection of the resources for a given user request, with VM migration and putting unused machines in the sleep mode [60] Workload distribution + DVFS-based multiobjective optimization [69] Polling, interrupt-driven execution (relinquishing CPU and waiting on a network event), DVFS power levers [70] Selection of nodes in an overprovisioned HPC clusters and Intel RAPL Scientific Programming [ 24,42] Nx CPU [68] [47] [46,62] [ 30,48,70] GPU Hybrid [73] Performance/execution time/energy minimization + thermal aware 1x CPU [33] [38] Nx CPU [44,66,67]
12 Scientific Programming kinds of API, including MPI tracing from real applications, and a DAG (directed acyclic graph) format for task workflows.e SimGrid extension [93] enables to account energy consumption of concurrent applications in the HPC grids featuring DVFS technology of the multicore processors.GENSim [94] is a data center simulator capable to model a mixed task input, for both interactive web service calls as well as batch tasks.e tool has been used for estimation of power consumption, assuming usage of both brown and green energy, where the latter is used for accelerating the current batch computations during the predicted peek times of the renewable energy sources.e results were validated using a real hardware experimental testbed consisting of a collection of CPU (Intel Nehalem) based cloud servers [86].
Combination of tools OMNet++ and INET [58] was used for HPC computation modeling, where energy-aware scheduling algorithms were tested.e specific cluster configuration was assumed, and a number of clients requesting totally 400 jobs were simulated.e behavior of main server components was evaluated including such procedures like switching off the idle nodes.e simulation results were compared to the results obtained in a real testbed environment.
GDCSim [87] (Green Data Center Simulator) provides a holistic solution for evaluation of data center energy consumption.e tool enables an analysis of data center geometries, workload characteristics, platform power management schemes, and applied scheduling algorithms.It supports both thermal analysis under different physical

Work
Applications/benchmarks [33] Mibench and mediabench [44] Automotive-industrial, consumer-networking, telecom, mpeg [73] MD5 password breaking application [47] Monte Carlo simulation of particle transport unstructured implicit finite element method molecular dynamics unstructured shock hydrodynamics [30] Both single-application and multiapplication workloads PARSEC (x264, swaptions, vips, fluidanimate, Black Scholes, bodytrack), Minebench (ScalParC, kmeans, HOP, PLSA, svmfe, btree, kmeans_fuzzy), Rodinia (cfd, nn, lud, particlefilter), Jacobi, swish++, dijkstra [48] CoMD, Lulesh, MP2C [34] NAS parallel benchmarks (NPB) kernel EP, European option pricing benchmark Black Scholes [74] Workload simulated in a scheduling simulator [49] Matrix multiplication + P_write_priv benchmark-an I/O benchmark of IMB package [50] Job models from historical data from HLRS (highperformance computing center Stuttgart) [51] Early evaluation through an application extracted from Drug Discovery code, computation of interatomic distances and overlap of drug molecule, and protein active site probes [52] HPCG, NAS parallel benchmarks, NICAM-DC-MINI (for Post-K Japanese national flagship supercomputer development) [53] For system adaptation experiments: Molecular dynamics simulation (MDS), advance research Weather research and forecasting (WRF-ARW) configurations (using CFD) as well as energy efficiency analysis of resource management algorithms (using eventbased approach).e simulator was used for evaluation of scheduling in an HPC environment and a transnational workload on Internet data center.GreenCloud [88] presented a packet-level simulator for a cloud, providing energy consumption model for various data center architectures.e model covers workload basic infrastructure elements: computing servers, access, aggregation, and core networks devices including various L2/L3 switches working at various network speeds (1, 10, and 100 Gb Ethernet).For the power management purposes, the simulator uses DVFS (dynamic voltage and frequency scaling) and DNS (DyNamic Shutdown) schemes along with the different workload characteristics incorporated into a defined data center model.e presented use case shows evaluation of energy consumption for two-and three-layer data center architectures including a variant supporting a high-speed (100 Gb Ethernet) interconnection.
TracSim [89] is a simulator for a typical HPC cluster with a fixed power cap, which should not be exceeded due to cooling and electric connection limitations.e assumption is that some compute jobs do not need so much power; thus, the others can use more energy consuming resources.e tool implements various scheduling policies to simulate different approaches for evaluation of the possible power level.
e experiments showed that this solution can be tuned for a specific environment, i.e., a production HPC cluster at Los Alamos National Laboratory (LANL), and the overall simulation results are accurate in 90%, in most cases.
In [90], Ostermann et al. proposed a combination of three tools, providing a sophisticated, event-based simulator for a cloud environment working under the Infrastructure as a Service (IaaS) model with a given power cap for the whole modeled system.e simulator consists of the following components: (i) ASKALON [95] responsible for a scientific workflow, (ii) GroudSim [96] being the main event based engine of the solution, and (iii) DISSECT-CF [97] containing functionality related to cloud modeling.e approach evaluation was based on the simulation of scientific workflows (using traces of real execution) and showed good performance and scalability despite the fact of using such a complex solution.
e energy-aware HyperSim-G simulator [74] was used for testing genetic-based scheduling algorithms deployed in a grid environment.e tool is based on a basic version of the HyperSim-G event-based simulation package described in [98].As an energy-saving technique, the tool utilized DVFS and performed experiments showed a systematic method of evaluation of compute grid schedulers supporting energy and performance biobjective minimization.
Design Space Exploration for GP GPU was proposed in [39], providing a tool for multiobjective evaluation of GP GPU devices in the context of specific medical or industrial applications.e analysis is performed for various parameters, including energy-efficiency, performance, or real-time capabilities evaluating the modeled devices.e simulator was designed as a distributed application deployed in a heterogeneous cloud environment, supporting a variety of GPUs, including the ones still to be released by the manufacturer.e validation of the solution was made using a real-life streaming application and showed a low error level (below 4% in the worst case) in comparison to the real devices.
In [35], Langer et al. presented a work covering energy minimization of the multicore chips for two minibenchmark HPC applications.
e optimal configuration was selected using integer linear programming and solved  [100] providing energy-aware design space exploration for multicore chips, considering dynamic, short-circuit, and leakage power modeling.

Open Areas
Finally, based on the analyzed research, we can formulate open areas for research that seem crucial for further progress in the field of energy-aware high-performance computing: (1) e variety of the HPC tools used for energy/power management, presented in Table 1, shows a need for unification of various APIs provided by the different vendors, to propose a uniform power-aware API spanning available HPC computing devices such as multi-and manycore CPUs, GPUs, and accelerators, supporting a common, cardinal subset of universal parameters related to power/energy as well as performance measurements and management.(2) e usability, precision, and performance of the currently used tools for prediction and simulation presented in Table 8, in the context of their support for specific computing environments (Table 2), device types (Table 3), and used metrics, show that further development of, possibly empirical, performance-energy models for a wide range of CPU and GPU architectures for various classes of applications is required (e.g., the ones described in Table 7), including performance (power limit) functions, available for runtime usage as well as simulator environments.
(3) As a conclusion from Table 6, we can recognize several open research directions, concerning the energy-aware HPC field, which still need further development: (i) Energy/power-aware methods for hybrid (CPU + accelerator) systems (ii) Optimization with any energy/power control method but targeted at minimization of product of energy and execution time (iii) Using hybrid energy/control methods for energy consumption and energy-time product minimization (4) Finally, analysis of the energy/power control methods, presented in Table 5, drives us to the following conclusions: (i) ere is a need for development of tools for automatic configuration of an HPC system including power caps, for a wide variety of application classes focusing on performance and energy consumption and available for various parallel programming APIs.While there exist approaches for selected classes of applications and using MPI (e.g., [65]), there are no general tools able to adjust to a variety of application APIs.ese tools can use the models proposed in the previous step as well as detect and assign an application to one of selected classes, in terms of performance-energy profiles.(ii) Automatic configuration of an HPC system in terms of performance and energy for a hybrid (CPU + accelerator) application at runtime, where off-loading of computations can be conditioned, not only by the time of the computations but also by power/energy constraints.(iii) Farther development and validation of currently existing tools focused on energy/power management area, including their functionality extensions as well as quality improvements, e.g., validation of AMD's Application Power Management TDP Power Cap tool or IBM's EnergyScale capabilities.

Summary and Future Work
In the paper, we have discussed APIs for controlling energy and power aspects of high-performance computing systems incorporating state-of-the-art CPUs and GPUs and presented tools for prediction and/or simulation of energy/ power consumption in an HPC system.We analyzed approaches, control methods, optimization criteria, and programming examples as well as benchmarks used in state-ofthe-art research on energy-aware high-performance computing.Specifically, solutions for systems such as workstations, clusters, grids, and clouds using various computing devices such as multi-and manycore CPUs and GPUs were presented.Optimization metrics including combinations of execution time, energy used, power consumption, and temperature were analyzed.Control methods used in available approach include scheduling, DVFS/DFS/DCT, power capping, application optimizations, and hybrid approaches.We have finally presented open areas and recommendations for future research in this field.

Table 1 :
Tools for energy/power management in modern HPC systems.

Table 6 :
Correlation of proposed metrics with energy/power control methods and devices.

Table 8 :
[99]s for prediction and/or simulation of energy/power consumption in an HPC system..esimulation was based on the Sniper[99]package, aiming to increase efficiency by optimizing the level of the simulation accuracy.e tool was enhanced by the McPAT framework