Parallelizing Comprehensive Learning Particle Swarm Optimization by Open Computing Language on an Integrated Graphical Processing Unit

,


Introduction
A graphical processing unit (GPU) is a processor specially designed to rapidly manipulate the creation of images in a frame buffer intended for output to a display device. By providing functionalities such as texture mapping, rendering, shading, anti-aliasing, color space, and video decoding, a GPU is an indispensable aid to a central processing unit (CPU) to manage and boost the performance of graphics. A CPU consists of only a few processing elements optimized for sequential processing, whereas a GPU consists of a large number of compute units, with each compute unit in turn containing many processing elements, thereby constituting a massively parallel architecture for handling multiple computing tasks simultaneously. People have recently studied leveraging the massively parallel architectures of GPUs for accelerating nongraphical general-purpose computing in a wide range of areas [1][2][3][4][5][6][7][8][9][10][11][12][13][14]. GPU-based parallel computing is implemented by a host program and kernel(s). e host program runs on the CPU and can launch a kernel on a connected GPU. A kernel is a function executed in parallel on the processing elements within the GPU. e parallel threads of the kernel can be organized into groups, with each group of concurrent threads executed on the same compute unit. Data are transferred between the CPU and the GPU.
Metaheuristics are naturally suitable to be parallelized on GPUs. A metaheuristic is essentially a set of nature-inspired intelligent search strategies and is promising for singleobjective global optimization [15]. A metaheuristic usually finds the optimum using a population of individuals, with each individual representing a candidate solution. All the individuals are initialized randomly and evolve iteratively and randomly in the search space to gradually locate the optimum. Each individual is thus associated with a fitness indicating the individual's search performance. e final solution obtained is the individual that exhibits the best fitness among the population when the evolution ends. Compared with traditional optimization methods such as linear programming, nonlinear programming, dynamic programming, and optimal control, metaheuristics do not require the objective and constraints of the optimization problem to be continuous, differentiable, linear, or convex. In addition, metaheuristics can be directly linked with simulation models without requiring simplifying any assumption in the models. According to Tan and Ding [16], GPU-based parallelization implementations of metaheuristics are classified into four categories, namely, naive model, multiphase model, all-GPU model, and multipopulation model. e naive model offloads fitness calculation onto the GPU side. e naive model can be coarse-grained or finegrained. For the coarse-grained naive model, the fitness of each individual is calculated by a separate thread. If the fitness is an aggregation of a number of partial calculations that can be run in parallel, then the fined-grained naive model can be used to further accelerate fitness calculation. Besides fitness calculation, the multiphase model also parallelizes other phases of a metaheuristic to leverage the GPU's computing power. Multiple kernels are used to parallelize different phases, and different kernels can take different parallelization granularities to best fit the corresponding tasks. e multiphase model might incur the overhead of frequent kernel launching.
e all-GPU model combines multiple kernels into one kernel. Instead of relying on a single population, multiple populations are employed in the multipopulation model, with each population evolving separately with different thread(s).
Particle swarm optimization (PSO) is a class of metaheuristics simulating the food-searching behavior of a bird flock [17,18]. In PSO, population is termed as swarm, and an individual is termed as the particle. All the particles "fly" in the search space. Each particle is accordingly additionally associated with a position and velocity. Each particle updates the flight trajectory iteratively and randomly and tries to gradually move towards the optimum. Many PSO variants have been proposed in the literature since 1995 [17,18], and some variants have been parallelized on GPUs. Open computing language (OpenCL) [19] and compute unified device architecture (CUDA) [20] are two commonly used industry standards that facilitate parallel computing on GPUs. Based on the C/C++ programming language, OpenCL and CUDA both provide a series of runtime application programming interfaces (APIs). OpenCL enables programming across heterogeneous types of processors including not only GPU but also digital signal processor, and field programming gate array, whereas CUDA only supports NVIDIA proprietary GPUs. Global PSO (GPSO) is the earliest PSO variant [17]. In GPSO, the flight of each particle is guided by the particle's historical best search experience (i.e., personal best position) and also the historical best search experience out of the entire swarm (i.e., global best position). GPSO was parallelized on GPUs by CUDA following the multiphase model in [21], the all-GPU model in [1][2][3]22], and the multipopulation model in [4,5]. GPSO is liable to get stuck in premature convergence, and local PSO (LPSO) [23] is an improved variant based on GPSO. A static topology is constructed in LPSO. For a particle, its neighborhood contains itself and the particles that it directly connects to in the topology. Instead of referring to the global best position, the historical best search experience of the neighborhood (i.e., local best position) is used for updating particle's velocity. An all-GPU parallelization implementation of LPSO by CUDA was proposed in [6]. In contrast with LPSO, standard PSO (SPSO) [24] maintains a dynamic topology. An all-GPU parallelization model of SPSO implemented by CUDA and that by OpenCL were, respectively, introduced in [25,26]. Comprehensive learning PSO (CLPSO) [27][28][29] differs from GPSO, LPSO, and SPSO in that CLPSO encourages each particle to learn from different exemplars on different dimensions during the flight trajectory update, whereas GPSO, LPSO, and PSO enforce the same pair of personal best position and global/local best position for updating particle's velocity on all the dimensions. CLPSO is good at preserving the particles' diversity and exhibits excellent optimization performance. Papadakis and Bakrtzis [7] investigated developing an all-GPU model of CLPSO by OpenCL. Kilic et al. [8], Ouyang et al. [9], Souza et al. [10], Sharma et al. [11,12], and Rabinovich [30] studied parallelizing other PSO variants on GPUs.
In this paper, we study parallelizing CLPSO on a platform with the Intel Core i7-6500U CPU, the third-generation double data rate (DDR3) main memory, and the Intel HD Graphics 520 (IHDG520) GPU by OpenCL. IHDG520 is an integrated GPU; it can be found in various ultralowvoltage mobile CPUs and is suited for laptop (particularly, ultrabook) computers. e IHDG520 GPU lacks dedicated graphics memory and has to access the main memory. e sequentialization implementation of CLPSO was evaluated with a swarm of 40 particles in [27][28][29]. We propose a coarse-grained all-GPU model with the kernel being concurrently executed by 40 threads. Two enhancement strategies, i.e., generating and transferring random numbers from the CPU to the IHDG520 GPU as well as reducing the number of instructions in the kernel, are introduced into the model for the purpose of shortening the execution time. We further investigate parallelizing deterministic optimization for implicit stochastic optimization (ISO) of China's Xiaowan Reservoir. e deterministic optimization is performed on an ensemble of 62 years' historical inflow records with 2 Complexity monthly time steps and is solved by CLPSO.
e deterministic optimization is parallelized by a coarse-grained multipopulation model extended from the all-GPU model, using each swarm of particles to address the optimal operation in a separate year. e multipopulation model involves a large number of threads. Because of the limit on data transfer capacity between the CPU and the IHDG520 GPU, we modify the random number generation strategy by just generating a small number of random numbers that can be flexibly exploited by the large number of threads without hurting randomness.
A reservoir is a hydraulic structure that impounds water and uses water to serve various purposes such as hydropower generation, flood control, navigation, sediment control, and water provisioning for agricultural, domestic, and industrial demands. A reservoir system consists of one or more cascaded reservoirs constructed within the same river basin. e optimal operation of a reservoir system is to schedule outflows of the reservoir (s) over a series of consecutive time steps in order to optimize a specific objective, trying to fulfill the multipurpose development of the system. e optimal operation of a reservoir system is complex because the optimization problem has to take into account inflow imprecision and uncertainties, the dynamic multistage nature of decision-making, and different physical and operational constraints [31][32][33][34][35][36].
e optimal operation of a reservoir system is either deterministic or stochastic. Deterministic optimization assumes that inflows into the reservoirs (s) over all the time steps are available. However, in practice, only limited inflow forecasting information can be obtained. Alternatives to avoid perfect inflow knowledge during the entire planning horizon include ISO and explicit stochastic optimization (ESO) [35,37]. ISO is also referred to as Monte Carlo optimization. ISO optimizes over a long continuous series of historically recorded or synthetically generated inflows, or many shorter equally likely sequences. Accordingly, stochasticity of the inflows is implicitly addressed, and deterministic optimization can be applied on the ensemble of inflows. Operation rules for the outflows of each reservoir over all the time steps conditioned on information, e.g., the reservoir's present storage volume (or forebay elevation), previous inflows, and limited forecasted inflows, are then abstracted from the deterministic optimization results using a fitting method, e.g., rule curve [38][39][40][41][42], linear regression [43][44][45][46][47][48], artificial neural network [49][50][51][52], neuro-fuzzy inference system [53,54], decision tree [50,55], genetic programming [56], and support vector regression [57]. Compared with ISO, ESO directly operates on probabilistically described inflows [58][59][60][61][62]. e deterministic optimization is usually nonlinear, nonconvex, and nondifferentiable [63][64][65][66] and has been extensively addressed by metaheuristics recently [67].
Integrated GPUs are prevalent nowadays and can be found in both laptop (e.g., the IHDG520 GPU and the Intel HD Graphics 620 GPU) and desktop computers (e.g., the Intel HD Graphics 530 GPU). e clock rates of the Intel HD Graphics 520, 620, and 530 GPUs are all rather low, being 0.3 GHz, 0.3 GHz, and 0.35 GHz, respectively. e Intel HD Graphics 620 and 530 GPUs also lack dedicated graphics memory.
e Intel integrated GPUs feature significantly different architectures from NVIDIA and AMD GPUs studied in the existing literature body, e.g., the NVIDIA GPUs studied in [1-4, 6-10, 12, 21, 22, 25, 26, 30] and the AMD GPUs studied in [13,14,68,69] have much higher clock rates and have dedicated graphics memories. Many NVIDIA and AMD GPUs are quite larger in size and require higher power supply. e motherboard and the power supply of a brand desktop computer often differ considerably from those of a self-assembled desktop computer and may not support adding an NVIDIA or AMD GPU. As a result, this paper experimenting on the IHDG520 GPU is of critical practical meaning. In addition, to the best of our knowledge, this paper is the first pioneering work investigating parallelizing the deterministic optimization for the ISO of a reservoir system on a GPU. e rest of this paper is organized as follows. e working procedure of CLPSO, the knowledge about OpenCL, and the characteristics of the IHDG520 GPU are detailed in Section 2. Section 3 introduces the case study of the Xiaowan Reservoir and formulates the deterministic optimization for the ISO of the Xiaowan Reservoir. Section 4 presents our proposed coarse-grained all-GPU and multipopulation models of CLSPO implemented by OpenCL. e performance of the models is evaluated in Section 5. Section 6 concludes this paper.

Comprehensive Learning Particle Swarm Optimization.
Let the search space be D-dimensional, and there are N particles in the swarm. Each particle, denoted as i (i � 1, 2, . . ., N), is associated with velocity V i � V i, 1 , V i, 2 , . . ., V i, D and a position P i � P i, 1 , P i, 2 , . . ., P i, D . In each iteration (or generation), V i and P i are updated on each dimension d (d � 1, 2, . . ., D) as follows: where w is the inertia weight; δ i,d is a random number uniformly distributed in [0, 1]; and E i � E i, 1 , E i, 2 , . . ., E i, D is the exemplar that guides the update of i's flight trajectory. e dimensional velocity V i,d is clamped to a positive value V d , i.e., Let P d and P d , respectively, be the lower bound and the upper bound of the search space on each dimension d, e weight w linearly decreases from 0.9 to 0.4 in each generation k according to the following equation: where K is the predefined number of generations. CLPSO maintains a personal best position B i � B i, 1 , B i, 2 , . . ., B i, D for each particle i. Initially, B i � P i . Let f be the fitness function. In each generation, if i's fitness value f (P i ) is better than i's personal best fitness value f (B i ), then B i � P i ; otherwise, B i does not change. e dimensional exemplar E i,d can be B i, d or B j, d with j ≠ i, and the choice depends on i's learning probability L i . To be specific, the value of L i is On each dimension d, a random number uniformly distributed in [0, 1] is generated. If the generated number is no less than To determine j, two different particles excluding i are randomly selected, and j is the particle with a better personal best fitness value. If E i � B i on all the dimensions, CLPSO randomly chooses one dimension d and one particle j with j ≠ i to force E i, d � B j, d . e exemplar E i is not updated unless f (B i ) ceases improving for a refreshing gap of 7 generations.
In each generation, f (P i ) is evaluated only if P i is feasible, i.e., P i,d is within [P d , P d ] on each dimension d. As E i is always feasible, infeasible P i will eventually be drawn back to the search space. After K generations, CLPSO determines the global best position G � G 1 , G 2 , . . ., G D that exhibits the best fitness value among all the personal best positions. f (G) is the swarm's global best fitness value. e step-by-step flowchart of CLPSO is depicted in Figure 1. If N is large enough, the chances of selecting two same particles and j � i at Step 3 are very small; hence, the procedure to determine j can be simplified and just slightly affect the performance of CLPSO by just randomly selecting two particles when the generated random number is greater than L i and randomly selecting a particle when E i � B i on all the dimensions.

Open Computing Language.
As can be seen from Figure 2, OpenCL views the hardware platform as a collection of heterogeneous compute devices attached to and managed by a CPU. A compute device can be a GPU or some other type of processor. OpenCL implements parallel computing with a host program and kernels. e host program runs on the CPU and can launch a kernel on a compute device. e parallel threads of a kernel are termed as work items. e work items can be organized into a number of independent so-called work groups, and only the work items executed on the same compute unit can be included in one work group. Each work item is identified by a unique global ID. e index space of the global IDs is one, two, or three-dimensional, with each attribute starting at zero. Each work item can also be identified by the group ID of the work group and the local ID of the work item relative to the work group. e host program creates a context. A context specifies kernel (s) to be executed on one or more compute devices. Besides kernel, a context also manages objects such as command queue, memory, and program. A command queue holds commands (or operations) that will be executed on a compute device. Commands placed into a command queue can be classified into three categories, i.e., kernel management, memory management, and synchronization. Values for the input parameters of a kernel are transferred between the CPU and a compute device. OpenCL represents generic data by a buffer and supports creating a buffer only for a onedimensional array. e memory space of a compute device is divided into four regions, i.e., global memory, constant memory, local memory, and private memory. e global memory permits read/write access to all the work items in all the work groups. Being writable by the CPU but not the compute device, the constant memory remains constant during the execution of a kernel. A local memory is just shared by all the work items in one specific work group. Each work item has a private memory, invisible by any other work item. Memory region qualifiers, "__global," "__constant," "__local," and "__private," can be applied on an input parameter of a kernel to, respectively, restrict that the parameter is to be stored in the global memory region, the constant memory region, the local memory region, or the private memory region. An input parameter with no memory region qualifier is stored in the private memory region by default. An input parameter with the data transferred by a buffer can only be stored in the global memory region. All the variables and constants additionally declared inside the kernel are stored in the private memory region. OpenCL is able to synchronize all the work items in the same work group, but cannot synchronize work items across different work groups. A program consists of one or more kernels.

Intel HD Graphics 520 Graphical Processing Unit.
IHDG520 is an integrated GPU, i.e., it is embedded on the same die as the CPU. Integrated GPUs lead to less heat output and less power usage; thus, they have been widely taken in laptop (particularly, ultrabook) computers. e IHDG520 GPU has 24 compute units clocked at 0.3 GHz. Each compute unit is composed of 256 processing elements. e IHDG520 GPU has to share the main memory with the CPU. For the IHDG520 GPU, the size of the constant memory region and that of the local memory region are both zero; in other words, only the global memory region and the private memory region located in the main memory can be used. e size of the global memory region is 1.3 GB. e maximum size of a buffer created in the global memory region is 511 MB. e IHDG520 GPU uses on-chip registers to store kernel instructions. e IHDG520 GPU supports single-precision floating point calculation, but does not support double-precision floating point calculation.

Case Study.
e Xiaowan Reservoir located on Lancang River is taken as the case for study. Lancang River is the upper stream of Mekong River in China. Mekong River is a cross-border river in Southeast Asia. Originating from the Qinghai-Tibet Plateau, Mekong River runs through 6 countries, i.e., China, Myanmar, Laos, ailand, Cambodia, and Vietnam, sequentially. Mekong River is the world's 12 th longest river, with a length of 4350 km. e length of Lancang River is 2139 km, draining an area of 0.16 million km 2 over the provinces including Qinghai, Tibet, and Yunnan. e Xiaowan Reservoir is constructed at the west of Yunnan and on the middle reach of Lancang River, with a longitude of 100°05′28″ and a latitude of 24°42′18″. Figure 3 shows the Lancang River basin in Yunnan and the Xiaowan Reservoir. e Xiaowan Reservoir is mainly used for hydropower generation and also serves flood control, irrigation, sediment control, and navigation. For the Xiaowan Reservoir, the installed power generation capacity is 4200 MW, the normal forebay elevation is 1240 m, the dead forebay elevation is 1166 m, the flood control forebay elevation is 1236 m, and the total storage volume is 14,914 million m 3 . e Xiaowan Reservoir is affected by a monsoon climate, and the inflows feature seasonal variations. e flood season is from June to September. e guaranteed hydropower generation per year is 190·10 8 kWh. Historical inflow records for the Xiaowan Reservoir from the year 1953 to 2014 are available.

Deterministic Optimization Problems' Formulation.
e deterministic optimization problems for the ISO of the Xiaowan Reservoir are formulated with a yearly planning horizon of 12 monthly time steps and an ensemble of M years. Assuming that, for each year m (m � 1, 2, . . ., M), the inflow I m, t into the reservoir in each month t (t � 1, 2, . . ., 12) of year m is already known, the deterministic optimization problem related to year m tries to determine the power discharge rate Q m, t and the spillage rate S m, t in each month t of year m for the objective of maximizing the total hydropower generation over the yearly planning horizon of year m; I m, t , Q m, t , and S m, t are all measured by the unit of m 3 /s, and the following equation gives the objective: . . .

Compute device
Compute unit Processing element CPU Figure 2: View of the hardware platform by OpenCL.
Step 1: for each particle i (i = 1, 2, ..., N), randomly initialize i's dimensional velocity V i,d according to the maxmimum dimensional velocity and dimensional position P i,d according to the dimensional search space on each dimension d (d = 1, 2, ..., D); calculate i's fitness value f (P i ); and set i's dimensional personal best position B i,d = P i,d on each dimension d, personal best fitness value f (B i ) = f (P i ), stagnation number R i = 0, and learning probability L i according to equation (5).
Step 5: determine the global best position G and the global best fitness value f (G).
Step 4: Step 3: on all the dimensions, randomly choose one dimension for i to learn from some other particle; update V i,d according to Equation (1), clamp V i,d according to Equation (3), and update P i,d according to Equation (2) on each dimension d. Complexity where U m, t is the power output in month t of year m and is measured by the unit of kW and X m, t is the number of days in month t of year m. U m, t is calculated by where C m, t is the water conversion rate in month t of year m and is measured by the unit of m 3 /kWh. C m, t is affected by the water head H m, t in month t of year m. H m, t is the difference of the forebay elevation Y m, t and the tailrace elevation Z m, t in month t of year m, i.e., Let 6 Complexity Q m,t cannot surpass the power discharge rate upper bound Q m,t in month t of year m. As a result, Q m,t takes a smaller value of O m,t and Q m,t , as expressed in the following equation: Let A be the initial/final storage volume bound. e initial storage volume A m,1 is known, and A m,1 � A. e storage volume at the end of each month is calculated based on water balance, i.e., e problem is associated with the following constraints: where O t and O t are, respectively, the lower and upper bounds of the outflow rate in each month t and A t+1 and A t+1 are, respectively, the lower and upper bounds of the storage volume at the end of each month t. e deterministic optimization needs to solve M problems, with one problem for each year m separately. Monthly operation rules can be abstracted from the deterministic optimization results of all the M problems.

Parallelizing Comprehensive Learning
Particle Swarm Optimization

Basic Coarse-Grained All-GPU Model.
A basic parallelization scheme is presented here and works as the basis of our proposed enhancement strategies. e basic parallelization scheme follows the all-GPU model and implements a single kernel. CLPSO needs to generate random numbers uniformly distributed in [0, 1] at Steps 1 and 3. An OpenCL program is composed of both the host part and the kernel part. OpenCL provides no built-in primitive for generating any kind of random number in the kernel part. We write an auxiliary inline function that the kernel function can invoke for generating a random unsigned integer number based on the multiplicative linear congruential (MLC) principle [70], i.e., where ϕ is a random unsigned integer number and works as the seed for generating the next random unsigned integer number. A random float number δ uniformly distributed in [0, 1] is obtained by When the inline function is called, the function code gets directly inserted at the point of each function call, thereby shortening the function call overhead. e all-GPU model is coarse-grained, with each particle mapped to a separate work item in a one-dimensional index space. Each work item is identified by the global ID. N, D, and ϕ i are input parameters of the kernel function, while k and K are additionally declared as variables/constants inside the kernel function. ϕ i is the seed for each particle/work item i to generate a random unsigned integer number. Buffers are created for and the memory region qualifier "__global" is applied on , and ϕ i ; hence, they are stored in the global memory region. N, D, k, and K are stored in the private memory region. e detailed working procedures of the host part and the kernel part are illustrated in Figure 4. e CPU first needs to initialize numerical values for some input parameters including N, D, P d , P d , V d , and ϕ i . e numerical values are then transferred from the CPU to the IHDG520 GPU before the kernel function executes. e work items execute concurrently, and each work item is just responsible for performing the operations related to the corresponding particle at Steps 1, 3, and 4. Only one prespecified work item executes Steps 2 and 5. When the kernel function finishes execution, G d and f (G) are transferred back to the CPU.

Enhancement Strategies.
Two enhancement strategies, namely, generating and transferring random numbers from the CPU as well as reducing the number of instructions in the kernel, are employed to accommodate the characteristics of the IHDG520 GPU and the OpenCL APIs for the purpose of significantly shortening the execution time of the basic coarse-grained all-GPU model.
OpenCL provides no built-in primitive for generating any type of random number. In the basic coarse-grained all-GPU model, an auxiliary inline function is written to assist the kernel function generating random numbers based on the MLC principle. e MLC principle generates a random unsigned integer number based on an unsigned integer input, and a random float number uniformly distributed in [0, 1] can be obtained by dividing the random unsigned integer number timed with 1.0 over 2147483647.0. Most GPUs including the IHDG520 GPU are not good at dealing with the integer multiplication and modulation operations as well as the float division operation involved in the MLC random number generation process and need many clock cycles to execute such costly operations. In addition, the IHDG520 GPU is slow at execution because its clock rate is just 0.3 GHz.
Step 1 of CLPSO randomly initializes each particle i's dimensional velocity V i,d and position P i,d on each dimension d and requires 2ND random numbers. At Step 3, a random number is compared with each particle i's learning probability L i on each dimension d, two particles are randomly selected for determining the dimensional exemplar E i, d if the number is greater than L i , and a dimension is randomly selected to learn from a particle that is also randomly selected when E i equals to the personal best position B i on all the dimensions; thus, maximally, 3 ((K − 1)/7 + 1) ND + 2 ((K − 1)/7 + 1) N random numbers are needed. Step 4 updates each particle i's dimensional velocity Complexity 7 V i,d with a random coefficient on each dimension d, and uses KND random numbers. It would be very time-consuming to generate all the 2ND + 3 ((K − 1)/7 + 1) ND + 2 ((K − 1)/7 + 1) N + KND random numbers on the IHDG520 GPU. e CPU is clocked at 2.5 GHz, and the host part of the parallelization program can invoke some highly efficient C/C++ library function to generate a random number uniformly distributed in [0, 1]. erefore, we can generate all the random numbers on the CPU and transfer all the random numbers from the CPU to the IHDG520 GPU before the kernel function begins execution. With respect to the basic coarse-grained all-GPU model, , and f (B i ) are input parameters of the kernel function. OpenCL supports creating a buffer only for a one-dimensional array. Buffers representing onedimensional arrays with ND elements are created for V i,d , P i,d , E i, d , and B i, d , and buffers representing one-dimensional arrays with D elements are created for L i , f (P i ), and f (B i ). All the work items are indexed in a one-dimensional space, and the global IDs of the work items range from 0 to N − 1, with each work item standing for a separate particle. e kernel function thus accesses V i,d , P i,d , E i, d , and B i, d by the index iN + d from the corresponding one-dimensional array and accesses L i , f (P i ), and f (B i ) by the index i from the corresponding one-dimensional array, with i also being the global ID of the work item. B i, d and f (B i ) are shared among the particles for exemplar redetermination at Step 3 of CLPSO, while V i,d , P i,d , E i, d , L i , and f (P i ) are not shared by the particles. We can alternatively declare each particle's dimensional velocities, positions, and exemplars as one-dimensional arrays with D elements and also each particle's learning probability and fitness value as variables inside the kernel function, thereby reducing the number of addressing instructions. ere are a number of if-clause conditional statements in the kernel function of the basic coarse-grained all-GPU model. Most GPUs including the IHDG520 GPU are also slow at performing conditional operations. Some if-clause conditional statements in the kernel function can actually be replaced by OpenCL built-in primitives. To be specific, clamping the dimensional velocity V i,d based on the maximum dimensional velocity V d can be done by the primitive clamp (V i,d , −V d , V d ), and judging whether the dimensional position P i,d is feasible can be implemented as step (P i,d , P d ) + step (P d , P i,d ) through using the step primitive. e step primitive returns 0 if the right input parameter is less than the left input parameter, and 1, otherwise. e use of the OpenCL built-in primitives to replace some if-clause conditional statements therefore helps to shorten the time consumed on conditional operations. [66], the storage volume constraints as expressed in equations (12) and (13) are repaired as follows.

Coarse-Grained Multipopulation Model. Like in
First, for each year m, reversely starting from t � 12 and ending at 1, sequentially calculate the new lower bound A m,t+1 and upper bound A m,t+1 for the storage volume at the end of month t according to equations (16) and (17) Let only one work item perform Step 5 Step 3 Step 4 … … Figure 4: Basic coarse-grained all-GPU model. Second, incrementally starting from t � 1 and ending at 12, sequentially calculate Δ m, t which is the deviation of the storage volume at the end of month t according to equation (18), modify O m,t according to equation (19), and update A m,t + 1 if Δ m,t ≠ 0. Note that O m,t is kept feasible in equation (19).
Let Θ(A m,t+1 ) be the violation of the constraint for the storage volume at the end of month t. Θ(A m,t+1 ) is calculated according to the following equation: e original constrained problem is converted to an unconstrained problem by optimizing the following objective that incorporates the violations: where λ is the penalty factor and is a large positive number. e term 12 t�1 24 U m,t X m,t is the benefit, and the term λ 12 t�1 Θ(A m,t+1 ) is the violation cost. e unconstrained problem is solved by CLPSO. Each particle's position is a 12-dimensional vector representing the outflow rates over the yearly planning horizon. e power discharge and spillage rates in each month can be easily determined from the outflow rate in the corresponding month. e deterministic optimization for ISO on the ensemble of M years is parallelized by a coarse-grained multipopulation model extended from the all-GPU model. e kernel part of the multipopulation model is illustrated in Figure 5. M work groups are used to concurrently tackle the M optimal operation problems, with each work group consisting of N work items and solving the optimal operation problem related to a different year following the all-GPU model. Each work group determines the global best position, global best fitness, global best benefit, and global best violation cost for the corresponding optimal operation problem. e global best benefit is the benefit of the global best position, and the global best violation cost is the violation cost of the global best position. e global best position, global best fitness, global best benefit, and global best violation cost results of all the work groups are transferred back from the GPU to the CPU when the kernel function finishes execution. By summing the global best benefit results and global best violation cost results, respectively, the CPU is able to obtain the total best benefit and the total best violation cost for the deterministic optimization. e summation can only be done by the host part on the CPU because OpenCL does not support synchronizing different work groups. A serious challenge arises and needs to be addressed for the multipopulation model. e maximum size of a buffer created in the global memory region of the IHDG520 GPU is limited to be 511 MB, and the size of the global memory region is 1.3 GB. e multipopulation model needs maximally M (2ND + 3 ((K − 1)/7 + 1) ND + 2 ((K − 1)/7 + 1) N + KND) random numbers uniformly distributed in [0,1]. Suppose N � 40, D � 12, K � 10000, and M � 62, as a single-precision float number occupies 4 bytes, storing about 432 million random numbers requires a memory space of around 1.6 GB, exceeding the 511 MB limit for a buffer and the 1.3 GB size of the global memory region. We propose to create a buffer representing a one-dimensional array of 2ND + 3 ((K − 1)/7 + 1) ND+2 ((K − 1)/7 + 1) N + KND + M − 1 random numbers. Each work item is identified by the group ID m and the local ID i simultaneously, with m ranging from 0 to M − 1 and i ranging from 0 to N − 1 in a one-dimensional index space. Each work item with the group ID m and the local ID i can access 2D + 3 ((K − 1)/7 + 1)D+2 ((K − 1)/7 + 1) + KD random numbers starting at the index (2D+3 ((K − 1)/7 + 1) D + 2 ((K − 1)/ 7 + 1) + KD) i + m from the one-dimensional array stored in the buffer. is modified random number generation strategy aims to generate a small number of random numbers that can be flexibly used by the large number of work items without negatively impacting randomness.  Table 1 are implemented and compared. e sequentialization model involves a swarm of 40 particles. In all the 3 all-GPU models, 40 work items are used to concurrently execute the kernel on the IHDG520 GPU. Each particle/work item loops for 5000 generations. Eight 30-dimensional benchmark functions which were also used in [27][28][29] are evaluated. Table 2 gives the name, description, global optimum P * , corresponding function value f (P * ), and search space of each function. f 1 , f 2 , and f 3 are unimodal, and all the other functions are multimodal. f 7 and f 8 are rotated.
Regarding the 3 rd issue, the multi population model is compared with the sequentialization model. e deterministic optimization for the ISO of the Xiaowan Reservoir is performed on the historical monthly inflow data recorded during the period of 62 years from 1953 to 2014. e optimal operation related to each year is solved by CLPSO with a swarm of 40 particles in the sequentialization model. e multipopulation model takes advantage of 62 work groups to concurrently tackle the 62 optimal operation problems. Each work group solves the optimal operation with respect to a separate year, is composed of 40 work items, and follows the final coarsegrained all-GPU model. Each work item iterates for 10,000 generations. e multipopulation model employs the modified random generation strategy. e initial/final storage volume bound is 145.57 10 8 m 3 , corresponding to the normal forebay elevation as 1240 m. e penalty factor is 368 * 10 8 .
Concerning the 4 th issue, we need to understand the overhead of kernel launching. e execution time of a parallelization model is the time gap between the initialization of parameters and the release of OpenCL objects and is the addition of CPU-side execution time and GPU-side execution time. e GPU-side execution time is the time spent on blocking until all the enqueued commands in the command queue are issued to the GPU and have completed. e CPU-side execution time can be divided into 6 segments: segment 1 is the time initializing the numerical values for some input parameters; segment 2 is the time creating a context, a command queue, and buffers, as well as enqueueing commands to write some of the buffers; segment 3 is the time building an executable program; segment 4 is the time creating a kernel declared in the program, setting input parameters of the kernel and enqueueing a command to execute the kernel; segment 5 is the time reading the results from the GPU and releasing the kernel; and segment 6 is the time releasing the other objects. Let only one work item perform Step 5
speedup of a parallelization model as compared with the sequentialization model is the value calculated as the mean execution time of the parallelization model divided by that of the sequentialization model. Table 3 gives the statistical (i.e., mean, standard deviation, maximum, and minimum) execution time and global best fitness value results of the sequentialization model and the intermediate, basic, and final all-GPU models on all the functions. To determine whether the solutions obtained by the sequentialization model are statistically different from those obtained by the final all-GPU model, two-tailed t-tests with the assumption of equal variances and the significance level 0.05 are carried out, and the t-test results are listed in Table 4. Table 5 lists the statistical execution time, total best benefit, and total best violation cost results of the sequentialization model and the multipopulation model on the case study. e two-tailed ttest and speedup results from the comparison of the sequentialization model and the multipopulation model on the case study are listed in Table 6.  Table 7 gives the mean CPU-side execution time results of the final all-GPU model on some benchmark functions and the multipopulation model on the case study. e original sequentialization implementation of CLPSO proposed in [27][28][29] also uses 40 particles that iterate for 5000 generations and was also evaluated on the 30-dimensional Sphere, Schewefel's P2.22, Noise, Rosenbrock's, Rastrigin's, Ackley's, rotated Schwefel's, and rotated Rastrigin's functions for 25 runs. Our sequentialization model described in Table 3 differs from the original sequentialization implementation in that the inequality conditions enforced on determining the dimensional exemplar are relaxed to facilitate generating and transferring random numbers from the CPU. e statistical global best fitness results of our sequentialization model as given in Table 3 are similar with those of the original sequentialization implementation found in [27][28][29], verifying that the relaxation of the inequality conditions causes little negative impact on the final solution's quality when the number of particles is large enough. e sequentialization model and the basic, intermediate, and final all-GPU models all use single-precision float numbers and directly report 0 for sufficiently small float numbers, e.g., the statistical global best fitness results of all the models are all 0 on functions f 1 and f 2 in Table 3.
As can be seen from Table 3 Table 4 indicate that the global best fitness results of the final model are statistically indifferent on f 3 to f 8 as the t-test results are greater than the significance level 0.05. A t-test cannot proceed when the two pairs of data samples are all 0; hence, the t-test results on f 1 and f 2 are blank. Functions f 1 to f 3 are unimodal, and f 4 to f 8 are multimodal. e statistical global best fitness results given in Table 3 show that the sequentialization model and all the all-GPU models of CLPSO can find the global optimum on f 1 Table 3.
In Table 3,        position P i,d by the orthogonal matrix is actually a two-layer nested for-loop; as a result, shortening of the execution time is much more on f 8 than on the other functions. Particles are likely to be infeasible when the work items concurrently optimize the highly mountainous function f 7 ; hence, shortening of the execution time is not that noticeable on f 7 . e mean execution time results of the final model however are still more than those of the sequentialization model on all the functions; this is because a considerable amount of time (in the scale of hundreds of ms) must be spent on creating/releasing OpenCL objects (e.g., context, command queue, buffer, program, and kernel) and transferring buffers between the CPU and the GPU, as verified by the mean CPU-side execution time results given in Table 7. It can be seen from Table 3 that the standard deviation execution time results of the sequentialization model and all the all-GPU models are relatively small as compared with the corresponding mean execution time results, meaning that the execution time results of each model do not vary much in each run. e deterministic optimization for the ISO of the Xiaowan Reservoir is multimodal, as reflected from the statistical total best benefit results of the sequentialization model listed in Table 5. e standard deviation total best benefit of the sequentialization model is 0.42 10 8 kWh. e statistical total best benefit results of the multipopulation model are similar with those of the sequentialization model. e t-test results given in Table 6 indicate that the total best benefit results of the two models are statistically indifferent. Accordingly, the modified random generation strategy does not hurt randomness. e mean total best benefit of the sequentialization model and that of the multipopulation model are around 14,550 10 8 kWh; hence, in average, the optimized hydropower generation is about 235 10 8 kWh per year, much more than the guaranteed hydropower generation 190 10 8 kWh per year, validating the powerful global optimization capability of CLPSO. e solutions are feasible as the statistical total best violation cost results of the 2 models are all 0. e sequentialization model is very timeconsuming, and its mean execution time is 16,090.00 ms. As we can see from Tables 5 and 6, the mean execution time of the multipopulation model is 1165.60 ms which is significantly less than that of the sequentialization model, and the speedup is 13.80. e significant speedup is achieved by parallelizing the 62 optimal operation problems and parallelizing the 40 particles for each optimal operation problem. e 2 models are robust in terms of execution time results in all the runs as the standard deviation execution time results are small.
It can be observed from Figure 6 that the natural inflows are large in the flood season from June to September for all the 3 typical years. e natural inflows of year 1954 are considerably more than those of 1976 in July, August, September, and October, and those of 1976 are noticeably more than those of 2011 in June, July, September, and October. As Figures 7-9 show, the Xiaowan Reservoir needs to release much more outflows in April, May, June, July, September, October, and November of 1954 than in the corresponding months of 1976, leading to much lower forebay elevations in April, May, June, July and August, less power outputs in March, June, and July, and more power outputs in April, October, November, and December. e outflow rates and power outputs of 1976 are more than those of 2011 in April, May, June, September, October, November, and December.
As we can see from Table 7, the mean segment 1 time, segment 2 time, segment 4 time, segment 5 time, and segment 6 time results of the final all-GPU model are similar on functions f 1 , f 3 , and f 7 , with the mean segment 1 time results more than, the mean segment 2 time results less than, and the mean segment 4, 5, and 6 time results similar to the mean segment 1 time result, the mean segment 2 time result, and the mean segment 4, 5, and 6 time results of the multipopulation model on the case study, respectively. e segment 1 time is mainly the time generating the random numbers and is thus affected by the number of random numbers. e segment 2 time increases with the number of buffers created. e mean segment 3 time results of the final all-GPU model on f 1 and f 3 are similar and are much less than those of the final all-GPU model on f 7 and the multipopulation model on the case study, indicating that the more difficult fitness evaluation of a particle, the more time building the program. Steps 1, 3, and 4 of CLPSO involve operations related to each work item, while Steps 2 and 5 are executed by just one prespecified work item. Steps 2, 3, and 4 constitute a for-loop. When multiple kernels are used for parallelizing different phases of CLPSO, intermediate results must be transferred back from a kernel, and if the kernel is not the last kernel, then the intermediate results need to be transferred to the next kernel. Steps 2, 3, and 4 cannot be implemented as multiple kernels as the for-loop causes the overhead of frequently enqueueing commands to write some buffers, enqueueing commands to execute the kernels, and reading results from the kernels. e overhead can be much large with respect to many generations. Steps 1 and 5 also do not benefit from being split as multiple kernels because all the work items are occupied at Step 1, and for the multipopulation model, each work group has one work item occupied at Step 5. An alternative is to implement 3 kernels, respectively, corresponding to Step 1, the for-loop, and Step 5. e alternative incurs a small overhead of enqueueing commands to write some buffers, creating kernels, setting input parameters of the kernels, enqueueing commands to execute the kernels, reading results from the kernels, and releasing the kernels. Accordingly, our proposed final all-GPU model and multipopulation model for parallelizing CLPSO are appropriate.

Conclusions
In this paper, we have studied parallelizing CLPSO by OpenCL on the integrated IHDG520 GPU. We have firstly proposed a basic coarse-grained all-GPU model, with one kernel written and each work item representing a separate particle. As the IHDG520 GPU features a low clock rate and the CPU has a high clock rate, two strategies, i.e., generating and transferring random numbers from the CPU to the GPU as well as reducing the number of instructions in the kernel, have been adopted to shorten the basic model's execution time. To facilitate parallelization implementation of CLPSO, the inequality conditions used when determining a dimensional exemplar are relaxed. We have also studied a realworld case parallelizing the deterministic optimization for the ISO of the Xiaowan Reservoir. e deterministic optimization has been solved by CLPSO on 62 years' monthly natural inflow records and has been parallelized by a multipopulation model using a large number of work items extended from the all-GPU model. Owing to the size limits for a buffer transferring data from the CPU to the GPU and for storing the data in the global memory region, the random number generation strategy has been further modified by generating a small number of random numbers that can be flexibly exploited by the large number of work items without harming randomness. Experiments have been conducted on various unimodal/multimodal 30-dimensional benchmark global optimization functions and the case study. e experimental results demonstrate that (1) the relaxation of the inequality conditions causes little negative impact on the final solution's quality; (2) the two enhancement strategies help improve the basic model's efficiency; (3) the modified random number generation strategy is suitable for the case of a large number of work items; and (4) the multi population model is able to achieve the consumption of significantly less execution time than the corresponding sequentialization model. In the future, we will investigate adapting and applying the proposed models for parallelizing more advanced metaheuristics [29,[71][72][73][74] and solving more real-world large-scale problems.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.