^{1}

^{2}

^{3}

^{4}

^{1}

^{5}

^{1}

^{1}

^{1}

^{2}

^{3}

^{4}

^{5}

Comprehensive learning particle swarm optimization (CLPSO) is a powerful metaheuristic for global optimization. This paper studies parallelizing CLPSO by open computing language (OpenCL) on the integrated Intel HD Graphics 520 (IHDG520) graphical processing unit (GPU) with a low clock rate. We implement a coarse-grained all-GPU model that maps each particle to a separate work item. Two enhancement strategies, namely, generating and transferring random numbers from the central processor to the GPU as well as reducing the number of instructions in the kernel, are proposed to shorten the model’s execution time. This paper further investigates parallelizing deterministic optimization for implicit stochastic optimization of China’s Xiaowan Reservoir. The deterministic optimization is performed on an ensemble of 62 years’ historical inflow records with monthly time steps, is solved by CLPSO, and is parallelized by a coarse-grained multipopulation model extended from the all-GPU model. The multipopulation model involves a large number of work items. Because of the capacity limit for a buffer transferring data from the central processor to the GPU and the size of the global memory region, the random number generation strategy is modified by generating a small number of random numbers that can be flexibly exploited by the large number of work items. Experiments conducted on various benchmark functions and the case study demonstrate that our proposed all-GPU and multipopulation parallelization models are appropriate; and the multipopulation model achieves the consumption of significantly less execution time than the corresponding sequential model.

A graphical processing unit (GPU) is a processor specially designed to rapidly manipulate the creation of images in a frame buffer intended for output to a display device. By providing functionalities such as texture mapping, rendering, shading, anti-aliasing, color space, and video decoding, a GPU is an indispensable aid to a central processing unit (CPU) to manage and boost the performance of graphics. A CPU consists of only a few processing elements optimized for sequential processing, whereas a GPU consists of a large number of compute units, with each compute unit in turn containing many processing elements, thereby constituting a massively parallel architecture for handling multiple computing tasks simultaneously. People have recently studied leveraging the massively parallel architectures of GPUs for accelerating nongraphical general-purpose computing in a wide range of areas [

Metaheuristics are naturally suitable to be parallelized on GPUs. A metaheuristic is essentially a set of nature-inspired intelligent search strategies and is promising for single-objective global optimization [

Particle swarm optimization (PSO) is a class of metaheuristics simulating the food-searching behavior of a bird flock [

In this paper, we study parallelizing CLPSO on a platform with the Intel Core i7-6500U CPU, the third-generation double data rate (DDR3) main memory, and the Intel HD Graphics 520 (IHDG520) GPU by OpenCL. IHDG520 is an integrated GPU; it can be found in various ultralow-voltage mobile CPUs and is suited for laptop (particularly, ultrabook) computers. The IHDG520 GPU lacks dedicated graphics memory and has to access the main memory. The sequentialization implementation of CLPSO was evaluated with a swarm of 40 particles in [

A reservoir is a hydraulic structure that impounds water and uses water to serve various purposes such as hydropower generation, flood control, navigation, sediment control, and water provisioning for agricultural, domestic, and industrial demands. A reservoir system consists of one or more cascaded reservoirs constructed within the same river basin. The optimal operation of a reservoir system is to schedule outflows of the reservoir (s) over a series of consecutive time steps in order to optimize a specific objective, trying to fulfill the multipurpose development of the system. The optimal operation of a reservoir system is complex because the optimization problem has to take into account inflow imprecision and uncertainties, the dynamic multistage nature of decision-making, and different physical and operational constraints [

Integrated GPUs are prevalent nowadays and can be found in both laptop (e.g., the IHDG520 GPU and the Intel HD Graphics 620 GPU) and desktop computers (e.g., the Intel HD Graphics 530 GPU). The clock rates of the Intel HD Graphics 520, 620, and 530 GPUs are all rather low, being 0.3 GHz, 0.3 GHz, and 0.35 GHz, respectively. The Intel HD Graphics 620 and 530 GPUs also lack dedicated graphics memory. The Intel integrated GPUs feature significantly different architectures from NVIDIA and AMD GPUs studied in the existing literature body, e.g., the NVIDIA GPUs studied in [

The rest of this paper is organized as follows. The working procedure of CLPSO, the knowledge about OpenCL, and the characteristics of the IHDG520 GPU are detailed in Section

Let the search space be _{i} = _{i, 1}, _{i, 2}, …, _{i, D} and a position _{i} = _{i, 1}, _{i, 2}, …, _{i, D}. In each iteration (or generation), _{i} and _{i} are updated on each dimension _{i,d} is a random number uniformly distributed in [0, 1]; and _{i} = _{i, 1}, _{i, 2}, …, _{i, D} is the exemplar that guides the update of

The dimensional velocity

Let

The weight

CLPSO maintains a personal best position _{i} = _{i, 1}, _{i, 2}, …, _{i, D} for each particle _{i} = _{i}), then _{i} = _{i} does not change. The dimensional exemplar _{i,d} can be _{i, d} or _{j, d} with

On each dimension _{i}, _{i, d} = _{i, d}; otherwise, _{i, d} = _{j, d}. To determine _{i} = _{i} on all the dimensions, CLPSO randomly chooses one dimension _{i, d} = _{j, d}. The exemplar _{i} is not updated unless _{i}) ceases improving for a refreshing gap of 7 generations. In each generation, _{1}, _{2}, …, _{D} that exhibits the best fitness value among all the personal best positions. _{i} and randomly selecting a particle when _{i} = _{i} on all the dimensions.

Flowchart of CLPSO.

As can be seen from Figure

View of the hardware platform by OpenCL.

The host program creates a context. A context specifies kernel (s) to be executed on one or more compute devices. Besides kernel, a context also manages objects such as command queue, memory, and program. A command queue holds commands (or operations) that will be executed on a compute device. Commands placed into a command queue can be classified into three categories, i.e., kernel management, memory management, and synchronization. Values for the input parameters of a kernel are transferred between the CPU and a compute device. OpenCL represents generic data by a buffer and supports creating a buffer only for a one-dimensional array. The memory space of a compute device is divided into four regions, i.e., global memory, constant memory, local memory, and private memory. The global memory permits read/write access to all the work items in all the work groups. Being writable by the CPU but not the compute device, the constant memory remains constant during the execution of a kernel. A local memory is just shared by all the work items in one specific work group. Each work item has a private memory, invisible by any other work item. Memory region qualifiers, “__global,” “__constant,” “__local,” and “__private,” can be applied on an input parameter of a kernel to, respectively, restrict that the parameter is to be stored in the global memory region, the constant memory region, the local memory region, or the private memory region. An input parameter with no memory region qualifier is stored in the private memory region by default. An input parameter with the data transferred by a buffer can only be stored in the global memory region. All the variables and constants additionally declared inside the kernel are stored in the private memory region. OpenCL is able to synchronize all the work items in the same work group, but cannot synchronize work items across different work groups. A program consists of one or more kernels.

IHDG520 is an integrated GPU, i.e., it is embedded on the same die as the CPU. Integrated GPUs lead to less heat output and less power usage; thus, they have been widely taken in laptop (particularly, ultrabook) computers. The IHDG520 GPU has 24 compute units clocked at 0.3 GHz. Each compute unit is composed of 256 processing elements. The IHDG520 GPU has to share the main memory with the CPU. For the IHDG520 GPU, the size of the constant memory region and that of the local memory region are both zero; in other words, only the global memory region and the private memory region located in the main memory can be used. The size of the global memory region is 1.3 GB. The maximum size of a buffer created in the global memory region is 511 MB. The IHDG520 GPU uses on-chip registers to store kernel instructions. The IHDG520 GPU supports single-precision floating point calculation, but does not support double-precision floating point calculation.

The Xiaowan Reservoir located on Lancang River is taken as the case for study. Lancang River is the upper stream of Mekong River in China. Mekong River is a cross-border river in Southeast Asia. Originating from the Qinghai-Tibet Plateau, Mekong River runs through 6 countries, i.e., China, Myanmar, Laos, Thailand, Cambodia, and Vietnam, sequentially. Mekong River is the world’s 12^{th} longest river, with a length of 4350 km. The length of Lancang River is 2139 km, draining an area of 0.16 million km^{2} over the provinces including Qinghai, Tibet, and Yunnan. The Xiaowan Reservoir is constructed at the west of Yunnan and on the middle reach of Lancang River, with a longitude of 100°05′28″ and a latitude of 24°42′18″. Figure ^{3}. The Xiaowan Reservoir is affected by a monsoon climate, and the inflows feature seasonal variations. The flood season is from June to September. The guaranteed hydropower generation per year is 190⋅10^{8} kWh. Historical inflow records for the Xiaowan Reservoir from the year 1953 to 2014 are available.

Lancang River basin in Yunnan and the Xiaowan Reservoir.

The deterministic optimization problems for the ISO of the Xiaowan Reservoir are formulated with a yearly planning horizon of 12 monthly time steps and an ensemble of _{m, t} into the reservoir in each month _{m, t} and the spillage rate _{m, t} in each month _{m, t}, _{m, t}, and _{m, t} are all measured by the unit of m^{3}/s, and the following equation gives the objective:_{m, t} is the power output in month _{m, t} is the number of days in month

_{m, t} is calculated by_{m, t} is the water conversion rate in month ^{3}/kWh. _{m, t} is affected by the water head _{m, t} in month _{m, t} is the difference of the forebay elevation _{m, t} and the tailrace elevation _{m, t} in month

Let _{m, t} be the storage volume at the beginning of month _{m, t} is a function of the average storage volume

_{m, t}.

Let _{m,1} is known, and

The problem is associated with the following constraints:

The deterministic optimization needs to solve

A basic parallelization scheme is presented here and works as the basis of our proposed enhancement strategies. The basic parallelization scheme follows the all-GPU model and implements a single kernel. CLPSO needs to generate random numbers uniformly distributed in [0, 1] at Steps 1 and 3. An OpenCL program is composed of both the host part and the kernel part. OpenCL provides no built-in primitive for generating any kind of random number in the kernel part. We write an auxiliary inline function that the kernel function can invoke for generating a random unsigned integer number based on the multiplicative linear congruential (MLC) principle [

When the inline function is called, the function code gets directly inserted at the point of each function call, thereby shortening the function call overhead.

The all-GPU model is coarse-grained, with each particle mapped to a separate work item in a one-dimensional index space. Each work item is identified by the global ID. _{i, d}, _{i, d}, _{i}, _{i}), _{d}, _{i} are input parameters of the kernel function, while _{i} is the seed for each particle/work item _{i, d}, _{i, d}, _{i}, _{i}), _{d}, _{i}; hence, they are stored in the global memory region. _{i}. The numerical values are then transferred from the CPU to the IHDG520 GPU before the kernel function executes. The work items execute concurrently, and each work item is just responsible for performing the operations related to the corresponding particle at Steps 1, 3, and 4. Only one prespecified work item executes Steps 2 and 5. When the kernel function finishes execution, _{d} and

Basic coarse-grained all-GPU model.

Two enhancement strategies, namely, generating and transferring random numbers from the CPU as well as reducing the number of instructions in the kernel, are employed to accommodate the characteristics of the IHDG520 GPU and the OpenCL APIs for the purpose of significantly shortening the execution time of the basic coarse-grained all-GPU model.

OpenCL provides no built-in primitive for generating any type of random number. In the basic coarse-grained all-GPU model, an auxiliary inline function is written to assist the kernel function generating random numbers based on the MLC principle. The MLC principle generates a random unsigned integer number based on an unsigned integer input, and a random float number uniformly distributed in [0, 1] can be obtained by dividing the random unsigned integer number timed with 1.0 over 2147483647.0. Most GPUs including the IHDG520 GPU are not good at dealing with the integer multiplication and modulation operations as well as the float division operation involved in the MLC random number generation process and need many clock cycles to execute such costly operations. In addition, the IHDG520 GPU is slow at execution because its clock rate is just 0.3 GHz. Step 1 of CLPSO randomly initializes each particle _{i} on each dimension _{i, d} if the number is greater than _{i}, and a dimension is randomly selected to learn from a particle that is also randomly selected when _{i} equals to the personal best position _{i} on all the dimensions; thus, maximally, 3 ((

With respect to the basic coarse-grained all-GPU model, _{i, d}, _{i, d}, _{i}, _{i}) are input parameters of the kernel function. OpenCL supports creating a buffer only for a one-dimensional array. Buffers representing one-dimensional arrays with _{i, d}, and _{i, d}, and buffers representing one-dimensional arrays with _{i}, _{i}). All the work items are indexed in a one-dimensional space, and the global IDs of the work items range from 0 to _{i, d}, and _{i, d} by the index _{i}, _{i, d} and _{i}) are shared among the particles for exemplar redetermination at Step 3 of CLPSO, while _{i, d}, _{i}, and

Like in [

First, for each year

Second, incrementally starting from _{m, t} which is the deviation of the storage volume at the end of month _{m,t} according to equation (_{m,t + 1} if Δ_{m,t} ≠ 0. Note that _{m,t} is kept feasible in equation (

Let

The original constrained problem is converted to an unconstrained problem by optimizing the following objective that incorporates the violations:

The unconstrained problem is solved by CLPSO. Each particle’s position is a 12-dimensional vector representing the outflow rates over the yearly planning horizon. The power discharge and spillage rates in each month can be easily determined from the outflow rate in the corresponding month. The deterministic optimization for ISO on the ensemble of

Coarse-grained multipopulation model.

A serious challenge arises and needs to be addressed for the multipopulation model. The maximum size of a buffer created in the global memory region of the IHDG520 GPU is limited to be 511 MB, and the size of the global memory region is 1.3 GB. The multipopulation model needs maximally

In [

For the 1^{st} and 2^{nd} issues, a sequentialization model of CLPSO and 3 coarse-grained all-GPU models as listed in Table _{1}, _{2}, and _{3} are unimodal, and all the other functions are multimodal. _{7} and _{8} are rotated.

Coarse-grained all-GPU models of CLPSO.

Model | Description |
---|---|

Basic | Parallelize CLPSO without employing any enhancement strategy |

Intermediate | The basic model with the enhancement strategy generating and transferring random numbers from the CPU |

Final | The intermediate model with the enhancement strategy reducing the number of instructions in the kernel |

Benchmark global optimization functions.

Function | Description | Global optimum | Search space | |
---|---|---|---|---|

_{1} | Sphere, | 0 | ||

_{2} | Schwefel’s P2.22, | 0 | ||

_{3} | Noise, | 0 | ||

_{4} | Rosenbrock’s, | 0 | ||

_{5} | Rastrigin’s, | 0 | ||

_{6} | Ackley’s, | 0 | ||

_{7} | Rotated Schwefel’s, | 0 | ||

_{8} | Rotated Rastrigin’s, | 0 |

Regarding the 3^{rd} issue, the multi population model is compared with the sequentialization model. The deterministic optimization for the ISO of the Xiaowan Reservoir is performed on the historical monthly inflow data recorded during the period of 62 years from 1953 to 2014. The optimal operation related to each year is solved by CLPSO with a swarm of 40 particles in the sequentialization model. The multipopulation model takes advantage of 62 work groups to concurrently tackle the 62 optimal operation problems. Each work group solves the optimal operation with respect to a separate year, is composed of 40 work items, and follows the final coarse-grained all-GPU model. Each work item iterates for 10,000 generations. The multipopulation model employs the modified random generation strategy. The initial/final storage volume bound is 145.57 10^{8} m^{3}, corresponding to the normal forebay elevation as 1240 m. The penalty factor is 368^{8}.

Concerning the 4^{th} issue, we need to understand the overhead of kernel launching. The execution time of a parallelization model is the time gap between the initialization of parameters and the release of OpenCL objects and is the addition of CPU-side execution time and GPU-side execution time. The GPU-side execution time is the time spent on blocking until all the enqueued commands in the command queue are issued to the GPU and have completed. The CPU-side execution time can be divided into 6 segments: segment 1 is the time initializing the numerical values for some input parameters; segment 2 is the time creating a context, a command queue, and buffers, as well as enqueueing commands to write some of the buffers; segment 3 is the time building an executable program; segment 4 is the time creating a kernel declared in the program, setting input parameters of the kernel and enqueueing a command to execute the kernel; segment 5 is the time reading the results from the GPU and releasing the kernel; and segment 6 is the time releasing the other objects.

All the sequentialization and parallelization models are executed for 25 runs on all the benchmark functions and the case study. The speedup of a parallelization model as compared with the sequentialization model is the value calculated as the mean execution time of the parallelization model divided by that of the sequentialization model. Table ^{3}/s in average, 1976 is a typical normal year with natural inflow 1186 m^{3}/s in average, and 2011 is a typical dry year with natural inflow 974 m^{3}/s in average. The monthly natural inflow records as well as outflow rate, forebay elevation, and power output results determined from the median run by the multipopulation model for 1954, 1976, and 2011 are, respectively, shown in Figures

Statistical execution time and global best fitness results of the sequentialization model and the coarse-grained all-GPU models on all the benchmark functions.

Benchmark function | Model | Execution time (ms) | Global best fitness value | ||||||
---|---|---|---|---|---|---|---|---|---|

Mean | Standard deviation | Maximum | Minimum | Mean | Standard deviation | Maximum | Minimum | ||

_{1} | Sequentialization | 167.24 | 7.17 | 172.00 | 156.00 | 0 | 0 | 0 | 0 |

Basic | 5823.84 | 16.12 | 5866.00 | 5803.00 | 0 | 0 | 0 | 0 | |

Intermediate | 574.08 | 10.93 | 608.00 | 561.00 | 0 | 0 | 0 | 0 | |

Final | 450.58 | 11.29 | 483.00 | 436.00 | 0 | 0 | 0 | 0 | |

_{2} | Sequentialization | 192.84 | 7.70 | 203.00 | 187.00 | 0 | 0 | 0 | 0 |

Basic | 5881.24 | 12.77 | 5912.00 | 5865.00 | 0 | 0 | 0 | 0 | |

Intermediate | 638.32 | 13.38 | 671.00 | 624.00 | 0 | 0 | 0 | 0 | |

Final | 506.72 | 11.94 | 531.00 | 483.00 | 0 | 0 | 0 | 0 | |

_{3} | Sequentialization | 164.72 | 7.90 | 172.00 | 156.00 | 5.41 | 1.34 | 6.70 | 3.18 |

Basic | 6148.96 | 28.44 | 6194.00 | 6084.00 | 6.47 | 2.04 | 9.29 | 2.04 | |

Intermediate | 571.64 | 11.71 | 593.00 | 561.00 | 5.95 | 1.36 | 8.76 | 3.37 | |

Final | 451.76 | 7.05 | 468.00 | 437.00 | 5.12 | 1.71 | 9.14 | 1.93 | |

_{4} | Sequentialization | 161.64 | 7.68 | 172.00 | 156.00 | 32.54 | 14.89 | 54.60 | 21.19 |

Basic | 6007.28 | 14.83 | 6037.00 | 5975.00 | 36.61 | 18.11 | 82.75 | 10.87 | |

Intermediate | 596.96 | 10.09 | 609.00 | 577.00 | 35.76 | 19.33 | 74.10 | 9.75 | |

Final | 454.28 | 8.18 | 468.00 | 437.00 | 33.62 | 20.24 | 84.01 | 10.90 | |

_{5} | Sequentialization | 254.60 | 7.47 | 266.00 | 249.00 | 3.00 | 3.00 | 8.00 | 0 |

Basic | 5982.96 | 15.74 | 6022.00 | 5959.00 | 1.08 | 1.12 | 4.60 | 1.30 | |

Intermediate | 621.52 | 7.42 | 640.00 | 608.00 | 6.60 | 4.80 | 1.87 | 6.00 | |

Final | 487.36 | 10.36 | 515.00 | 468.00 | 8.00 | 1.10 | 5.30 | 1.00 | |

_{6} | Sequentialization | 310.12 | 5.20 | 312.00 | 296.00 | 1.10 | 2.00 | 1.50 | 8.00 |

Basic | 5997.92 | 15.76 | 6038.00 | 5974.00 | 1.60 | 2.00 | 1.90 | 1.10 | |

Intermediate | 646.68 | 11.11 | 671.00 | 624.00 | 1.50 | 1.00 | 1.50 | 1.10 | |

Final | 514.16 | 10.62 | 531.00 | 499.00 | 1.20 | 2.00 | 1.50 | 8.00 | |

_{7} | Sequentialization | 492.32 | 7.87 | 500.00 | 483.00 | 1289.46 | 181.45 | 1496.70 | 874.44 |

Basic | 3059.44 | 31.37 | 3105.00 | 2995.00 | 1279.78 | 153.06 | 1586.59 | 962.10 | |

Intermediate | 746.92 | 10.50 | 765.00 | 733.00 | 1307.11 | 122.96 | 1601.35 | 1119.69 | |

Final | 714.48 | 10.99 | 733.00 | 702.00 | 1309.49 | 158.23 | 1560.73 | 1042.22 | |

_{8} | Sequentialization | 531.04 | 7.16 | 546.00 | 514.00 | 26.56 | 4.24 | 36.60 | 18.25 |

Basic | 6571.36 | 15.69 | 6599.00 | 6552.00 | 27.93 | 3.19 | 34.43 | 22.45 | |

Intermediate | 934.08 | 14.53 | 967.00 | 920.00 | 28.76 | 4.60 | 38.60 | 20.65 | |

Final | 594.68 | 12.17 | 624.00 | 577.00 | 28.43 | 3.50 | 33.49 | 21.56 |

Two-tailed

Benchmark function | _{1} | _{2} | _{3} | _{4} | _{5} | _{6} | _{7} | _{8} |
---|---|---|---|---|---|---|---|---|

Two-tailed | — | — | 0.40 | 0.11 | 0.08 | 0.20 | 0.37 | 0.89 |

Statistical execution time, total best benefit, and total best violation cost results of the sequentialization model and the coarse-grained multipopulation model on the case study.

Model | Execution time (ms) | Total best benefit (10^{8} kWh) | Total best violation cost | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Mean | Standard deviation | Maximum | Minimum | Mean | Standard deviation | Maximum | Minimum | Mean | Standard deviation | Maximum | Minimum | |

Sequentialization | 16,090.00 | 25.39 | 16,177.00 | 16,053.00 | 14,550.50 | 0.42 | 14,550.80 | 14,548.85 | 0 | 0 | 0 | 0 |

Multipopulation | 1165.60 | 9.52 | 1185.00 | 1139.00 | 14,550.42 | 0.23 | 14,550.72 | 14,550.04 | 0 | 0 | 0 | 0 |

Two-tailed

Two-tailed | 0.38 |

Speedup | 13.80 |

Natural inflow records of the three typical years.

Outflow rate results of the three typical years.

Forebay elevation results of the three typical years.

Power output results of the three typical years.

Mean CPU-side execution time results of the final all-GPU model on some benchmark functions and the multipopulation model on the case study.

Model | Benchmark function/case study | Mean CPU-side execution time (ms) | ||||||
---|---|---|---|---|---|---|---|---|

Segment 1 | Segment 2 | Segment 3 | Segment 4 | Segment 5 | Segment 6 | Total | ||

Final all-GPU | _{1} | 134.16 | 18.96 | 171.24 | 0.60 | 0.60 | 2.96 | 328.52 |

_{3} | 132.84 | 18.16 | 167.16 | 0.64 | 0.60 | 2.56 | 321.96 | |

_{7} | 133.40 | 18.55 | 277.16 | 0.60 | 0.58 | 2.79 | 433.08 | |

Multipopulation | Case study | 108.52 | 21.24 | 301.40 | 0.64 | 0.55 | 2.62 | 434.97 |

The original sequentialization implementation of CLPSO proposed in [_{1} and _{2} in Table

As can be seen from Table _{1} and _{2} and similar with those of the sequentialization model on the other functions. The _{3} to _{8} as the _{1} and _{2} are blank. Functions _{1} to _{3} are unimodal, and _{4} to _{8} are multimodal. The statistical global best fitness results given in Table _{1} and _{2} in all the runs, can find the global optimum on _{5} in some runs and a near-optimum in the other runs, can find a near-optimum on _{3} and _{6} in all the runs, and can only find a local optimum on _{5}, _{7}, and _{8} in all the runs.

For the sequentialization model, a significant part of the execution time is spent on fitness calculation. Functions _{5} and _{6} include cosine operations. Compared with _{5}, _{6} additionally needs to calculate exponential values. _{7} and _{8} are rotated functions and multiply the original decision vector by an orthogonal matrix. _{8} is a rotated variant of _{5}. There are sine operations in _{7}. Therefore, the evaluation of _{7} and _{8} is most time-consuming followed by _{5} and _{6}, and _{1} to _{4} are least time-consuming. The analysis is clearly validated by the statistical execution time results of the sequentialization model given in Table

In Table _{1}, _{2}, _{3}, _{4}, _{5}, _{6}, and _{8} and around 3000 ms on _{7}. The mean execution time results of the intermediate all-GPU model are around 600 ms on _{1} to _{6}, around 700 ms on _{7}, and around 900 ms on _{8}, far less than those of the basic model. The dramatic execution time difference between the basic model and the intermediate model on the same function is attributed to random number generation. The basic model generates random numbers in the kernel function based on the MLC principle. Performing the integer multiplication and modulo as well as float division operations involved in the MLC random number generation process on the IHDG520 GPU is very time-consuming. In contrast, the intermediate model generates random numbers on the high clock rate CPU efficiently and transfers the random numbers from the CPU to the GPU. The basic model takes less time on _{7} than on the other functions because the landscape of _{7} is highly mountainous; each particle is likely to fly to a position that leads to a better personal best fitness value during the trajectory update, and the model thus goes through much less times of exemplar redetermination and invokes much less times of random number generation. The mean execution time results of the final model are around 100 ms less than those of the intermediate model on _{1} to _{6}, around 30 ms less on _{7}, and around 300 ms less on _{8}, verifying that the strategy reducing the number of instructions in the kernel benefits shortening the execution time. With respect to _{8}, multiplication of the dimensional position _{8} than on the other functions. Particles are likely to be infeasible when the work items concurrently optimize the highly mountainous function _{7}; hence, shortening of the execution time is not that noticeable on _{7}. The mean execution time results of the final model however are still more than those of the sequentialization model on all the functions; this is because a considerable amount of time (in the scale of hundreds of ms) must be spent on creating/releasing OpenCL objects (e.g., context, command queue, buffer, program, and kernel) and transferring buffers between the CPU and the GPU, as verified by the mean CPU-side execution time results given in Table

The deterministic optimization for the ISO of the Xiaowan Reservoir is multimodal, as reflected from the statistical total best benefit results of the sequentialization model listed in Table ^{8} kWh. The statistical total best benefit results of the multipopulation model are similar with those of the sequentialization model. The ^{8} kWh; hence, in average, the optimized hydropower generation is about 235 10^{8} kWh per year, much more than the guaranteed hydropower generation 190 10^{8} kWh per year, validating the powerful global optimization capability of CLPSO. The solutions are feasible as the statistical total best violation cost results of the 2 models are all 0. The sequentialization model is very time-consuming, and its mean execution time is 16,090.00 ms. As we can see from Tables

It can be observed from Figure

As we can see from Table _{1}, _{3}, and _{7}, with the mean segment 1 time results more than, the mean segment 2 time results less than, and the mean segment 4, 5, and 6 time results similar to the mean segment 1 time result, the mean segment 2 time result, and the mean segment 4, 5, and 6 time results of the multipopulation model on the case study, respectively. The segment 1 time is mainly the time generating the random numbers and is thus affected by the number of random numbers. The segment 2 time increases with the number of buffers created. The mean segment 3 time results of the final all-GPU model on _{1} and _{3} are similar and are much less than those of the final all-GPU model on _{7} and the multipopulation model on the case study, indicating that the more difficult fitness evaluation of a particle, the more time building the program. Steps 1, 3, and 4 of CLPSO involve operations related to each work item, while Steps 2 and 5 are executed by just one prespecified work item. Steps 2, 3, and 4 constitute a for-loop. When multiple kernels are used for parallelizing different phases of CLPSO, intermediate results must be transferred back from a kernel, and if the kernel is not the last kernel, then the intermediate results need to be transferred to the next kernel. Steps 2, 3, and 4 cannot be implemented as multiple kernels as the for-loop causes the overhead of frequently enqueueing commands to write some buffers, enqueueing commands to execute the kernels, and reading results from the kernels. The overhead can be much large with respect to many generations. Steps 1 and 5 also do not benefit from being split as multiple kernels because all the work items are occupied at Step 1, and for the multipopulation model, each work group has one work item occupied at Step 5. An alternative is to implement 3 kernels, respectively, corresponding to Step 1, the for-loop, and Step 5. The alternative incurs a small overhead of enqueueing commands to write some buffers, creating kernels, setting input parameters of the kernels, enqueueing commands to execute the kernels, reading results from the kernels, and releasing the kernels. Accordingly, our proposed final all-GPU model and multipopulation model for parallelizing CLPSO are appropriate.

In this paper, we have studied parallelizing CLPSO by OpenCL on the integrated IHDG520 GPU. We have firstly proposed a basic coarse-grained all-GPU model, with one kernel written and each work item representing a separate particle. As the IHDG520 GPU features a low clock rate and the CPU has a high clock rate, two strategies, i.e., generating and transferring random numbers from the CPU to the GPU as well as reducing the number of instructions in the kernel, have been adopted to shorten the basic model’s execution time. To facilitate parallelization implementation of CLPSO, the inequality conditions used when determining a dimensional exemplar are relaxed. We have also studied a real-world case parallelizing the deterministic optimization for the ISO of the Xiaowan Reservoir. The deterministic optimization has been solved by CLPSO on 62 years’ monthly natural inflow records and has been parallelized by a multipopulation model using a large number of work items extended from the all-GPU model. Owing to the size limits for a buffer transferring data from the CPU to the GPU and for storing the data in the global memory region, the random number generation strategy has been further modified by generating a small number of random numbers that can be flexibly exploited by the large number of work items without harming randomness. Experiments have been conducted on various unimodal/multimodal 30-dimensional benchmark global optimization functions and the case study. The experimental results demonstrate that (1) the relaxation of the inequality conditions causes little negative impact on the final solution’s quality; (2) the two enhancement strategies help improve the basic model’s efficiency; (3) the modified random number generation strategy is suitable for the case of a large number of work items; and (4) the multi population model is able to achieve the consumption of significantly less execution time than the corresponding sequentialization model. In the future, we will investigate adapting and applying the proposed models for parallelizing more advanced metaheuristics [

The data used to support the findings of this study are available from the corresponding author upon request.

The authors declare that they have no conflicts of interest.

This work was financially supported by the National Natural Science Foundation of China Projects (61703199, 61866023, and 61865012), the Shaanxi Province Natural Science Foundation Basic Research Project (2020JM-278), and the Central Universities Fundamental Research Foundation Project (GK202003006).