Accelerating the SCE-UA Global Optimization Method Based on Multi-Core CPU and Many-Core GPU

1State Key Laboratory of Simulation and Regulation of Water Cycle in River Basin, Research Center on Flood & Drought Disaster Reduction of the Ministry of Water Resources, China Institute of Water Resources and Hydropower Research, Beijing 100038, China 2College of Hydrology and Water Resources, Hohai University, Nanjing 210098, China 3Hydrologic Bureau (Information Center) of the Huaihe River Commission, Bengbu 233001, China


Introduction
There are a large number of intelligent optimization algorithms in the field of parameter optimization of the environmental models, such as the genetic algorithm (GA) [1] and the particle swarm optimization (PSO) [2,3].Among these algorithms, the shuffled complex evolution method developed at The University of Arizona (SCE-UA) is recognized as an effective and robust global optimization technique for calibrating environmental models [4][5][6][7][8][9].However, for problems with high dimensionality, complex objective function response surface, and high objective function computational load, it is resource-demanding and time-consuming.This disadvantage evokes the need for efficient acceleration of the SCE-UA method.The SCE-UA method is inherently parallel and should be accelerated at algorithm level.Besides, the parallel algorithm needs to be properly implemented on powerful parallel computation hardware.With the development of the parallel computing technology, the best way is the utilization of the heterogeneous computing system, which is composed by the multi-core central processing units (CPUs) and the many-core graphics processing units (GPUs).However, in previous literatures, the algorithm level parallelization analysis and the parallelization based on the heterogeneous computing system for the SCE-UA method are rare.
Parallel computing has been more and more popular in one form or another for many decades.In the early stages it was generally restricted to practitioners who had access to large and expensive machines.Today, things are quite different.Almost all consumer desktop and laptop computers have CPUs with multiple cores.Multi-core CPU hardware systems are build up on a set of processors which have access to a common memory.This architecture is recognized as shared-memory system.By placing several cores on a chip, multi-core processors offer a way to improve the performance of microprocessors.In the programming model of the multicore processors, the parallelization is implemented by creating "threads" which represent separate tasks run by different  CPU cores.With multi-core CPUs, several existing or new programming models and environments can help users [10].For example, OpenMP, Pthread, Cilk [11], and even MATLAB can be considered tools to help users implement programs on multi-core CPUs.Among those tools, OpenMP is adopted for the parallelization of the SCE-UA method on multi-core CPU systems owing to its simplicity and good efficiency.Many-core GPU and multi-core CPU are two kinds of completely different hardware systems.The GPU is a highly parallel, multithreaded, many-core processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figures 1(a) and 1(b) [12,13].NVIDIA introduced CUDA (compute unified device architecture), a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve complex computational problems.CUDA guides the programmer to partition the problem into coarse subproblems that can be solved independently in parallel by blocks of threads and each subproblem into finer pieces that can be solved cooperatively in parallel by all threads within the block [13].Therefore, we adopted CUDA Fortran as the tool for the parallelization of the SCE-UA method on many-core NVIDIA GPU systems.
The objectives of this paper are the following.(1) Rare previous literature is about the parallelization and acceleration of the SCE-UA method.We analyze which part of the SCE-UA method could be parallelized.We redesigned and accelerated the SCE-UA method in the algorithm level and made the method highly suited to the multi-core CPU and many-core GPU.(2) The multi-core CPUs and manycore GPUs have not been applied for the parallelization and acceleration of the SCE-UA method previously.We implement parallel SCE-UA on these two kinds of hardware systems by utilizing the OpenMP and CUDA Fortran.
(3) The Griewank benchmark function is used in this paper to test and compare the performances of the serial and parallel SCE-UA methods.According to the results of the comparison, some useful advises are given to direct how to properly use the parallel SCE-UA methods.

The Serial SCE-UA Method (Serial-SCE-UA).
The SCE-UA method is specifically designed to deal with the peculiarities encountered in environmental model calibration.The method is based on a synthesis of four concepts: (1) combination of deterministic and probabilistic approaches; (2) systematic evolution of a "complex" of points spanning the parameter space, in the direction of global improvement; (3) competitive evolution; (4) complex shuffling.The synthesis of these elements makes the SCE-UA method effective and robust, and also flexible and efficient.A detailed presentation of the theory underlying the SCE-UA algorithm could be found in Duan's papers [5,14].Duan provides MATLAB and Fortran 77 SCE-UA codes on his official website.These two versions are serial codes and are implemented on single core CPU.They can be recognized as the standard SCE-UA codes.In this paper, the serial SCE-UA CPU code is revised from Duan's MATLAB version and is implemented in Fortran 90.This serial SCE-UA CPU code is utilized as the base line for the performance comparisons.by OpenMP and many-core GPU by CUDA Fortran, respectively.The SCE-UA method is inherently parallel and should be redesigned to implement the parallelization.According to the computation flow path of the SCE-UA algorithm, the following parts of the SCE-UA method are parallelizable: (1) initialization of complexes, (2) evolution of complexes, (3) random number generating, and (4) sorting and rearranging of complexes.The flowchart of the parallel SCE-UA method is shown in Figure 2.

Parallel Initialization of Complexes.
Before the evolution of complexes, the SCE-UA method randomly generates a set of initial points to constitute the initial complexes where each point represents a potential solution for the problem.The generation steps include the generation of initial points and the computation of objective function value of each point.The generation of initial points can be parallelized and requires a parallel random number generator which will be described in Section 2.2.4.The computation of objective function value for each initial point is unrelated to other points and can also be parallelized.Supposing we need to generate npt initial points and their corresponding objective function values, we create npt threads on the CPU or GPU.
In each thread, we compute the objective function value for the corresponding randomly generated point.By using multi-core CPU and many-core GPU, the npt threads can be executed in parallel to obtain the objective function values.

Parallel Evolution of Complexes. The evolution of complexes improves the objective function values by evolving
ngs complexes towards the global optimum.The process of complex evolution is inherently parallel and should be parallelized.For each complex evolution loop, we create ngs threads to represent ngs complexes and evolve these threads in parallel on the CPU or GPU to implement the parallel evolution.When the stopping criterion is satisfied, the complex evolution is stopped.In each complex evolution loop, for each complex, we perform the CCE (complex competitive evolution) and the points sequence rearranging for nspl steps.
The CCE contains three typical operations to imitate the genetic algorithm and the downhill simplex algorithm, which includes simplex choosing (the selection), the worst point reflection (the crossover), and the reflection failed point random regeneration (the mutation).In the mutation step, we need a parallel random number generator which is described in Section 2.2.4.The points sequence rearranging utilizes the quick sort algorithm to sort the evolved points in increasing objective function values and rearranges the point sequence according to the objective function values.After that, the algorithm is ready for the next round CCE operation.

Parallel Random Number Generating.
During the initialization and evolution of complexes, we need to generate uniformly distributed random numbers for each thread.
Because the initialization and evolution process is parallel, the random number generating process should be run in parallel.
We design the following parallel random number generating method: (1) For each thread, we generate a random number sequence, respectively [15][16][17].This method promises that the generation process for each thread is unrelated to other threads and guarantees the high quality of the generated random number sequence.
(2) In order to avoid the correlation between different random number sequences and obtain better random characteristics, we adopt different randomly generated seeds for each thread and utilize the Mersenne twister random number generator instead of the widely used linear congruential generator [18,19].For each thread, a separately generated random seed is adopted and catered for the Mersenne twister random number generator to generate a random number sequence for the corresponding thread.

Parallel Sorting and Rearranging of Complexes.
After the initialization or the shuffling evolution loop, the initial or evolved complexes should be sorted in increasing objective function values and rearranged according to their corresponding objective function values.Because the sorting and rearranging processes are inherently parallel, we should parallelize these processes by using the radix parallel sorting method [20,21] and the parallel rearranging method.The parallel sorting and rearranging are also implemented on the multi-core CPU and the many-core GPU.
Here,  represents the number of dimensions (i.e., the number of decision variables) and   ∈ [−100, 100] for  = 1, 2, . . ., .The global optimum of the Griewank problem is ( 1 ,  2 , . . .,   ) = 0 for   = 0 for  = 1, 2, . . ., .As a simple example, the two-dimensional Griewank function is demonstrated in Figure 3.The Griewank problem is complex enough to test the global property of the SCE-UA and the dimension can be changed to test the performance of the SCE-UA.Therefore, we adopt it as the benchmark function in this study.

Settings for the Hardware and Software Utilized in This Study
(1) The Hardware Utilized in This Study.In this study, we utilized the Intel Core i7-4710HQ CPU with hyperthreading (4 CPU cores with 8 threads) and the NVIDIA Geforce GTX 850M (DDR3 version) (see Figure 4).We can see from Figure 4 that the i7-4710HQ CPU has 8 logical CPU cores and the GTX 850M GPU has 640 CUDA GPU cores, which means the CPU and GPU can make full use of their computation capability by using 8 and 640 threads in parallel, respectively.
(2) Software Settings for This Study.The SCE-UA method has several algorithm parameters which control the convergence behavior of the algorithm.They are maxn, maximum number of objective function trials allowed before optimization is terminated; kstop, number of shuffling loops in which the objective function must improve by the specified percentage pcento or else optimization will be terminated; pcento, percentage by which the objective function must change in the specified number of shuffling loops kstop or else the optimization is terminated; peps, minimum parameter space allowed before optimization is terminated.For the purpose of fair comparison, we set the parameters as follows: maxn = positive infinity; kstop = 5; pcento = 0.1; peps = 0.000001.In order to analyze the performance of the serial and parallel SCE-UA algorithm, we need to adjust settings for the serial and parallel algorithm to test how the algorithm performs.These settings are nopt, number of decision variables; ngs, number of complexes; nobj, loop number which is used to test the objective function computation overhead (each loop contains four floating-point arithmetic operations, i.e., including one addition, one subtraction, one multiplication, and one division.Larger nobj corresponds to higher computation overhead).There is one thing that must be noted for the purpose of fair comparison, the time consumed by memory allocation on GPU, transfer between CPU and GPU, and deallocation from GPU that is also considered in this research.All the SCE-UA methods are implemented in single floating-point precision.

Performance Comparison Based on Total Execution Time.
In this section, we test the performances of the serial and parallel SCE-UA methods based on the total execution time (in seconds).For the purpose of fair comparison, we set the nobj = 1000000 for these comparisons.We adjust the nopt and ngs to test how the algorithm performs.

Serial-SCE-UA.
The total execution time of the serial SCE-UA is demonstrated in Figure 5 and Table 1.As we can see, with the increasing of the nopt, the total execution time increases.As for the increasing of the ngs, the execution time varies from 43.95 s to 46559.85 s in proportion.Let nopt = 30 as an example (the bolded line of Table 1); the ngs  increases from 4 to 1024 with a multiple of two.We can observe that the total execution time increases from 130.04 s to 28609.81 s with approximately the same multiple of two.This is because the serial version only utilizes one CPU core; with the increasing of ngs, the total execution time of course increases with approximately the same proportion.For ngs < 16, the execution time is not doubled.This is mainly because of less computational overhead; the CPU resources used for the thread dispatch are relatively higher than the computation.The dispatch of threads cost relatively more time which makes the execution time not doubled.

OMP-SCE-UA. The total execution time of the
OpenMP SCE-UA is demonstrated in Figure 6 and Table 2.As we can see, with the increasing of the nopt, the total execution time increases.As for the increasing of the ngs, the total execution time varies from 13.32 s to 6186.20 s.The time consumed by the OpenMP version is much less than the serial version.Different from the serial version, with the increasing of the ngs and when the ngs is smaller than 16, the total execution time increases not according to the same increasing multiple of the ngs.Let nopt = 30 as an example (the bolded line of Table 2); the ngs increases from 4 to 1024 with a multiple of to boost the performance.Therefore, the increasing multiple of total execution time is not two (less than two).However, when the ngs is equal to or larger than 16, with the increasing of ngs, the OMP-SCE-UA requires more CPU cores to boost the performance.However, there are only 8 CPU cores and the CPU cannot provide more computation resources for the OMP-SCE-UA to boost the performance.Therefore, the execution time increases with the same increasing multiple of the ngs (when ngs is equal to or larger than 16).

CUDA-SCE-UA.
The total execution time of the CUDA SCE-UA is demonstrated in Figure 7 and Table 3.As we can see, with the increasing of the nopt, the total execution time increases.As for the increasing of the ngs, the total execution time varies from 224.62 s to 1400.51 s.The performance of the CUDA version is completely different from the serial and OpenMP version.With the increasing of the ngs, the execution time changes little and decreases a little.This is because with less ngs (i.e., less GPU threads) the GPU cannot hide the latency of the global memory access.The latency slows down the performance of the algorithm.With larger ngs, the GPU can hide the latency by launching many threads simultaneously and therefore boost the performance of the algorithm.

Speedup Ratio Analysis.
The speedup ratio statistics are demonstrated in Figure 8 and Tables 4 and 5 varies from 3.28x to 7.81x, and the speedup ratio of the CUDA version varies from 0.17x to 45.95x.The speedup ratio of the OpenMP version is all less than 8x.The reason is as follows.The OpenMP program creates as many threads as possible to make full use of all the CPU cores.When the number of threads becomes larger than 8, because the CPU utilized in this study has only 8 CPU cores, the creation, management, dispatching, and destroying of CPU threads consume more CPU resources and prevent the speedup ratio from being larger than 8x.As for the CUDA version, when ngs (that equals the number of threads created by the GPU) is small, the speedup ratio of the CUDA version is less than the OpenMP version.This is because the clock speed of the CPU core is 2.5 GHz which is much higher than the clock speed of the GPU core (902 MHz).However, when ngs is very large (ngs > 128), the speedup ratio of the CUDA version becomes much higher than the OpenMP version.This is because the GPU has much more cores (640 CUDA GPU cores) than the CPU (only 8 CPU cores) and can launch much more parallel threads to boost the performance.As for the OpenMP version, the speedup ratio nearly reaches maximum value when ngs equals 8 and 16, and the speedup ratio cannot become larger by increasing the ngs.As for the CUDA version, the speedup ratio can become very large by increasing the ngs and the performance becomes much more satisfactory than the OpenMP version.At last, we can observe from Figure 8(f) that the nopt (i.e., the number of decision variables) has very little impact on the speedup ratio.This means that although the total execution time increases with the increasing of the nopt, the speedup ratio has very little relationship with the nopt.

Analysis of the Impact of the Objective Function
Computational Overhead.The relationship between the nobj and the speedup ratio is demonstrated in Figure 9.We set nopt = 20.The ngs varies from 4 to 1024.We can see that with the increasing of the nobj (i.e., the increasing of the objective function computational overhead), the speedup ratio becomes larger.There is one difference between the OpenMP and the CUDA version.With the increasing of the ngs and the nobj, the optimization problem becomes more complex and the computational overhead becomes heavier.The increasing extent of the speedup ratio of the OpenMP version is not very wide, and the speedup ratio is not higher than 8x.With the increasing of the ngs, the speedup ratio of the CUDA version becomes much higher than the OpenMP version, up to approximately 45x.These results show that the CUDA version performs much better than the OpenMP version under the condition of solving the problem with complex and high computational load objective function.

Optimization Accuracy Comparison.
After checking the optimization results, we found that all the optimization problems converge to the global optimum.This fact shows that the parallel SCE-UA can find the global optimum with the same accuracy as the serial SCE-UA.

Some Useful Advices on How to
Properly Utilize the Parallel SCE-UA Method.After carefully checking the results, we give some useful advices on how to properly utilize the parallel SCE-UA methods: (1) Both of the serial and parallel versions can give the correct optimization result.Therefore, for normal problems, the serial and parallel versions are both applicable and can converge to the global optimum.For simple problems (ngs < 256), the serial and parallel versions are all good to use but the parallel version may not obtain satisfactory speedup ratio because the overhead of creating, dispatching, and destroying of the threads is usually higher than the computation overhead of the optimization algorithm.Therefore, for simple problems (ngs < 256), we recommend using the serial version.
(2) The parallel version runs faster than the serial version especially for complex (ngs > 128) and high dimensional problems (nopt > 40).Therefore, we recommend using the parallel version for these problems.
For complex problem with relatively small ngs (ngs < 256), we recommend using the OpenMP version to obtain a better performance.For complex problem with large ngs (ngs > 128), the CUDA version is a better choice.As for problems with very high objective function computational load, we recommend to use the CUDA version.
(3) We should announce that the parallel version needs more memory than the serial version.This is because the parallel version needs to create arrays and vectors separately for each thread to ensure the correctness of parallel execution.Therefore, for very complex and very high dimensional problems, the parallel version may need much more memory and cause the memory overflow.We recommend using the 64-bit exe on 64-bit operating systems to cope with these kinds of problems because that 64-bit program can utilize more memory than the 32-bit program.The dimensionality of the problem (nopt) do not affect the speedup ratio very much; the only constraint is the memory size.Larger nopt need more memory to store the large and complex population.Therefore, for these problems we recommend installing more CPU memory for the OpenMP version or adopting the Tesla GPU card (usually has more GPU memory) for the CUDA version to provide more memory to ensure the successful execution of the optimization.

Conclusions
In this paper, we proposed parallel SCE-UA method and implemented it on Intel multi-core CPU and NVIDIA manycore GPU by OpenMP and CUDA Fortran.The serial and parallel SCE-UA codes were optimized at the same level to ensure a fair comparison.The Griewank benchmark function was adopted in this paper to test and compare the performances of the serial and parallel SCE-UA methods.
According to the results of the comparison, some useful advises were given to direct how to properly use the parallel SCE-UA.Three conclusions can be stated here: (1) Both of the serial and parallel versions can obtain the global optimum with a satisfactory probability.We can now produce reliable estimates of global optima for large complex optimization problems by both the serial and parallel versions of the SCE-UA methods.
(2) The experimental studies were carried out by using the Griewank benchmark function.This benchmark function embodies many typical problems encountered in the global optimization.Therefore, the recommendations for the parallel SCE-UA derived here can be recognized as guidelines for most applications.
(3) With the advent of the newly developed parallel SCE-UA methods, we can now produce reliable and much faster estimates of global optima for large complex optimization problems by using the parallel SCE-UA methods.The OpenMP version is recommended to be used on medium problems and the CUDA version is recommended to be used on large problems.

Figure 1 :
Figure 1: Comparison of computation capability of CPU and GPU: (a) floating-point operations per second (FLOP/s) for the CPU and GPU; (b) memory bandwidth for the CPU and GPU [13].

2. 3 .
Experimental Studies of the Serial and Parallel SCE-UA 2.3.1.The Griewank Benchmark Function.The performance comparison of serial and parallel SCE-UA methods is based on the Griewank benchmark function.The Griewank global optimization problem is a multimodal minimization problem defined as follows:

Figure 4 :Figure 5 :
Figure 4: The CPU and GPU used in this study.

Figure 8 :
Figure 8: Speedup ratio of the parallel SCE-UA versus the serial SCE-UA.

= 4 Figure 9 :
Figure 9: Relationship between the nobj and the speedup ratio.

Table 1 :
Total execution time (seconds) of the Serial-SCE-UA.

Table 2 :
Total execution time (seconds) of the OMP-SCE-UA.

Table 3 :
Total execution time (seconds) of the CUDA-SCE-UA.

Table 4 :
Speed-up ratio of the OMP-SCE-UA versus the Serial-SCE-UA.

Table 5 :
Speed-up ratio of the CUDA-SCE-UA versus the Serial-SCE-UA.
16).We can observe that the total execution time increases from 34.02 s to 3737.83 s.When ngs is smaller than 16, the increasing multiple of total execution time is not two (less than two).When ngs is equal to or larger than 16, the increasing multiple is approximately equal to two.This is because the OpenMP version can utilize at most 8 CPU cores (the CPU used in this study has at most 8 cores).When the ngs is less than 16, the number of threads equals the CPU cores used.Larger ngs requires more CPU cores and the CPU can provide enough cores for the OMP-SCE-UA (when ngs is less than16)