Implementation of Membrane Algorithms on GPU

Membrane algorithms are a new class of parallel algorithms, which attempt to incorporate some components of membrane computing models for designing efficient optimization algorithms, such as the structure of the models and the way of communication between cells. Although the importance of the parallelism of such algorithms has been well recognized, membrane algorithms were usually implemented on the serial computing device central processing unit (CPU), which makes the algorithms unable to work in an efficient way. In this work, we consider the implementation ofmembrane algorithms on the parallel computing device graphics processing unit (GPU). In such implementation, all cells of membrane algorithms can work simultaneously. Experimental results on two classical intractable problems, the point set matching problem and TSP, show that the GPU implementation of membrane algorithms is much more efficient than CPU implementation in terms of runtime, especially for solving problems with a high complexity.


Introduction
Membrane computing is an emergent branch of natural computing initiated by Pǎun in 2000 [1], with the aim of abstracting innovative computing models or ideas from the living cells and higher order structures of living cells, such as tissues and organs.The obtained models, called  systems, are distributed and parallel computing devices.Most variants of  systems were proved to be computationally complete (equivalent to Turing machines or other equivalent computing devices; we also say that  systems are universal) as number computing devices [2][3][4], language generators [5,6], and function computing devices [7,8].For general information on this area please refer to the Handbook of Membrane Computing [9], and for the up-todate information refer to the membrane computing website http://ppage.psystems.eu/.
systems have been proved to be a rich framework for handling many problems related to computing.Such systems can theoretically solve presumably intractable problems in a feasible time (solving NP-complete problems [10][11][12] or even PSPACE-complete problems [13]).Actually,  systems can also provide some new ideas for designing optimization algorithms to obtain approximate solutions to the intractable problems [14][15][16][17][18][19][20][21][22].The optimization algorithms inspired by  systems are usually called membrane algorithms (some researchers also call membrane algorithms  systems based optimization algorithms).The first membrane algorithm was proposed by Nishida in 2004 [14], where the nested structure and the communication mechanism between cells were brought from  systems.Experimental results showed that such an algorithm is effective and efficient for solving the intractable problem, traveling salesman problem (TSP) [14].Since then, many membrane algorithms have been proposed for solving various optimization problems, such as knapsack problem [15], point set matching problem [16], numerical optimization problem [17], multiobjective optimization problem [18,19], DNA sequence design problem [20], and many practical problems [21,22].The dynamic behavior analysis of membrane algorithms indicated that a membrane algorithm has a stronger capacity of balancing exploration and exploitation than its counterpart evolutionary algorithm in order to prevent premature convergence that might occur [23].We should stress that all membrane algorithms mentioned above can work in a parallel way, in the sense that the evolution of each cell can be performed simultaneously.Although the importance of the parallelism of such algorithms has been well recognized, membrane algorithms were usually implemented on the serial computing device CPU, which makes the algorithms unable to work as expected in a more efficient way.To achieve this aim, this paper considers the implementation of membrane algorithms on parallel computing devices.
The graphics processing units (GPUs) are a kind of computing devices with high parallelism on numerical operations, where massively parallel processors can support several thousands of concurrent threads.The computational power of GPUs has turned them into attractive platforms for general-purpose scientific and engineering applications, especially for tackling large scale numerical computing problems [24].In this work, we will consider the implementation of membrane algorithms on GPU with an attempt to make the membrane algorithms work in a parallel way.Under the GPU implementation of membrane algorithms presented here, the work of a cell of membrane algorithms is achieved by a thread of GPU, by which all the cells of a membrane algorithm can evolve simultaneously through the concurrent threads of GPU.The GPU implementation ensures that membrane algorithms can work in a more efficient way, in the sense that each operation of membrane algorithms is performed in parallel as much as possible.Experimental results on two classical intractable problems, the point set matching problem and TSP, show that the GPU implementation of membrane algorithms is effective and efficient.Compared with the CPU implementation, the GPU implementation of membrane algorithms consumes much less runtime for dealing with the intractable problems, especially for large scale numerical intractable problems.Software with a friendly interface is also developed for GPU implementation of membrane algorithms, which can provide a convenient tool for the users to solve the intractable problems.We should stress that the GPU implementation of P systems also exists, for example, in Sevilla and Spain [25,26].A parallel simulator for membrane computing models on GPU, called PMCGPU, can be found in the website [27].
The rest of the paper is organized as follows.In Section 2, we present the GPU implementation procedure of membrane algorithms.Experimental results on the point set matching problem and TSP are presented in Section 3. Section 4 presents software for GPU implementation of membrane algorithms.Finally, conclusions and remarks are presented in Section 5.

Implementation of Membrane Algorithms on GPU
In this section, we will present a GPU implementation procedure of membrane algorithms.We first review its CPU implementation procedure.
Let us recall that a membrane algorithm with a nested structure mainly consists of the following four operations: (1) initialize solutions in each of the  cells; (2) update solutions in each of the  cells; (3) exchange solutions between adjacent cells; (4) select good solutions in each of the  cells.
Figure 1 illustrates the CPU implementation procedure of membrane algorithms with a nested structure.Under the CPU implementation of membrane algorithms, one cell of the algorithms starts to work only after the work of another cell is finished during the four operations.This means that the cells will evolve one by one during each of the four operations.
Therefore, each of the four operations is achieved in a sequential manner under the CPU implementation.In fact, it is not difficult to find that the four operations can be achieved in a parallel manner, in the sense that the  cells evolve simultaneously if a parallel computing device is used.The GPU implementation of membrane algorithms can achieve such a parallelism of the four operations.
Briefly, a GPU consists of hundreds of blocks, and each block can support several thousands of concurrent threads.A GPU should work with the help of host CPU.Data can be transferred between threads in the same block through the shared memory in the block.But, the transferred data should be very little due to the small size of the shared memory.Data cannot be directly transferred between threads in different blocks, but only through the host.Figure 2 depicts the structure of a GPU.The proposed GPU implementation procedure of membrane algorithms with a nested structure is shown in Figure 3.Under the GPU implementation presented here, the  cells will work in parallel instead of one by one during the four operations.The parallelism of the  cells is achieved based on an idea as follows: a thread of a GPU does the work of a cell of membrane algorithms, so the  cells can work in parallel through the concurrent threads.This means that GPU has to create the same number of threads as that of cells used by membrane algorithms during the implementation on GPU.Note that, in the GPU implementation procedure of membrane algorithms, a synchronization function has been used after each of the four operations.Under the implementation on GPU, the work of each thread may not be finished at the same time, so this function will ensure that the next operation of membrane algorithms starts only after the work of each of the  cells is finished for the four operations.The parallel work of the cells under the implementation on GPU enables membrane algorithms to work in a more efficient way in terms of runtime, which will be illustrated by simulation experiments in the following section.

Experimental Results and Analysis
In this section, we will evaluate the performance of GPU implementation of membrane algorithms through two classical intractable problems: the point set matching problem and TSP.It is useful for readers to have some familiarity with such two problems, so we here briefly recall them.
A point set matching problem can be formulated as follows.Given two sets  = { 1 , . . .,   } and  = { 1 , . . .,   }, find a map  :   →   , (  ⊆ ,   ⊆ ) such that the following matching objective value Gobj is minimized, where (, ) is the Euclidean distance between points  and ;   ,    are the sizes of  and   ; and  is a penalty factor.The TSP raises the following question.Given a list of  cities   , 1 ≤  ≤ , find a route   1   2 ⋅ ⋅ ⋅      1 (i.e., this route visits each city exactly once and returns to the origin city), such that the total distance Dist of this route is minimized: where (  ,   ) is the Euclidean distance between cities   and   .
The membrane algorithm proposed in [16] will be adopted to test the performance of GPU implementation.The number of generations is set as 100, the updating number of each cell is 8 in one generation, and the other parameters are the same as those suggested in [16].Each test is executed 5 times.All simulations reported in this work are conducted on a PC with a 3.40 GHz Intel Core i7-2600K CPU, Windows XP Professional SP3 64 bit operating system, and NVIDIA GeForce GTX 560 Ti graphics card (it uses the GF114 GPU which offers a maximum of 384 cores).Note that the runtime of GPU implementation reported in the following subsections contains all times that implementing a membrane algorithm on GPU takes, including data reorganization, hostto-GPU data transfer, and GPU-to-host data transfer.

Experiments on the Point Set Matching Problem.
In the experiments, the point set matching problem is created as follows: a point set  consisting of  points is generated randomly in the region {(, ) | 0 ≤ ,  ≤ 256√/10}, and the observed  is generated by adding the Gaussian noise with a variance  = 10 to set .
Figure 4 presents the runtime of CPU and GPU implementation of the membrane algorithm having 200 cells on the point set matching problem with different sizes of point set.As shown in Figure 4, the GPU implementation of membrane algorithm is effective and efficient.The GPU implementation outperforms the CPU implementation on the point set matching problem, in the sense that it consumes much less runtime than CPU implementation, especially for point set with a large size.The runtime of CPU and that of GPU implementation will both increase as the size of point set increases.The runtime of CPU implementation will dramatically increase with an increment of the size of point set, while the increment on runtime of GPU implementation is quite slight with an increment of the size of point set.Note that membrane algorithm will perform initializing, updating, exchanging, and selecting operations for the point set matching problem in each cell, and the cells used by the membrane algorithm will be performed in parallel under the GPU implementation instead of one by one under the CPU implementation.So, the runtime will not increase dramatically as the size of point set increases.
Table 1 shows the runtime of CPU and GPU implementation of the membrane algorithm with different numbers of cells on the point set matching problem having 200 points.The GPU implementation of membrane algorithm outperforms the CPU implementation, in the sense that it can make membrane algorithm with a large number of cells work in a more efficient way.The runtime of CPU implementation will dramatically increase as the number of cells used by the membrane algorithm increases, while the implementation   on GPU almost has the same runtime as the number of cells increases.The slight increment on runtime of GPU implementation is partially caused by the increasing overhead generated by the control of the synchronization of cells as the number of cells increases.

Experiments on the TSP.
In the experiments, each TSP is chosen from the TSP benchmark problems in TSPLIB proposed by Reinelt [28].These benchmark problems have been widely used for testing the performance of an optimization algorithm.
Figure 5 shows the runtime of CPU and GPU implementation of the membrane algorithm having 200 cells on the TSP with different numbers of cities.It is not difficult to find that on TSP there is a similar result as that of the point set matching problem.The GPU implementation of membrane algorithms outperforms the CPU implementation on the TSP in terms of runtime.The GPU implementation will consume much less runtime than CPU implementation, especially on TSP with a large number of cities.The number of cities  Table 2 shows the runtime of CPU and GPU implementation of the membrane algorithm with different numbers of cells on the TSP benchmark problem with 493 cities.As shown in Table 2, on TSP the runtime of CPU implementation of membrane algorithms will increase with an increment of the number of cells, while the implementation on GPU will almost consume the same runtime as the number of cells increases.Different from the case of point set matching problems, the runtime of CPU implementation of membrane algorithms on TSP does not dramatically increase with an increment of the number of cells.This result is mainly caused by the fact that the time consumed by each cell is quite little since the operations in each cell hold a low complexity for the TSP with 200 cities.So, the total runtime of CPU implementation will only increase slightly with an increment of the number of cells.
Compared with the simulation results on the point set matching problem, we can find that the GPU implementation of membrane algorithms is quite efficient for tackling large scale intractable problems, while it is not good at dealing with intractable problems with a small size.The reason for this is that the cells of membrane algorithms working in parallel can save only a little runtime under GPU implementation, since the computational complexity in each cell is very low for intractable problems with a small size.At the same time, under GPU implementation the data should be transferred repeatedly between GPU and host, which will consume some additional runtime.Therefore, the GPU implementation of membrane algorithms will not work in a more efficient way than CPU implementation for solving intractable problems with a small size.

Software with a Friendly Interface
For the GPU implementation of membrane algorithms, we have to write a new programming code for different intractable problems, which is quite inconvenient for the users.Therefore, software with a friendly interface (termed MAGPU) has been developed, which can provide the users with a convenient tool to implement membrane algorithms with a nested structure on GPU.This software can be found on the website https://github.com/warmheart0/magpu/.
This software mainly contains the following functions: (1) the setting of a few parameters, including the generation number, number of cells, and updating number of each cell in one generation; (2) the choice of experimental data and updating strategy of each cell (it also allows the users to define their own updating strategy); (3) the saving and showing of experimental results; (4) the saving and comparing of runtimes of several experiments; (5) the calculation of a few measurement indexes, such as mean and variance.Figure 6 presents an interface of the software for GPU implementation of membrane algorithms with a nested structure.

Conclusions and Remarks
In this paper, the implementation of membrane algorithms on a parallel computing device GPU was carried out.The GPU implementation can achieve the parallelism of membrane algorithms, which makes membrane algorithms work in a more efficient way in terms of runtime.Experimental results on the point set matching problem and TSP show that the GPU implementation of membrane algorithms outperforms the CPU implementation, in the sense that it consumes much less runtime.
Although the GPU implementation of membrane algorithms has shown a good performance in terms of runtime, many problems remain to be solved for the GPU implementation presented in this work.Among these problems, an interesting one is to further reduce the runtime of GPU implementation of membrane algorithms.In the presented GPU implementation, the data transfer between host and GPU will be performed a large number of times, which takes a lot of runtime.Therefore, a possible solution is to reduce the number of data transfer times between host CPU and GPU.It is conjectured that the runtime of GPU implementation can be greatly reduced by improving the GPU implementation procedure such that only a small number of date transfer times are performed between host and GPU.
In order to provide the users with a convenient tool to implement membrane algorithms on GPU, software with a friendly interface has been developed.This software is only a simple version and many functions need to be improved or added, which is an interesting work that deserves to be further investigated.

Figure 1 :
Figure 1: The CPU implementation procedure of membrane algorithms.

Figure 4 :
Figure 4: Runtime (s) of CPU and GPU implementation of the membrane algorithm having 200 cells on the point set matching problem with different sizes of point set.

Figure 5 :
Figure 5: Runtime (s) of CPU and GPU implementation of the membrane algorithm having 200 cells on TSP with different numbers of cities.

Figure 6 :
Figure 6: An interface of the software for GPU implementation of membrane algorithms with a nested structure.

Table 1 :
Runtime (s) of CPU and GPU implementation of the membrane algorithm with different numbers of cells on point set matching problem having 200 points.

Table 2 :
Runtime (s) of CPU and GPU implementation of the membrane algorithm with different numbers of cells on TSP benchmark problem with 493 cities.