As the number of rules and sample rate for type 2 fuzzy logic systems (T2FLSs) increases, the speed of calculations becomes a problem. The T2FLS has a large membership value of inherent algorithmic parallelism that modern CPU architectures do not exploit. In the T2FLS, many rules and algorithms can be speedup on a graphics processing unit (GPU) as long as the majority of computation a various stages and components are not dependent on each other. This paper demonstrates how to install interval type 2 fuzzy logic systems (IT2-FLSs) on the GPU and experiments for obstacle avoidance behavior of robot navigation. GPU-based calculations are high-performance solution and free up the CPU. The experimental results show that the performance of the GPU is many times faster than CPU.
1. Introduction
Graphic processing units (GPUs) give a new way to perform general purpose computing on hardware that is better suited for the complicated fuzzy logic systems. However, the installation of these systems on the GPUs is also difficult because many algorithms are not designed in a parallel format conducive to GPU processing. In addition, there may be too many dependencies at various stages in the algorithm that will slow down GPU processing.
Type 2 fuzzy logic has been developed in theory and practice to obtain achievement for real applications [1–10]. A review of the methods used in the design of interval type 2 fuzzy controllers has been considered [11]. However, the complexity of T2FLS is still large and many researches focus to reduce these problems on the approach to algorithm or hardware implementation. Some proposals implement type 2 FLS focus on the design, and software development for coding a high-speed defuzzification stage based on the average method of two type 1 FLS [12] or the optimization of an incremental fuzzy PD controller based on a genetic algorithm [13]. More recent works, where an interval type 2 FIS Karnik-Mendel is designed, tested and implemented based on hardware implementation [14]. Using GPUs for general purpose computing is mentioned in many researches, recently, to speed up complicated algorithms by parallelizing to suitable GPU architecture, especially for applications of fuzzy logic. Anderson et al. [15] presented a GPU solution for the fuzzy C-means (FCMs). This solution used OpenGL and Cg to achieve approximately two orders of magnitude computational speedup for some clustering profiles using an nVIDIA 8800 GPU. They later generalized the system for the use of non-Euclidean metrics [16]. Further, Sejun Kim [17] describes the method used to adapt a multilayer trees structure composed of fuzzy adaptive units into CUDA platforms. Chiosa and Kolb [18] present a framework for mesh clustering solely implemented on the GPU with a new generic multilevel clustering technique. Chia et al. [19] proposes the implementation of a zero-order TSK-fuzzy neural network (FNN) on GPUs to reduce training time. Harvey et al. [20] present a GPU solution for fuzzy inference. Anderson et al. [21] present a parallel implementation of fuzzy inference on a GPU using CUDA. Again, over two orders of speed improvement of this naturally parallel algorithm can be achieved under particular inference profiles. One problem with this system, as well as the FCM GPU implementation, is that they both rely upon OpenGL and Cg (graphics libraries), which makes the system and generalization of its difficult for newcomers to GPU programming.
Therefore, we carried out fuzzy logic systems analysis in order to take advantage of GPUs processing capabilities. The algorithm must be altered in order to be computed fast on a GPU. In this paper, we explore the use of nVIDIA's Compute Unified Device Architecture (CUDA) for the implementation of an interval type 2 fuzzy logic system (IT2FLS). This language exposes the functionality of the GPU in a language that most programmers are familiar with, the C/C++ language that the masses can understand and more easily integrate into applications that do not have the need otherwise to interface with a graphics API. Experiments are implemented for obstacle avoidance behavior of robot navigation based on nVIDIA platform with the summarized reports on runtime.
The paper is organized as follows: Section 2 presents an overview on GPUs and CUDA; Section 3 introduces the interval type 2 fuzzy logic systems; Section 4 proposes a speedup of IT2FLS using GPU and CUDA; Section 5 presents experimental results of IT2FLS be implemented on GPUs in comparing with on CPU; Section 6 is conclusion and future works.
2. Graphics Processor Units and CUDA
Traditionally, graphics operations, such as mathematical transformations between coordinate spaces, rasterization, and shading operations have been performed on the CPU. GPUs were invented in order to offload these specialized procedures to advanced hardware better suited for the task at hand. Because of the popularity of gaming, movies, and computer-aided design, these devices are advancing at an impressive rate. Classically, before the advent of CUDA, general purpose programming on a GPU (GPGPU) was performed by translating a computational procedure into a graphics format that could be executed in the standard graphics pipeline. This refers to the process of encoding data into a texture format, identifying sampling procedures to access this data, and converting the algorithms into a process that utilized rasterization (the mapping of array indices to graphics fragments) and frame buffer objects (FBO) for multipass rendering. GPUs are specialised stream processing devices.
This processing model takes batches of elements and computes a similar independent calculation in parallel to all elements. Each calculation is performed with respect to a program, typically called a kernel. GPUs are growing at a faster rate than CPUs, and their architecture and stream processing design makes them a natural choice for many algorithms, such as computational intelligence algorithms that can be parallelised.
nVIDIA's CUDA is a data-parallel computing environment that does not require the use of a graphics API, such as OpenGL and a shader language. CUDA applications are created using the C/C++ language. CPU and GPU programs are developed in the same environment (i.e., a single C/C++ program), and the GPU code is later translated from C/C++ to instructions to be executed by the GPU. nVIDIA has even gone as far as providing a CUDA Matlab plugin. A C/C++ program using CUDA can interface with one GPU or multiple GPUs can be identified and utilized in parallel, allowing for unprecedented processing power on a desktop or workstation.
CUDA allows multiple kernels to be run simultaneously on a single GPU. CUDA refers to each kernel as a grid. A grid is a collection of blocks. Each block runs the same kernel but is independent of each other (this has significance in terms of access to memory types). A block contains threads, which are the smallest divisible unit on a GPU. This architecture is shown in Figure 1.
CUDA-processing model design [22].
The next critical component of a CUDA application is the memory model. There are multiple types of memory and each has different access times. The GPU is broken up into read-write perthread registers, read-write perthread local memory, read-write per-block shared memory, read-write per-grid global memory, read-only per-grid constant memory, and read-only per-grid texture memory. This model is shown in Figure 2.
CUDA GPU memory model design [22].
Texture and constant memory have relatively small access latency times, while global memory has the largest access latency time. Applications should minimize the number of global memory reads and writes. This is typically achieved by having each thread read its data from global memory and store its content into shared memory (a block level memory structure with smaller access latency time than global memory). Threads in a block synchronize after this step. Memory is allocated on the GPU using a similar mechanism to malloc in C, using the functions cudaMalloc and cudaMallocArray. GPU functions that can be called by the host (the CPU) are prefixed with the symbol “global”, GPU functions that can only be called by the GPU are prefixed with “device”, and standard functions that are callable from the CPU and executed on the CPU are prefixed with “host” (or the symbol can be omitted, as it is the default). GPU functions can take parameters, as in C. When there are a few number of variables that the CPU would like to pass to the GPU, parameters are a good choice; otherwise, such as in the case of large arrays, the data should be stored in global, constant, or texture memory and a pointer to this memory is passed to the GPU function. Whenever possible, data should be kept on the GPU and not transferred back and forth to the CPU.
3. Interval Type 2 Fuzzy Logic Systems3.1. Type 2 Fuzzy Sets
A type 2 fuzzy set in X is denoted A~, and its membership grade of x∈X is μA~(x,u), u∈Jx⊆[0,1], which is a type 1 fuzzy set in [0,1]. The elements of domain of μA~(x,u) are called primary memberships of x in A~, and memberships of primary memberships in μA~(x,u) are called secondary memberships of x in A~.
Definition 1.
A type 2 fuzzy set, denoted A~, is characterized by a type 2 membership function μA~(x,u) where x∈X and u∈Jx⊆[0,1], that is,
(1)A~={((x,u),μA~(x,u))∣∀x∈X,∀u∈Jx⊆[0,1]}
or
(2)A~=∫x∈X∫u∈JxμA~(x,u)(x,u),Jx⊆[0,1]
in which 0≤μA~(x,u)≤1.
At each value of x, say x=x′, the 2D plane whose axes are u and μA~(x′,u) is called a vertical slice of μA~(x,u). A secondary membership function is a vertical slice of μA~(x,u). It is μA~(x=x′,u) for x∈X and for all u∈Jx′⊆[0,1], that is,
(3)μA~(x=x′,u)≡μA~(x′)=∫u∈Jx'fx′(u)u,Jx′⊆[0,1]
in which 0≤fx′(u)≤1.
In manner of embedded fuzzy sets, a type 2 fuzzy sets [1] is union of its type 2 embedded set, that is,
(4)A~=∑j=1nA~ej,
where n≡∏i=1NMi and A~ej denoted the jth type 2 embedded set of A~, that is,
(5)A~ej≡{(uij,fxi(uij)),i=1,2,…,N}.
where uij∈{uik,k=1,…,Mi}.
Type 2 fuzzy sets are called interval type 2 fuzzy sets if the secondary membership function fx'(u)=1, for all u∈Jx, that is, a type 2 fuzzy set is defined as follows.
Definition 2.
An interval type 2 fuzzy set A~ is characterized by an interval type 2 membership function μA~(x,u)=1 where x∈X and u∈Jx⊆[0,1], that is,
(6)A~={((x,u),1)∀x∈X,∀u∈Jx⊆[0,1]}.
Uncertainty of A~, denoted FOU, is union of primary functions that is FOU(A~)=⋃x∈XJx. Upper/lower bounds of membership function (UMF/LMF), denoted μ¯A~(x) and μ_A~(x), of A~ are two type 1 membership function and bounds of FOU.
3.2. Interval Type 2 Fuzzy Logic Systems (IT2FLSs)
The general type 2 fuzzy logic system is introduced as Figure 3. The output block of a type 2 fuzzy logic system consists of two blocks that are type-reduced and defuzzifier. The type-reduced block will map a type 2 fuzzy set to a type 1 fuzzy set, and the defuzzifier block will map a fuzzy to a crisp. The membership function of an interval type 2 fuzzy set is called FOU which is limited by two membership functions of a type 1 fuzzy set that are UMF and LMF (see Figure 4).
Diagram of type 2 fuzzy logic system [4].
The membership function of an interval type 2 fuzzy set [1].
The combination of antecedents in a rule for IT2FLS is called firing strength process represented by the Figure 5.
The combination of antecedents in a rule for IT2FLS [23].
In the IT2FLS, calculating process involves 5 steps to getting outputs: fuzzification, combining the antecedents (apply fuzzy operators or implication function), aggregation, and defuzzification.
Because each pattern has a membership interval as the upper μ¯A~(x) and the lower μ_A~(x), each centroid of a cluster is represented by the interval between cL and cR. Now, we will represent an iterative algorithm to find cL and cR as follows.
Step 1.
Calculate θi by the following equation:
(7)θi=12[μ¯(xi)+μ_(xi)].
Step 2.
Calculate c′ as follows:
(8)c′=c(θ1,θ2,…,θN)=∑i=1Nxi*θi∑i=1Nθi.
Step 3.
Find k such that xk≤c′≤xk+1
Step 4.
Calculate c′′ by following equation: in case c′′ is used for finding cL(9)c′′=∑i=1kxiμ¯(xi)+∑i=k+1Nxiμ_(xi)∑i=1kμ¯(xi)+∑i=k+1Nμ_(xi).
In case c′′ is used for finding cR, then
(10)c′′=∑i=1kxiμ_(xi)+∑i=k+1Nxiμ¯(xi)∑i=1kμA_(xi)+∑i=k+1Nμ¯(xi).
Step 5.
If c′=c′′ go to Step 6 else set c′=c′′, then back to Step 3.
Step 6.
Set cL=c′ or cR=c′.
Finally, compute the mean of centroid, y, as
(11)y=cR+cL2.
4. Speedup of IT2FLS Using GPU and CUDA
The first step in IT2FLS on the GPU is selection of memory types and sizes. This is a critical step, the choice of format and type dictate performance. Memory should be allocated such that sequential access (of read and write operations) is as possible as the algorithm will permit.
Let the number of inputs be N, the number of parameters that define a membership function be P, the number of rules be R and the discretization rate be S. Inputs are stored on the GPU as a one-dimensional array of size N (see Figure 6).
Input vector.
The consequences are a CPU two-dimensional array of size R×P. They are used only on the CPU when calculating the discrete fuzzy set membership values.
The antecedents are a two-dimensional array on the GPU of size R×(N×P).
The fired antecedents are an R one-dimensional array on the GPU, which stores the result of combining the antecedents of each rule (see Figures 7, 8, and 9). The last memory layout is the discretized consequent, which is an S×R matrix created on the GPU.
Consequent.
Antecedent.
Discretized consequent.
The inputs and antecedents are of type texture memory because they do not change during the firing of a FLS, but could change between consecutive firings of a FLS and need to be updated. We proposed the GPU program flow diagram for a CUDA application computing a IT2FLS in Figure 10.
IT2FLS diagram for a CUDA application on the GPU.
In IT2FLS, we have to calculate two values for two membership functions that are UMF and LMF. The first step is a kernel that fuzzifies the inputs and combines the antecedents. The next steps are implication and a process which is responsible for aggregating the rule outputs. The last GPU kernel is the defuzzification step.
The first kernel reads from the inputs and antecedents textures and stores its results in the fired antecedent's global memory section. All inputs are sampled for each rule, the rth rule samples the rth row in the antecedent's memory, membership values are calculated, and the minimum of the antecedents is computed and stored in the rth row of the fired antecedent's memory region. There are B blocks used by this kernel, partially because there is a limit in terms of the number of threads that can be created per block (current max is 512 threads). Also, one must consider the number of threads and the required amount of register and local memory needed by a kernel to avoid memory overflow. This information can be found per each GPU. We limited the number of threads per block to 128 (an empirical value found by trying different block and thread profiles for a system that has two inputs and trapezoidal membership functions). The general goal of a kernel should be to fetch a small number of data points, and it should have high arithmetic intensity. This is the reason why only a few memory fetches per thread are made, and the membership calculations and combination step is performed in a single kernel.
The next steps are implication and rule aggregation kernels. At first, one might imagine that using two kernels to calculate the implication results and rule aggregation would be desirable. However, the implication kernel, which simply calculates the minimum between the respective combined antecedent results and the discretized consequent, is inefficient. As stated above, the ratio of arithmetic operations to memory operations is important. We want more arithmetic intensity than memory access in a kernel. Attempt to minimize the number of global memory samples, we perform implication in the first step of reduction. Reduction, in this context, is the repeated application of an operation to a series of elements to produce a single scalar result. In the case of rule aggregation, this is the application of the maximum operator over each discrete consequent sample point for each rule. The advantage of GPU reduction is that it takes advantage of the parallel processing units to perform a divide and conquer strategy. And the last step is defuzzification kernel. As described above, rule output aggregation and defuzzification reduction are used for IT2FLS on the GPU.
The output of rule output aggregation is two rows in the discretized consequent global memory array. The defuzzifier step is done by the Karnik-Mendel algorithms with two inputs that are rule_combine_UMF and rule_combine_LMF with two outputs yl and yr, respectively. The crisp output y is calculated by the formula y=(yl+yr)/2.
The steps for finding yl and yr on GPU (Notation: Rule_Combine_UMF (i) = μA¯(xi) and Rule_Combine_LMF (i) = μA_(xi), N = sample rate) as follows.
Step 1.
Calculate θi on GPU by the following equation:
(12)θi=12[μA¯(xi)+μA_(xi)].
Step 2.
Calculate c′ on GPU as follows:
(13)c′=c(θ1,θ2,…,θN)=∑i=1Nxi*θi∑i=1Nθi.
Next, copy c′ to host memory.
Step 3.
Find k such that xk≤c′≤xk+1 (calculated on CPU).
Step 4.
Calculate c′′ on GPU by following equation. In case c′′ is used for finding yl(14)c′′=∑i=1kxiμA¯(xi)+∑i=k+1NxiμA_(xi)∑i=1kμA¯(xi)+∑i=k+1NμA_(xi).
In case c′′ is used for finding yr, consider
(15)c′′=∑i=1kxiμA_(xi)+∑i=k+1NxiμA¯(xi)∑i=1kμA_(xi)+∑i=k+1NμA¯(xi).
Next, copy c′′ to host memory.
Step 5.
If c′=c′′ go to Step 6 else set c′=c′′, then back to Step 3 (calculated on CPU).
Step 6.
Set yl=c′ or yr=c′.
5. Experiments5.1. Problems
We implement IT2FLS with collision avoidance behavior of robot navigation. The fuzzy logic systems have two inputs: the extended fuzzy directional relation (FDR) [24] and range to obstacle; the output is angle of deviation (AoD). The fuzzy rule has the form as follows
IF FDR is A~i AND Range is B~i THEN AoD is C~i, where A~i, B~i, and C~i are type 2 fuzzy sets of antecedent and consequent, respectively.
The fuzzy directional relation has six linguistic values (NLarge, NMedium, NSmall, PSmall, PMedium, and PLarge). The range from robot to obstacle is divided into four subsets: VNear, Near, Medium, and Far. The output of fuzzy if-then is a linguistic variable representing for angle of deviation and has six linguistic variables the same the fuzzy directional relation with the different membership functions. Linguistic values are interval type 2 fuzzy subsets that membership functions are described in Figures 11, 12, and 13. The problem is built with 24 rules given by the following Table 1.
The rule base of collision avoidance behavior.
FDR
Range
AoD
FDR
Range
AoD
NS
VN
PL
PS
VN
NL
NS
N
PL
PS
N
NL
NS
M
PM
PS
M
NM
NS
F
PS
PS
F
NS
NM
VN
PM
PM
VN
NM
NM
N
PM
PM
N
NM
NM
M
PM
PM
M
NM
NM
F
PS
PM
F
NS
NL
VN
PM
PL
VN
NM
NL
N
PM
PL
N
NM
NL
M
PS
PL
M
NS
NL
F
PS
PL
F
NS
Membership grades of FDR.
Membership grades of Range.
Membership grades of AoD.
5.2. Experiments
The performance of the GPU implementation of a IT2FLS was compared to a CPU implementation. The problem is written in C/C++ console format and be installed on the Microsoft Visual Studio 2008, and it was performed on computers with the operating system windows 7 32 bit and nVIDIA CUDA support with specifications.
CPU was the Core i3-2310 M 2.1 GHz, the system had 2 GB of system RAM (DDR3).
GPU was an nVIDIA Gerforce GT 540 M graphics card with 96 CUDA Core, 1 GB of texture memory, and PCI Express X16.
The number of inputs was fixed to 2, the number of rules was varied between 32, 64, 128, 256, and 512, and sample rate was varied between 256, 512, 1024, 2048, 4096, and 8192.
We take the ratio of CPU versus GPU performance. A value below 1 indicated that the CPU is performing best, and value above 1 indicates the GPU is performing best. The CPU/GPU performance ratios for the IT2FLS are given in Table 2 and run-time graph of the problem implementation was shown in Figure 14.
CPU/GPU performance ratio.
R
S
128
256
512
1024
2048
4096
8192
32
0.1
0.24
0.41
0.76
1.47
2.64
3.22
64
0.21
0.39
0.73
1.58
2.7
4.69
8.99
128
0.32
0.76
1.25
3.07
3.95
9.01
18.26
256
0.72
1.19
2.95
5.35
10.7
17.56
21.3
512
0.77
2.43
3.32
12.57
18.43
20.51
29.3
Run-time graph of the problem implementation.
6. Conclusion
As demonstrated in this paper, the implementation of interval type 2 FLS on a GPU without the use of a graphics API which can be used by any researcher with knowledge of C/C++. We have demonstrated that the CPU outperforms the GPU for small systems. As the number of rules and sample rate grow, the GPU outperforms the CPU. There is a switch point in the performance ratio matrices (Table 2) that indicates when the GPU is more efficient than the CPU. In the case that sample rate is 8192 and rule is 512, the GPU runs approximately 30 times faster on the computer.
Future work will look at to extend interval type 2 FLS to the generalised type 2 FLS and applying to various applications.
Acknowledgment
This paper is sponsored by Intelligent Robot Project at LQDTU and the Research Fund RFit@LQDTU, Faculty of Information Technology, Le Quy Don University.
KarnikN. N.MendelJ. M.LiangQ.Type-2 fuzzy logic systemsKarnikN. N.MendelJ. M.Centroid of a type-2 fuzzy setLiangQ.MendelJ. M.Interval type-2 fuzzy logic systems: theory and designMendelJ. M.JohnR. I.LiuF.Interval type-2 fuzzy logic systems made simpleLiuF.An efficient centroid type-reduction strategy for general type-2 fuzzy logic systemNgoL. T.PhamL. T.NguyenP. H.HirotaK.On approximate representation of type-2 fuzzy sets using triangulated irregular networkNgoL. T.PhamL. T.NguyenP. H.HirotaK.Refinement geometric algorithms for type-2 fuzzy set operationsProceedings of the IEEE International Conference on Fuzzy SystemsAugust 20098668712-s2.0-7124909383810.1109/FUZZY.2009.5277211NgoL. T.Refinement CTIN for general type-2 Fuzzy logic systemsProceedings of the IEEE International Conference on Fuzzy Systems (IEEE-FUZZ '11)2011Hanoi, Vietnam12251232StarczewskiJ. T.Efficient triangular type-2 fuzzy logic systemsHagrasH. A.A hierarchical type-2 fuzzy logic control architecture for autonomous mobile robotsCastilloO.MelinP.A review on the design and optimization of interval type-2 fuzzy controllersSepúlvedaR.MontielO.CastilloO.MelinP.Modelling and simulation of the defuzzification stage of a type-2 fuzzy controller using VHDL codeMaldonadoY.CastilloO.MelinP.Optimization of membership functions for an incremental fuzzy PD control based on genetic algorithmsSepúlvedaR.Montiel-RossO.CastilloO.MelinP.Embedding a KM type reducer for high speed fuzzy controller into an FPGAAndersonD. T.LukeR. H.KellerJ. M.Speedup of fuzzy clustering through stream processing on graphics processing unitsAndersonD.LukeR. H.KellerJ. M.Incorporation of non-euclidean distance metrics into fuzzy clustering on graphics processing units41Proceedings of the Inertial Fusion Sciences and Applications (IFSA '07)20071281392-s2.0-5954909469610.1007/978-3-540-72432-2_14SejunK.DonaldC.A GPU based parallel hierarchical fuzzy ART clustering1Proceedings of the International Joint Conference on Neural
Networks2011Rolla, Mo, USA27782782ChiosaI.KolbA.GPU-based multilevel clusteringChiaF.TengC.WeiY.Speedup of implementing fuzzy neural networks with high-dimensional inputs through parallel processing on graphic processing unitsHarveyN.LukeR.KellerJ. M.AndersonD.Speedup of fuzzy logic through stream processing on graphics processing unitsProceedings of the IEEE Congress on Evolutionary Computation (CEC '08)June 2008380938152-s2.0-5574908747610.1109/CEC.2008.4631314AndersonD.Parallelisation of fuzzy inference on a graphics processor unit using the compute unified device architectureProceedings of the UK Workshop on Computational Intelligence (UKCI '08)200816HarrisM.18 March 2008, Optimizing Parallel Reduction in CUDA, NVIDIA Whitepaper, http://www.nvidia.com/object/cuda/sample/dat/a%20-%20parallel.htmlImamR.KharismaB.Design of interval type-2 fuzzy logic based power system stabilizerNgoL. T.PhamL. T.NguyenP. H.Extending fuzzy directional relationship and applying for mobile robot collision avoidance behavior