A Fading Channel Simulator Implementation Based on GPU Computing Techniques

Channel simulators are powerful tools that permit performance tests of the individual parts of a wireless communication system. This is relevant when new communication algorithms are tested, because it allows us to determine if they fulfill the communications standard requirements. One of these tests consists of evaluating the system performance when a communication channel is considered. In this sense, it is possible to model the channel as an FIR filter with time-varying random coefficients. If the number of coefficients is increased, then a better approach to real scenarios can be achieved; however, in that case, the computational complexity is increased. In order to address this issue, a design methodology for computing the time-varying coefficients of the fading channel simulators using consumer-designed graphic processing units (GPUs) is proposed. With the use of GPUs and the proposed methodology, it is possible for nonspecialized users in parallel computing to accelerate their simulation developments when compared to conventional software. Implementation results show that the proposed approach allows the easy generation of communication channels while reducing the processing time. Finally, GPU-based implementation takes precedence when compared with the CPU-based implementation, due to the scattered nature of the channel.


Introduction
Currently, the high demand for integrated services (voice, data, and video) means that new data transmission schemes have to be developed for dealing with high transmission data rates and at the same time for offering high levels of quality of service.The fourth generation (4G) of mobile communication systems is still under development; its main goal is to provide a digital communication network (land, mobile, and satellite) with peak data rates of 100 Mbps for high mobility devices and high data rates of 1 Gbps for users or devices in low mobility environments or stationary conditions.The main technologies used in 4G include techniques based on multiple-input and multiple-output (MIMO) antennas, turbo decoding, adaptive modulation, coding schemes and error correction, and orthogonal FDMA (orthogonal FDMA, OFDM) [1,2].Current versions of standards that incorporate 4G are LTE-A (long term evolution-advanced) and IEEE 802.16 m WiMAX (Worldwide Interoperability for Microwave Access) mobile.Therefore, the new issues imposed by the standards require new processing algorithms to be tested on high mobility environments affected by Doppler shifts (time-selective channels) and multipath propagation (frequency-selective channels).The temporal channel variability occurs when the characteristics of the transmission medium change over time or when there is a relative motion between the receiver and transmitter, as in communication systems such as LTE and WiMAX.The frequency selectivity appears when multiple copies of the transmitted signal arrive at the receiver due to physical mechanisms such as multipath propagation.
Moreover, knowing the behavior or performance of a mobile communication system under real conditions (in situ test) can be very expensive, owing to the transfer of the communications system and test equipment to the place under study, among other issues.Additionally, the system behavior can not be tested under the same propagation conditions due 2 Mathematical Problems in Engineering to the nature of the communication channel.Faced with this problem, an economical alternative is to use mathematical models, which represent the radio channels under consideration.In this sense, we can define a channel simulator as a software tool that permits reproduction of the behavior or the propagation conditions of a mobile communications channel under controlled or laboratory conditions.
On the other hand, GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU in order to accelerate scientific, engineering, and business applications [3].Recently, several works related to the wireless communication area, which uses GPU devices, have been published [4][5][6][7].Those works follow an implementation strategy in order to handle the channel complexity using multiple cores.For example, in [4] a wireless channel simulator is implemented.In that work, the potential of GPU-based processing is studied in order to improve the runtime performance of computationally intensive accurate wireless network simulation.In [5], the use of general purpose GPUs is investigated in order to provide the computational capabilities required for performing the radio frequency path loss computation.A discussion of the acceleration of wireless channel simulation using GPUs is provided in [6].In addition, in [7], an implementation of parallel lattice reduction-aided 2 × 2 MIMO detector using GPUs for the WiMAX standard is presented.
Although several works related to the use of GPUs in communication systems exist, there are currently no works that describe in detail the implementation of a fading channel simulator based on GPUs.In this paper, the methodology for implementing a fading channel simulator (time and frequency selective) via GPU computing is presented.
The proposed methodology considers the use of common GPU software libraries that permit nonspecialized users in GPU programming to easily implement the proposed simulator.On the other hand, the generation of the Rayleigh fading variates is achieved using the filtering method [8][9][10].In this case, the filtering method is carried out in time domain by using a finite impulse response (FIR) filter for coloring Gaussian noise samples.Furthermore, it is well known that if the filter order is increased, then the accuracy of the channel statistics can be improved, though at the cost of increasing the computational complexity.Therefore, in this work, we take advantage of GPUs for handling such computational complexity (multiplication and addition operations) in order to implement an accurate communication channel for SISO systems.Moreover, this methodology paves the way for implementing MIMO channel simulators in the future.
The rest of this paper is organized as follows: In the second section, the background of the wireless communication system is stated, specifically as regards the channel communication model.In Section 3, how to simulate the communication channel is explained.Next, in Section 4, the GPU implementation of the fading channel simulator is detailed.Section 5 is devoted to presenting the implementation results when a WiMAX scenario is considered.Finally, the conclusions are presented in Section 6.

Communication System
Consider a single-input and single-output (SISO) communication system where the transmission of in-phase   () and quadrature   () signals modulated by orthogonal carriers   () and   (), respectively, are assumed, which are mixed for obtaining ().This signal () is propagated through the communication channel (, ), which is considered to be a causal time-varying linear system.The signal filtered by the channel reaches the receiver where a noisy version () is detected.It can be expressed mathematically as follows: where For sake of simplicity, if   () = cos(2  +) and   () = sin(2   + ), where   is any carrier frequency and  is any phase, the system becomes the well known single carrier communication system.It is important to emphasize that an OFDM system implemented with IFFT/FFT produces a baseband signal that is modulated as in a single carrier system.
If we consider that both signals   () and   () are band limited to a maximum frequency of  max and   ≫  max (this condition is always accomplished in real communication systems) it is easy to demonstrate [11,12] with the aid of the Hilbert transform the existence of base-band equivalent signals ỹ(), x(), ũ(), and H(, ) for (), (), (), and (, ), respectively.In general, these equivalent base-band signals are complex, where the real part corresponds to the in-phase component and the imaginary to the quadrature component; thus, x() =   ()+  () and ỹ() =   ()+  () for  = √ −1.The relations between the original pass-band signals and their baseband equivalents are as follows [12]: where Re(⋅) is the real part of the complex number in parentheses.Considering (2), the base-band equivalent of ( 1) is which can be interpreted as a collection of multiple paths (scatters), where the transmitted signal x() is propagated.
The fact that these paths have different lengths and pass through different conditions of propagation causes the received signal from a specific path to be a delayed, attenuated, and phase-shifted version of the x().In this sense, for a specific time  1 and a specific delay  1 , the channel coefficient H( 1 ,  1 ) will be a complex variable, where the magnitude represents the attenuation factor and the phase shift factor.On the other hand, due to the constant changes in the environment and the possible relative movement between transmitter and receptor, these factors are time dependent.According to [12], H(, ) can be modeled as a complex stochastic process composed of the sum of a deterministic part (the ensemble average of H(, )) and a random part (zero mean random process).From this point, we will only consider the random part (an assumption generally accepted when a channel simulator is developed).The autocorrelation function of this random process is equal to where (⋅) is the expectation operator and (⋅) * represents the complex conjugate.This channel model is difficult to implement; nevertheless, some assumptions can be asserted which simplify the model.The first is the absence of correlation between the different scatters, and the second is that each scatter is a wide-sense stationary process, which together comprise the well known wide-sense stationary uncorrelated scattering (WSSUS) model.Therefore, (4) transforms into where  =  1 =  2 , Δ =  1 −  2 , and  H(Δ, ) is the autocorrelation function with respect to the time difference variable Δ for the scatter located in the delay variable .From (5), it is possible to calculate the scattering function, which is defined as the Fourier transform of the correlation function with respect to the time difference variable Δ, as follows: where F{⋅} is the Fourier transform operator.This scattering function (; ) indicates how the Doppler spectrum is for a given delay value in the variable .
In many communication standards, a discrete number of scatters are considered instead of a continuous number, as suggested in previous equations.If this assumption is considered, then where  is an index variable that enumerates the −1 discrete scatters and   () is a complex variable that encloses the gain and phase shift factor of such scatter.If a WSSUS channel is considered, the correlation function of (7) is with scattering function

Channel Simulation
In order to perform a computational simulation of the communication channel, it is necessary to deal with the discrete version of the baseband equivalent channel presented in (7).This discrete channel results in band-limiting and sampling (7) in time and time-delay domains at a rate of   ≥ 2 max .Thus, it is defined as where   = 1/  , ℎ(, ) = H(, ) ⊗ (), the symbol ⊗ represents the convolution operator, and () is a function for band-limiting the channel to  max , which, for practical purposes, could be a time windowed cardinal sine function.Substituting ( 7) into (10) results in where ℎ[, ] corresponds to the coefficients of the FIR filter for simulating the communication channel,  enumerates the samples in the time domain, and 0 ≤  ≤  − 1 enumerates the taps of the filter.Likewise,  can be calculated as ⌈( max +   )/  ⌉, where  max is the maximum delay of the paths in the channel H(, ), and 2  is the length of the filter ().This filter could be anticausal; nevertheless, it is possible to introduce a delay in order to convert this filter into a causal filter and therefore physically feasible.
In order to implement (11), it is necessary to generate  uncorrelated discrete Gaussian stochastic complex processes at rate   .In the state of the art many algorithms for obtaining these stochastic processes are stated, as mentioned in [13][14][15][16] and references therein.Such processes must be filtered (colored) in order to accomplish the desired scattering function.It is important to note that these filters only affect the frequency components below a maximum Doppler frequency  max  ; therefore, it is possible to generate the samples at a rate of at least   ≥ 2 max  , where typically   ≪   , and then to use any upsampling technique for accomplishing the   rate.
The impulse response of the filter for coloring the th process is the discrete version (at rate   ) of the following expression: Finally, an interpolation technique such as splines, polynomial, or basis expansion is used for obtaining the samples at   rate.The entire process is presented in Figure 1 and summarized in Algorithm 1.

GPU Implementation
The emergence of GPUs has allowed complex algorithms to be executed almost in real time.GPU is conceptualized as a set of streaming multiproccesors (SM), where each SM is Require: Scattering function Require: Define the gain  2   that correspond to the variance of the process   [] for all the  paths (1) for all  such that 0 ≤  ≤  − 1 do (2) Generate the zero mean unitary variance complex Gaussian stochastic process at rate   samples per second (3) Multiply the stochastic process by √  () for ensuring the gains of the paths (4) Filter the process with discrete   () (5) Interpolate the process for obtaining samples at rate   (6) end for (7)   characterized by a single instruction multiple data (SIMD) architecture.Therefore, in each clock cycle, each processor of the multiprocessor executes the same instruction, operating on multiple data streams; that is, each of these processors has the possibility of accessing a shared memory (common to all processors belonging to the same SM) and a local cache memory.In addition, all the processors have access to the global GPU (device) memory.Figure 2 illustrates the GPU hardware architecture.
Our strategy for implementing the fading channel simulator is aimed at improving the overall performance by chaining software functions (called kernels) representing each communication step.In order to implement the parallel fading simulator as illustrated in Figure 3, we distinguish five stages in the GPU design methodology as follows.

Gaussian Random Number Generator.
In this stage, the CUDA Random Number Generation (cuRand) library [17] is employed in order to obtain Gaussian random numbers (GRN) by means of efficient generation of high-quality pseudorandom numbers.Particularly, curand init function is launched for creating a random number generator in a massively parallel scheme.There are seven types of random number generators in cuRand; in this study, we have selected the XORWOW algorithm, which is a member of the Xor shift family of pseudorandom number generators, with customized parameters for operating on GPUs.
The curand normal2 function generates two normally distributed pseudorandom numbers in each call.Because the underlying algorithm is based on the Box-Muller transform, it is suitable for generating random complex numbers; that is, each call generates real and imaginary parts at the same time.
There is a CUDA kernel for computing a set of  independent GRN vectors.Each vector corresponds to a path, which is computed in chunks by the GPU multiprocessors and then stored on device global memory.The implementation of the GNR generator is presented in the Algorithm 2, where the function setup kernel initializes the threads of the same block with a different sequence number but the same seed and offset (zero offset).Furthermore, generate normal kernel computes several pseudorandom values with Gaussian distribution through the calling of curand normal2.

Upsamplernpp
Tap generatormatrix-matrix multiplication-cuBLAS . . .  the random numbers generated in the previous subsection.Since the filter coefficients are fixed for all channel realizations and paths, they are stored in the constant memory of GPU.This memory is devoted to storing and broadcasting read-only data to all threads on the GPU.In addition, the results of GRN are stored in shared memory, since many threads must access them simultaneously.The filtering is conceptualized as a convolution, so a kernel that performs the convolution in parallel is used.
There is a set of  independent 1D signal convolutions to be computed, one for each path.However, the filtering is performed using the NVIDIA Performance Primitives library (npp) [18]; specifically, one of the nppiFilterRow functions is used, which performs a 1D filtering on 2D data, each row being a channel path.

Path Gain Implementation.
The path gain is implemented with a multiplication function.The resulting colored noise from the previous stage is multiplied by a scalar.This could be carried out with a specific kernel or by using a standard library, such as CUDA Basic Linear Algebra Subroutines (cuBLAS) [19] or npp.The proposed implementation uses the nppiMulC function of the npp library.

Upsampler.
The upsampler stage is responsible for generating noise samples at the rate   , implemented as an interpolation.The usual interpolation available for GPUs is the linear interpolation offered by texture memory; npp offers other methods for more accurate results.In this case, the nppiResize function with a cubic interpolation is used.It returns the interpolated value for a given coordinate within two known noise values.
4.5.Tap Generator.Multiple paths have been treated separately.In this stage, they are correlated using predefined (computed offline) coefficients according to (11).This correlation operation can be seen as the multiplication of  upsampled scaled colored noise (path) by the coefficient matrix B[, ] = (  −   ).This could be carried out with a programmer's own implementation or by using a standard library, such as cuBLAS as well.This proposal uses the cublasSgemm kernel that performs a matrix-matrix multiplication with optional scalar product.

Implementation Results
In order to corroborate the functionality of the proposed fading channel simulator in modern communication systems such as WiMAX, it was configured with the following parameters [20, page 404]: a maximum frequency Doppler  4, a resulting GPU-based realization of the fading channel according to the specified parameters for  = 1024 time samples is presented.It is important to note that the offline computed data (see Figure 3) are transferred to GPU simulator by text files.The simulation was carried out using an iMAC computer with the following specifications: OS 10.9.4 (Maverics), Intel Core processor i5 (3.4 GHz), 16 GB of RAM, graphic card GeForce GTX 780 M with 4 GB of RAM, and 1536 CUDA cores.
For evaluating the time performance, the parameters used in the previous test have been maintained; however, the parameter  was fixed to  = 1 × 10 6 samples.In this sense, Table 1 presents the average, maximum, and minimum time consumption for a CPU-based implementation (Matlab) versus the proposed GPU-based methodology (CUDA).It is clear that the GPU methodology has gains of 30-fold (mean  value) when compared with CPU-based implementations, which is attractive if parallel versions of the channel simulator are required, as could be the case in MIMO applications.Table 2 reports the time percentage for accomplishing each task of the channel simulator in the GPU.It should be noted that in this table the reading and device memory allocation-the most time-consuming tasks-are not considered.These tasks are performed only once at the initialization stage of the simulation.
On the other hand, Table 3 and Figure 5 present the overall time consumption in milliseconds for CPU-and GPU-based implementations when the number of samples is fixed to  = 5120, 10240, 20480, 81920, 327680, 655360, 1000000, and 1310720 samples.This shows that while the time consumption in the CPU-based implementation increments exponentially, it remains almost linear in the GPUbased implementation.
Similarly, the good performance achieved with the GPU implementation with respect to the CPU implementation can be observed in the x-fold gain reported in Table 3.This gain is calculated as the time consumption quotient of both implementations.The behavior of this gain has been reported for each of  samples stated in the previous paragraph.
Finally, it is important to emphasize that the presented approach can deal with several path realizations.This suggests that the developed fading channel simulator can be considered for generating large MIMO channels, which represents a new simulation paradigm.

Conclusions
The principal result of this study is the introduction of a methodology for designing fading channel simulators via GPU devices.Such a methodology permits nonspecialized users to easily implement channel simulators in parallel.As was shown, the use of GPUs in the development of fading channel simulators greatly saves simulation time when channel realizations are generated for testing communication systems.Moreover, a case of study for WiMAX systems demonstrated the functionality of the implemented channel simulator.We believe that the proposed parallel channel simulator can aid in testing mobile communication systems based on LTE and WiMAX.Additionally, the presented approach based on GPU will allow the design of more sophisticated simulators of complex channel models such as triply selective MIMO fading channels (i.e., time, frequency, and space selective).

Figure 1 :
Figure 1: Structure of the fading channel simulator.

Figure 4 :
Figure 4: Impulse response realization of the fading channel simulator considering the vehicular class B ITU multipath channel model ( max  = 2000 Hz,   = 10 Msps).
() =   ()  () +   ()  (), and  is a time variable.The impulse response (, ) states the response of the channel in the instant  when a stimulus is applied in  − , which reflects the time variability of the channel impulse response.Likewise, () is the aggregated stochastic noise.This received signal () is demodulated in order to obtain the in-phase and quadrature signals   () and   ().

Table 1 :
Channel emulator implementation comparison.Time consumption for 1 mega samples generation and 10 channel realizations (in milliseconds). max  = 2000 Hz and a sample rate   = 10 Msps,   = 10 max  .

Table 2 :
Time consumption by module computing 10 channel realizations.