Channel simulators are powerful tools that permit performance tests of the individual parts of a wireless communication system. This is relevant when new communication algorithms are tested, because it allows us to determine if they fulfill the communications standard requirements. One of these tests consists of evaluating the system performance when a communication channel is considered.
In this sense, it is possible to model the channel as an FIR filter with time-varying random coefficients. If the number of coefficients is increased, then a better approach to real scenarios can be achieved; however, in that case, the computational complexity is increased. In order to address this issue, a design methodology for computing the time-varying coefficients of the fading channel simulators using consumer-designed graphic processing units (GPUs) is proposed. With the use of GPUs and the proposed methodology, it is possible for nonspecialized users in parallel computing to accelerate their simulation
developments when compared to conventional software. Implementation results show that the proposed approach allows the easy generation of communication channels while reducing the processing time. Finally, GPU-based implementation takes precedence when compared with the CPU-based implementation, due to the scattered nature of
the channel.
1. Introduction
Currently, the high demand for integrated services (voice, data, and video) means that new data transmission schemes have to be developed for dealing with high transmission data rates and at the same time for offering high levels of quality of service. The fourth generation (4G) of mobile communication systems is still under development; its main goal is to provide a digital communication network (land, mobile, and satellite) with peak data rates of 100 Mbps for high mobility devices and high data rates of 1 Gbps for users or devices in low mobility environments or stationary conditions. The main technologies used in 4G include techniques based on multiple-input and multiple-output (MIMO) antennas, turbo decoding, adaptive modulation, coding schemes and error correction, and orthogonal FDMA (orthogonal FDMA, OFDM) [1, 2]. Current versions of standards that incorporate 4G are LTE-A (long term evolution-advanced) and IEEE 802.16 m WiMAX (Worldwide Interoperability for Microwave Access) mobile. Therefore, the new issues imposed by the standards require new processing algorithms to be tested on high mobility environments affected by Doppler shifts (time-selective channels) and multipath propagation (frequency-selective channels). The temporal channel variability occurs when the characteristics of the transmission medium change over time or when there is a relative motion between the receiver and transmitter, as in communication systems such as LTE and WiMAX. The frequency selectivity appears when multiple copies of the transmitted signal arrive at the receiver due to physical mechanisms such as multipath propagation.
Moreover, knowing the behavior or performance of a mobile communication system under real conditions (in situ test) can be very expensive, owing to the transfer of the communications system and test equipment to the place under study, among other issues. Additionally, the system behavior can not be tested under the same propagation conditions due to the nature of the communication channel. Faced with this problem, an economical alternative is to use mathematical models, which represent the radio channels under consideration. In this sense, we can define a channel simulator as a software tool that permits reproduction of the behavior or the propagation conditions of a mobile communications channel under controlled or laboratory conditions.
On the other hand, GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU in order to accelerate scientific, engineering, and business applications [3]. Recently, several works related to the wireless communication area, which uses GPU devices, have been published [4–7]. Those works follow an implementation strategy in order to handle the channel complexity using multiple cores. For example, in [4] a wireless channel simulator is implemented. In that work, the potential of GPU-based processing is studied in order to improve the runtime performance of computationally intensive accurate wireless network simulation. In [5], the use of general purpose GPUs is investigated in order to provide the computational capabilities required for performing the radio frequency path loss computation. A discussion of the acceleration of wireless channel simulation using GPUs is provided in [6]. In addition, in [7], an implementation of parallel lattice reduction-aided 2 × 2 MIMO detector using GPUs for the WiMAX standard is presented.
Although several works related to the use of GPUs in communication systems exist, there are currently no works that describe in detail the implementation of a fading channel simulator based on GPUs. In this paper, the methodology for implementing a fading channel simulator (time and frequency selective) via GPU computing is presented.
The proposed methodology considers the use of common GPU software libraries that permit nonspecialized users in GPU programming to easily implement the proposed simulator. On the other hand, the generation of the Rayleigh fading variates is achieved using the filtering method [8–10]. In this case, the filtering method is carried out in time domain by using a finite impulse response (FIR) filter for coloring Gaussian noise samples. Furthermore, it is well known that if the filter order is increased, then the accuracy of the channel statistics can be improved, though at the cost of increasing the computational complexity. Therefore, in this work, we take advantage of GPUs for handling such computational complexity (multiplication and addition operations) in order to implement an accurate communication channel for SISO systems. Moreover, this methodology paves the way for implementing MIMO channel simulators in the future.
The rest of this paper is organized as follows: In the second section, the background of the wireless communication system is stated, specifically as regards the channel communication model. In Section 3, how to simulate the communication channel is explained. Next, in Section 4, the GPU implementation of the fading channel simulator is detailed. Section 5 is devoted to presenting the implementation results when a WiMAX scenario is considered. Finally, the conclusions are presented in Section 6.
2. Communication System
Consider a single-input and single-output (SISO) communication system where the transmission of in-phase xi(t) and quadrature xq(t) signals modulated by orthogonal carriers ϕi(t) and ϕq(t), respectively, are assumed, which are mixed for obtaining x(t). This signal x(t) is propagated through the communication channel H(t,τ), which is considered to be a causal time-varying linear system. The signal filtered by the channel reaches the receiver where a noisy version y(t) is detected. It can be expressed mathematically as follows:(1)y(t)=∫-∞∞x(t-τ)H(t,τ)dτ+u(t),where x(t)=xi(t)ϕi(t)+xq(t)ϕq(t), and t is a time variable. The impulse response H(t,τ) states the response of the channel in the instant t when a stimulus is applied in t-τ, which reflects the time variability of the channel impulse response. Likewise, u(t) is the aggregated stochastic noise. This received signal y(t) is demodulated in order to obtain the in-phase and quadrature signals yi(t) and yq(t).
For sake of simplicity, if ϕi(t)=cos(2πfct+θ) and ϕq(t)=sin(2πfct+θ), where fc is any carrier frequency and θ is any phase, the system becomes the well known single carrier communication system. It is important to emphasize that an OFDM system implemented with IFFT/FFT produces a base-band signal that is modulated as in a single carrier system.
If we consider that both signals xi(t) and xq(t) are band limited to a maximum frequency of fmax and fc≫fmax (this condition is always accomplished in real communication systems) it is easy to demonstrate [11, 12] with the aid of the Hilbert transform the existence of base-band equivalent signals y~(t), x~(t), u~(t), and H~(t,τ) for y(t), x(t), n(t), and H(t,τ), respectively. In general, these equivalent base-band signals are complex, where the real part corresponds to the in-phase component and the imaginary to the quadrature component; thus, x~(t)=xi(t)+jxq(t) and y~(t)=yi(t)+jyq(t) for j=-1. The relations between the original pass-band signals and their baseband equivalents are as follows [12]:(2)y(t)=Rey~(t)ej2πfct,x(t)=Rex~(t)ej2πfct,u(t)=Reu~(t)ej2πfct,H(t,τ)=Re12H~(t,τ)ej2πfct,where Re(·) is the real part of the complex number in parentheses. Considering (2), the base-band equivalent of (1) is(3)y~(t)=∫-∞∞x~(t-τ)H~(t,τ)dτ+u~(t),which can be interpreted as a collection of multiple paths (scatters), where the transmitted signal x~(t) is propagated. The fact that these paths have different lengths and pass through different conditions of propagation causes the received signal from a specific path to be a delayed, attenuated, and phase-shifted version of the x~(t). In this sense, for a specific time t1 and a specific delay τ1, the channel coefficient H~(t1,τ1) will be a complex variable, where the magnitude represents the attenuation factor and the phase shift factor. On the other hand, due to the constant changes in the environment and the possible relative movement between transmitter and receptor, these factors are time dependent. According to [12], H~(t,τ) can be modeled as a complex stochastic process composed of the sum of a deterministic part (the ensemble average of H~(t,τ)) and a random part (zero mean random process). From this point, we will only consider the random part (an assumption generally accepted when a channel simulator is developed). The autocorrelation function of this random process is equal to(4)RH~(t1,t2;τ1,τ2)=E(H~(t1,τ1)H~∗(t2,τ2)),where E(·) is the expectation operator and (·)∗ represents the complex conjugate. This channel model is difficult to implement; nevertheless, some assumptions can be asserted which simplify the model. The first is the absence of correlation between the different scatters, and the second is that each scatter is a wide-sense stationary process, which together comprise the well known wide-sense stationary uncorrelated scattering (WSSUS) model. Therefore, (4) transforms into(5)RH~t1,t2;τ1,τ2=RH~t1,t1-Δt;τ1,τ2δτ1-τ2=PH~(Δt;ξ),where ξ=τ1=τ2, Δt=t1-t2, and PH~(Δt,ξ) is the autocorrelation function with respect to the time difference variable Δt for the scatter located in the delay variable ξ. From (5), it is possible to calculate the scattering function, which is defined as the Fourier transform of the correlation function with respect to the time difference variable Δt, as follows:(6)S(f;ξ)=F{PH~(Δt;ξ)},where F{·} is the Fourier transform operator. This scattering function S(f;ξ) indicates how the Doppler spectrum is for a given delay value in the variable ξ.
In many communication standards, a discrete number of scatters are considered instead of a continuous number, as suggested in previous equations. If this assumption is considered, then(7)H~(t,τ)=∑k=0K-1Ak(t)δ(τ-τk),where k is an index variable that enumerates the K-1 discrete scatters and Ak(t) is a complex variable that encloses the gain and phase shift factor of such scatter. If a WSSUS channel is considered, the correlation function of (7) is(8)PH~(Δt;k)=EAk(t)Ak∗(t-Δt)with scattering function(9)S(f;k)=F{PH~(Δt;k)}.
3. Channel Simulation
In order to perform a computational simulation of the communication channel, it is necessary to deal with the discrete version of the baseband equivalent channel presented in (7). This discrete channel results in band-limiting and sampling (7) in time and time-delay domains at a rate of fs≥2fmax. Thus, it is defined as(10)h[n,m]=h(t,τ)t=nTs,τ=mTs,where Ts=1/fs, h(t,τ)=H~(t,τ)⊗B(τ), the symbol ⊗ represents the convolution operator, and B(τ) is a function for band-limiting the channel to fmax, which, for practical purposes, could be a time windowed cardinal sine function. Substituting (7) into (10) results in(11)hn,m=∑k=0K-1AknTsBmTs-τk=∑k=0K-1Ak[n]B(mTs-τk),where h[n,m] corresponds to the coefficients of the FIR filter for simulating the communication channel, n enumerates the samples in the time domain, and 0≤m≤M-1 enumerates the taps of the filter. Likewise, M can be calculated as (τmax+tB)/Ts, where τmax is the maximum delay of the paths in the channel H~(t,τ), and 2tB is the length of the filter B(τ). This filter could be anticausal; nevertheless, it is possible to introduce a delay in order to convert this filter into a causal filter and therefore physically feasible.
In order to implement (11), it is necessary to generate K uncorrelated discrete Gaussian stochastic complex processes at rate fs. In the state of the art many algorithms for obtaining these stochastic processes are stated, as mentioned in [13–16] and references therein. Such processes must be filtered (colored) in order to accomplish the desired scattering function. It is important to note that these filters only affect the frequency components below a maximum Doppler frequency fmaxD; therefore, it is possible to generate the samples at a rate of at least fls≥2fmaxD, where typically fls≪fs, and then to use any upsampling technique for accomplishing the fs rate.
The impulse response of the filter for coloring the kth process is the discrete version (at rate fls) of the following expression:(12)Gk(t)=F-1{S(f;k)}.
Finally, an interpolation technique such as splines, polynomial, or basis expansion is used for obtaining the samples at fs rate. The entire process is presented in Figure 1 and summarized in Algorithm 1.
Require: Define the gain σk2 that correspond to the variance of the process Ak[n] for all the K paths
(1) for allk such that 0≤k≤K-1do
(2) Generate the zero mean unitary variance complex Gaussian stochastic process at rate fls samples per second
(3) Multiply the stochastic process by Pd(k) for ensuring the gains of the paths
(4) Filter the process with discrete Gk(t)
(5) Interpolate the process for obtaining samples at rate fs
(6) end for
(7) for allndo
(8) Obtain M filter’s coefficients
hn,m=∑k=0K-1Ak(nTs)BmTs-τk
(9) end for
Structure of the fading channel simulator.
4. GPU Implementation
The emergence of GPUs has allowed complex algorithms to be executed almost in real time. GPU is conceptualized as a set of streaming multiproccesors (SM), where each SM is characterized by a single instruction multiple data (SIMD) architecture. Therefore, in each clock cycle, each processor of the multiprocessor executes the same instruction, operating on multiple data streams; that is, each of these processors has the possibility of accessing a shared memory (common to all processors belonging to the same SM) and a local cache memory. In addition, all the processors have access to the global GPU (device) memory. Figure 2 illustrates the GPU hardware architecture.
GPU data distribution for P multiprocessors with Q processors each.
Our strategy for implementing the fading channel simulator is aimed at improving the overall performance by chaining software functions (called kernels) representing each communication step. In order to implement the parallel fading simulator as illustrated in Figure 3, we distinguish five stages in the GPU design methodology as follows.
Proposed GPU design flow.
4.1. Gaussian Random Number Generator
In this stage, the CUDA Random Number Generation (cuRand) library [17] is employed in order to obtain Gaussian random numbers (GRN) by means of efficient generation of high-quality pseudorandom numbers. Particularly, curand_init function is launched for creating a random number generator in a massively parallel scheme. There are seven types of random number generators in cuRand; in this study, we have selected the XORWOW algorithm, which is a member of the Xor_shift family of pseudorandom number generators, with customized parameters for operating on GPUs.
The curand_normal2 function generates two normally distributed pseudorandom numbers in each call. Because the underlying algorithm is based on the Box-Muller transform, it is suitable for generating random complex numbers; that is, each call generates real and imaginary parts at the same time.
There is a CUDA kernel for computing a set of K independent GRN vectors. Each vector corresponds to a path, which is computed in chunks by the GPU multiprocessors and then stored on device global memory. The implementation of the GNR generator is presented in the Algorithm 2, where the function setup_kernel initializes the threads of the same block with a different sequence number but the same seed and offset (zero offset). Furthermore, generate_normal_kernel computes several pseudorandom values with Gaussian distribution through the calling of curand_normal2.
_global_ void generate_normal_kernel(curandState∗state, int n, float∗result)
{ int id = threadIdx.x + blockIdx.x∗6;
float2 x;
curandState localState = state[id];
for(int i=0; i<n; i++)
{ /∗Generate pseudorandom normals∗/
x=curand_normal2(&localState);
result[id]= x.x;
}
/∗Copy state back to global memory∗/
state[id] = localState;
}
4.2. Parallel Doppler FIR-Filter
The Doppler filter uses the resulting coefficients obtained by sampling (12) and the random numbers generated in the previous subsection. Since the filter coefficients are fixed for all channel realizations and paths, they are stored in the constant memory of GPU. This memory is devoted to storing and broadcasting read-only data to all threads on the GPU. In addition, the results of GRN are stored in shared memory, since many threads must access them simultaneously. The filtering is conceptualized as a convolution, so a kernel that performs the convolution in parallel is used.
There is a set of K independent 1D signal convolutions to be computed, one for each path. However, the filtering is performed using the NVIDIA Performance Primitives library (npp) [18]; specifically, one of the nppiFilterRow functions is used, which performs a 1D filtering on 2D data, each row being a channel path.
4.3. Path Gain Implementation
The path gain is implemented with a multiplication function. The resulting colored noise from the previous stage is multiplied by a scalar. This could be carried out with a specific kernel or by using a standard library, such as CUDA Basic Linear Algebra Subroutines (cuBLAS) [19] or npp. The proposed implementation uses the nppiMulC function of the npp library.
4.4. Upsampler
The upsampler stage is responsible for generating noise samples at the rate fs, implemented as an interpolation. The usual interpolation available for GPUs is the linear interpolation offered by texture memory; npp offers other methods for more accurate results. In this case, the nppiResize function with a cubic interpolation is used. It returns the interpolated value for a given coordinate within two known noise values.
4.5. Tap Generator
Multiple paths have been treated separately. In this stage, they are correlated using predefined (computed offline) coefficients according to (11). This correlation operation can be seen as the multiplication of N upsampled scaled colored noise (path) by the coefficient matrix B[m,k]=B(mTs-τk). This could be carried out with a programmer’s own implementation or by using a standard library, such as cuBLAS as well. This proposal uses the cublasSgemm kernel that performs a matrix-matrix multiplication with optional scalar product.
5. Implementation Results
In order to corroborate the functionality of the proposed fading channel simulator in modern communication systems such as WiMAX, it was configured with the following parameters [20, page 404]: a maximum frequency Doppler fmaxD=2000Hz and a sample rate fs=10 Msps, fls=10fmaxD. In addition, the vehicular class B ITU multipath channel model was considered, which consists of six discrete paths with relative power [-2.5,0,-12.8,-10,-25.2,-16] dB at delay time [0,300,8900,12900,17100,20000] nsec, respectively. For implementing the filter B[m,k], a raised cosine function with a roll-off factor of 0.5 and a duration of 6Ts sec was considered. This delay results in the generation of M=20μsec+0.6μsec/(0.1μsec)=206 taps. In Figure 4, a resulting GPU-based realization of the fading channel according to the specified parameters for N=1024 time samples is presented. It is important to note that the offline computed data (see Figure 3) are transferred to GPU simulator by text files.
Impulse response realization of the fading channel simulator considering the vehicular class B ITU multipath channel model (fmaxD=2000Hz, fs=10 Msps).
The simulation was carried out using an iMAC computer with the following specifications: OS 10.9.4 (Maverics), Intel Core processor i5 (3.4 GHz), 16 GB of RAM, graphic card GeForce GTX 780 M with 4 GB of RAM, and 1536 CUDA cores.
For evaluating the time performance, the parameters used in the previous test have been maintained; however, the parameter N was fixed to N=1×106 samples. In this sense, Table 1 presents the average, maximum, and minimum time consumption for a CPU-based implementation (Matlab) versus the proposed GPU-based methodology (CUDA). It is clear that the GPU methodology has gains of 30-fold (mean value) when compared with CPU-based implementations, which is attractive if parallel versions of the channel simulator are required, as could be the case in MIMO applications.
Channel emulator implementation comparison. Time consumption for 1 mega samples generation and 10 channel realizations (in milliseconds).
Matlab^{1}
CUDA Libs^{2}
Min
1640.895
37.376
Max
1821.171
75.186
Mean
1760.821
46.935
1CPU: Intel Core i5 3.4 GHz 16 GB.
^{2}GPU: GeForce GTX 780M 4 GB.
Table 2 reports the time percentage for accomplishing each task of the channel simulator in the GPU. It should be noted that in this table the reading and device memory allocation—the most time-consuming tasks—are not considered. These tasks are performed only once at the initialization stage of the simulation.
Time consumption by module computing 10 channel realizations.
Time (%)
Module
78.18%
Matrix-matrix multiplication
13.62%
Initializing random number generator^{1}
3.83%
FIR filter
2.37%
Upsampling
1.62%
Gaussian number generation
0.01%
Path gain
1The seed initialization is carried out only once before the first channel realization.
On the other hand, Table 3 and Figure 5 present the overall time consumption in milliseconds for CPU- and GPU-based implementations when the number of samples is fixed to N = 5120, 10240, 20480, 81920, 327680, 655360, 1000000, and 1310720 samples. This shows that while the time consumption in the CPU-based implementation increments exponentially, it remains almost linear in the GPU-based implementation.
Time consumption comparative (in milliseconds): CPU-based implementation (Matlab) versus GPU-based implementation (CUDA).
N
Matlab^{1}
CUDA^{2} Libs
x-fold
(samples)
(gain)
5,120
31.5466
0.614496
51
10,240
38.5282
1.350240
28
20,480
54.7829
2.331968
23
81,920
179.8391
5.785952
31
327,680
633.4515
17.09622
37
655,360
1204.584
25.36316
47
1,000,000
1769.243
37.81030
47
1,310,720
3024.966
47.43065
64
1CPU: Intel Core i5 3.4 GHz 16 GB.
^{2}GPU: GeForce GTX 780M 4 GB.
Time consumption comparative: CPU-based implementation versus GPU-based implementation.
Similarly, the good performance achieved with the GPU implementation with respect to the CPU implementation can be observed in the x-fold gain reported in Table 3. This gain is calculated as the time consumption quotient of both implementations. The behavior of this gain has been reported for each of N samples stated in the previous paragraph.
Finally, it is important to emphasize that the presented approach can deal with several path realizations. This suggests that the developed fading channel simulator can be considered for generating large MIMO channels, which represents a new simulation paradigm.
6. Conclusions
The principal result of this study is the introduction of a methodology for designing fading channel simulators via GPU devices. Such a methodology permits nonspecialized users to easily implement channel simulators in parallel. As was shown, the use of GPUs in the development of fading channel simulators greatly saves simulation time when channel realizations are generated for testing communication systems. Moreover, a case of study for WiMAX systems demonstrated the functionality of the implemented channel simulator. We believe that the proposed parallel channel simulator can aid in testing mobile communication systems based on LTE and WiMAX. Additionally, the presented approach based on GPU will allow the design of more sophisticated simulators of complex channel models such as triply selective MIMO fading channels (i.e., time, frequency, and space selective).
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was supported by the Programa para el Desarrollo Profesional Docente (PRODEP) 2014 and CONACYT, Ciencia Básica, 2014 (CB2014-241272), Mexico.
Longoria-GandaraO.Parra-MichelR.Estimation of correlated MIMO channels using partial channel state information and DPSSBazdreschM.CortezJ.Longoria-GándaraO.Parra-MichelR.A family of hybrid space-time codes for MIMO wireless communicationsNVIDIAHigh-performance computing2014, http://www.nvidia.com/object/what-is-gpu-computing.htmlAndelfingerP.MittagJ.HartensteinH.GPU-based architectures and their benefit for accurate and efficient wireless network simulationsProceedings of the 19th Annual IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '11)July 201142142410.1109/mascots.2011.402-s2.0-80053038850HenzB. J.RichieD.JeanE.ParkS. J.RossJ. A.ShiresD. R.Real-time radio wave propagation for mobile ad-hoc network emulation using GPGPUsProceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA '13)2013BaiS.NicolD. M.Acceleration of wireless channel simulation using GPUsProceedings of the European Wireless Conference (EW '10)April 2010Lucca, Italy84184810.1109/ew.2010.54835252-s2.0-77954438922YangH.KimT.AhnC.KimJ.ChoiS.GlossnerJ.Implementation of parallel lattice reduction-aided MIMO detector using graphics processing unitPätzoldM.StüberG. L.Cyril-DanielI.A MATLAB-based object-oriented approach to multipath fading channel simulation2008Québec, CanadaHi-Tek MultisystemsJeruchimM. C.BalabanP.ShanmuganK. S.BelloP.Characterization of randomly time-variant linear channelsCastilloJ. V.AtocheA. C.Longoria-GandaraO.Parra-MichelR.An efficient Gaussian random number architecture for MIMO channel emulatorsProceedings of the IEEE Workshop on Signal Processing Systems (SiPS '11)October 2011Beirut, Lebanon31632110.1109/sips.2011.60889962-s2.0-84055222712Vela-GarciaL.CastilloJ. V.Parra-MichelR.PätzoldM.An accurate hardware sum-of-cisoids fading channel simulator for isotropic and non-isotropic mobile radio environmentsVázquez CastilloJ.Vela-GarciaL.GutiérrezC.Parra-MichelR.A reconfigurable hardware architecture for the simulation of Rayleigh fading channels under arbitrary scattering conditionsKontorovichV.PrimakS.Alcocer-OchoaA.Parra-MichelR.MIMO channel orthogonalisations applying universal eigenbasisNVIDIACUDA random number generation library (cuRAND), 2014, https://developer.nvidia.com/curandNVIDIANVIDIA performance primitives NVIDIA developer zonehttps://developer.nvidia.com/nppNVIDIACUDA basic linear algebra subroutines (cuBLAS), 2014, https://developer.nvidia.com/cublasAndrewsJ. G.GhoshA.MuhamedR.