A Triply Selective MIMO Channel Simulator Using GPUs

Amethodology for implementing a triply selectivemultiple-inputmultiple-output (MIMO) simulator based on graphics processing units (GPUs) is presented.The resulting simulator is based on the implementation of multiple double-selective single-input singleoutput (SISO) channel generators, where the multiple inputs and the multiple received signals have been transformed in order to supply the corresponding space correlation of the channel under consideration. A direct consequence of this approach is the flexibility provided, which allows different propagation statistics to each SISO channel to be specified and thus more complex environments to be replicated. It is shown that under some specific constraints, the statistics of the triply selectiveMIMO simulator are the same as those reported in the state of art. Simulation results show the computational time improvement achieved, up to 650-fold for an 8 × 8MIMO channel simulator when compared with sequential implementations. In addition to the computational improvement, the proposed simulator offers flexibility for testing a variety of scenarios in vehicle-to-vehicle (V2V) and vehicle-toinfrastructure (V2I) systems.


Introduction
With the growing demand by users of rapid transfer of high amounts of data, it has been necessary to develop new transmissions strategies and, consequently, to verify their performance in order to accomplish the established goals.In this sense, multiple-input multiple-output (MIMO) simulator has been considered in recent years a fundamental part of new communication standards embraced by long-term evolutionvehicle (LTE-V) and vehicle-to-vehicle (V2V) technologies.This is due to the fact that MIMO can take advantage of space diversity and multipath scattering for increasing the data transmission rate considerably in comparison with single-input single-output (SISO) communication systems [1].Because of its relevance, several mathematical channel models and hence diverse channel simulators/emulators have been proposed in order to prove the performance of MIMObased communication systems.
In this sense, as summarized by [2], MIMO channel models can be classified in diverse ways; the broadest classification considers physical models and analytical models.Physical models take into account the electromagnetic propagation and the environment under study for obtaining a channel model.Moreover, such models can be categorized as deterministic models [3], geometric-based stochastic models [4], or stochastic models [5,6].Analytical models, on the other hand, abstract the complex electromagnetic propagation mechanisms into tractable channel impulse responses for modeling the MIMO channels.Moreover, analytical models can be classified as propagation-based models and correlation-based models, where the distinguished Kronecker model [7] and the Weichselberger model [8] fit in the latter classification.These models are the most commonly used models for simulating MIMO channels due to their simplicity and the conceptualization of the propagation environment.This paper considers an analytical model based on correlation, which assumes a triply selectivity; that is, the propagation channel for each transmitter and receiver antenna pair presents both time and frequency selectivity and a spatial correlation.This model was chosen to capture the nature of the propagation channel with greater precision.As expected, the implementation of this triply selective channel simulator is highly complex; for example, if a MIMO system made up of   = 8 transmitter antennas and   = 8 receiver antennas is considered, then it is necessary to implement   ×   = 64 independent SISO channel simulators in parallel.This number of SISO channels increases as well as the number of antennas is increased.Therefore, graphics processing units and GPU-accelerated computing techniques can help to manage the computational complexity in the MIMO channel simulator.
The GPU-accelerated computing techniques use the GPU together with CPUs, which are available in affordable computers or servers.The purpose is to accelerate scientific computation and engineering calculations, among others.As a result, several works related to wireless channel simulators have been presented in order to handle the computational complexity in the implementation of channel simulators [9][10][11][12] and high demanding digital signal processing algorithms [13,14].
In [9], the authors present a GPU-based implementation, which uses the filtering method for developing a doubly selective SISO channel simulator (frequency and time selective).In [10], a full 3-D GPU-based beam-tracing method is presented for propagation modeling in complex indoor environments.Likewise, the study in [11] presents an improved path loss simulation incorporating a three-dimensional terrain model using parallel coprocessors or GPUs.In addition, in [12] and references therein, a selection of GPU-based implementations is presented, which improves the computational processing time and reports GPU-based implementations with significant speedups.However, even though many GPUbased implementations have been reported in the state of the art, a complete triply selective fading channel simulator has not been reported in the open literature.Examples of complete MIMO fading channel simulators and emulators can be found in [15][16][17], but these architectures do not exploit the triple selectivity at the same time, or they configure the hardware with a low number of antennas.
This paper presents a triply selective MIMO channel simulation methodology using GPU techniques; the methodology will be based on SISO channel generators presented in [9].This simulator includes the phenomenology of the propagation environment (time, frequency and space selectivities), which has been left out of the state-of-art simulators due to the computational complexity.Likewise, the introduced methodology exhibits enough flexibility for implementing channel simulators with different MIMO channel configurations.
1.1.Notation.Bold upper (lower) case letters are used for denoting matrices (vectors); (⋅)  , (⋅)  , ⌈⋅⌉, and (⋅) denote transpose, complex transpose (Hermitian), the ceil function, and the expectation operator, respectively.[A] , denotes the element in the th row and th column of A. vec(A) is the reordering of the columns of A into a single column vector.diag(a) is a diagonal matrix whose elements are those from vector a.I  denotes an identity matrix of length  × .Finally, ⊗ stands for the Kronecker product of two matrices.

Organization.
The paper is organized as follows: in Section 2, the mathematical model of the triply selective channel is analyzed.The methodology for implementing this mathematical model using GPUs is described in Section 3. Implementation results assuming diverse scenarios are presented in Section 4. Finally, some concluding remarks in Section 5 close this paper.

Triply Selective Channel Model
Assume a baseband discrete time MIMO communication channel that presents time, frequency, and spatial selectivity (triple-selective channel).Moreover, consider that this communication system is conformed of   transmit antennas and   receive antennas as depicted in Figure 1.This system can be envisaged as an array of   ×   SISO channels where a transmitter and a receiver correlation stage are included in order to provide the corresponding spatial correlation statistics.Without loss of generality, if it is stated that all the SISO channels are modeled as FIR filters of  coefficients, then the MIMO system at time index  can be expressed mathematically as follows: where y() ∈ C where h , () = [ℎ , (, 1), ℎ , (, 2), . . ., ℎ , (, )]  is the discrete time-varying channel impulse response from the transmitter  to the receiver .The value  is selected as max(⌈( max , +   )/  ⌉) ∀ = 1, . . .,   ; ∀ = 1, . . .,   , where  max , is the maximum delay corresponding to the subchannel from transmitter  to receiver ,   is the impulse response duration of the combined transmitter and receiver filters, and   is the symbol period.Thus, it is possible to consider the MIMO channel as whose autocorrelation function can be stated as follows: This channel model can be interpreted as a form of the Kronecker model reviewed in [7,18], where time and frequency selectivity have also been considered.In order to simplify the calculations, autocorrelation of H  () is analyzed instead, which does not affect the statistics of the channel under analysis.Thus, the autocorrelation of where it has already made use of If it is assumed that all the subchannels in H  () are (i) uncorrelated among themselves, (ii) wide-sense stationary uncorrelated scattering (WSSUS), (iii) Rayleigh, then, vec(H  ()) can be expressed as From ( 6), w , () ∈ C  , ×1 is a zero-mean circularly symmetric complex Gaussian random vector with an autocorrelation function: where Σ 2 , and Λ , ( 1 −  2 ) are both diagonal matrices, such that [Σ 2 , ] , =  2  is the power of the th path of the subchannel from antenna  to antenna  with delay  ,  and autocorrelation function [Λ , ( 1 −  2 )] , evaluated at the instant difference  1 −  2 for  = 1, . . .,  , , where  , is the number of paths of the subchannel from  to .
Let G be a block diagonal matrix, where the elements of each block G , ∈ C × , are given by where () is the impulse response of the band limiting filter, for example, sinc or raised cosine filter with duration   ,  = 1, . . ., , and  = 1, . . .,  , .From (6), where are both block diagonal matrices.Substituting ( 9) into (5), Function (11) is the autocorrelation function of the proposed model (3).It is possible to observe that this model offers great flexibility by allowing the subchannels to have different statistics, for example, different power delay profile (PDP) and power Doppler spectrum (PDS).As a special case, if all the paths of all subchannels are forced to have the same PDS, then Λ( 1 −  2 ) = I( 1 −  2 ), where ( 1 −  2 ) is the autocorrelation function of all the paths.As a particular case, if Jakes' model is considered, where the different scatters propagate in the azimuth plane and arrive at the receivers with an angle of arrival distributed uniformly between [0, 2], then ( 1 −  2 ) =  0 (2 max ( 1 −  2 )  ), where  0 (⋅) is the Bessel function of order 0 and  max is the maximum Doppler frequency.If it is also considered that all the subchannels have the same PDP composed of  paths, then Σ2 = I     ⊗ Σ 2  and G = I     ⊗ G  , where Σ 2  ∈ R × is a diagonal matrix containing the gains of the paths and G  ∈ R × is calculated as in (8).However, assuming that the paths of all the subchannels have the same delay, then (11) transforms into (12), which coincides with the model proposed in [16]:

GPU Implementation
The simulator implementation takes advantage of the parallel capabilities of General-Purpose Computing on GPU (GPGPU) and the use of the VexCL library [19].VexCL is an OpenCL/CUDA library developed in C++ by Denis Demidov.It supports multidevice, multiplatform computations and provides functions for floating-point vector/matrix operations.VexCL uses vector expressions, which are automatically processed in parallel across all devices.
The simulation process comprises two stages: channel coefficient generation and data frame processing.These stages are depicted in Figure 2. The first stage corresponds to the generation of the elements of matrix H of (2); the second stage represents (1).

Complex
Operations.GPU frameworks, like OpenCL and CUDA, do not provide complex data types; the closest data type is the 2-vector type cl double2 of OpenCL.Hence, custom vex functions were implemented to perform addition and multiplication of complex numbers.These functions are used in vector expressions.The codes implemented for performing the multiplication and addition of complex numbers are presented in Listings 1 and 2, respectively.The previous functions receive two 2-vectors used as complex numbers.

Channel Coefficient Generation.
The channel coefficients require random numbers, which are filtered for generating the channel taps with specific PDS and correlation.=1  , of pairs of double precision floating-point numbers, vex::element index() is the function to get the th random number, and 123 is the seed for the generator.Instead of using doubles, the vector noise is composed of cl double2 numbers; for example, a pair of double precision numbers is used as complex numbers.

Doppler Filter.
The generated complex random numbers are passed through a filter which provides the corresponding time domain statistics.The transfer function of this filter is the square root of the autocorrelation function in the time domain for each path.As a particular case where all the paths in all the subchannels follow the Jakes model, the impulse response of this filter is Γ(3/4)( max /(|( − /2)  |)) 1/4  1/4 ( max |(−/2)  |), where Γ(⋅) is the gamma function,  is the length of the filter, and  = 0, . . .,  is an index that enumerates the coefficients [22].In order to perform the noise filtering, a custom function, named convolution, was coded; it receives pointers to arrays of 2-vectors corresponding to the noise and Doppler filter coefficients, respectively.The code used for this convolution is presented in Listing 3, where i is the OpenCL's element index, x is the noise, y is the filter, tF is the length of the filter, fS is the number of samples to be filtered, nS is the number of samples, and nP is the number of paths.The convolution function processes in parallel the complete noise vector.It executes the function body for each noise sample.

Path Gains.
They are implemented as the product of a vector and a scalar.Each path has its own gain.

Upsampling.
The upsampling is a quadratic interpolation to a desired number of values of the noise vector.

Tap Generation.
The resulting noise vector, representing the paths, is correlated with a matrix G , (e.g., raised cosine).This correlation is implemented as a matrixmatrix product.A MIMO system has multiple channels; the correlation is done for each channel.

Data Frame
Processing.This section describes the implementation of Figure 1 and (1): a transmitter and a receiver correlation stage and the discrete time-varying channel impulse response from the transmitter  to the receiver .

Transmitter Correlation.
The correlation on the transmitter side is implemented as a product of the data x() and correlation matrices C  ; the latter matrix contains the coefficients that correlate data frames and the transmit antennas.

FIR Filter.
The transmitted data frames are filtered with the channel coefficients, H(), described in Section 3.2.In order to achieve this goal, a convolution is implemented.The code used for this convolution is presented in Listing 4, where the variable i is the OpenCL's element index, the variable x is a pointer to the data to be processed, the variable y is a pointer to the generated channel coefficients, the variable tRc is the number of taps of the raised cosine, the variable nD is the data frame size, the variable uS is the size of the path, the variable nT is the number of transmitters, and the variable nR is the number of receivers.

Receiver Correlation.
The correlation on the receiver side is implemented as a product of the received data and the correlation matrix C  ; the latter matrix contains the coefficients that correlate data frames and the receive antennas.

Results
In this section, the time performance of the proposed simulator is evaluated.In order to achieve this goal, a MIMO 2 × 2 channel is considered, where all the subchannels have the same PDS with Jakes shape and  max = 2000 Hz.If a carrier frequency of 5.9 GHz is considered, then the assumed  max corresponds to a scenario where the speed of the mobile is 360 km/h.Likewise, the PDP of each SISO subchannel is fixed as described in Table 1.The PDPs are selected in order to prove the functionality of the simulator, with the .It is assumed that the transmitter and the receiver filters are both a square root raised cosine with a rolloff factor of 0.5 and the duration of its convolution is 2  = 6  , where the period symbol is fixed at   = 0.1 s.According to the maximum delay values presented in Table 1, as well as the values of   and   , the parameter  is equal to 206 taps.Finally, the MIMO simulator generates channels assuming data frames composed of 1024 symbols.
Figure 3 shows a parallel realization of each MIMO subchannel using the proposed GPU-based simulator, where each of the subfigures presents the time-variation of the corresponding filter coefficients, which are associated with the assigned PDS.Moreover, it is possible to observe that each subchannel satisfies the specifications of the PDP assigned.Thus, PDPs with large delays correspond to filters with more coefficients, as observed in the channel realizations ℎ 1,2 and ℎ 2,1 , respectively.
The time performance evaluation is carried out using a personal computer (PC) with the following specifications: The simulations are carried out over 100 frames and the elapsed processing time per frame is obtained by averaging all the frames.Table 2 shows the time required to execute each of the simulator blocks.Figure 4 presents the comparison of the time required to execute all the considered cases when the GPU-based simulator is used, as well as the sequential implementation of this simulator using C language.For a better appreciation, the results are plotted in logarithmic scale.
Table 3 summarizes the time consumption and also presents the -fold time gain.This gain is calculated as the quotient of the time required by the sequential implementation divided by the GPU implementation.It is possible to observe that for the cases from 2 × 2 to 7 × 7, as the complexity of simulator increments, the gain increments too.This is due to the fact that as more operations are required, better utilization of the parallel resources of the GPU is achieved.In the remaining cases, the gain is around 680; starting at the 7 × 7 case, the simulator simultaneously uses the available resources in the device.As a matter of fact, with the GPU considered for testing the simulator, the 24 × 24 case is the most complex MIMO channel that can be simulated; nevertheless, if the hardware is upgraded, then this inconvenience can be addressed, opening the possibility for the simulation of more complex scenarios, for example, massive MIMO.4.1.Discussion.Recently, several channel simulator approaches have been introduced in the open literature.However, these implementations are based on field programmable gate array (FPGA) devices, and they are restricted to using a single configuration in the developed simulator; that is, the time-and frequency-selective correlations are fixed.As a result, setting new channel statistics in the simulator (new channel propagation conditions) implies the redesign of the implemented hardware.For example, in the channel simulator introduced by [16], the time selectivity is generated by using a sum of sinusoids, which approximates a Jakes PDS.This simulator uses uniform random number generators for defining the configuration of the parameters of these sinusoids.Thus, if one wishes to generate a channel with a different PDS, then this will imply changing the probability density function of the random number generator block.In contrast, for changing the statistics of the simulator proposed in this paper, it is necessary only to upgrade the coefficients of the filters, which allows for a quick redesign of the simulator.
As regards performance, even though the reported FPGA implementations can operate in real time, they are constrained in terms of the number of operations that they  can execute.For example, in [16] it is reported that the MIMO channels which can be simulated should be limited to   ×   ×  ≤ 160 total taps.Meanwhile, the tests performed on the proposed simulator with the selected GPU device reveal that the most complex channel that can be simulated contains 118656 total taps (24 × 24 × 206).Moreover, the implementations presented in the open literature for achieving the reported performance consider that the paths of the PDP occur in multiples of the symbol period and all the subchannels share the same PDP.This contrasts with our GPU-based implementation, which can deal with paths with arbitrary positions and distinct PDPs for each subchannel.This is noteworthy because communication standards recommend PDPs with paths whose allocations are symbol period independent, while the flexibility of setting different PDPs for each subchannel makes it possible to simulate more complex scenarios.If the constrains assumed in the reported simulators were considered, then the proposed simulator could generate more total taps and consequently MIMO channels with larger numbers of antennas.

Conclusions
This paper presents a methodology for implementing a triply selective MIMO simulator based on GPUs.The proposed simulator considers the use of multiple SISO simulators, which are also implemented with GPUs, together with input and output correlation matrices in order to provide the corresponding space selectivity.This approach allows for great flexibility because each SISO subchannel can be set with different statistics and, therefore, more complex environments can be modeled.It is shown that under some considerations, the scheme fits with the more specific model proposed in the state of the art.Moreover, the implementations with GPUs considerably reduce the execution time compared with sequential implementations.Implementation results show a 688-fold gain for MIMO 7 × 7 when compared with sequential implementations.This suggests that this approach will enable more complex MIMO channels to be simulated with a greater number of antennas and a minimum penalty of time execution; in this way, communication systems can be tested in less time.Likewise, the considered model allows for the simulation of different scenarios, which are not forced to have the same statistics in each subchannel.Therefore, it becomes possible to simulate propagation environments that are more adequate for V2I and V2V communication systems.

Figure 4 :
Figure 4: Comparison of execution time for different MIMO channels.
with y() = [ 1 (),  2 (), . . .,    ()]  is a vector containing the samples received from each antenna, C  ∈ C   ×  is the matrix that provides the spatial correlation due to the receive antennas, and Ĉ = C  ⊗ I  , C  ∈ C   ×  correlates the transmitted samples according to the correlation of the transmit antennas.   ()]  is a vector composed of the actual and past samples of each transmitter, where x  () = [  (),   ( − 1), . . .,   ( − )]  .The matrix H() ∈ C   ×   is defined as follows:

Table 1 :
Power delay profiles for configuring the SISO subchannels of the MIMO 2 × 2 simulator.

Table 2 :
Time consumption by each module of the proposed simulator considering an 8 × 8 MIMO simulator.PDPs of h 1,1 and h 1,2 corresponding to the vehicular test environment channel A and channel B as defined in [23] respectively.The values of matrices C  and C  are fixed in such a way that they make [C  C   ] , = 0.3 |−| and [C  C   ] , = 0.9 |−| , respectively [24]

Table 3 :
Time consumption for different MIMO channel realizations when the GPU-based simulator and a sequential implementation are used.