Processing-Efficient Distributed Adaptive RLS Filtering for Computationally Constrained Platforms

In this paper, a novel processing-efficient architecture of a group of inexpensive and computationally incapable small platforms is proposed for a parallely distributed adaptive signal processing (PDASP) operation.The proposed architecture runs computationally expensive procedures like complex adaptive recursive least square (RLS) algorithm cooperatively.Theproposed PDASP architecture operates properly even if perfect time alignment among the participating platforms is not available. An RLS algorithm with the application of MIMO channel estimation is deployed on the proposed architecture. Complexity and processing time of the PDASP scheme with MIMO RLS algorithm are compared with sequentially operated MIMO RLS algorithm and liner Kalman filter. It is observed that PDASP scheme exhibits much lesser computational complexity parallely than the sequential MIMO RLS algorithm as well as Kalman filter. Moreover, the proposed architecture provides an improvement of 95.83% and 82.29%decreased processing time parallely compared to the sequentially operated Kalman filter and MIMO RLS algorithm for low doppler rate, respectively. Likewise, for high doppler rate, the proposed architecture entails an improvement of 94.12% and 77.28%decreased processing time compared to the Kalman and RLS algorithms, respectively.


Introduction
Adaptive filtering techniques play a very important role in the emerging fields of science and technology [1][2][3]; however, the last two decades have witnessed tremendous research in the field of adaptive filtering for the improvement of their convergence and complexity requirements [4][5][6].However, achieving fast convergence on an energy-constrained and computationally incapable platform still remains a dream in spite of magnificent advancements in Integrated Circuit (IC) technologies.For instance, in video conferencing, the echo cancellation requires a high definition adaptive filtering algorithm to avail a robust convergence performance while tracking the time varying uncertainties present in the communication link.Nevertheless, such high definition adaptive algorithm cannot be run on an energy-constrained and computationally incapable inexpensive platform.The following lines present a brief review of the literature where significant efforts have been made to propose a low complexity and distributed solution for this problem.
In [7,8], Banachiewicz inversion formulation is used to perform the matrix inversion for MIMO-OFDM based software defined radio (SDR) signal detection.The inversion of a 4 × 4 matrix is divided into four 2 × 2 matrices that reduce the computational operations.Likewise, the authors in [9] derive a low complexity algorithm for Hermitian positivedefinite recursive matrix inversion that provides low computational complexity compared to [7,8] with the utilization of 52 operations for finding matrix inversion only.Therefore, using the concept of matrix inversion of [7][8][9], it does not exhibit a significant impact on the computational cost of high definition adaptive filtering [10].In [11], Xiao et al. introduce an LR-MMSE algorithm based on QR decomposition and complex lattice reduction (CLR) which provides 35.5% lesser computational complexity than MMSE based scheme [12]; however, 35.5% lesser computational cost still can not meet the demands of high data rate communications.A low complexity in reduced rank linear interference suppression is proposed in [13,14] which is based on the use of polynomial expansion (PE).The matrix inverse is represented by an th order matrix polynomial [13,[15][16][17] and its selection is one of the tedious tasks while trading off between the complexity and detection performance.In [18], a comparison is made on renowned subband adaptive filtering (SAF) structures with parallel arrangement of multirate filter banks.The SAF technique exhibits reduced complexity through the use of least mean square (LMS) adaptive filtering algorithm in acoustic noise environment.Therefore, due to phase, aliasing, and amplitude distortions and extra processing delay, these systems may be ruled out for real-time implementation.Another architecture configuration for reducing runtime is MMSE signal estimation using wireless sensor nodes [19].In this architecture, authors use the distributed adaptive nodespecific signal estimation (DANSE) technique to estimate the channel coefficients by following Wiener Hopf equation.However, DANSE technique only follows the MMSE criterion rather than running the adaptive filtering algorithm.This makes DANSE incapable of estimating time varying channel conditions.To the best of our knowledge, there is no parallel structure of recursive adaptive filtering algorithms in the literature where any of the complex adaptive algorithms runs in parallel fashion over computationally incapable platforms with no perfect time alignment.Nevertheless, software parallelism is provided by Matlab © [20] and Labview © [21] which are available in various architectures of the present fast processing computers with perfect time-alignment processors.These software programs provide parallel processing toolbox to divide the large problems into smaller computations, hence requiring reduced running time.Likewise, graphical processing unit (GPU) enables running high definition graphics on a personal computer (PC) by exploiting hundreds of cores [22].Furthermore, compute unified device architecture (CUDA) [23] is NVIDIA's GPU architecture which provides multithreaded applications where cores can communicate and exchange information with each other.However, these cores have not been used to run adaptive algorithms in parallel with nonaligned time indexes.
In this paper, our objective is to provide a novel low complexity solution for inexpensive and computationally incapable platforms through proposing their parallely distributed adaptive signal processing (PDASP) operation making them run computationally expensive procedures cooperatively.
The implementation of the proposed PDASP technique using recursive least square (RLS) makes the inexpensive and computationally incapable platforms work in parallel even with nonaligned time indexes by providing much lesser processing time parallely than the sequential Kalman [24] and RLS [25,26] algorithms, whereas RLS is the special case of Kalman filter and is one of the most popular filters in the adaptive filtering domain that offers a superior convergence rate, especially for time varying environments with the price of an increase in nonlinear computational cost.
The idea behind the distributed signal processing is based on parallel processing and "divide and conquer" prototype.In this prototype, the part of the algorithm is divided into subalgorithms which are then passed to other processing nodes to provide an efficient low complexity solution.
The rest of the paper is organized in the following manner.Section 2 describes the system model.Section 3 presents the proposed PDASP scheme for computationally incapable platforms.Complexity analysis is introduced in Section 4. In Section 5, simulation based results are presented and Section 6 draws the conclusions.

System Model
In this section, we discuss the working procedure of our proposed parallely operated recursive least square (RLS) filter in the light of its conventional sequential operation.
In conventional (RLS) adaptive algorithm and its variants, all filter subparts are interdependent on each other and operate sequentially.Before introducing the proposed parallel RLS operation over individual platforms with different clock systems, we define some timing variables with illustration shown in Figure 1, where a single iteration of an RLS algorithm consists of  sequential blocks.
(i) Computational Time   .This is the time taken by the processor for a single computation.It can be calculated simply by the speed of the processor.
(ii) Block Processing Time   .It is the processing time of a block of the algorithm.It depends on the number of computations involved in a block.It can thus be a multiple of   .
(iii) Fetch Time   .This is the time in which one block fetches information from another block, usually its predecessor.

(iv) Algorithm
Step Time   .It is the processing time of a complete iteration of the algorithm.
If RLS filtering is operated on a single computationally capable platform, all algorithm blocks would be executed sequentially as shown in Figure 1 with fetch time   → 0. However, if the same RLS filtering is operated on a group of computationally incapable platforms using the proposed PDASP architecture, different algorithm blocks would be executed parallely on various individual platforms with varying fetch times depending upon the media among the nodes as shown in Figure 2.
The only possible way to operate the RLS algorithm in parallel fashion on individual platforms with different clock systems is by putting the time as nonaligned.While setting the time nonalignment, two things must be taken into account.First, it must be realized that the filter is not showing any uncertain behavior though implementing on any application.Secondly, all the filter subparts are able to work in parallel manner with favorable fetch times with respect to block processing times.In this way, the sequential structure may be able to work parallely even with nonaligned time indexes.In Figure 2, the cooperative parallely operated RLS filtering architecture consists of four processing nodes, namely,  1 ,  2 ,  3 , and  4 .The processing nodes  1 and  4 are interlinked with  2 and  3 , respectively.while being connected to themselves also.Likewise,  2 is interconnected with  1 and  4 and  3 is only linked with  4 .All the processing nodes would first share information with one another and then would work out the desired process.The processing time of each block differs from one another and is known to all nodes; therefore, all processing nodes which complete their processing tasks earlier than others wait the processing time equivalent to the block of maximum processing time till the processing of the block with maximum processing time ends.In this way, the inexpensive and computationally incapable platforms work in parallel for a combined goal.

Proposed PDASP Technique for RLS with Nonaligned Time Indexes
In adaptive filtering, all the filter subparts are interdependent on one another.Due to cascaded fashion the algorithm takes mutual processing time while attaining its convergence with respect to uncertain channel conditions.By using the PDASP technique, RLS algorithm runs in parallel manner even with nonaligned time indexes while providing parallely low processing time at each machine or processing node.The flow diagram of PDASP technique using RLS is shown in Figure 3.The notation " X " is used to represent the time used in processing of X whose computation is done inside the pointed block.Let the processing times taken by error covariance matrix "Φ  ," Kalman gain "g  ," received signal estimation r  , estimation error "e  ," and update filter coefficient matrix " Ĥ " be  Φ ,  g ,  r ,  e , and  Ĥ, respectively.Therefore, the total time taken by the whole algorithm that runs in cascaded fashion is The maximum processing time among  Φ ⋅ ⋅ ⋅  Ĥ is  Φ because Φ  takes more multiplications than any of g  , r  , e  , and Ĥ , in order to operate RLS algorithm parallely while distributing the operation of various blocks on individual nodes with nonaligned time indexes.The strict and sufficient conditions with respect to fast convergence rate in terms of multiplication and addition computations can thus be written as Due to nonaligned time indexes, the mismatch  between the aligned and nonaligned time indexes can be written as where  Seq is the error of sequential algorithm and  NA is the error of the PDASP algorithm with nonaligned time indexes.The proposed architecture can be run in sequential format for convergence calibration.The sequential implementation of PDASP RLS algorithm with nonaligned time indexes is nearly the same as that of a conventional RLS algorithm run on a single machine.Steps of this sequential format are shown in Algorithm 1.

Complexity Analysis
The complexity of the linear Kalman filter requires 2 3 + 6 2 + 3 + 1 multiplications and 3 3 + 4 2 + 2 additions per iteration, where  represents the dimension of the filter Initilize: , Ĥ+1 , a  , r+1 , order.Likewise, RLS algorithm that is the special case of Kalman filter entails 5 2 + 2 + 1 multiplications and 4 2 + 2 additions per iteration.The implementation of the proposed PDASP technique on RLS algorithm exhibits much lesser computational cost for each parallely distributed entity block.The proposed PDASP technique with nonaligned time indexes entails parallely 2 2 multiplications and  2 additions per iteration at maximum.The proposed parallel technique thus provides much lesser processing time than that of sequential Kalman and RLS algorithms.

Simulation Results
In this section, Monte Carlo simulations with binary phase shift keying (BPSK) are performed on 4 × 4 MIMO communication system to substantiate the validation of our proposed PDASP architecture.The forgetting factor  is set to be 0.98 for both the proposed PDASP and sequential RLS algorithms.
The proposed parallel technique that is implemented on MIMO RLS is then compared with the sequential MIMO RLS adaptive algorithm and Kalman filter in terms of computational complexity, mean square error (MSE), and processing time with nonaligned time indexes.The implementation of the proposed PDASP scheme is done using MIMO RLS algorithm and its performance in terms of computational complexity is then compared with sequentially operated nondistributed Kalman and RLS adaptive filtering algorithms.The parallel technique provides much Sequential Kalman (multiplication complexity) Sequential RLS (multiplication complexity) Proposed PDASP RLS (multiplication complexity) Sequential Kalman (addition complexity) Sequential RLS (addition complexity) Proposed PDASP RLS (addition complexity)  lesser computational complexity parallely than the sequential Kalman and MIMO RLS algorithms.Figure 4 represents the multiplication and addition complexity comparison of proposed PDASP technique and those of sequentially operated Kalman and MIMO RLS algorithms.It is observed that the proposed PDASP technique using nonaligned time indexes provides parallely much lesser multiplication and addition complexity than sequential Kalman and MIMO RLS algorithms.Figures 5 and 6 show the MSE performance at low doppler rate    = 10 −6 and high doppler rate    = 10 −3 , respectively, and Figure 7 shows their MSE difference among the proposed PDASP MIMO RLS and sequentially operated algorithms.It is realized that the difference in convergence performance of proposed PDASP scheme run with nonaligned time indexes and that of the sequential Kalman and MIMO RLS algorithms is only of 20 and 10 iterations, respectively, at low doppler spread and about 30 and 20 iterations with relatively high doppler spread,   respectively.Considering the difference in Figure 7, it can be seen that, due to initialization of the algorithm parameters of PDASP technique, the error difference that is small at the start gradually increases and then reverses to decrease and eventually becomes zero on a specific number of iterations.

Computational complexity
The fast Ethernet speed of 125 Mbits/s is taken as the reference peak bit rate in the wired communication.In 4 × 4 MIMO PDASP scheme, the maximum size of 4 × 4 matrix is to be transmitted from one machine node to another through wired communication.However, each entry in 4 × 4 MIMO matrix consists of 4 bytes, in which two significant bytes are before the decimal point and two significant bytes are taken after the decimal point.The total number of bits for 4 × 4 matrix are 4 × 16 × 8 = 512 bits.The fetch time   according to this number of bits is 41 s.Therefore, the processing time   comparison and the processing time difference at   = 41 s among the sequential algorithms and the proposed PDASP technique are presented in Figures 8 and 9, respectively.It is clear that the proposed PDASP technique provides much lesser processing time than the sequentially operated Kalman filter and MIMO RLS algorithm.The percentage improvement in decreased processing time is shown in Tables 1 and 2. At low doppler rate, it is realized that the proposed PDASP MIMO RLS algorithm converges about 35 iterations with the addition of   = 41s at each iteration but still utilizes 95.83% and 82.29% lesser processing time than the sequential Kalman and MIMO RLS algorithms, respectively.Likewise, for high doppler rate, the proposed PDASP MIMO RLS takes 50 iterations for its convergence with the increase of 30 and

Conclusions
In this paper, a novel low complexity architecture for the parallely distributed adaptive signal processing (PDASP) operation of inexpensive and computationally incapable small platforms has been proposed.The proposed architecture makes the inexpensive and computationally incapable devices run computationally expensive procedures like complex adaptive Kalman and RLS algorithms cooperatively.The operation of the proposed PDASP architecture has been evaluated on the basis of presence of time nonalignment in the execution of its parallel block entities.Complexity and processing time of proposed PDASP scheme with RLS algorithm have been compared with those of sequentially operated Kalman and RLS algorithms.It has been observed that PDASP scheme exhibits much lesser computational complexity and processing time parallely than the sequentially operated Kalman and RLS algorithms.

Figure 1 : 3 Figure 2 :
Figure 1: Sequential working of conventional RLS algorithm.(a) Sequential working of individual blocks.(b) Processes involved in a single block.

Figure 3 :
Figure 3: Proposed PDASP architecture for MIMO RLS algorithm with nonaligned time indexes.

𝑘 Algorithm 1 :
PDASP RLS algorithm with nonaligned time indexes if it runs sequentially.

6 5
between PDASP and RLS at f d T = 10 −3 Mismatch between PDASP and Kalman at f d T = 10 −3 Mismatch between PDASP and RLS at f d T = 10 −6 Mismatch between PDASP and Kalman at f d T = 10 −

Figure 7 :
Figure 7: Mean Square Error difference among sequential algorithms and proposed PDASP scheme.

Figure 8 :
Figure 8: Processing time comparison among sequential algorithms and proposed PDASP technique.

Table 1 :
Percentage improvement in decreased processing of PDASP MIMO RLS with respect to sequential MIMO RLS.

Table 2 :
Percentage improvement in decreased processing of PDASP with respect to sequential linear Kalman.20 iterations compared to the sequential Kalman and MIMO RLS algorithms, respectively.It can be seen that the proposed technique still entails 94.12% and 72.28% lesser processing time than the sequentially operated Kalman and MIMO RLS algorithms, respectively.
The proposed PDASP technique with nonaligned time indexes provides parallely 2 2 multiplications and  2 additions per iteration at maximum.Likewise, the proposed technique utilizes 95.83% and 82.29% lesser processing time than the sequential Kalman and MIMO RLS algorithms, respectively, for low doppler rate.Likewise, for high doppler rate, the proposed technique entails 94.12% and 77.28% decreased processing than sequentially operated Kalman and MIMO RLS algorithms, respectively.In a nutshell, processing time and parallel complexity of the proposed PDASP based MIMO RLS scheme have been observed to be much lesser than those of Kalman and RLS adaptive filtering algorithms, if operated sequentially on a single unit.