In this paper, a novel processing-efficient architecture of a group of inexpensive and computationally incapable small platforms is proposed for a parallely distributed adaptive signal processing (PDASP) operation. The proposed architecture runs computationally expensive procedures like complex adaptive recursive least square (RLS) algorithm cooperatively. The proposed PDASP architecture operates properly even if perfect time alignment among the participating platforms is not available. An RLS algorithm with the application of MIMO channel estimation is deployed on the proposed architecture. Complexity and processing time of the PDASP scheme with MIMO RLS algorithm are compared with sequentially operated MIMO RLS algorithm and liner Kalman filter. It is observed that PDASP scheme exhibits much lesser computational complexity parallely than the sequential MIMO RLS algorithm as well as Kalman filter. Moreover, the proposed architecture provides an improvement of 95.83% and 82.29% decreased processing time parallely compared to the sequentially operated Kalman filter and MIMO RLS algorithm for low doppler rate, respectively. Likewise, for high doppler rate, the proposed architecture entails an improvement of 94.12% and 77.28% decreased processing time compared to the Kalman and RLS algorithms, respectively.
1. Introduction
Adaptive filtering techniques play a very important role in the emerging fields of science and technology [1–3]; however, the last two decades have witnessed tremendous research in the field of adaptive filtering for the improvement of their convergence and complexity requirements [4–6]. However, achieving fast convergence on an energy-constrained and computationally incapable platform still remains a dream in spite of magnificent advancements in Integrated Circuit (IC) technologies. For instance, in video conferencing, the echo cancellation requires a high definition adaptive filtering algorithm to avail a robust convergence performance while tracking the time varying uncertainties present in the communication link. Nevertheless, such high definition adaptive algorithm cannot be run on an energy-constrained and computationally incapable inexpensive platform. The following lines present a brief review of the literature where significant efforts have been made to propose a low complexity and distributed solution for this problem.
In this paper, our objective is to provide a novel low complexity solution for inexpensive and computationally incapable platforms through proposing their parallely distributed adaptive signal processing (PDASP) operation making them run computationally expensive procedures cooperatively.
The implementation of the proposed PDASP technique using recursive least square (RLS) makes the inexpensive and computationally incapable platforms work in parallel even with nonaligned time indexes by providing much lesser processing time parallely than the sequential Kalman [24] and RLS [25, 26] algorithms, whereas RLS is the special case of Kalman filter and is one of the most popular filters in the adaptive filtering domain that offers a superior convergence rate, especially for time varying environments with the price of an increase in nonlinear computational cost.
The idea behind the distributed signal processing is based on parallel processing and “divide and conquer” prototype. In this prototype, the part of the algorithm is divided into subalgorithms which are then passed to other processing nodes to provide an efficient low complexity solution.
The rest of the paper is organized in the following manner. Section 2 describes the system model. Section 3 presents the proposed PDASP scheme for computationally incapable platforms. Complexity analysis is introduced in Section 4. In Section 5, simulation based results are presented and Section 6 draws the conclusions.
2. System Model
In this section, we discuss the working procedure of our proposed parallely operated recursive least square (RLS) filter in the light of its conventional sequential operation.
In conventional (RLS) adaptive algorithm and its variants, all filter subparts are interdependent on each other and operate sequentially. Before introducing the proposed parallel RLS operation over individual platforms with different clock systems, we define some timing variables with illustration shown in Figure 1, where a single iteration of an RLS algorithm consists of N sequential blocks.
Sequential working of conventional RLS algorithm. (a) Sequential working of individual blocks. (b) Processes involved in a single block.
(i) Computational Time Tc. This is the time taken by the processor for a single computation. It can be calculated simply by the speed of the processor.
(ii) Block Processing Time Tb. It is the processing time of a block of the algorithm. It depends on the number of computations involved in a block. It can thus be a multiple of Tc.
(iii) Fetch Time Tf. This is the time in which one block fetches information from another block, usually its predecessor.
(iv) Algorithm Step Time Ts. It is the processing time of a complete iteration of the algorithm.
If RLS filtering is operated on a single computationally capable platform, all algorithm blocks would be executed sequentially as shown in Figure 1 with fetch time Tf→0. However, if the same RLS filtering is operated on a group of computationally incapable platforms using the proposed PDASP architecture, different algorithm blocks would be executed parallely on various individual platforms with varying fetch times depending upon the media among the nodes as shown in Figure 2.
The only possible way to operate the RLS algorithm in parallel fashion on individual platforms with different clock systems is by putting the time as nonaligned. While setting the time nonalignment, two things must be taken into account. First, it must be realized that the filter is not showing any uncertain behavior though implementing on any application. Secondly, all the filter subparts are able to work in parallel manner with favorable fetch times with respect to block processing times. In this way, the sequential structure may be able to work parallely even with nonaligned time indexes. In Figure 2, the cooperative parallely operated RLS filtering architecture consists of four processing nodes, namely, M1, M2, M3, and M4. The processing nodes M1 and M4 are interlinked with M2 and M3, respectively. while being connected to themselves also. Likewise, M2 is interconnected with M1 and M4 and M3 is only linked with M4. All the processing nodes would first share information with one another and then would work out the desired process. The processing time of each block differs from one another and is known to all nodes; therefore, all processing nodes which complete their processing tasks earlier than others wait the processing time equivalent to the block of maximum processing time till the processing of the block with maximum processing time ends. In this way, the inexpensive and computationally incapable platforms work in parallel for a combined goal.
3. Proposed PDASP Technique for RLS with Nonaligned Time Indexes
In adaptive filtering, all the filter subparts are interdependent on one another. Due to cascaded fashion the algorithm takes mutual processing time while attaining its convergence with respect to uncertain channel conditions. By using the PDASP technique, RLS algorithm runs in parallel manner even with nonaligned time indexes while providing parallely low processing time at each machine or processing node. The flow diagram of PDASP technique using RLS is shown in Figure 3. The notation “TX” is used to represent the time used in processing of X whose computation is done inside the pointed block. Let the processing times taken by error covariance matrix “Φk,” Kalman gain “gk,” received signal estimation rk, estimation error “ek,” and update filter coefficient matrix “H^k” be TΦ, Tg, Tr, Te, and TH^, respectively.
Proposed PDASP architecture for MIMO RLS algorithm with nonaligned time indexes.
Therefore, the total time taken by the whole algorithm that runs in cascaded fashion is(1)TΦ+Tg+Tr+Te+TH^=Ttot.
The maximum processing time among TΦ⋯TH^ is TΦ because Φk takes more multiplications than any of gk, rk, ek, and H^k, in order to operate RLS algorithm parallely while distributing the operation of various blocks on individual nodes with nonaligned time indexes. The strict and sufficient conditions with respect to fast convergence rate in terms of multiplication and addition computations can thus be written as(2)Tg,Tr,Te,TH^≤TΦ,TΦ+Tf≪Ttot.
Due to nonaligned time indexes, the mismatch ζ between the aligned and nonaligned time indexes can be written as(3)ζ=eSeq-eNA,where eSeq is the error of sequential algorithm and eNA is the error of the PDASP algorithm with nonaligned time indexes. The proposed architecture can be run in sequential format for convergence calibration. The sequential implementation of PDASP RLS algorithm with nonaligned time indexes is nearly the same as that of a conventional RLS algorithm run on a single machine. Steps of this sequential format are shown in Algorithm 1.
<bold>Algorithm 1: </bold>PDASP RLS algorithm with nonaligned time indexes if it runs sequentially.
Initilize: λ,H^k+1,ak,r^k+1,Φk+1,Φk,ek,gk
r^k+2T=sk+2TH^k+1
ek+1=rk+1-r^k+1
ak+1T=λ-1sk+1TΦk+1
gk+1=ak+1ak+1Tsk+1+1
Φk+2=λ-1Φk-gkakT
H^k+2=H^k+1+ekgkT
4. Complexity Analysis
The complexity of the linear Kalman filter requires 2N3+6N2+3N+1 multiplications and 3N3+4N2+2 additions per iteration, where N represents the dimension of the filter order. Likewise, RLS algorithm that is the special case of Kalman filter entails 5N2+2N+1 multiplications and 4N2+2 additions per iteration. The implementation of the proposed PDASP technique on RLS algorithm exhibits much lesser computational cost for each parallely distributed entity block. The proposed PDASP technique with nonaligned time indexes entails parallely 2N2 multiplications and N2 additions per iteration at maximum. The proposed parallel technique thus provides much lesser processing time than that of sequential Kalman and RLS algorithms.
5. Simulation Results
In this section, Monte Carlo simulations with binary phase shift keying (BPSK) are performed on 4×4 MIMO communication system to substantiate the validation of our proposed PDASP architecture. The forgetting factor λ is set to be 0.98 for both the proposed PDASP and sequential RLS algorithms.
The proposed parallel technique that is implemented on MIMO RLS is then compared with the sequential MIMO RLS adaptive algorithm and Kalman filter in terms of computational complexity, mean square error (MSE), and processing time with nonaligned time indexes. The implementation of the proposed PDASP scheme is done using MIMO RLS algorithm and its performance in terms of computational complexity is then compared with sequentially operated nondistributed Kalman and RLS adaptive filtering algorithms. The parallel technique provides much lesser computational complexity parallely than the sequential Kalman and MIMO RLS algorithms. Figure 4 represents the multiplication and addition complexity comparison of proposed PDASP technique and those of sequentially operated Kalman and MIMO RLS algorithms. It is observed that the proposed PDASP technique using nonaligned time indexes provides parallely much lesser multiplication and addition complexity than sequential Kalman and MIMO RLS algorithms. Figures 5 and 6 show the MSE performance at low doppler rate fDT=10-6 and high doppler rate fDT=10-3, respectively, and Figure 7 shows their MSE difference among the proposed PDASP MIMO RLS and sequentially operated algorithms. It is realized that the difference in convergence performance of proposed PDASP scheme run with nonaligned time indexes and that of the sequential Kalman and MIMO RLS algorithms is only of 20 and 10 iterations, respectively, at low doppler spread and about 30 and 20 iterations with relatively high doppler spread, respectively. Considering the difference in Figure 7, it can be seen that, due to initialization of the algorithm parameters of PDASP technique, the error difference that is small at the start gradually increases and then reverses to decrease and eventually becomes zero on a specific number of iterations.
Computational complexity comparison among 4×4 MIMO sequential algorithms and proposed PDASP technique.
Mean Square Error (MSE) tracking performance versus training length for 4×4 MIMO when fDT=10-6.
Mean Square Error (MSE) tracking performance versus training length for 4×4 MIMO when fDT=10-3.
Mean Square Error difference among sequential algorithms and proposed PDASP scheme.
The fast Ethernet speed of 125 Mbits/s is taken as the reference peak bit rate in the wired communication. In 4×4 MIMO PDASP scheme, the maximum size of 4×4 matrix is to be transmitted from one machine node to another through wired communication. However, each entry in 4×4 MIMO matrix consists of 4 bytes, in which two significant bytes are before the decimal point and two significant bytes are taken after the decimal point. The total number of bits for 4 × 4 matrix are 4×16×8=512 bits. The fetch time Tf according to this number of bits is 41μs. Therefore, the processing time comparison and the processing time difference at Tf=41μs among the sequential algorithms and the proposed PDASP technique are presented in Figures 8 and 9, respectively. It is clear that the proposed PDASP technique provides much lesser processing time than the sequentially operated Kalman filter and MIMO RLS algorithm. The percentage improvement in decreased processing time is shown in Tables 1 and 2. At low doppler rate, it is realized that the proposed PDASP MIMO RLS algorithm converges about 35 iterations with the addition of Tf=41μs at each iteration but still utilizes 95.83% and 82.29% lesser processing time than the sequential Kalman and MIMO RLS algorithms, respectively. Likewise, for high doppler rate, the proposed PDASP MIMO RLS takes 50 iterations for its convergence with the increase of 30 and 20 iterations compared to the sequential Kalman and MIMO RLS algorithms, respectively. It can be seen that the proposed technique still entails 94.12% and 72.28% lesser processing time than the sequentially operated Kalman and MIMO RLS algorithms, respectively.
Percentage improvement in decreased processing of PDASP MIMO RLS with respect to sequential MIMO RLS.
Doppler rate
MIMO RLS MSE convergence
MIMO RLS processing time, PRLS
Proposed PDASP MIMO RLS convergence
Proposed PDASP RLS processing time, PPDASP at Tf=41μsec
% improvement in decreased processing time, PRLS-PPDASP/PRLS×100
fDT=10-6
25 iterations
0.0096373 sec
35 iterations
0.0016825 sec
82.29%
fDT=10-3
30 iterations
0.0107447 sec
50 iterations
0.0024410 sec
77.28%
Percentage improvement in decreased processing of PDASP with respect to sequential linear Kalman.
Doppler rate
Kalman MSE convergence
Kalman processing time, PKalman
Proposed PDASP MIMO RLS convergence
Proposed PDASP RLS processing time, PPDASP at Tf=41μsec
% improvement in decreased processing time, PKalman-PPDASP/PKalman×100
fDT=10-6
15 iterations
0.0403 sec
35 iterations
0.0016825 sec
95.83%
fDT=10-3
20 iterations
0.0415 sec
50 iterations
0.0024410 sec
94.12%
Processing time comparison among sequential algorithms and proposed PDASP technique.
Processing time difference for 4×4 MIMO system. (a) Sequential RLS versus proposed PDASP technique. (b) Sequential linear Kalman versus proposed PDASP technique.
6. Conclusions
In this paper, a novel low complexity architecture for the parallely distributed adaptive signal processing (PDASP) operation of inexpensive and computationally incapable small platforms has been proposed. The proposed architecture makes the inexpensive and computationally incapable devices run computationally expensive procedures like complex adaptive Kalman and RLS algorithms cooperatively. The operation of the proposed PDASP architecture has been evaluated on the basis of presence of time nonalignment in the execution of its parallel block entities. Complexity and processing time of proposed PDASP scheme with RLS algorithm have been compared with those of sequentially operated Kalman and RLS algorithms. It has been observed that PDASP scheme exhibits much lesser computational complexity and processing time parallely than the sequentially operated Kalman and RLS algorithms. The proposed PDASP technique with nonaligned time indexes provides parallely 2N2 multiplications and N2 additions per iteration at maximum. Likewise, the proposed technique utilizes 95.83% and 82.29% lesser processing time than the sequential Kalman and MIMO RLS algorithms, respectively, for low doppler rate. Likewise, for high doppler rate, the proposed technique entails 94.12% and 77.28% decreased processing than sequentially operated Kalman and MIMO RLS algorithms, respectively. In a nutshell, processing time and parallel complexity of the proposed PDASP based MIMO RLS scheme have been observed to be much lesser than those of Kalman and RLS adaptive filtering algorithms, if operated sequentially on a single unit.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
PelekanakisK.ChitreM.Robust equalization of mobile underwater acoustic channelsZhangB.ZhangL.KarrayF.ZhangL.Retinal vessel extraction by matched filter with first-order derivative of gaussianBidyadharS.PradhanR.An adaptive predictive error filter based maximum power pot trackg algorithm for a photovoltaic systemSchüldtC.LindstromF.ClaessonI.Low-complexity adaptive filtering implementation for acoustic echo cancellationProceedings of the IEEE Region 10 Conference (TENCON '06)November 2006Hong Kong10.1109/TENCON.2006.3437952-s2.0-34547600077SoniR. A.GallivanK. A.JenkinsW. K.Low-complexity data reusing methods in adaptive filteringSch\"uldtC.LindstromF.ClaessonI.A low-complexity delayless selective subband adaptive filtering algorithmZhangF. Z.MüllerR. R.CottatellucciL.VehkaperäM.Blind pilot decontaminationEberliS.CescatoD.FichtnerW.Divide-and-conquer matrix inversion for linear MMSE detection in SDR MIMO receiversProceedings of the 26th Norchip ConferenceNovember 2008Tallinn, Estonia16216710.1109/NORCHP.2008.47383032-s2.0-62949189192BhottoM. Z. A.AntoniouA.Robust recursive least-squares adaptive-filtering algorithm for impulsive-noise environmentsXiaoL.LiuS.YangD.A low-complexity block diagonalization algorithm for mu-mimo two-way relay systems with complex lattice reductionWubbenD.BohnkeR.KuhnV.KammeyerK.Near-maximum likelihood detection of MIMO systems using MMSE-based lattice reduction2Proceedings of the IEEE International Conference on CommunicationsJune 200479880210.1109/ICC.2004.1312611HonigM. L.XiaoW.Performance of reduced-rank linear interference suppressionKovačevićB.BanjacZ.KovačevićI. K.Robust adaptive filtering using recursive weighted least squares with combined scale and variable forgetting factorsMoshaviS.KanterakisE. G.SchillingD. L.Multistage linear receivers for DS-CDMA systemsSesslerG. M. A.JondralF. K.Low complexity polynomial expansion multiuser detector for CDMA systemsHoydisJ.DebbahM.KobayashiM.Asymptotic moments for interference mitigation in correlated fading channelsProceedings of the IEEE International Symposium on Information Theory Proceedings (ISIT '11)August 2011rus2796280010.1109/ISIT.2011.60340832-s2.0-80054825355NoorA. O. A.SamadS. A.HussainA.A review of advances in subband adaptive filteringBertrandA.MoonenM.Distributed adaptive node-specific MMSE signal estimation in sensor networks with a tree topologyProceedings of the 17th European Conf. Signal Process2009Glasgow, UK794798KajanS.SekajI.OravecM.The use of MATLAB parallel computing toolbox for genetic algorithm-based MIMO controller design9Proceedings of the 17th International Conference on Process Contro2009Štrbské Pleso, Slovakia912StarkloffE.Designing a parallel, distributed test systemProceedings of IEEE AUTOTESTCONAnaheim, Calif564567KecklerS. W.DallyW. J.KhailanyB.GarlandM.GlascoD.GPUs and the future of parallel computingCheS.BoyerM.MengJ.TarjanD.SheafferJ. W.SkadronK.A performance study of general-purpose applications on graphics processors using CUDASayedA. H.KailathT.A state-space approach to adaptive rls filteringHaykinS.BenestyJ.HuangY.