Low Power Digital Multimedia Telecommunication Designs

The increasing prominence of wireless multimedia systems and the need to limit power capability in very-high density VLSI chips have led to rapid and innovative developments in low-power design. Power reduction has emerged as a signiﬁcant design constraint in VLSI design. The need for wireless multimedia systems leads to much higher power consumption than traditional portable applications. This paper presents possible optimization technique to reduce the energy consumption for wireless multimedia communication systems. Four topics are presented in the wireless communication systems subsection which deal with architectures such as PN acquisition, parallel correlator, matched ﬁlter and channel coding. Two topics include the IDCT and motion estimation in multimedia application. These topics consider algorithms and architectures for low power design such as using hybrid architecture in PN acquisition, analyzing the algorithm and optimizing the sample storage in parallel correlator, using complex matched ﬁlter that analog operational circuits controlled by digital signals, adopting bit serial arithmetic for the ACS operation in viterbi decoder, using CRC to adaptively terminate the SOVA iteration in turbo decoder, using codesign in RS codec, disabling the processing elements as soon as the distortion values become great than the minimum distortion value in motion estimation, and exploiting the relative occurrence of zero-valued DCT coe(cid:129)cient in IDCT.


INTRODUCTION
In recent years, the need for personal mobile communications ± ``anytime, anywhere'' access to data and communication service ± has become increasingly clear.In addition, the portable multimedia systems are expected to be used more frequently and for longer durations [1].This concern has been accelerated by the growing popularity of portable multimedia systems [2].
The advent of Personal Communications Service (PCS) and Personal Data Assistant (PDA), the future trend is to run multimedia applications on those wireless devices.The need for high speed data and video transmission will lead to much higher power consumption than traditional portable applications [3].Due to the limited powersupply capability of current battery technology, we are facing a dilemma that has two design problems to conquer.One is to explore high-performance design and implementation that can meet the stringent speed constraint for real-time multimedia applications.The other is to consider low-power design so as to prolong the operating them of the wireless multimedia systems (PCS/PDA devices) [3].
In this paper, we will review the design of lowpower wireless multimedia systems.The structure of the paper is organized as follows.In Section 2, we introduce four low power designs techniques: PN acquisition, parallel correlator, matched ®lter and channel coding which corrects the error of received data in receiver.Section 3 introduces the design of an IDCT macrocell and motion estimation suited for mobile and the multimedia systems.In Section 4 we draw a conclusion.

WIRELESS COMMUNICATION
We discuss PN acquisition, 2) correlator, 3) matched ®lter and 4) channel coding in Code Division Multiple Access (CDMA) communication systems that are being widely deployed today.
(1) In spread-spectrum systems, the receiver must synchronize on to the transmitted PN (Pseudo-Noise) code and has to de-spread the received signal into original symbol by calculating the correlation of input data and the PN sequence [6].(2) The correlator used to for cell synchronization, timing recovery, data recovery and channel estimation in a Direct Spread Code Division Multiple Access (DS-CDMA) recei-ver [23].Synchronization process is to determine the location of a periodically occurring marker (say N bits long, where N may be as large as 256) transmitted by all transmitters (base station) in the system.(3) In CDMA communications, matched ®lters and sliding correlators are usually employed in the despreading of spread-signal.Generally speaking, sliding correlator is suitable for narrow-band CDMA systems because it takes long time to carry out a complete correlation calculation for a long PN sequence.A matched ®lter, however, is more attractive in W-CDMA (Wideband Code Division Multiple Access) communication that high ¯exibility, spectral eciency and advanced data services up to 2Mbps because it just takes a single chip time to perform the correlation calculation.(4) Power ecient system design requires attention to implementation of algorithms and functions as well as proper selection of system level components such as error-control coding and modulation schemes.We must consider what combinations of components give optimum performance.For error-control codes, we concentrate on the decoders because these consume much more power than the encoder.The decoder performance vs. power consumption tradeos for several common codes is compared [4,5,21].It shows how we can optimize an example communication system with respect to energy consumption.For a given error-control code, tremendous savings in power consumption can be attained both through algorithm reformulation and architectural innovations speci®cally targeted for energy conservation.We shall revisit those power-hungry components as a following subsection.0

PN Acquisition
To compute the autocorrelation function, either matched ®lters or serial correlators can be used.The matched ®lters compute values for the 2 K.-S.CHO AND J.-D.CHO autocorrelation function after each chip duration while serial correlators produce a value after each period of a data bit.Since a chip duration is much shorter than a data bit duration, the matched ®lters compute the autocorrelation values much faster than serial correlators.Consequently, the acquisition time, the amount of time takes to ®nd the autocorrelation peak and thus the alignment of PN sequences, is much shorter for a matched ®lter design than serial.However, a matched ®lter design requires signi®cantly higher complexity and power than a serial correlator.In the serial correlators approach, each value of the autocorrelation function is computed at the data rate, which is much slower compared to the chipping rate that the matched ®lters operate at.
The PN acquisition must process the spreadspectrum signal at a speed much faster than the transmitted data rate, its energy consumption can become signi®cant and should be minimized for portable applications.Typically, either matched ®lter or serial correlators are used to acquire the PN code timing.We describe a hybrid PN acquisition architecture which employs both matched ®lters and serial correlators to achieve a lower energy consumption and fast acquisition time as compared to the traditional approaches of using either matched ®lters or serial correlators alone [6].The result shows a factor-of-four reduction in energy for the hybrid scheme as compared to the matched ®lters architecture and a factor-of-two reduction in energy as compared to the serial architecture.Figure 1 shows the hybrid PN acquisition architecture that employs both matched ®lters and serial correlators.For better performance, a double dwell scheme is used.
The key to low energy dissipation in the hybrid architecture is to use low power serial correlators during the second dwell, while during the ®rst dwell the higher power matched ®lters are used, so that a fast acquisition time can be maintained.In contrast, only in the matched ®lters approaches, the ®lters are used for the second dwell.Since the second dwell averages the autocorrelation value at a single code phase over many bits, the matched ®lters dissipate more energy than necessary.

Parallel Correlator
Direct Sequence Spread Spectrum (DSSS) trans- missions require a despreading stage within the standard receiver block to recover the spread spectrum signal.For long spread spectrum codes, the correlation block can be a large portion of the receiver size, hence a considerable portion of the power consumption.In [7], a correlator structure for detecting a periodically repeating marker sequence in W-CDMA systems is suggested.
We describe two power reduction alternatives for a parallel spread spectrum correlator, by analyzing the algorithm and designing a baseline correlator and by investigating how to streamline the arithmetic operations in one case, and optimizing the sample storage in the other.
A generally correlator structure is the shift register storage [8].Given the shift register storage, there are potential power savings that can be realized in the adder tree by looking at the data statistics.As the data is shifted by one position, the previous coecient and the new coecient will remain the same for half the number of samples (in runs of length 2 or greater).In order to capture this behavior we de®ne ``bypass bits'' (see Fig. 2) which tell the adder stages if a term is not changing and thus it has zero contribution to the dierence between the present and next correlation sum.
By storing the previous sum and identifying the factors that are changing we can streamline the arithmetic operation to reduce the number of terms.The overall number of adders cannot be reduced as dierent codes change the locations of inactive adders, but we can shutdown the unused adder [8], and limit their power consumption.The equation for the correlator can be rewritten to express the new computation as follows, If the coecient for a sample has not changed from the previous calculation, then h Ã t is 0 in (1), otherwise h Ã t will re¯ect the new polarity ( 1 or À 1).When the coecient changes, the original sample value must be removed from the sum, and then the sample with the new polarity must be added.In order to take advantage of the new method of calculating the correlation sum, a specialized adder cell was developed as in Figure 3.In the case where a coecient has not changed as a sample is shifted, its particular contribution to the overall sum should be zero.
When a term bypassed, the adder can be con®gured to ignore its value (using cs), and only pass the other input as the result (using ca or cb).
Another approach to reduce power dissipation is to reduce the activity in the storage area.A possible approach to reduce the unnecessary activity on the data-lines is to use a register ®le (with pointer) implementation instead of the n-bit wide shift register, as seen in Figure 4.With this scheme, only one register out of the total of 2m À 1 will experience clock and output transitions for each new sample.Because the global bus feeds every register in the register ®le, minimizing the transitions on this bus can reduce the overall power consumption considerably.A simple, but eective coding technique to reduce power is the Bus Invert method [22], and for a 6 bit bus, it reduces transition activity by approximately 20%.

Matched Filter
We describe the mixed analog-digital LSI implementation of a high-speed low-power complex matched ®lter.The basic idea behind the developed complex matched ®lter is that the massive and high-speed despreading operation of the QPSK modulated complex spread-signals are directly carried out in analog domain by using the low power analog operational circuits controlled by digital signals.Comparing to pure digital matched ®lters, two high-speed Analog-to-Digital (A/D) converters operating at chip rate are omitted, and the total dissipation power is greatly reduced [9].
We address the despreading algorithm of the complex matched ®lter for QPSK modulated spread-signal.A simpli®ed CDMA receiver is shown in Figure 5.At the receiver side, after the RF demodulation and the quadrature demodulation, the spread baseband complex signal R(t) R I (t) j R Q (t) is separated into an inphase signal R I (t) and a quadrature R Q (t), which are then despread by a matched ®lter bank combined with two A/D converters operating at chip rate.The despread signal D(t) D I (t) j D Q (t), coming from the matched ®lter section is furthermore fed to a RAKE combiner in order to compensate for the fading or the multi-path eects by risen the radio channel.Now let us address the algorithm of the complex matched ®lter.Let N and T C denote the number of taps and the chip duration of the PN codes, respectively, the despreading operation of the complex matched ®lter can be formulated as where C(i) are the PN codes.
Based on the algorithmic structure shown in Figure 6, a complex matched ®lter corresponding to 128-tap PN-codes can be realized by the system architecture shown in Figure 7.The four B-MFs shown in Figure 6 is con®gured by using two For each SH-unit, the sampled-value is hold until the next samplingpulse comes, and the hold-value is multiplied by the corresponding bit of the PN-code.Since the PN-code bit only takes the value of ``1'' corresponding to `` 1'' or ``0'' corresponding to ``À 1'', the multiplication operation can be simply realized by a MUX which outputs the input-analog signal when the PN-code bit is ``0''.The output signal of each MUX is simultaneously connected to the analog parallel adder banks at plus-side and at

Channel Coding
The purpose of Forward Error Correction (FEC) is to improve the capacity of a channel by adding some carefully designed redundant information to the data being transmitted through the channel.

Bit-serial Viterbi Decoder
Viterbi decoders employed in digital wireless communications are complex and dissipate high power.The low-power design of Viterbi decoders has been an important issue for mobile and portable applications.The power dissipations of two dierent implementations of viterbi decoder (which is the register-exchange approach with less switching activity and the traceback approach with shift update scheme) is investigated [10].
This paragraph presents a low-power bit-serial Viterbi decoder chip [11] with the coding rate r 1/3 and the constraint length K 9 (256 states).The design of a low-power Viterbi decoder with a large number of states which is targeted for next generation wireless applications.We discuss the adopted bit-serial arithmetic for the ACS (Add-Compare-Select) operations.The bit-serial approach has made it possible to obtain an area and power ecient ACS architecture.The traceback technique is very power ecient due to the use of application-speci®c memories.
The architecture of the bit-serial ACS unit is depicted in Figure 8.Each ACS unit has three fulladders.Two of them are used to add the state metrics and the branch metrics and the third one is used to compare two new state metrics.After processing the most signi®cant bit, a decision bit is stored in a register and a new state metric is selected from two candidates stored in two sets of FIFO depending on the decision bit.
The trace-back strategy is depicted in Figure 9, where W, T and D represent ``WRITE'', ``TRACE BACK'' and ``RECODE'', respectively.During ``WRITE'' operation, 256 decision bits are written to survivor path memories simultaneously.Both ``TRACE BACK'' and ``DECODE'' are read operations, but only ``DECODE'' gives decoded outputs.After 48 ``TRACE BACK'' operations, 24 decoded bits are obtained consecutively.T001053d.207

Turbo Code Decoder
One of the problems for decoding turbo code in the receiver is the complexity and the high power consumption since multiple iterations of Soft Output Viterbi Algorithm (SOVA) have to be carried out to decode a data frame.An approach using Cyclic Redundancy Checking (CRC) to adaptively terminate the SOVA iteration of each frame is presented [12].This results in system that has variable workload of which the amount of computation required for each data frame is dierent.Dynamic voltage scaling is then used to further reduce the power consumption.Turbo code consists of multiple RSC (Recursive Systematic Convolutional) component codes and the encoder creates a powerful code by feeding randomly shued (interleaved) versions of the same information sequence to the encoders of these component codes.Using the component code structures, the turbo code decoder ®rst estimates the likelihood ratios of the information bits and then iteratively revises these estimates.For a turbo code with 2 component codes, one turbo decoding iteration corresponds to two SOVAs.It follows that the complexity of one iteration would correspond to that of four classical Viterbi decoders.Signi®cant increase in the complexity also implies that the power consumption of the decoder goes up considerably, which has a fatal eect on many applications, especially portable applications.
We introduce a turbo code decoding algorithm using adaptive iteration.The basic idea is to employ a CRC scheme in each decoding iteration so that no redundant iteration will be performed once the frame is correctly decoded.Let S max denote the maximum number of decoding iterations that can be used.The proposed adaptive iteration decoding algorithm is as follows.This adaptive iteration decoding algorithm is shown in the Figure 10.The x k are the received systematic bits and x k ' are the interleaved version of x k .The y k,1 and y k,2 are the received coded bits after the ®rst and second constituent recursive systematic encoders, respectively.L E (u k ) is the extrinsic information after the SOVA {u k } is the non-interleaved survivor sequence after passing through a SOVA.CRC is performed right after each SOVA.

Reed-Solomon Codec
Reed-Solomon (RS) coders are used for error control coding in many applications such as digital audio, digital TV, software radio, CD player, and wireless and satellite communications.These universal RS(n, k) codecs are to be implemented as a combination of hardware and software in application-speci®c DSP processor with specially designed programmable ®nite ®eld datapath and dedicated and optimized software to reduce the total energy consumption.This paragraph presents hardware/software codesign of low-energy programmable Reed-Solomon (RS) codecs [13].With current scaled technologies, many DSP algorithms based on binary two's complementary arithmetic can be realized using domain speci®c programmable digital signal processor (DS-PDSP) optimized for targeted application.
If ®nite ®eld arithmetic would be implemented in a programmable DSP datapath, the universal RS codecs (and other ®nite ®eld based systems) could be easily implemented in software.A lowenergy architecture, vector ± vector (or vectormatrix) multiplications for one of the most frequently use DSP operations can lead to 70% energy reduction compared with the straightforward multiplication datapath containing one parallel multiplier.
Universal RS decoders are coded using this decoding algorithm as well as other frequencydomain and time-domain decoding schemes based on three types of datapath architectures presented in [14].The performance characteristics of RS encoder and these RS decoders based on various datapaths are evaluated and compared.
A domain-speci®c programmable DSP processor (DS-PDSP) is assumed whose datapath is specialized for ®nite ®eld operations, especially multiplications.These are two major operations involved in ®nite ®eld multiplication, namely polynomial multiplication and polynomial modulo operation over GF (2).They can be implemented as a whole in one parallel multiplier as shown in the parallel datapath architecture in Figure 11(a).These two operations also were implemented using two separate units, a MAC array for polynomial multiplication and a DEGRED array for polynomial modulo operation, and with separate instruction, as illustrated in Figure 11(b).Using ®nite ®eld vector ± vector (or vector-matrix) multiplication operation (which are the most frequently used operation in ®nite ®eld applications) as bench-mark, this datapath architecture can lead to 70% energy reduction compared with the datapath shown in Figure 11(a).
Low energy programming scheme for major computations in the RS encoding and decoding algorithms are as follows; (A) Vector ± vector multiplications The vector ± vector multiplications can be performed using N polynomial multiply-accumulate operations (MAC ) and 1 polynomial modulo operation (DEGRED).This leads to about 70% energy reduction.(B) Convolution RSD1, uses the division-free BM (Berlekamp-Massey) algorithm for ®nding the error-locator polynomial.RSD2, uses the original BM algorithm with division for ®nding the error-locator polynomial.RSD3, is based on the Modi®ed Euclidean algorithm.RSD4, corresponds to the transform decoding algorithm, which uses the original frequencydomain BM algorithm and compute error polynomial using DFT/IDFT.RSD5, is time domain RS decoding algorithm based on the division free BM algorithm and the coecients of error polynomial are computed using the time-domain correspondence of Forney's algorithm.0

Low-energy RS codecs through datapath selection
We can conclude that the RS(255,247) encoder based on (MAC DEGRED) datapath consume about 30.9% of the energy of the encoder based on the non-pipelined parallel multiplier (Mult(0-p)) datapath, about 42.7% of the energy of the encoder based on the one level pipelined parallel multiplier (Mult(1-p)) datapath.The RS decoder based on the (MAC DEGRED) datapath consumes about 28.8% of the energy of that based on the (Mult(0-p)) datapath, about 40% of the energy of that based on the (Mult(1-p)) datapath.As a result, the energy-latency product of RSD1 deco-der based on (MAC DEGRED) is only 29.5% of that based on the (Mult(0-p)) datapath, 39.6% of that based on the (Mult(1-p)) datapath.
Low-energy RS codecs through algorithm selection RS decoding based on RSD4 and RSD5 consume much more energy than that based on the frequency-domain decoding algorithms RSD1, RSD2 and RSD3.Moreover, RS decoding based RSD4 and RSD5 also have much greater latency value.Therefore, they are not suitable for lowenergy software implementations.
RSD1, RSD2 and RSD3 dier only in the computation of error locator polynomial and the error-evaluator polynomial.If the parallel multiplication datapaths (Mult(0-p) or (Mult(1-p)) are used, the RSD1 decoder outperform the RSD2 and RSD3 decoders for small t; as t increase (t !8), the RSD2 decoder has the best performance.However, if the (MAC DEGRED) datapath is used, the RS decoder programmed using RSD1 algorithm always consumes the least energy.
We can conclude that for RS(255,k) codes with the generally-used error-correcting range of 2 t 16, the RS encoder using the generator-matrix approach and the RS decoder using the proposed RSD1 algorithm based on the (MAC DEGRED) datapath have the best overall performance in term of both energy and energy-latency product.

MULTIMEDIA APPLICATION
In video coding, motion estimation has been shown to be very useful in reducing temporal redundancy but requires very high computational complexity.The most commonly used motion estimation architectures are based on the block matching motion estimation algorithm.
Discrete cosine transform (DCT), which can exploit the spatial redundancy, has played an important role in video compression standards.The technique for reducing power dissipation of the DCT by targeting the multiplier section of a DCT processor is presented [15].The pivot of this technique is a multiplication algorithm for the low power implementation of the DCT on CMOS based signal processing systems.The algorithm reduces power consumption by reducing the eective switched capacitance of the multiplier through eective manipulation of the multiplication process between the cosine and data matrices.
The computation complexity requirement makes hardware solution of inverse DCT (IDCT) be adopted in real time application.

Motion Estimation
Among the algorithms used for motion estimation by block matching, the FS-BM algorithm uses an exhaustive search to ®nd the candidate block that is closest to the reference block.The low power full-search block matching (FSBM) motion estimation design for the H.263 low bit rate video coding was proposed [16].These registers that named as G register means that the clock of this register is gated control.To reduce the power consumption in each PE, this use gated clock control in the block accumulator.
FS-BM estimation processors typically adopt systolic array architecture and are responsible for the major part of power consumption in video coding system.Reference [17] eliminates unnecessary computations by computing a conservative estimation of distortion values before computing the exact value.There is another method of power consumption reduction by disabling the Processing Elements (PE) as soon as the distortion values become greater than the minimum distortion value already computed.Equation ( 3), ( 4) is conventional method of full search algorithm.The power consumption of the architectures is reduced by blocking new values of x the enter into the PEs, presenting the circuits to switch when D i (l,c) D i À 1 (l,c) in Eq. ( 5).For reduction power consumption with such pipeline architectures, it is necessary to compute D i À 1 (l,c) and D_min t on time to block the computation of D i (l,c).This problem can be solved by spacing out in time the computation of D i (l,c) and D i 1 (l,c) for a given (l,c), i.e., by computing in sequence partial distortion values for dierent candidate blocks.
The absolute dierences corresponding to a single row of the reference block are accumulated for a line of candidate block in sequence (c loop).The (p 1) elements of the BLOCKING bit vector identify when the computation for any one of those candidate blocks should be disable.The (p 1) elements of this vector are updated in every iterations of the i loop, and the value of D_min t is updated after processing a line of candidate blocks.The low-power linear architecture proposed in Figure 12(a) is derived from the single assignment code presented in Algorithm 1 [18].Candidate blocks are placed in raster format at the input of the ®rst PE.For each row I of the reference block, the sums of absolute dierences corresponding to all (p 1) candidate blocks in a line are computed in consecutive clock cycles, 2 Â n clock cycles after x t À 1 (l I,[À p/2]) appears on the input, the value of D i (l,[ À p/2]) is provided by the last PE to one of the ®nal adder inputs.This adder sums D i (l,[ À p/ 2]) with D i À 1 (l,[ À p/2]), provided by the output shift register.The value of the sum is then compared with the minimum distortion value found for the AAS and the comparison result is stored in the blocking register.At the same time, the updated value of D i is stored in the output shift register.This processing is repeated in the next p clock cycles for the remaining p candidate blocks in line l.Every time a pixel of the ®rst column of a candidate block appears in the array input, the blocking shift register provides a disabled signal to eliminate the unnecessary computations.Blocking registers are introduced in each PE to prevent the internal AD and adder circuits from switching when the disable signal is asserted, as depicted in Figure 12(b).The disable signal is pipelined through the array, to match the pipelining of the distortion computation.Few additional registers (marked with `Ã' ) are required to implement the part of the algorithm designed to reduce the power consumption.This architecture fully implements the processing presented in Eq. ( 3) with no restrictions on p and n values.
Processors with linear array architectures require a high working frequency.For the proposed  The minimum working frequency value can be decreased by using multiple linear arrays, in order to process in parallel the distortion for dierent blocks.For example, if two linear arrays are used, the values of the working frequency referred in the previous paragraph are reduced to half.The control of these two linear arrays in quite independent, assuming that the frame buer supports two simultaneously accesses to dierent positions.Another way of decreasing the minimum working frequency is to design low-power 2-D array architectures based on Eq. ( 3).However, it is more dicult to solve the dependency problem in 2-D structures, because two loops have to be processed in parallel.Consequently, the power consumption reduction is lower [19].

IDCT
We discuss the design of an IDCT macrocell suited for mobile and highly integrated applications [20].The Strategy for reducing the chip power was twofold.First, this selected an IDCT algorithm that minimized activity by exploiting the relative occurrence of zero-valued DCT coecients in compressed video.Previous IDCT implementations have relied on conventional fast IDCT algorithms that perform a constant number of operations per block independent of the data distribution.Typically, DCT blocks of MPEGcompressed video sequences have only ®ve to six nonzero coecients, mainly located in the low spatial frequency position.Based on the information on the statistical distribution of DCT coecients, this decided to depart radically from conventional IDCT algorithms that perform a ®xed number of operations per block.Given such a input data statistics, the direct application of IDCT equation below will result in a small average number of operations since multiplication and accumulation with a zero-valued X[k] coecient may constitute a ``no operation'' (NOP).In result, the data-driven algorithm requires a smaller number of operations per block compared to the conventional Chen  Second, this minimized the energy through aggressive voltage scaling using deep pipelining and appropriate circuit techniques so that the chip could produce 14M samples/sec (640 Â 480, 30fps, 4,2,0) at 1.3V in a standard 3.3V process (VTP À 0.9 V, VTN 0.7 V) and meet the requirement for MPEG2 MP@ML (main level, main pro®le).Details are as follows.The presence of many zero-valued coecients must be exploited in order to reduce the switching activity and reap the low power bene®ts of the selected algorithm.Clock gating in a pipeline can be implemented if each stage used a separate clock net gated by an appropriated qualifying pulse.The qualifying pulse propagates from stage to stage along with the non-zero coecient enters the pipeline, only the stage that corresponds to the zero is powered down.
But this meets one problem.A clock-gated t 1 À t 0 b t CLK3Q t pd À t hold (Fig. 13(a)) pipeline is susceptible to race conditions since the clock nets are not nominally equipotential, if the wrong data will be sampled at the second stage.
In such cases a negative level-sensitive latch was inserted Figure 13(b) to ensure functional correctness with a minimal penalty ( `2%) in power and no eect on the critical path.Battery lifetime will dominate system and wireless multimedia system design issues in the next generation.The low-power technique designs these systems requires vertical integration of the design process at all levels, from algorithm development to system architecture to circuit layout.This paper presented possible optimization to reduce the energy consumption for wireless multimedia communication that will be used in next generation wireless multimedia systems.We illustrated examples of design, how we addressed some of the issues in low-power wireless multimedia system.

FIGURE 1
FIGURE 1 Hybrid PN acquisition architecture.

FIGURE 2
FIGURE 2 Bypass bit generation.FIGURE 3 Modi®ed adder cell with bypass function.

FIGURE 7
FIGURE 7 Architecture of the complex matched ®lter for 128-tap PN-code.

FIGURE 6
FIGURE 6 Algorithmic structure of a complex matched ®lter consisting of four basic matched ®lter (B-MF).
(a) Initialize the number of iterations S to 1.(b) Decode the frame using SOVA.(c) Check if there is any bit error in the frame by checking the CRC.(d) If the frame contains bit errors and S `Smax , increment S by 1 and go back to Step 2. Otherwise, stop.0

FIGURE 10
FIGURE 10  Turbo decoder employing SOVA with adaptive iteration algorithm.
cY D i lY c nÀ1 j0 jx t iY j À x tÀ1 l iY c jj 3 uY v lY cjD minÀpa2 lY c pa2 4Equation (5) is the method of disabling the PE.D i lY c D iÀ1 lY c D i lY cY 0 i n À 1 D i lY c & nÀ1 j0 jx t iY j À xtÀ1 l iY c jjY ifD iÀ1 lY c `D min t D iÀ1 lY cY otherwise DlY c D nÀ1 lY c D min t minDlY cflY cg P AAS 5 D_min_t then D_min_t X D(l,c,nÀ 1,n); (u,v) X (l,c) end{c} end{l} Algorithm 1. Single assignment code to derive low-power linear architecture

FIGURE 12
FIGURE 12 Low-power architecture (a) linear array (b) processing element. algorithm.