Lifting-Based Fractional Wavelet Filter: Energy-Efficient DWT Architecture for Low-Cost Wearable Sensors

&is paper proposes and evaluates the LFrWF, a novel lifting-based architecture to compute the discrete wavelet transform (DWT) of images using the fractional wavelet filter (FrWF). In order to reduce the memory requirement of the proposed architecture, only one image line is read into a buffer at a time. Aside from an LFrWF version with multipliers, i.e., the LFrWFm, we develop a multiplier-less LFrWF version, i.e., the LFrWFml, which reduces the critical path delay (CPD) to the delay Ta of an adder. &e proposed LFrWFm and LFrWFml architectures are compared in terms of the required adders, multipliers, memory, and critical path delay with state-of-the-art DWTarchitectures. Moreover, the proposed LFrWFm and LFrWFml architectures, along with the state-of-the-art FrWF architectures (with multipliers (FrWFm) and without multipliers (FrWFml)) are compared through implementation on the same FPGA board.&e LFrWFm requires 22% less look-up tables (LUT), 34% less flip-flops (FF), and 50% less compute cycles (CC) and consumes 65% less energy than the FrWFm. Also, the proposed LFrWFml architecture requires 50% less CC and consumes 43% less energy than the FrWFml. &us, the proposed LFrWFm and LFrWFml architectures appear suitable for computing the DWT of images on wearable sensors.

In many visual applications of wearable sensors and portable imaging devices, images captured by the camera need to be transmitted wirelessly to a body-worn or nearby hub device. e wearable sensors and portable imaging devices have limited resources, and the wireless links have narrow bandwidth [28], making it impossible to directly send the raw (uncompressed) images. us, there is a need to compress the images before transmission [29]. erefore, an image coder is needed in order to compress the images. In an image coder, an image is generally first transformed using the discrete cosine transform (DCT) [30] or discrete wavelet transform (DWT) [31,32] and then it is quantized and entropy coded. e DWT, which is also used in JPEG 2000 [33], is popular in a wide variety of applications, including activity monitoring [34], fault detection in inverter circuits [35], medical imaging [36], image denoising [37], image recognition [38], image reconstruction [39], watermarking [40], computer graphics, and real-time processing [41] due to its multiresolution feature and excellent energy compaction properties [42,43]. e hardware architectures for wearable visual sensors and portable imaging devices in the IoT and wireless multimedia sensor networks should require minimal hardware resources and consume low energy for a small form factor and long battery life [44,45]. Generally, the computational capabilities of visual sensor nodes have been increasing in recent years [46]. Nevertheless, due to the economic pressures on visual sensor designs and despite the emergence of specialized hardware acceleration, e.g., FPGA and, components [47][48][49], the computational resources of visual sensors will likely remain scarce. Emerging computing and communication paradigms, such as mobile ad hoc cloud computing [50,51], expect the nodes to not only transmit sensed images but also to participate in some service computing functions, e.g., for localized image analysis and decision making, which can be orchestrated through software-defined networking and control structures [52][53][54]. In order to make the economical functioning of wearable visual sensors in such networked systems feasible, the resource usage of the image coding and transform must be very low. In particular, as the DWT is an important component of an image coder for visual sensors, the DWT hardware architecture should have minimal area and energy consumption.

Related Work.
e conventional convolution-based DWT computation of an image requires a huge amount of memory due to its row-and column-wise scanning [55,56], making it unsuitable for memory-constrained wearable sensors. e different low-memory architectures reported in the literature for computation of DWT can be categorized as line-based architectures [57], stripe-based architectures [58,59], block-based architectures [60,61], and the fractional wavelet filter (FrWF) architecture [62]. For an image of dimension J × J pixels, the line, stripe, and block-based architectures require random access memory (which we refer to as RAM or memory for brevity) in the range of 3J to 5.5J words, while the FrWF architecture requires 2J + 22 words of RAM [62].
Another low-memory pipeline-based architecture has been proposed in [63]. However, the design in [63] is based on the nonseparable DWT computation approach, which is unpopular because of its higher computational requirements than the conventional separable approach. It is a well-known fact that at a particular throughput, the separable 2D DWT computation approach is computationally more efficient than the nonseparable approach [64]. A dual data scanningbased DWT architecture is reported in [65]. In this architecture, several 2D DWT units are combined into a parallel multilevel architecture for computing up to six DWT levels. However, this architecture needs 13J words of memory. An architecture based on an interlaced read scan algorithm (IRSA) is proposed in [66] in conjunction with a liftingbased approach with a 5/3 filter-bank which requires 2J words of memory. However, the long critical path delay (CPD) of 2T m + 4T a (where T m is the multiplier delay and T a is the adder delay) of the architecture in [66] may limit its use in real-time applications.
An LUT-based lifting architecture for computing the DWT has been reported in [67]. e design [67] has low area and power requirements. However, it has a long CPD equal to T LUT + (W/4 − 1)T FA + 2T a (where T LUT is the look-up table (LUT) delay, W � 16 bits is the word length, and T FA is the full adder delay). A lifting-based architecture for computing both the 1D and 2D DWT has been presented in [68]. However, this design uses a transpose buffer of size J 2 . An energy-efficient block-based DWT architecture has been proposed in [61]. However, this architecture requires a large number of multipliers, namely, 16 and 36 multipliers for 5/3 and 9/7 filters, respectively. Another energy-efficient liftingbased reconfigurable DWT architecture has been proposed in [69], mainly for medical applications. However, the frequency of operation of this architecture is limited to 20 MHz. An energy-efficient lifting-based configurable DWT architecture for neural sensing applications has been proposed in [70], requiring 12 adders and 12 multipliers. However, its operating frequency is limited to only 400 KHz and 80 KHz for the gating and interleaving architectures used in the main architecture, respectively.
A power-efficient modified form of the DWT architecture has been presented in [71], using Radix-8 booth multipliers. is architecture uses bit truncation to reduce the area and power. However, bit truncation degrades the quality of the reconstructed image when the inverse DWT is applied. ere have been some DWT implementations on graphics processing units (GPUs) [72][73][74][75][76][77][78]; however, GPUs are relatively expensive for low-cost sensing platforms. e recently proposed FrWF architecture requires only 2J + 22 words of memory and has a CPD equal to the delay of a multiplier T m [62]. A multiplier-less FrWF architecture was also reported in [62] which reduces the CPD to the delay of an adder, T a , T a < T m . However, the FrWF architecture (with and without multipliers) has high energy consumption owing to its large number of compute cycles. e high energy consumption of the FrWF architecture may be prohibitive for wearable sensors and portable imaging devices with tight memory and energy constraints [79].

Contributions and Structure of is Article.
is paper proposes the LFrWF m , a novel lifting-based energy-efficient architecture to compute the DWT coefficients of an image with a 5/3 filter-bank. At the core of the proposed LFrWF m architecture is a novel basic Lift_block that computes the H and L subband coefficients with only two two-input adders and one multiplier (plus two pipeline registers), thus greatly reducing the hardware requirements compared to prior convolution architectures. Moreover, a multiplier-less implementation of the proposed architecture, denoted by LFrWF ml , is designed. e multiplier-less LFrWF ml has a shorter CPD than the proposed multiplier-based LFrWF m 2 Advances in Multimedia architecture.
e proposed LFrWF m and LFrWF ml architectures are not only efficient in terms of energy but also require fewer adders, multipliers, and registers than the state-of-the-art FrWF architectures (with multipliers (FrWF m ) and without multipliers (FrWF ml )). We compare the proposed architectures with state-of-the-art DWT computation architectures in terms of the required adders, multipliers, memory, and critical path delay. We also implement the proposed architectures and the state-of-the-art FrWF architectures on the same FPGA board. Experimental results demonstrate that the proposed LFrWF m and LFrWF ml architectures have lower hardware resource requirements and energy consumption than the state-of-theart FrWF m and FrWF ml architectures. e remaining part of the paper is arranged as follows. Section 2 gives a brief overview of the DWT and FrWF techniques. e proposed lifting-based LFrWF architecture is described in detail in Section 3 along with its memory requirement.
e evaluation results along with related discussions are presented in Section 4. Finally, Section 5 concludes the paper.

Background
is section briefly reviews the DWT and FrWF techniques along with FrWF architecture. e main notations used in this article are summarized in Table 1.

Discrete Wavelet Transform (DWT).
e most popular approach for computing the two-dimensional (2D) DWT of an image is the separable approach, in which the rows are filtered first, followed by column-wise filtering of the resulting coefficients. When a row is convolved (filtered) by a low-pass filter (LPF) and a high-pass filter (HPF), followed by downsampling by a factor of two, the results are known as approximation and detail coefficients, respectively. For a 1D signal of dimension J, which we consider as a preliminary step for computing the 2D DWT, there are J/2 approximation coefficients and J/2 detail coefficients. Combining the downsampling with the convolution operation, the approximation coefficients a(i) and the detail coefficients d(i) for i � 0, 1, . . . , J/2 − 1 can be expressed mathematically as [55] respectively, whereby l j and h j denote the j th LPF and HPF coefficient, respectively, x 2i+j denotes the (2i + j) th signal sample, while f 1 and f 2 are the number of LPF and HPF coefficients, respectively. e largest integer less than or equal to x is denoted by the symbol ⌊x⌋.
In the separable approach, all image rows are first convolved separately by a HPF and a LPF, followed by downsampling with a factor of two, resulting in the H and L subbands. en, the columns of the H and L subbands are convolved by a HPF and a LPF, followed by downsampling with a factor of two, resulting in the HH, HL, LH, and LL subbands [80]. However, this approach needs to save the entire J × J image in the RAM on the sensor (board) system. us, this DWT computation approach requires a huge amount of memory, making this approach unsuitable for low-cost wearable sensors and portable imaging devices with limited RAM [55,56]. e lifting scheme [81] computes the DWT of images using inplace computations which save memory. Moreover, the lifting scheme uses predict and update steps for computing the subbands. In particular, the low-pass filtered coefficients are predicted using the high-pass filtered coefficients. us, the lifting scheme reduces the convolution operations needed by the LPF coefficients. Hence, the lifting scheme reduces the number of arithmetic computations required for computing the image DWT [82]. e lifting scheme for a 5/3 filter-bank is shown in Figure 1. In this figure, x 0 , x 1 , . . . , x 6 are the input signal samples. Among these samples, x 0 , x 2 , x 4 , and x 6 are the even-indexed samples, while x 1 , x 3 , and x 5 are the oddindexed samples. Also, α and β are the high-frequency and low-frequency lifting parameters, respectively; G 0 and G 1 are the scaling parameters, whereby α � −0.5, β � 0.25, and G 0 � G 1 � 1 [66]; d 0 , d 1 , and d 2 are the high-frequency wavelet coefficients; while a 0 , a 1 , a 2 , and a 3 are the lowfrequency wavelet coefficients. e high-and low-frequency wavelet coefficients are computed following the diagram in Figure 1; for instance, It should be noted that the arrows without an associated symbol in Figure 1 have the unit multiplication factor, i.e., 1.

Fractional Wavelet Filter (FrWF).
e FrWF is a lowmemory DWT computation technique [56]. It uses a specific image data scanning technique in order to reduce the memory required for computing the DWT. It selects a vertical filter area (VFA), scanning f 1 rows of the image from an SD-card (where f 1 is the number of LPF coefficients). e rows in a VFA are read in raster scan order. Once the reading of all the image rows in a VFA is complete, the VFA is shifted by two lines in the vertical direction. is shifting of the VFA is done in order to incorporate the dyadic downsampling. One line of the HH, HL, LH, and LL subbands is computed from one VFA. All the image lines are covered by shifting the VFA. e VFA will Advances in Multimedia be shifted J/2 times for an image of dimension J × J. e FrWF has been combined with a low-memory image coding algorithm to design an efficient image coder for WMSNs in [83]. An FPGA architecture for the FrWF with a 5/3 filterbank has been proposed in [62]. is FrWF architecture, which follows the FrWF data scanning order, requires 2J + 22 words of memory and a total of 5J 2 /2 compute cycles. e large number of compute cycles results in a high energy consumption, which may be prohibitive for resource-constrained wearable visual sensors and portable imaging devices. e proposed LFrWF focuses on reducing the energy consumption for computing the DWT of images.

Proposed LFrWF Low Energy Architecture
is section presents the proposed LFrWF lifting-based architecture to compute the DWT of an image using the FrWF approach with a 5/3 filter-bank.

Data Scanning Order.
e proposed lifting-based architecture follows the data scanning order of the FrWF algorithm [56]. It is assumed (as is common for low-memory implementations of the DWT computation) that the original image is stored on an SD-card; throughout, the SD accesses are appropriately buffered to compensate for the latencies of the SD-card accesses. Initially, a vertical filter area which spans f 1 image lines (f 1 is the number of LPF coefficients) is marked in the SD-card. e rows of the image are read in raster scan order from the VFA, one line at a time into the RAM buffer P_store (as shown in Figure 2). After the processing of all the rows of the VFA is completed, the VFA is shifted down by two lines and the new rows are again read into buffer P_store in raster scan order. e complete image is read by repeatedly shifting the VFA downwards by two lines until all the rows are read. In the proposed architecture, one complete line is read at a time and scanned in raster order; in contrast, the FrWF architecture in [62] reads only 5 coefficients of an image line at a time.

Proposed Lifting-Based LFrWF Architecture.
is subsection describes the proposed lifting-based DWT architecture in detail. Figure 2 shows the top-level block diagram of the proposed LFrWF architecture. e LFrWF architecture works as follows. First, the input image pixels of a line are read into the register P_store. is P_store register stores the original image pixels of 8 bits each. e pixels of the image from P_store are sent to the Lift_block (as detailed in Figure 3) to compute the H and L subband coefficients using the lifting scheme. e generated H and L subband coefficients are saved in the register 1D_store. e contents of the 1D_store register are used as inputs for the Conv_block (as shown in Figure 4), which generates intermediate coefficients that are saved in the HH_store, HL_store, LH_store, and LL_store registers. ese intermediate values are successively updated by the next image lines. e intermediate values in the registers HH_store, HL_store, LH_store, and LL_store, after updating, will give the values of the HH, HL, LH, and LL subbands, respectively. Once the final subband coefficients of the HH, HL, LH, and LL subbands are computed, they are transferred and saved in an external SD-card. e functioning of the different blocks leading to the computation of the subbands is described next.

Lifting Block.
In the lifting scheme with a 5/3 filter-bank, two previous high-pass filtered coefficients are used to predict a low-pass filtered coefficient. For the efficient implementation of the lifting scheme, we introduce a novel basic Lift_block. As illustrated in Figure 3, the basic Lift_block computes two H subband coefficients and one L subband coefficient from a group of five input pixels in three steps. e inputs (Input 1 , Input 2 , Input 3 , and Lift par ) and output (Out 1 ) of the adders and multiplier to be used in Figure 3 for the different steps are shown in Table 2. e first two steps compute two coefficients of the H subband and the third step computes a coefficient of the L subband. In Table 2, P 0 , P 1 , P 2 , P 3 , and P 4 are the first five pixels of an image line. H 0 and H 1 are the first two high-pass filtered coefficients which are stored as the first two elements of the register 1D_store. L 0 is the first low-pass filtered coefficient and is stored as the third element of the register 1D_store. e high-pass filtered coefficients (H 0 and H 1 ) and the low-pass filtered coefficient (L 0 ) are computed as where α � −0.5 and β � 0.25 are lifting parameters [66]. Once the five pixels (P 0 , P 1 , P 2 , P 3 , and P 4 ) are processed, the first two pixels are discarded and two new pixels are read along with the previous last three pixels. e same procedure, in equations (3)- (5), is repeated on these new pixels to compute the H and L subband coefficients. e basic Lift_block in Figure 3 requires two two-input adders and one multiplier. e functionality of this basic Lift_block essentially replaces the functionality of the convolution stage-1 block in the FrWF m architecture, as shown in Figure 3 in [62] and elaborated in Figures 4-7 in [62]. For an LPF length of f 1 and an HPF length of f 2 , the FrWF m convolution stage-1 block in [62] requires f 1 − 1 two-input adders and f 1 multipliers for the low-pass filtering as well as f 2 − 1 two-input adders and f 2 multipliers for the high-pass filtering. us, for a 5/3 filter, the FrWF m convolution stage-1 block requires six adders as well as eight multipliers. Figure 4, the H subband coefficients from the 1D_store register are multiplied by a suitable HPF and LPF coefficient (as determined by a multiplexer) and then added/stored with the previous value in the registers HH_store and HL_store, respectively. Similarly, the L subband coefficient in the 1D_store register is multiplied by a suitable HPF and LPF coefficient (as determined by a multiplexer) and then added/ stored with the previous value in the registers LH_store and LL_store, respectively. e values in the registers HH_store, HL_store, LH_store, and LL_store are updated to compute the coefficients of the HH, HL, LH, and LL subbands, respectively.

Convolution Block. In the Conv_block in
We note that the Conv_block in Figure 4 is essentially equivalent to the aggregation of the FrWF convolution stage-2 blocks in Figures 4-7 in [62]. e Conv_block in Figure 4 requires four two-input adders and four multipliers. On the other hand, the aggregation of the FrWF convolution stage-2 blocks in Figures 4-7 in [62] requires two two-input adders and two multipliers.

Pipeline Registers.
e Lift_block and the Conv_block use two and four pipeline registers, respectively, to temporarily save the intermediate results after each compute cycle. rough the use of the pipeline registers, the critical path delay (CPD) of the proposed LFrWF architecture becomes the multiplier delay T m .
Overall, for a 5/3 filter, considering both the basic Lift_block ( Figure 3) and the Conv_block (Figure 4), the proposed LFrWF m requires six two-input adders and five multipliers compared to eight two-input adders and ten multipliers of the FrWF m architecture (Figures 4-7 in [62]).
Advances in Multimedia e proposed LFrWF architecture stores the original image and the subbands in the SD-card. us, higher wavelet decomposition levels can be computed with the same architecture, whereby the LL subband coefficients are taken as input.

Proposed Multiplier-less LFrWF ml Implementation.
e 5/3 filter-bank coefficients (shown in Table 3) and the 5/3 filter-bank lifting parameters involve integer division and multiplication. us, they can be implemented using the shift and add method. More specifically, the convolution with the 5/3 filter-bank requires only integer multiplication and division and can therefore be implemented with only shift and add operations. For example, z · 0.25 � z · 2 − 2 , i.e., shifting the number z two times to the right is equivalent to dividing z by 4. e shift and add concept, as applied to the 5/3 filter coefficients, operates as follows: (1) e filter coefficient l −2 � l 2 � −1/8 � −1/2 3 can be implemented by three right shift operations, followed by a complement operation (2) e filter coefficient l −1 � l 1 � 2/8 � 1/2 2 can be implemented by two right shift operations (3) e filter coefficient l 0 � 6/8 � 1/2 2 + 1/2 can be implemented by two right shift operations, followed by addition with one right shift (4) e filter coefficient h 0 � h 2 � −1/2 can be implemented by one right shift operation, followed by a complement operation (5) e coefficient h 1 � 1, thus, no shifting is required With these specified shifting operations, the convolution block can be simplified and implemented using only shifters and adders. Multiplier-less computation blocks for the 5/3 LPF and HPF coefficients are given in Figures 5 and 6, respectively. One benefit of the multiplier-less implementation over the multiplier-based architecture in Section 3.2 is that the multiplier-less implementation reduces the CPD from the multiplier delay T m down to the adder delay T a .

Memory Requirement.
In order to compute the DWT coefficients, the proposed LFrWF architecture uses four registers (HH_store, HL_store, LH_store, and LL_store), two register arrays (P_store and 1D_store), and six pipeline registers. e register array P_store (of size J words) is used to store an image line. e H and L subband coefficients computed by the Lift_block are saved in the register array 1D_store of 3 words. e four registers HH_store, HL_store, LH_store, and LL_store are of J/2 words each. e total memory requirement of the proposed architecture is equal to the sum of all registers, i.e., Mem LFrWF � 3J + 9 words.
3.5. Line Segmentation. Equation (6) indicates that LFrWF memory requirement grows with the image dimension J and thus will be significantly greater than the FrWF memory requirement of 2J + 22 words for large images. In order to reduce the memory requirement of the proposed LFrWF architectures, each image line may be segmented, as illustrated in Figure 7, with overlapping of f 1 /2 coefficients at both boundaries of the second to the last, but one segment (the first and last segments only require overlapping at one boundary) (Appendix E in reference [88]). In this approach, only one line segment needs to be read into the register array P_store. us, the memory requirement of the LFrWF with G line segments is For the 5/3 filter-bank with a VFA of f 1 � 5 lines, the memory requirement is e other resource requirements are independent of line segmentation and remain unchanged. e line segmentation reduces the memory requirement of the proposed LFrWF architectures so that their memory requirement can be reduced below the memory required by FrWF architectures of [62]. e FrWF architecture does not include a line segmentation provision; therefore, its memory requirement cannot be reduced further. We observe from Table 4 that the memory requirements of the proposed LFrWF architectures are greater than the FrWF memory requirements. However, by incorporating the line segmentation approach, the memory requirement of the LFrWF architectures can be reduced below that of the FrWF architectures. In case of the 5/3 filter-bank, we observe from Table 4 that the memory requirement of the FrWF architectures is 2J + 22, while the memory requirement of LFrWF architecture with G line segments is J/G + 2J + 13, see equation (8). erefore, the LFrWF memory requirement is less than the FrWF memory requirement if G > J/9.

Results and Discussion
is section presents the implementation of the proposed LFrWF architecture and its comparison with state-of-the-art architectures. First, we compare the proposed LFrWF architecture with several state-of-the-art architectures in terms of the required numbers of adders and multipliers, as well as the critical path delay (CPD) and required memory. Next, the postimplementation results of the proposed LFrWF architectures are compared with the state-of-the-art FrWF Advances in Multimedia architecture [62] by implementing both architectures on the Xilinx Artix-7 FPGA platform. Table 4 compares the numbers of required adders and multipliers, as well as the CPD and the required RAM of the proposed LFrWF architectures with state-of-the-art architectures. e numbers of adders and multipliers of the existing state-ofthe-art architectures shown in Table 4 have been taken from the corresponding papers. We observe from Table 4   reduces the number of adders down to less than half of the other prior architectures. Among the architectures using multipliers, the proposed LFrWF m architecture also requires the least number of multipliers, namely, only five multipliers, see Figures 3 and 4. Only the RMA [85] has a similarly low multiplier requirement with six multipliers (but requires approximately twice the memory compared to LFrWF). e other prior architectures require twice or more multipliers than the proposed LFrWF m architecture.

Adders, Multipliers, CPD, and Memory.
We also observe from Table 4 that the CPD of the proposed LFrWF m architecture and the FrWF m architecture [62] are T m , which is less than the architectures in [85,86]. We note from Table 4 that the multiplier-less LFrWF ml and FrWF ml have reduced the CPD to T a , which is less than the CPD of other state-of-the-art architectures. e CPD of T a achieved by the proposed LFrWF ml architecture cuts the shortest CPD of any existing architecture of 2T a achieved by the Aziz architecture [87] down to half. Note that the shifter delay T s is commonly larger than the adder delay T a , i.e., T s > T a ; thus, the PMA architecture [85] has a longer CPD than the Aziz architecture. e benefit of the reduction in CPD is that the architectures can be operated at higher frequencies, since maximum operations frequency � 1/CPD. As the CPD decreases, the maximum operating frequency increases. Table 4 furthermore indicates that the FrWF architecture has the lowest memory requirement. However, the memory requirement of the proposed LFrWF architecture is less than the memory requirement of the other state-of-the-art architectures in Table 4. As noted in Section 4.3, with segmentation of a line of J words (pixels) into G segments (of J/G words each), the LFrWF memory requirement drops below the FrWF memory requirement if more than J/9 segments are used.

FPGA Implementation.
e proposed LFrWF architecture computes the DWT coefficients of images based on lifting while following the FrWF approach. As observed from Table 4, the FrWF architecture [62] requires the least memory among the state-of-the-art architectures. us, we implemented the FrWF architectures [62] and the proposed LFrWF architectures (initially without segmentation, i.e., G � 1) on an Artix-7 FPGA (family: Artix-7, device: xc7a15t, package: csg324, speed: −2L). e implementations used identical multipliers, adders, and other components provided by the Xilinx Artix-7 FPGA family. All architectures used an input pixel width of 8 bits and a data-path width of 16 bits. Table 5 summarizes the FPGA implementation comparison. We report averages for evaluations with seven popular 512 × 512 (8 bits/pixel) test images, namely, "lena," "barbara," "goldhill," "boat," "mandrill," "peppers," and "zelda," obtained from the Waterloo Repertoire (http:// links.uwaterloo.ca). e energy consumption is evaluated by multiplying the number of compute cycles with the average power consumption and the compute (clock) cycle durations of 5.0 ns and 1.5 ns for the architectures with multipliers and without multipliers, respectively. ese clock cycle durations have been selected to satisfy the CPD constraint, as given in Table 5, namely, a CPD of 4.8 ns for the design with multipliers and a CPD of 1.45 ns for the multiplier-less design. e number of compute cycles and the average power consumption were evaluated by simulation with the Xilinx Vivado software suite, version 2018.2.
We observe from Table 5 that the proposed LFrWF m architecture requires approximately 22% less LUTs, 34% less FFs, and 50% less compute cycles, and consumes 65% less energy than the FrWF m architecture. Due to the reduced number of hardware components (LUTs and FFs), the area occupied by the LFrWF m architecture will be less than the area of the corresponding FrWF m architecture. Moreover, the proposed multiplier-less LFrWF ml architecture requires 2.6% less FFs and 50% less cycles and consumes 43% less energy than the multiplier-less FrWF ml architecture [62]. e proposed LFrWF ml architecture requires slightly more LUTs than the multiplier-less FrWF ml architecture.
We also observe from Table 5 that the proposed LFrWF reduces the number of required compute cycles to roughly half the compute cycles required by the FrWF. More specifically, while the FrWF requires on the order of 10 million compute cycles for a 512 × 512 image, the proposed LFrWF requires only a little more than 5 million compute cycles.
is substantial reduction is primarily due to the computational efficiency of the novel Lift_block (see Section 3.2.2) for computing the decomposition subband coefficients.
Moreover, we observe from Table 5 that the power consumption of the proposed LFrWF architecture with multipliers is less than the power consumption of the corresponding FrWF architecture with multipliers, while the multiplier-less LFrWF and FrWF have approximately the same power consumption. e energy consumption is evaluated by multiplying clock cycle duration (which is based on the CPD) with the number of clock cycles and the consumed power. Due to the reduced (almost half ) number of compute cycles and the lower (or same) power consumption, the energy consumption levels of the proposed LFrWF architectures are substantially lower than the energy consumption levels of the FrWF architectures. We further observe from Table 5 that compared to the designs with multipliers, the multiplier-less designs of both the LFrWF and the FrWF have the same numbers of clock cycles, but shorter CPD and (slightly) reduced power levels; thus, the multiplier-less designs have substantially reduced energy consumption levels. We also observe from Table 5 that both architectures have the same CPD. We note that the numbers of hardware components, e.g., adders, multipliers, LUT, and FF, and other parameters, such as the number of clock cycles, memory, and CPD (T m or T a ), are independent of the platform on which the design is implemented and the test image. Among the results presented in Tables 4-6, only the energy consumption, the power consumption, and the energy delay product (EDP) depend on the platform and image.

Line Segmentation.
We observe from Table 6 that increasing the number of line segments G reduces the memory requirement while increasing the number of compute cycles and the energy consumption. e compute cycle and energy consumption increases are mainly due to the overlapping of f 1 /2 coefficients at the line segment boundaries which need to be read twice. However, for all line segmentations (G � 2, 4, 8), the number of compute cycles and energy consumption are less than for the corresponding FrWF architectures, see Table 5. We observe from Tables 5 and 6 that even with G � 8 segments per line, the number of compute cycles and the energy consumption of the proposed LFrWF architectures are less than those for the corresponding FrWF architectures. Since the FrWF architectures of [62] read only 5 pixels at a time, the segmentation approach cannot be incorporated into the FrWF architecture. Hence, the memory of the FrWF architectures cannot be further reduced by incorporating line segmentation.
e EDPs of the LFrWF and FrWF architectures with and without multipliers are compared in Figures 8 and 9, respectively. e EDP, which characterizes both the consumed energy and the computational performance, is evaluated by multiplying the consumed energy with the corresponding clock cycle duration. We observe from Figures 8 and 9 that the EDPs of the proposed LFrWF architectures (with and without multipliers) are less than the EDPs of the corresponding FrWF architectures (with and without multipliers). e EDP of the proposed LFrWF m architecture with multipliers (G � 1) is approximately 65% less than that for the FrWF architecture with multipliers, and the EDP of the proposed LFrWF ml multiplier-less architecture (G � 1) is approximately 43% less than that for the multiplier-less FrWF architecture. We observe from

Conclusion
is paper proposed and evaluated a lifting-based architecture to compute the DWT coefficients of an image based on the FrWF approach with a 5/3 filter-bank. e proposed architecture requires fewer adders and multipliers than state-of-the-art architectures. e proposed architecture      with multipliers (LFrWF m ) and without multipliers (LFrWF ml ) and the state-of-the-art FrWF architecture (with and without multipliers) [62] have been implemented on the same FPGA board and compared. e experimental results show that the proposed LFrWF m architecture requires less hardware components (and thus less area) and consumes 65% less energy than the FrWF m architecture. Moreover, the proposed LFrWF ml architecture consumes 43% less energy with only a slight increase in area compared to the FrWF ml architecture. e lower energy consumption with minimal area overhead makes the proposed architectures promising candidates for computing the DWT of images on resource-constrained wearable sensors.
An important direction for future research is to integrate the LFrWF architecture with efficient architectures of stateof-the-art wavelet-based image coding algorithms to design FPGA-based image coders for real-time applications on wearable visual sensors and IoT platforms. Another interesting future research direction is the examination of the use of our proposed approach in the context of compressive sensing [15,89].

Data Availability
e evaluation data used to support the findings of this study are included within the article.