Energy-Efficient Hardware Architectures for the Packet Data Convergence Protocol in LTE-AdvancedMobile Terminals

In this paper, we present and compare efficient low-power hardware architectures for accelerating the Packet Data Convergence Protocol (PDCP) in LTE and LTE-Advanced mobile terminals. Speci�cally, our work proposes the design of two cores: a crypto engine for the Evolved Packet System Encryption Algorithm (128-EEA2) that is based on the AES cipher and a coprocessor for the Least Signi�cant Bit (LSB) encoding mechanism of the Robust Header Compression (ROHC) algorithm. With respect to the former, �rst we propose a reference architecture, which re�ects a basic implementation of the algorithm, then we identify area and power bottle-necks in the design and �nally we introduce and compare several architectures targeting the most powerconsuming operations. With respect to the LSB coprocessor, we propose a novel implementation based on a one-hot encoding, thereby reducing hardware’s logic switching rate. Architectural hardware analysis is performed using Faraday’s 90 nm standardcell library. e obtained results, when compared against the reference architecture, show that these novel architectures achieve signi�cant improvements, namely, 25% in area and 35% in power consumption for the 128-EEA2 crypto-core, and even more important reductions are seen for the LSB coprocessor, that is, 36% in area and 50% in power consumption.


Introduction
New data-demanding mobile applications, such as video streaming and online gaming, are the main drivers for higher mobile data rates.New standards improve the mobile Internet experience by providing downlink data rates starting from 100 Mbit/s in LTE up to 1 Gbit/s for LTE-Advanced [1].is huge increase in data rates imposes new challenges on the design of mobile devices, where computational power and battery lifetime are strictly limited.
A recent uplink/downlink performance analysis of an LTE protocol stack on a representative virtual mobile platform [2,3] has identi�ed the Protocol Data Convergence Protocol (PDCP) as the most time-critical component within the Layer 2 soware architecture.PDCP incorporates two computationally expensive tasks: the Robust Header Compression (ROHC) algorithm, which compresses IP packet headers in order to improve the spectral efficiency of radio links, and the ciphering algorithm, responsible for user data protection and for providing a secure communication towards the core network.While both protocol functions show long processing times, ciphering comes in the �rst place followed by ROHC.Apart from performance requirements, PDCP algorithms must be designed for low-power in order to enable a longer battery lifetime, a crucial feature for mobile devices.As more and more functionalities are being integrated in mobile handsets, area-optimized designs are also required to reduce chip costs.
Two architectural approaches exist to accelerate computationally expensive processing functions.e �rst one is based on embedded multicore processing to speedup soware execution of protocol stack functionalities [4][5][6][7].is approach provides �exibility and reuse of computational resources as processor cores can be shared by several applications, including the protocol stack.However, it might be very challenging to identify and exploit parallelism, as most of the currently used mobile algorithms were not designed with the aim of running in a parallel architecture [8].e second approach is based on the design of dedicated hardware accelerators in the form of cores.ese hardware blocks are connected to an on-chip interconnect to offload the communication processor and boost processing speed, in order to satisfy the targeted timing requirements.
In addition to the above mentioned issues on parallelism, the soware approach requires scaling the number of processor cores in order to increase the processing performance.is does not �t well with the tight area and power constraints of mobile handsets, especially if we consider LTE-Advanced terminals, with data rates in the order of 1 Gbit/s.Another advantage of the hardware approach over its soware counterpart is its ability to achieve very low absolute power consumption levels by appropriately tweaking its architecture internals [9,10].Consequently, also for the next few years, mobile devices based on heterogeneous multicore platforms will be mainstream [11].
1.1.Main Contributions.Depending on the domain taken into consideration, Register Transfer Level (RTL) hardware architectures focus either on high speed cores or on lowpower consumption designs.As the latter is a paramount key performance indicator for mobile devices, the focus of this work is on the design of low-power architectures for the PDCP protocol within the Layer 2 of a cellular protocol stack.Our contribution is summarized as follows.
(i) We develop an energy-efficient crypto-core for the 128-EEA2 con�dentiality algorithm which is part of the PDCP protocol.is is achieved by optimizing the power of the core cipher on which 128-EEA2 is based, that is, the Advanced Encryption Standard (AES) in counter mode.For that purpose, we introduce several power-efficient architectures and evaluate them in terms of area and power consumption.
(ii) We present a con�gurable data path architecture for the 128-EEA2 crypto-core to limit the peak power in modern power-constrained mobile devices.e hardware adapts its peak power seamlessly at run time by switching between several architectural modes.
(iii) We propose a novel, compact, and energy-efficient hardware architecture for the Least Signi�cant Bit (LSB) encoding, the core compression mechanism in the ROHC algorithm.e architecture is evaluated against a reference architecture.
1.2.Paper Organization.e rest of the paper is organized as follows: in Section 2, we present a brief literature survey.
In Section 3, we describe both con�dentiality (128-EEA2) and ROHC algorithms.In Section 4, critical functional blocks in 128-EEA2 are identi�ed and various architectural optimizations are presented.Section 5 describes the LSB coprocessor with techniques for power reduction.We evaluate the proposed architectures and discuss the results in Section 6, leading to the conclusion in Section 7.

Related Work
Ciphering algorithms in mobile handsets are typically accelerated by dedicated hardware [12][13][14][15].Hardware architectures for core ciphers of several LTE con�dentiality algorithms were separately implemented in various works [16][17][18], where the main focus was to deliver high speed FPGA or ASIC designs.Few recent works [19,20] proposed compact and power-efficient architectures for both 128-EEA1 and 128-EEA3 ciphering algorithms.However, previous works proposing low-power implementations of AES, the core cipher of 128-EEA2, are limited to the byte substitution function [21][22][23][24].To our best knowledge, no research work has ever introduced hardware power reduction techniques for AES especially targeting its mode of operation in the LTE protocol.However, there are some attempts to accelerate both 128-EEA1 and 128-EEA2 algorithms by parallelizing them on an embedded ARM multicore processor [25,26].e focus in these works was �rst to deliver an efficient implementation with the minimum possible instruction count.Parallelism was then investigated and deployed at both algorithmic level (128-EEA1) and block level (128-EEA2) for further optimization of the processing time.In [25], different message blocks are dispatched to available cores for parallel processing.Energy savings were also achieved compared to single core implementations by exploiting the excess in performance gain for frequency and voltage scaling [9], or by mapping protocol stack soware to several small/big cores associated with different speeds and supply voltages [27].
Although the above mentioned works do not require any additional resources in a multicore-based handset, they were not able to achieve processing throughputs ful�lling LTE-Advanced requirements.While some works proposed hardware architectures for ciphering algorithms, very few considered hardware designs for Robust Header Compression (ROHC).Two acceleration concepts for ROHC are presented in [28]: scratch pad memories and dedicated hardware support.ese concepts were investigated at high-level with system simulations without suggesting concrete hardware architectures.Another work in [29] focused on implementing a high-performance hardware architecture for an outdated release of ROHC in network processors, where energy consumption is not an issue.In [30], an architecture for ROHC version 2 (ROHCv2) hardware with multiple-�ow support was introduced for LTE devices.It investigates the advantages of ROHCv2 hardware by comparing it to its soware counterpart on an embedded FPGA-based platform.e proposed hardware architecture is however too large and does not suggest any power reduction for the Window-based LSB encoding function, which consumes two thirds of the total hardware power [30].
e ROHC algorithm contains many memory checking operations; therefore, implementing its functionality in hardware implies a huge design, which requires transfer of all headers through an on-chip bus.Moreover, it is not efficient due to both the considerable power consumption levels of interconnects [31] and the high transfer time penalty caused by bus congestions.In this work, instead, we suggest to partition ROHC into soware and hardware  [32].e soware part executes tasks responsible for memory checking, as packet header �elds typically reside in the processor's cache memory, whereas the hardware part realizes the most time and power consuming operation, that is, the least signi�cant bit (LSB) encoding functionality.

Packet Data Convergence Protocol
e Packet Data Convergence Protocol (PDCP) resides in Layer 2 of the LTE cellular communication protocol stack [33].It is part of the data plane and receives user data in the form of IP packets during uplink communications.First, the headers of incoming packets are compressed using the speci�ed Robust Header Compression (ROHC) algorithm.Aer that, IP packets are ciphered using one of the standardized (128-EEA1, 128-EEA2, and 128-EEA3) con�dentiality algorithms [34].A certain encryption algorithm is selected when the connection is established, depending on the control messages signaled by the Radio Resource Control (RRC) in the control plane.A PDCP header is then added to each ciphered packet before being fed to the next downstream protocol, the Radio Link Control (RLC) sublayer.At the receiver side, PDCP performs the reverse functionality represented by deciphering and header decompression operations.In the following, both ciphering and compression algorithms considered in our study are described in more detail.

128-EEA2 Ciphering Algorithm. e Evolved Packet
System Encryption Algorithm (128-EEA2) is one of the three con�dentiality algorithms speci�ed in LTE [34].It is based on the 128-bit Advanced Encryption Standard (AES) running in the counter (CTR) mode [35,36].e structure of the 128-EEA2 con�dentiality algorithm is shown in Figure 1.e 128-bit counter block (  ) is used as an input to be processed by AES.e content of this block is initialized with communication parameters such as packet count (COUNT), radio bearer (BEARER), and the direction of communication (DIRECTION).e output of AES is then combined with the user data using the XOR operation to generate the so-called ciphertext (uplink encryption) or plaintext (downlink decryption).e PDCP packet length (LENGTH) is additionally used to strip out unwanted bits in case the packet being ciphered is not a multiple of the block size (128 bits).e counter mode is characterized by incrementing the Least Signi�cant Eight Bytes (LSEB) value of   by one upon ciphering of each data block within a PDCP packet.In addition, the COUNT value in   is incremented by one before processing a new PDCP packet.
e AES cipher operates on the counter block organized in a 4 × 4 byte matrix, referred to as state.e cipher consists of ten successive rounds preceded by a key addition operation (not shown in �gure).Each round is based on four operations: ByteSub, ShiRow, MixColumn, and AddKey.e last round however performs no MixColumn operation.A key scheduling function is applied on every cipher key (KEY) update to generate the keys (  ) associated with each round.e ByteSub performs a nonlinear byte-tobyte substitution with the Rijndael S-Box.Each row  in a state is then rotated to the le by  bytes in the ShiRow operation.e MixColumn function follows with modulo multiplication of the state columns by rotated versions of the polynomial vector (03 01 01 02).Finally, the AddKey combines the state bytes with the corresponding round key using XOR operations.e result of each round is fed to the next round till the keystream block is delivered by the last round.
e ROHC achieves a header size between one and four bytes, providing an optimum overhead reduction by a factor of 40.As a result, the effective communication bandwidth is increased.e ROHC framework [37] was speci�ed by the Internet Engineering Task Force (IETF) [38].An optimized version named ROHCv2 [39] was subsequently released with modi�cations aiming to improve robustness against high packet loss and retransmissions in wireless links.e ROHC framework de�nes encoding methods for several pro�les.For the purpose of our study, we will focus on the most complex IP/UDP/RTP pro�le.Each packet header contains many �elds, some are �xed (static) and others may change rarely or predictably (dynamic) during a communication session.A reference copy of all header �elds is stored at both compressor and decompressor to ensure a lossless transmission.e compressor starts operating in the Initialization and Refresh (IR) state, where a number of complete packet headers are sent for synchronization with the decompressor.Aerwards, a transition to the Compression (CO) state is made and compressed packet headers are sent with Cyclic Redundancy Check (CRC) protection.If no successful decompression is veri�ed by CRC, then more header information will be transmitted to repair the context at the decompressor.In the bidirectional mode, a negatively acknowledged feedback signal (SNACK or NACK) is sent by the decompressor, requesting retransmission of full header context (IR) packets or dynamic context (CO-Repair) packets.However, in Unidirectional mode where a feedback channel is not provided, IR packets are sent periodically to refresh the reference context at the decompressor, resulting in less efficient compression.
Compressed packets are generated by encoding two dynamic �elds: the IP Identi�er (IP-ID) and the RTP Timestamp.An offset (IP_ID_DELTA) is computed by subtracting the Message Sequence Number (MSN) from IP-ID.Since real-time audio and video applications employ codecs with �xed sampling periods, the RTP Timestamp usually increases by a �xed value (TS_STRIDE) and hence can be expressed with a variable number (TS_SCALED) multiplied by the stride.Both TS_SCALED and IP_ID_DELTA are further encoded with the Least Signi�cant Bit (LSB) scheme, where only the difference between the value to be encoded () and a successfully received previous value ( ref ) is sent, as illustrated in Figure 3. LSB de�nes the following interpretation interval: where  is the number of least signi�cant bits to be sent.e value of  must be calculated so that  lies in the interpretation interval.is function which determines  is referred to as ( ref  ).A shiing parameter  might be tuned according to channel conditions to improve compression efficiency.To improve the robustness of LSB encoding, a Window-based LSB (WLSB) is applied by employing a sliding window which stores a certain number of previous reference values.e value of  is then determined using the lower ( refmin ) and upper ( refmax ) bounds of the sliding window as follows: WLSB guarantees correct decoding of  as long as both  ref at the compressor and decompressor lie in the [ refmin   refmax ] interval.

128-EEA2 Crypto-Core
e hardware architecture of the EEA2 core is depicted in Figure 4.It is basically composed of three main blocks: control unit, counter block, and an AES engine.e latter consists of a Galois Field (GF(2)) counter required for the computation of the round keys by the key scheduler.In addition, the control unit is adapted in such a way that the same counter is also used to track the number of encryption rounds executed by the round module.Round processing is split into four parallel quarter rounds (QR 0 to QR 3 ), each of which processes a 4-byte chunk, that is, a state's column, of the message block.e counter block is a memorymapped IO 128-bit register.It can be directly accessed by soware and updated with a new initialization vector before encryption is triggered.Synchronization between modules is maintained by the control unit, which checks the input message length and activates potential blocks to ensure a complete encryption of the message.
In order to analyze the improvements with respect to area and power consumption of the 128-EEA2 core, a basic hardware architecture was implemented as a reference design.is basic architecture was designed without any special architectural optimizations and realizes the Sboxes as lookup tables (LUT) using several multiplexers.It is described as a VHDL model and synthesized using Faraday's standard cell library of 90 nm with a core power supply of 1 V.
Figure 5 displays the power and area breakdown of the 128-EEA2 core across its processing modules.In this �gure, both control and counter block modules are grouped together with state registers under one category (Control + Reg).Other categories represent functionalities in the AES engine.In particular, SubBytes, MixColumn, and AddKey belong to the Round module.A large portion of the power consumption (39%) is consumed by SubBytes as it contains sixteen substitution boxes (Sbox).KeySchedule follows with 31% of core's power which is relatively high taking into consideration that it contains only a quarter of the number of Sboxes used in SubBytes.is can be explained by the fact that logic switching in KeySchedule's Sboxes is rather heavy because of the fact that consecutive round keys are highly uncorrelated, so to follow the need to provide a higher level of security.In addition, most of the core's area is occupied by SubBytes (55%) followed by KeySchedule (21%).As a result, we identify both round and key scheduler as the most power-critical blocks, where Sboxes have a considerable share of power and area in both modules.In the following, we present some optimization techniques focusing on these three components in order to reduce the overall area and power consumption.

Substitution Box.
To optimize the Sboxes, we have investigated three architectural alternatives to the LUT reference implementation.e �rst is based on a Decoder-Switch-Encoder architecture (DSE) [23].Within this architecture, the transformation value is generated in three steps.First, an 8-bit input word is one-hot encoded using a decoder.e generated 256 bits then enter a switch and are routed to another 256-bit one-hot code according to Rijandel's Sbox input/output mappings.Finally, the switch output is converted from a one-hot encoded scheme back to an 8bit binary representation, which holds the output transformation value.As a second alternative, we propose a Nibble-Based Decoder-Switch-Encoder (NBDSE) architecture which splits the input byte into high and low nibbles (  and   ) to reduce Sbox complexity, as shown in Figure 6.e transformation of each nibble is computed in parallel to obtain the low (  ) and high (  ) nibbles of the substitution value.In this context, decoded high and low nibbles can be perceived as column and row lines respectively.Nibble switches consists of a permutation block which maps decoded row lines (  ) to sixteen sub-switches (highlighted in grey) based on an offline analysis of the Sbox contents.Figure 6 depicts the structure of a sub-switch containing three levels of logic.Row nibble transformations within a column sharing the same value are grouped through ORs followed by AND gates as depicted in Figure 6.e output of the subswitches forms a 16-bit one-hot code which is fed into a 16to-4 bit encoder to deliver the nibble's transformation value.
Finally, the third architecture, a Correlation-Based Implementation (CBI) of the Sbox, realizes a logic array of 16 rows by 16 columns.Within a row, each four neighboring transformations are grouped and replaced by the output of a switch (SW  ) as shown in Figure 7.A 16-bit pattern word is computed once for all rows in the Sbox from the lower column address bits, whereas the upper bits control the row multiplexer to select a corresponding switch output.To avoid unnecessary switching in the multiplexers of nonselected rows, the pattern bits are masked with decoded column address bits.e wiring of individual switch outputs with the masked pattern bits is obtained through an offline analysis of Sbox contents, where the correlation between grouped transformations is studied.A two-stage output multiplexer uses the 4-bit row address to deliver the �nal transformation value.

Key Caching.
According to the LTE protocol [33,34], the cipher key usually changes during cell handovers and is rarely altered within a communication session.In other words, many PDCP packets and their corresponding counter blocks will share the same cipher key within a single connection.We exploit this property to reduce the power consumed by caching round keys so that other packets can reuse them during an established connection.Figure 8 depicts the architecture of the key scheduler, where a RAM memory is inserted to cache the computed round keys for reuse in successive counter encryptions.e cache memory is implemented as a register set containing ten 128-bit registers, each dedicated for storing one round key.e control block is extended to read any changes in the cipher key assigned by the radio resource control in the control plane.If a new cipher key ( in ) is received, the round key computation mechanism is activated for the �rst counter block encryption.Simultaneously, the round keys (  ) are cached in their corresponding registers pointed by the address generator.In successive counter encryptions, the round keys are read from the cache memory whereas the key computation mechanism is deactivated by the control unit.In addition to the caching mechanism, we perform register retiming by moving the registers at the output of the Sboxes to their input.is helps in eliminating the power consumption of Sboxes caused by logic hazards propagating from the feedback path through the input multiplexer.

Round Shadowing.
According to the speci�cation of the EEA2 algorithm [34], AES is not applied directly on user data blocks.It is however applied on incrementing counter blocks which are in turn mixed with the user data blocks to produce the cipher text.Figure 9  other counter block encryptions.erefore, we identify three MixColumns and 24 Sbox redundant computations in the EEA2 algorithm during the �rst two rounds of each counter block encryption.For each PDCP packet with a typical length of thousands of bytes, equivalent to hundreds of block encryptions, thousands of redundant operations take place.�e exploit the redundant calculations in the �rst two rounds to reduce the power consumed in the Round block.erefore, we insert a total of three 32-bit shadow registers distributed on the quarter rounds.Figure 10 shows the round shadowing architecture for the quarter rounds QR 1 to QR 3 , where the shadow register is placed aer the Sboxes.It is important to mention that we have reordered the ShiRows and SubByte operations to maximize savings from redundant calculations.is modi�cation allows us to save SubBytes computations of the second round at low hardware complexity without affecting the ciphering functionality.In addition, the control unit is adapted to activate the shadow registers aer the �rst two rounds of the �rst counter block encryption.ereby, the shadow registers store the state of the three columns resulting from the �rst two rounds.Aerwards the control unit switches to the shadowing mode.In this mode, only QR 0 is enabled to compute the �rst column in the �rst round, whereas the other three columns are loaded from the shadow registers in the second round.e following rounds are processed with all quarter rounds being enabled.When the least signi�cant byte in the counter block reaches a value of 255, the control unit switches back to the normal mode for one block encryption in order to update the value of shadow registers.

Clock Gating.
To eliminate the dynamic power of idle components, clock gating [40] is applied on the entire 128-EEA2 hardware architecture.As the Round block stays in the shadowing mode for a large number of cycles, the shadow registers can be deactivated by inserting clock gating cells in order to avoid unnecessary internal power consumption.Similarly, the key cache in the key scheduler block can bene�t from clock gating as it operates in the read mode for a long time.erefore, some power can be saved by deactivating the clock of the huge 128-bit registers in the key scheduler.As a result, it is recommended to employ both key caching and round shadowing architectures associated with clock gating to optimize the design's power as intended.Furthermore, clock gating is applied for the counter block to deactivate its 32-bit register chunks which are not in�uenced by the incrementing process.

Dynamic Data Path
Adaptation. is optimization targets power-constrained embedded systems.us, it aims at reducing the peak power consumption of the EEA2 core by adapting the data path width at runtime, according to the requested ciphering speed.Figure 11 shows a modi�ed architecture of the Round block, where four 32-bit multiplexers are inserted to route state's columns to their respective quarterround.e Round architecture supports three modes of operation: Full-Round (FR), Half-Round (HR), and Quarter-Round (QR) modes.In FR, all quarter-round modules are enabled to deliver full ciphering speed.Two quarter rounds (QR 0 and QR 1 ) are only enabled in the HR mode, whereas only a single round module (QR 0 ) is enable for the QR mode.e control unit is equipped with a �nite state machine dedicated for con�guring the routing multiplexers.
e data path mode can be switched dynamically at round level with a small delay of two cycles.us four 32-bit state registers (SR 0 to SR 3 ) are additionally required to buffer the block's state aer each round to maintain synchronization between rounds operating at different data path modes.In complex systems, decisions in the control unit are typically based on noti�cations from a power controller unit which keeps track of system's workload behaviour and the charge level of device's energy buffer.e key scheduler is also slightly modi�ed where the number of simultaneously enabled Sboxes is equal to that of the quarter rounds for a given data path mode.e proposed architecture also facilitates the application of power gating of quarter-round modules in order to reduce the leakage power which is a serious concern in deep submicron (65 nm and below) technology nodes [9].

LSB Coprocessor
e Window-based LSB (WLSB) encoding function in ROHC performs four LSB computations to encode each of the speci�ed dynamic �elds (IP_ID_DELTA and TS_SCALED) as described in Section 3.2.erefore optimizing the LSB operation has a large impact on the function's performance and memory consumption.e window buffer in WLSB can be realized as a memory array in soware, which is more practical as it provides the �exibility needed to recon�gure window sizes according to channel conditions.e soware routine would also keep track of maximum and minimum reference values within a window by executing few instructions with relational operators.LSB computation is implemented by a dedicated hardware unit, which can be tightly attached as a coprocessor to the communication processor, that is, the CPU where the LTE protocol stack runs.
At least, two values ( and  ref ) are required by the LSB coprocessor to compute the number of bits () to be transmitted, representing the compressed value of .e value of  as a function of  and  ref can be derived from the interpretation interval as follows:  subtractor is needed to ensure the calculation of the absolute value of the difference ( −  ref ) in case  has a decrementing behaviour and hence is smaller than its reference value.Aerwards, the value  is computed by searching for the index () of the lemost "1" bit in the exponent value, starting from the most signi�cant bit (hence the loop behaviour).If any of the fractional (lower index) bits is high, then  will be evaluated to  incremented by one through a second adder logic, otherwise, it is set to .is way of computation is inspired from the mathematical equations and hence the loop-based architecture can be considered as a direct or conventional LSB implementation [29].e LSB coprocessor data path width can be adapted according to the size of the input word, which might vary from 8 to 64 bits.

Parallel Architecture.
We propose a novel parallel-based architecture of the LSB coprocessor targeting a fast and low-power implementation of the LSB encoding algorithm.Figure 13 presents the structure of an 8-bit parallel LSB implementation.It contains three processing stages: unary encoding, one-hot encoding, and 8-to-3 bit priority encoding.e example shown in Figure 13 illustrates the operation of each processing stage.First, the difference (   −  ref ) is computed before entering the unary encoder.e latter is composed of a chain of eight OR gates.e unary encoder output  is fed into the next stage to produce a one-hot code  highlighting the uppermost active bit in the difference .Eight AND and eight NOT gates are used to realize the one-hot encoder.Finally,  is produced by the priority encoder, which outputs the order of the active bit as a 3bit binary word.e parallel bit implementation can be extended with other 8-bit chunks to form wider parallel architectures suitable for input words of various sizes.In this case, the inputs  0 and  0 are connected to their respective signals in the upper processing module.However, for 8-bit implementations those inputs are wired to zero.
By applying the encoding logic directly on the word difference, the parallel architecture is able to save the incrementby-one adder required in the loop-based implementation.In addition, the parallel architecture does not include any arithmetic component in its encoding logic compared to that of the loop architecture, where an adder is crucial for a correct functional behaviour.Moreover, the parallel architecture is expected to be fast as it provides a purely combinational realization of LSB versus the sequential behaviour of the loop-based implementation.Apart from that, the parallel LSB implementation is intentionally designed to reduce the rate of logic switching to a high extend through the one-hot encoder block.

Operand Isolation.
Typically, two subtractors with a multiplexer are employed to calculate the absolute value of the difference (   ref ) as shown in Figure 12.Since the result of only one subtractor is eventually selected for further processing, we have one redundant subtraction every LSB encoding operation.Operand isolation technique can be applied as shown in Figure 14 to avoid switching both subtractors at a time.It deactivates the idle subtractor by inserting one demultiplexer (de-MUX) for each of the subtractor operands.e de-multiplexers are controlled by the comparator signal.For instance, if the �rst subtractor is selected, that is,    ref , then both de-multiplexers isolate the operands from the second subtractor by assigning its inputs to null.

Results
e new architectures proposed in the previous sections are described in VHDL models and synthesized by Synopsys' DesignVision Version B-2008.09[41].Synthesis is based on Faraday's standard cell library of 90 nm CMOS technology with a core power supply of 1 V. Post-synthesis simulations are performed in ModelSim, where the EEA2 and LSB architectures are veri�ed against 3GPP speci�cations.e EEA2 core is driven by 1000 bytes randomly generated PDCP packets, whereas the LSB coprocessor is supplied with random values ( and  ref ) adapted to the target data path width.e switching activity in designs' netlist is recorded and adopted for a subsequent accurate power analysis based on realistic power estimations.When referring to area efforts, the values are expressed in kilo Gate Equivalents (kGE), which is computed through dividing the total area by the size of a 2-input AND gate (5 m 2 ) in the adopted technology library.Results reported in the following paragraphs are obtained by the comparison of the performance of the optimized proposed architecture against a straight-forward or conventional reference architecture.
6.1.128-EEA2 Crypto-Core.Synthesis results show that the crypto-core supports a maximum frequency of 365 MHz with a block ciphering latency of 11 cycles.erefore, it is capable of providing a processing speed of 4.25 Gbit/s, a value that goes well beyond the currently de�ned LTE-Advanced requirements.e performance can be easily improved by applying conventional techniques such as pipelining.However, we do not employ such techniques as our main goal is not to achieve the fastest possible design, but rather to provide a processing speed which is high enough to support LTE-Advanced and beyond communication standards while implementing techniques for power and area reduction.In the following, we will use both terms processing time-per-byte and processing speed to refer to the performance of the hardware.e processing speed can be calculated by dividing the processing time-per-byte by the number of cycles-per-byte.In our discussions, we focus on three speed categories: timeper-byte of 16 ns, corresponding to 500 Mbit/s that is, the higher limit of the LTE requirements, time-per-byte of 8 ns, corresponding to 1 Gbit/s, that is, within the LTE-Advanced de�nition, and �nally time-per-byte of 4 ns corresponding to 2 Gbit/s, that is, in the beyond-LTE-Advanced domain.
Figure 15 shows the power consumption of the EEA2 core based on several Sbox architectures.As the processing time per byte reduces, the clock frequency increases exponentially, hence the power consumption.Maximum power savings of 29% is achieved by DSE due to its one-hot encoding which heavily reduces the switching rate in the Sbox.NBDSE comes second with a slightly lower power reduction (2%) compared to DSE. is is mainly due to the extra switching activity in the design subswitches.Finally, CBI comes third in place with power reduction of 21%.For very low processing time per byte (6 ns), the percentage of power savings is increased up to 38% with DSE since the critical path in LUT is longer than that in other architectures.erefore, more resources and power are consumed to satisfy the target processing speed.is critical path effect can however be eliminated by inserting pipeline stages which shorten design's critical paths.
Figure 16 shows a comparison in the area consumed by the EEA2 core based on different Sbox architectures.NBDSE outperforms other architectures with a maximum area reduction of 26%.is huge saving in area is caused by splitting the byte transformation into two nibble transformations, hence reducing the complexity of the output encoders.CBI achieves an area reduction of 17%, which is basically limited by the relatively large area consumed by row multiplexers.Due to its large 256-to-8 bit output encoder, DSE delivers the lowest area reduction of 13%.To make a fair evaluation with respect to design efficiency, we consider the power-area product which takes both area and power dimensions into consideration.Figure 16 shows the power-area product of the candidate architectures for several processing speeds.NBDSE is ranked �rst with and an average power-area product reduction of 46%.DSE and CBI follow with product reductions of 38% and 36%, respectively.For 2 Gbit/s, the power-area product loses its scalability observed in lower speeds due to the critical path effect mentioned earlier.Nevertheless, NBDSE achieves further reductions reaching 52%.As a result, we identify the nibble-based decoder-switch-encoder design as the most efficient among other architectures.
Next, we investigate the impact of key caching and round shadowing optimizations on both power and area consumption.As depicted in Figure 17, the key caching (KC) technique achieves up to 25% of power savings.Here the cipher key is set to change once every 50 block encryptions.Power reductions obtained through the caching mechanism comes however at the expense of a huge area overhead of up to 30%, caused by the inserted cache memory.However, round shadowing (RS) reduces power by 12% with its intelligent exploitation of redundant round computations.Although three shadow registers are added to the design, the area of EEA2 based on RS is reduced by 12%.As clock gating is associated with round shadowing, this area reduction is �usti�ed by the use of clock gating cells which are efficiently designed and therefore eliminate the cost of registers' enable multiplexers.By combining both RS and NBDSE optimizations, we are able to achieve 35% reductions in power consumption and area F 16: Area consumption and power-area product of the 128-EEA2 core for different Sbox approaches compared to the look-up table (LUT) implementation.savings up to 25%.With this architectural combination the power of Sboxes is heavily reduced.It has been illustrated in Section 4 that the switching activity of Sboxes in the key scheduler is much higher than that in the round block.us, by applying KC together with RS and NBDSE, further savings in power consumption will be negligible.By considering the power-area product depicted in Figure 18, we observe that KC provides only a slight (4%) product reductions at processing speeds below 1 Gbit/s, whereas it results in a slight (2%) product increase at processing speeds of 2 Gbit/s due to an increase in the occupied area.However, RS achieves 24% of power-area product reduction and even increases it up to 51% when applied together with NBDSE.As a result, we identify both round shadowing and nibble-based one-hot-encoding mechanisms as the most efficient for obtaining a low-power and compact 128-EEA2 hardware unit.Figure 19 depicts the power consumption of the 128-EEA2 architecture with support for dynamic data path adjustment.Here Sboxes are implemented with NBDSE architecture.When operating in Half-Round mode, the peak power consumption is reduced by 53%.However the peak power can be further reduced by 71% when the Quarter-Round mode is activated.is architecture is quite useful, for example, for power-constrained energy-harvesting devices, such as modern mobile devices which might partly rely on solar energy to charge their batteries and hence can keep alive even during critical battery lifetime levels.e hardware would switch to low data path modes if the target data rate is relatively low, depending on the requested service.
Alternatively, the data path can be adjusted to limit the peak power consumed by the design depending on the amount of charge available in the energy buffer of the device.In both scenarios, this optimization enables such devices to live longer and work properly at lower quality of service rather than causing equipment's death and hence complete loss of communication.e proposed architecture is also capable to switch between modes rapidly with a �ne, roundlevel granularity.erefore, it has a major advantage over other techniques such as voltage and frequency scaling, which typically require design resetting and reinitialization mechanisms.

LSB
Coprocessor.Both parallel and loop LSB architectures are investigated at various word sizes or architectural widths.For 16-bit architectures, the loop implementation supports a maximum frequency of 230 MHz with a corresponding throughput of 4.6 Gbit/s.However, the parallel implementation achieves a maximum clock speed of 536 MHz which is equivalent to 8.4 Gbit/s.erefore, our proposed architecture is almost twice as fast as the conventional implementation.By adding a pipelining register at the input of the encoding logic, the processing throughput of the parallel implementation can be raised up to 12.6 Gbit/s.Figure 20 shows the power consumption of the LSB coprocessor unit with parallel and loop implementations con�gured for encoding of 8-, 16-, 32-, and 64-bit word sizes.To analyze the impact of processing time on power consumption, we have considered several processing latencies (), ranging from 20 ns down to 6 ns.It is obvious that the parallel implementation (Parallel-arch) is more power efficient than the loop implementation (Loop-arch) as it consumes around 50% less power for 8-and 16-bit architectures.is is due to the low switching rate provided by the unary and onehot encoding blocks.e loop architecture increases rapidly with short processing times because of its longer critical path compared to Parallel-arch.Moreover, Loop-arch is not scalable with larger bit widths as it is realized using a chain of cascaded multiplexers which increases with the size of the input word.erefore, larger power savings of 62% and 74% are observed for the Parallel-arch at 32-and 64-bit architectures, respectively.However, the power consumption curves of Parallel-arch have a linear behavior and hence are scalable with respect to the word size and the processing latency.
Figure 21 depicts the area consumption of the LSB coprocessor unit with parallel and loop implementations.Results show that the proposed parallel architecture is more compact than the loop architecture and has a linear increase in area consumption with respect to word size.However, the area of the loop implementation increases exponentially with the word size, which has a strong impact on the length of its critical path.us, the area efficiency of Parallel-arch compared to Loop-arch increases with the word size.In particular, area savings start with 36% in 8-bit architectures and goes up to 71% for 64-bit architectures.Besides, average area reductions of 41% and 52% are obtained at 16-bit and 32-bit architectures, respectively.
As the difference between the word to be encoded and its reference value forms the input to the encoding logic, we study the impact of this difference (Δ) on the power consumption of both parallel and loop architectures.For this purpose, the test bench is adapted to limit the difference value to an upper bound of Δ max .Figure 22 depicts the power consumption of the LSB coprocessor with respect to Δ max .As the range of Δ increases, more logic gates are switched throughout the architecture causing a higher power to be consumed.While the raise in power consumption is negligible at 8-bit and 16-bit architectures, it can be clearly observed with 32-bit architectures.is can be distinguished from the slopes of the linear power consumption curves.It can be noticed that the slope of the power curves for the proposed Parallel-arch is smaller than that of the Looparch.Figure 22 shows a deviation of 9% in power values of the 32-bit Loop-arch which is roughly twice that of the Parallel-arch.Such an interesting behaviour indicates that the parallel architecture has low power �uctuations over its operation time and hence can be easily integrated in highlevel power estimation tools [42].Moreover, Parallel-arch  F 23: Impact of operand isolation on the power consumption of the LSB coprocessor.obtains even higher power reductions, compared to Looparch, for increasing values of Δ.
Next, we investigate to what limit operand isolation can reduce the power consumed by the LSB coprocessor.Here, operand isolation is combined with the parallel architecture, as the latter has proved to be the most efficient in terms of power and area cost.Figures 23 and 24 depict the power and area consumption for the parallel implementation in its both versions, that is, with and without operand isolation (Op-Isolation).For 8-bit architectures, no power reduction is achieved by operand isolation, because the additional power consumed in the de-multiplexer and the encoding logic (due to logic hazards) compensates the power savings obtained by the subtractors.However, wider architectures with operand isolation achieves power savings of 10%, equivalent to 20% with respect to Loop-arch, at processing latencies below 10 ns.Actually by isolating the subtractor operands we are increasing the critical path of the circuit though the demultiplexer logic.is causes more glitches to appear at the input of the encoding logic causing more area and power to be consumed at very low processing latencies, that is, below 8 ns.is effect however can be eliminated using pipelined architectures, where the critical path can be reduced and the target latency can be ful�lled without additional synthesis effort.A clear drawback of operand isolation is shown in Figure 24, where an average area overhead of 25% is obtained.As a result, operand isolation might only be recommended for applications with hard power constraints, that is, where the primary goal is to achieve the lowest power consumption possible.

Conclusion
Upcoming mobile devices have to perform fast packet processing in order to support high data rates up to 1 Gbit/s in LTE-Advanced networks.In addition, implementations of packet processing functions must possess low energy consumption pro�les in order to increase handset's battery lifetime.In this paper, we have presented and evaluated energy-efficient cores to accelerate ciphering (128-EEA2) and header compression (ROHC) functions in the Layer 2 PDCP sublayer of a cellular protocol stack.In particular, we investigate the 128-EEA2 con�dentiality algorithm that is based on AES cipher.For this purpose, we have introduced and explored several architectures for optimizing the power of AES substitution boxes and increasing its compactness.In addition, we introduce power reductions at the algorithmic level through round shadowing by eliminating redundant computations in the �rst two AES rounds within an encryption of a counter block.Similarly, key caching is applied to restrict the computation of round keys to situations where the cipher key is reassigned by the radio resource control.
We supported key caching and round shadowing with clock gating to eliminate the internal power of idle components, such as shadow and key cache registers.A special architecture of 128-EEA2 was also proposed and tailored for energy harvesting devices with hard peak power constraints.A recon�gurable data path architecture is developed for this purpose to control the peak power consumption at run time, however at the expense of the quality of service.However, the second core supporting header compression implements an LSB coprocessor with a power-efficient parallel architecture based on a one-hot encoding.Hardware architectures have been implemented and synthesized for a 90 nm ASIC technology.Results show maximum power (35%) and area (25%) reductions for the 128-EEA2 core associated with an architecture combining round shadowing, nibble-based Decoder-Switch-Encoder substitution boxes, and clock gating.Furthermore, the proposed parallel implementation for the LSB coprocessor has shown power and area reductions starting from 50% and 36%, respectively.

F 3 :
Example illustrating the concept of LSB encoding.

F 9 : 32 F 10 :
Demonstrating redundant computations in the encryption rounds of two consecutive counter blocks.Quarter-Round architecture with a shadowing register.

F 12 :F 13 :
Loop-based architecture of the LSB coprocessor.e proposed parallel-based architecture of the LSB coprocessor.

F 14 :
Architecture of the LSB coprocessor with operand isolation.

2 F 15 :
Power consumption of the 128-EEA2 core with different Sbox approaches compared to the look-up table (LUT) implementation.

F 17 :
Power consumption of the 128-EEA2 core for different optimization approaches compared to the basic implementation.

F 18 :
product (500 Mbit/s) Power-area product (1 Gbit/s) Power-area product (2 Gbit/s) Area consumption and power-area product of the 128-EEA2 core for different optimization approaches compared to the basic implementation.

F 19 :F 20 :
(128-bit) Half-round (64-bit) Quarter-round (32-bit) Power consumption of the 128-EEA2 core with data path adjustment architecture for different round modes.Power consumption of the LSB coprocessor with parallel and loop implementations.

F 21 :
Area consumption of the LSB coprocessor with parallel and loop implementations.

F 22 :
Impact of the word difference Δ on the power consumption of the LSB coprocessor with parallel and loop implementations.

F 24 :
Impact of operand isolation on the area consumption of the LSB coprocessor.
Breakdown of the area and power consumed by the 128-EEA2 core.
F 6: Structure of the NBDSE Sbox architecture.F 7: Structure of the rows in the Sbox CBI architecture.