An 8-Bit ROM-Free AES Design for Low-Cost Applications

Wehavepresentedamemory-lessdesignoftheadvancedencryptionstandard(AES)with8-bitdatapathforapplicationsofwireless communications.Thedesignusestheminimal160clockcyclestoprocessa128-bitdatablock.Forachievingtherequirementsof lowareacostandhighperformance,newdesignmethodsareusedtooptimizetheMixColumns(MC)andInverseMixColumns (IMC)andShiftRows(SR)andInverseShiftRows(ISR)transformations.Ourmethodscanefficientlyreducetherequiredclock cycles,criticalpathdelays,andareacostsofthesetransformationscomparedwithpreviousdesigns.Inchiprealization,ourdesign withbothencryptionanddecryptionabilitieshasa29%areaincreasebutachieves4.85timesimprovementinthroughput/area comparedwiththebest8-bitAESdesignreportedbefore.Forencryptiononly,ourAESoccupies3.5kgateswiththecriticaldelay of12.5nsandachievesathroughputof64Mbpswhichisthebestdesigncomparedwithpreviousencryption-onlydesigns.


Introduction
The AES algorithm has been widely used in data transmission in wireless communications [1][2][3] and RFID applications [4,5].The AES design with ASIC chip(s) can achieve the requirements of low cost and high performance.The design with low area cost usually also results in low power consumption.The area reduction of designing the AES can be achieved by optimizing the architectures of its subfunctions [4,[6][7][8][9][10][11], sharing the same operations of subfunctions [6,9,10,12], and reducing the data path of overall architectures [1, 2, 4-7, 10, 12-15].The feature of inherently iterative AES algorithm can be exploited to reduce the data path of overall architecture.The data path design of AES can be shrunk to 8-bit versions [1, 2, 4-7, 10, 12, 13] for reducing the area cost.The ASIC design of 8-bit AES reported in [5] has the smallest area cost compared with other versions but also leads to the lowest performance since more clock cycles are needed in encryption and decryption.For the objective of reducing the area cost but still keeping the acceptable performance, the proposed AES uses 8-bit data path and minimum clock cycles to perform the encryption/decryption processes.
For the portability of AES in different platforms and CMOS technologies, our AES uses pure combination logic to design the overall circuit without any memory blocks.
The new proposed design methods in major transformations led to the reduction of area cost in AES but still keep the high throughput that meets the requirements of wireless communications.The experiment results show that our AES design has better performance/area ratio compared with previous designs.The remainder of this paper is organized as follows.Section 2 briefly describes the AES algorithm and its transformations.The new designs of transformations and overall AES architecture are proposed in Section 3. Section 4 describes experimental results and comparisons with other previous designs.Finally, conclusions are given in Section 5.

AES Algorithm.
The AES algorithm for 8-bit data path that processes a 128-bit data block will take at least 160 rounds.The encryption processes perform ShiftRows (SR), SubBytes (SB), MixColumns (MC), and AddRoundKey (ARK) transformations.A separate KeyExpansion (KE) unit is required to generate the Kth round key for each ARK.The decryption process has three reversed transformations, InvShiftRows (ISR), InvSubBytes (ISB), and InvMixColumns (IMC), and one ARK.The normal rounds perform the four inversed transformations.The round keys operated in the decipher process are the reverse of the round keys generated in each round in the cipher process.

AES Transformations. Four kinds of transformations and
one key generation unit in the AES algorithm are described as follows.
(a) SB/ISB Transformations.The transformations are nonlinear substitution operations where each byte of the input state is computed with multiplicative inverse (MI) in GF(2 8 ) and followed by an affine transformation (AF) over the same field.Similarly, the ISB transformation performs the inverse affine transformation (IAF) followed by the operation of MI in GF(2 8 ).
(c) SR/ISR Transformations.The SR transformation rotates the last three rows of the state to the left by one, two, or three bytes depending on the row numbers.The ISR rotates them in the inverse direction of the SR.
(d) ARK Transformation.In each round, the ARK transformation performs an addition of the state with the round key using a bitwise XOR operation.
(e) KE Unit.In each round, the KE unit generates a new 128bit round key for the XOR operation with the state in the ARK transformation.For completing the rotation sequences, several multiplexers are added in Figure 1.The states are inputted to the SR unit by the sequences of their state number.Therefore, the state  0 is the first one that is inputted to the SR in the first clock cycle, and the state  15 is the last input to the SR in the sixteenth clock cycle.The output sequences of the states in the first row are unchanged after performing the SR rotations.The input state  0 ,  4 , and  8 are stored in register 8, 4, and 0, respectively, after several clock cycles.The state  12 is stored in register 0 after outputting the state  0 from register 8.In the second row, the first state  1 is delayed to the last one and other states  5 ,  9 , and  13 bypass the state  1 using the multiplexer.These three states are outputted before the state  1 .

Design of Our AES Architecture
The states  2 and  6 in the third row are delayed behind states  10 and  14 after the rotation of the third row.In the fourth row of original state matrix, the state  15 is the last one but becomes the first output state of that row after performing the SR.Similarly, the ISR performs the rotations in the inverse direction of the SR by using the multiplexers to bypass some states for outputting the correct sequences.The design in [13] is the best method to solve the SR and ISR rotations reported so far, but our design can reduce four 8-bit registers and shorten the critical path delay of the unit.] ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ] ⏟⏟⏟⏟⏟⏟⏟⏟⏟ Original States . ( As shown in Figure 2, our MC design uses eight 8-bit registers to store the states and uses two multiplication units ({02} × ) and ({03} × ) for performing the multiplication in (2).These two multiplication units are realized by simple bit-level XOR operations.The MC design uses two levels of registers.The four registers in the first level receive data from the MC or the ARK units.The second-level registers prepare calculation operations for the outputs.For example, the result of output state   0 is calculated as ({02}× 0 +{03}× 1 +{01}× 2 + {01} ×  3 ).The states  1 ,  2 ,  3 , and  4 are inputted to registers 4, 5, 6, and 7, respectively, after four clock cycles.In the next cycle, the four states are stored in registers 0-3, respectively, and perform the multiplications ({02} × 0) and ({03} × 1).The MC unit outputs the states   0 -  3 in the subsequent four clock cycles.At the same time, the next four states are inputted to registers 4-7 and wait for performing the multiplication with the constant matrix.The MC unit needs sixteen clock cycles to complete the calculation of a 128-bit state.The design in [1, 4-7, 14, 15] is the best method to solve the MC and IMC operations reported before, but our design can further reduce twenty-four 8-bit registers and shorten the critical path delay of the MC unit.The similar   optimization results are also obtained in the design of IMC unit.

The Design of Overall
Architecture.We realized iterative AES architecture designs using TSMC 0.18 m cell library.Figure 3 shows the 8-bit AES processor architecture.A plaintext block and the encryption key are loaded to the AES through the 8-bit input ports data in and key in.The enc signal is used to select the encryption or decryption processes.
The SB can be realized by the calculation of Multiplicative Inverse (MI) in GF((2 4 ) 2 ) and Affine Transformation (AF) units.The ISB can be realized by the same MI calculation with SB and inversed affine transformation (IAF) units.For reducing the area cost of the combined implementations of SB/ISB units, the MI logic is usually shared.The key expansion unit is used to generate and output the required 8-bit round key to the ARK.Since the round keys are in reverse order in decryption, the inverse cipher process can start only after generating the last round key.Afterward, the key expansion with the same round keys can be executed concurrently with the decryption process.

Experimental Results
In Table 1, various 8-bit AES designs in different technologies are listed for comparison.The designs in [1,4,7,14] only have the encryption ability.The design in [4] is the encryptiononly version of the previous design [5], for application in radio frequency identification (RFID).The design in [5] adopts the clock gating method to reduce the power consumption.One pipeline stage is used to reduce the critical path delay in the SB/ISB design.The SR/ISR units are implemented by random access memory (RAM).In [7], the encryption-only design merges the ARK and SR operations by using four pipeline stages to generate the correct output order of the computed state.The design in [1] provides a low power AES design for the RFID application.It uses gated clock design to reduce unwanted switching activity, the same approach as in [5].Good and Benaissa [14] proposed a low power/area AES chip design that provides a series of finitefield doubling, tripling, and XOR operations to perform the MC transformation.It also adopts separate data and key memories, the same approach as in [1], for parallel processing the state and round key.
In Table 1, we provide two kinds of implementation information of our AES design including AES with encryption ability and AES with encryption/decryption abilities in the chip level.The chip design is used to compare with other chip results that have their circuits fabricated.
We observe that most realizations are encryption only due to the fast verification of their designs.Most realizations of 8-bit data path AES require more clock cycles to compute   a 128-bit data block, resulting in smaller throughput rate.Therefore, most realizations are suitable for those applications with low frequency and throughput rate requirement, such as RFID.On the other hand, our design with higher throughput can be used in applications such as 802.11series wireless network.Our AES design with only encryption ability occupies 3.5 k gates with the critical delay of 12.5 ns.The major improvement of our AES in this version is to minimize the required number of clock cycles and critical path delay for processing MC and SR operations by our architecture designs.
The area cost and critical path delay of our AES are similar with the best design in [5].But our design can achieve a throughput of 64 Mbps which is the best design compared with previous encryption-only designs.The area cost of our AES design with both encryption and decryption abilities increases about 29%, but the throughput improves 4.85 times compared with the best design in [5].From the experimental results, we also observe that our AES design has the best normalized performance of throughput per gate compared with other previous designs.

Conclusions
In this paper, we have presented new design methods of AES transformations and their architecture.The major transformations, SR/ISR and MC/IMC, dominate the required clock cycles and path delays for processing the data encryption and decryption.We presented two design methods that can efficiently optimize these transformations, and the proposed architecture design can improve the throughput but keep low area cost compared with other previous designs.The design is suitable for area-limited applications that require high throughput, such as wireless communications.The implementation results demonstrate that the proposed design has the highest throughput with low area cost.

3. 1 .
Designs of Major Transformations.The optimization of separate transformations focuses on two major transformations, SR and MC, and their inverses, ISR and IMC.The designs of these transformations are described as follows.(a) The Design of SR/ISR Unit.In this paper, we propose a combined SR/ISR design as shown in Figure 1.It uses twelve 8-bit registers for receiving and storing data from MC or ARK units.The output sequences are generated after performing the SR rotations.Equation (1) shows the original 4 by 4 state matrix and the output state matrix after the SR rotations.The original states in the first row after performing the SR are unchanged.The states in the second, third, and forth rows are rotated by right shifting one, two, and three positions, respectively, ( (b) The Design of MC/IMC Unit.In our AES design, the MC, and IMC units are separated to reduce the complexity of the data paths.Equation (2) shows four input states  0 ,  1 ,  2 , and  3 that multiply constant values {03}, {02}, {01}, and {01}, respectively, in Galois Field GF(2 8 ) for generating output states   0 to   3 .The equation also shows that the constant values in the second, third, and fourth rows are rotated to left by one, two, and three positions corresponding to the first row

Table 1 :
Performance comparison of different 8-bit AES designs.