Cellular Automata-Based Parallel Random Number Generators Using FPGAs

Cellular computing represents a new paradigm for implementing high-speed massively parallel machines. Cellular automata (CA), which consist of an array of locally connected processing elements, are a basic form of a cellular-based architecture. The use of ﬁeld programmable gate arrays (FPGAs) for implementing CA accelerators has shown promising results. This paper investigates the design of CA-based pseudo-random number generators (PRNGs) using an FPGA platform. To improve the quality of the random numbers that are generated, the basic CA structure is enhanced in two ways. First, the addition of a superrule to each CA cell is considered. The resulting self-programmable CA (SPCA) uses the superrule to determine when to make a dynamic rule change in each CA cell. The superrule takes its inputs from neighboring cells and can be considered itself a second CA working in parallel with the main CA. When implemented on an FPGA, the use of lookup tables in each logic cell removes any restrictions on how the super-rules should be deﬁned. Second, a hybrid conﬁguration is formed by combining a CA with a linear feedback shift register (LFSR). This is advantageous for FPGA designs due to the compactness of the LFSR implementations. A standard software package for statistically evaluating the quality of random number sequences known as D iehard is used to validate the results. Both the SPCA and the hybrid CA/LFSR were found to pass all the D iehard tests.


Introduction
Cellular computing is touted as one of the new paradigms for future computational systems due to three key properties: simplicity, massive parallelism, and local interconnect [1].Such systems have advantages over the traditional generalpurpose processor in terms of high-speed parallel computation and fault tolerant capabilities.A variety of unique applications have been proposed and implemented using the cellular computational model including fault-tolerant selfhealing architectures [2], cellular neural networks [3,4], and pseudo-random number generators [5].Cellular automata (CA), which consist of an array of locally interconnected, elementary processing elements, can be viewed as one form of a high-speed massively parallel machine.As CA are dynamical systems which evolve in discrete time and space, they are used to model a variety of physical and biological processes, including fluid dynamics, the immune system, the evolution of genetic regulatory networks, and urban traffic flow [6][7][8][9][10].A CA-based architecture will likely form the basis for the development of ultra-high speed and compact quantumbased computers [11,12].However, the programming of a CA-based machine to compute complex problems is a challenging and unresolved issue [13][14][15].Also noteworthy is the development of a cellular automata hardware emulator known as the Wolfram machine [16].
This paper investigates a specific application for CAbased computation: the implementation of a high-quality pseudo random number generator (PRNG) [5].A good PRNG will produce a sequence of repeatable, but highquality, random numbers based on an initial seed.This is in contrast to a true random number generator (TRNG) which International Journal of Reconfigurable Computing produces an unrepeatable sequence of random numbers [17].Both types of random number generators are needed, for example, for computer security applications, where a TRNG is used to seed a PRNG with a completely random number.In addition, PRNGs are required in a variety of areas including Monte Carlo simulations, on-chip selftest circuitry, and optimization methods such as simulated annealing and genetic algorithms.
Reconfigurable platforms such as field programmable gate arrays (FPGAs) have been investigated for implementing cellular computing machines [18][19][20].As an FPGA consists of an array of reconfigurable logic cells, it provides an attractive platform for implementing CA-based computational structures.In this paper, the potential for using FPGAs to implement high-quality random number generators using cellular automata is explored.
In many hardware implementations, it is desirable to optimize performance of the PRNGs in terms of speed, area, and power dissipation, while producing high-quality random numbers.For example, due to advances in very large scale integration (VLSI) processing technology, the complexity of integrated circuit designs now make it feasible, and even necessary, to place self-test circuits on the chip itself.The hardware overhead introduced by a built-in self-test (BIST) module should be a small portion of the overall circuit.A common method for implementing a PRNG for selftesting circuits is a linear feedback shift register (LFSR), since it can be compactly constructed from a series of cascaded flip-flops and a few XOR gates.However, for certain tests involving stuck-open faults that can convert a combinational circuit to a sequential one, the correlation between adjacent numbers in a test vector sequence should be minimized.Due to the shifting properties of the LFSR, it is known to have problems with detecting sequential faults [21].Wolfram suggested in 1986 that cellular automata (CA) could be used for efficiently generating random numbers due to the use of nearest neighbor interconnectivity and regularity in their physical layout [5].Subsequent research has demonstrated that heterogeneous CA, which are composed of two linear functions, are more suited for test vector generation for BIST than LFSRs and the homogeneous CAs originally proposed by Wolfram [22].
Advances in VLSI technology have also made it possible to implement complex digital systems on FPGAs.Because FPGA designs can be optimized for a given application, they often have superior performance in terms of speed and power dissipation over generic integrated circuit designs and microprocessor-based implementations.As an example, for an application involving signal processing for radio astronomy, an FPGA-based system built using a 130 nm process technology was compared with a DSP fabricated with a comparable technology and a microprocessor fabricated from a 90 nm technology.The FPGA system had 10 times the throughput compared with the DSP design and 4 times the throughput of the microprocessor-based system [23].Similar improvements are reported by others [24,25].For this reason, FPGAs are increasingly being used in areas formerly dominated by application specific integrated circuits (ASICs), such as embedded system design and digital signal processing applications.Hence, there is a need to explore the implementation of PRNGs on FPGAs.An FPGA typically consists of an array of logic cells that can be arbitrarily connected through programmable interconnect.Each logic cell usually consists of a programmable lookup table (LUT), a flip-flop, and several multiplexers.The structure of the FPGA has an impact on the optimal design of PRNGs that will differ from a VLSI implementation.For example, a Xilinx FPGA has the ability to convert selected LUTs into shift registers.As such, a 52-bit LFSR can be efficiently implemented using only four logic cells [26].By contrast, it takes at least one logic cell to implement each CA cell.For this reason, it is interesting to explore the construction of a PRNG that combines a CA with an LFSR.Another configuration of interest is a CA that updates its own internal rules based upon the states of cells in its neighborhood as proposed in [27].While an efficient VLSI design restricts the possible rules that can be implemented in order to minimize the number of logic gates used, the use of LUTs removes this constraint in an FPGA design.The object of this paper is to investigate the quality of the random numbers that can be produced using the aforementioned designs while considering the amount of resources required when implemented on an FPGA.The widely used statistical tests implemented in the Diehard program are used to evaluate the quality of the random numbers [28].The Xilinx Spartan-3E FPGA is the reconfigurable platform used in this study.
This paper is organized as follows.Section 2 provides some background on the design of LFSRs and CAs as well as describing the relevant previous work in this area.Section 3 on research method describes the PRNG configurations that are evaluated and the simulation strategy used.Section 4 contains the results and a discussion of their implementation.Conclusions and future work are given in Section 5. Figure 3: Graphical representations of CA rules "90" and "150" [6].

Background and Previous Work
This section overviews the design of linear feedback shift registers (LFSRs) and cellular automata (CA), followed by a review of related works that have utilized LFSRs and CA for generating random numbers.
An LFSR consists of a shift register where selected outputs, known as taps, are fed into an XOR gate.In an LFSR using external feedback, the output of the XOR gate is then fed into the input of the shift register, as illustrated in Figure 1.Internal feedback LFSRs will place XOR gates between selected flip-flops that form the shift register.The location of the taps determines the pattern generated by the LFSR.An LFSR is said to have maximal length if it can generate a pattern which is 2 n − 1 before repeating, where n is the length of the shift register.While the LFSR can be compactly implemented in both VLSI and FPGA designs, the shifting operation produces a high degree of correlation between adjacent bits.
A cellular automata can be viewed as a state machine consisting of an array of cells which hold their current states.The CA will evolve in discrete timesteps, where the next state of each cell is determined by the states of the cells in its "neighborhood" according to some specified rule.A common configuration is a one dimensional CA with binary state values, and a neighborhood consisting of the cell's own state and those of its immediate neighbors, as depicted in Figure 2.
Such a one-dimensional CA can be represented as an array of n cells, {s 1 (t), s 2 (t), . . ., s n (t)}, where s i (t) represents the binary state of the ith cell at time t.As an example of an update rule for the one dimensional CA, consider In this case, the current state of the ith cell is determined by taking the exclusive-OR of its two immediate neighbors.A pictorial representation of how this rule is encoded is illustrated in Figure 3, where the top row represents the eight possible configurations for a three-cell neighborhood and the bottom row represents the next state for the cell of interest.Since the eight bits in the bottom row represent a value of 90 in decimal, this update rule is dubbed rule "90" [6].
The current state of the ith cell can be included in the update rule that uses the exclusive-OR operations, as represented by (2): In this case, the binary representation of the rule results in a value of 150 in decimal, and hence, (2) represents the CA rule "150."While higher dimensions and state values can be used, this paper will focus on one-dimensional CA with each state represented by a single bit.
As noted by Hortensius et al., if different rules are used in each cell (heterogeneous CA), higher quality random numbers can be generated from a CA than if a uniform (homogeneous) rule is applied to all cells [22].Combinations of rules 90 and 150 were found to produce good random numbers with maximal length suitable for BIST.These rules are popular since they can be generated using XOR gates, where analysis in GF( 2) can be used to determine maximal length CA.The use of time and site spacing was shown to further decorrelate adjacent bits.Further work on hybrid 90/150 CA for generating self-test circuitry was carried out by Nandi et al. [29].
Guan and Tan proposed a one-dimensional CA where the rule in each cell changes dynamically based upon the states of the cells within a new neighborhood of three cells [27].Dubbed "self-programmable cellular automata" (SPCA), the rules are switched between 90 and 165 or 150 and 105.These rules were selected because they can be easily implemented with XOR gates.The SPCA is diagramed in Figure 4.A "super-rule" is used to determine how the rules within the CA are switched, based on neighboring cells which can be up to a distance of three cells away.The use of the super-rule can help avoid the patterns that occur when cells have static rules.These patterns are indicative of low-quality random number sequences because they give rise to recurring structures (e.g., the "triangle pattern" seen in Figure 5(a)).
Each cell in the SPCA can be thought of as having a hierarchical structure (i.e., a "super rule" and a "rule state"), International Journal of Reconfigurable Computing  which together control the lower parts: two rules and a state.This type of CA may also be thought of as two interlinked parallel CA (Figure 4), one of which (the "lower" CA) depends on the other (the "upper" CA) to decide which of its two internal rules is to be used in the next timestep.But the upper CA also depends on the states of the lower CA to determine its own states.The neighborhoods of the upper and lower CA need not be the same.
During each timestep, two things happen simultaneously in each cell: the "upper" cell checks its neighborhood and applies its rule to determine and update its next state, while the "lower" cell checks its neighborhood and the state of the upper cell and then applies the appropriate rule to its inputs and updates its new state.Because these events happen simultaneously, the lower cell always uses the rule indicated by the upper cell during the previous timestep, not the currently updating state of the upper cell.The use of the upper CA to switch the rules of the lower CA provides an additional randomizing effect.An inspection of the spacetime diagram for a hybrid 90/150 CA and the SPCA, shown in Figure 5, reveals the improved quality of the random numbers generated from the latter.
In [27], SPCA of lengths from 16 to 24 bits were found that could pass all the statistical tests using the Diehard program.This paper investigates the implementation of the SPCA on FPGAs where it should be noted that the use of LUTs allows more flexibility in the choice of rule selection.
Tkacik proposed a random number generator which combines the outputs of a CA with an LFSR [30].A 37 bit CA was combined with a 43-bit LFSR.This maximal length configuration combined 32 bits from the CA and LFSR to produce a maximal length RNG.It was found that the LFSR and CA must be clocked at different frequencies to create a sequence of numbers that can pass all the Diehard tests.Figure 6 depicts the version implemented on FPGAs in our work.The last bit of the LFSR and CA can be combined in an XOR gate to produce a single random bit in order to generate maximal site spacing.In the design depicted in Figure 6, three additional internal nodes from the LFSR and CA are combined with XOR gates to generate four bits each clock cycle.The advantage of this site spacing [22], as we shall see, is that a sequence of random numbers can be produced that passes all of the Diehard tests without clocking the LFSR and CA at different rates.This technique makes use of site spacing in order to avoid any correlation between neighboring bits.It does, however, lead to decreased throughput (only up to 4 bits per timestep) which is undesirable because it may take several cycles to generate a multibit random number.In [30], Tkacik states the maximal length of a combined CA and LFSR is 2 m+n − 2 m − 2 n + 1 for an m-sized CA and an n-sized shift register.This is true if the cycle lengths of the individual CA and LFSR are relatively prime.A 37 bit shift register and 16 bit cellular automata represent one example of a maximal length CA/LFSR configuration that was studied in this work (Figure 6).

PRNGs Implemented on FPGAs
This section describes the various CA configurations that were considered for implementation on FPGAs and the simulation method used to determine the quality of random numbers that were produced by each design.
Two programs were written in C to simulate the two PRNGs: SPCA and the hybrid LFSR/CA.These simulations output the states of the various components of the PRNGs, allowing analysis and confirmation of the hardware output.The Linux command diff was used to compare the output files from the simulations to the test data obtained from logic analyzer measurements.
Logisim was used as a secondary simulator to confirm the results of the C program.The usefulness of Logisim lies in its graphical representation of the simulation, which makes some flaws more obvious and easier to fix than in C code.C code, however, is itself sometimes easier to debug and is much faster.With the diff command, the outputs from Logisim, C, and the FPGA hardware were all able to be automatically compared.A combination of VHDL code and hard macros were used to implement these PRNGs on an FPGA.The hard macro used to form the SPCA cell is shown in Figure 7.One whole cell is contained in this hard macro, which uses one slice on the Spartan 3E FPGA.Both of the theoretical upper and lower cells are included in this macro.The dashed circle and arrow constitute the "lower" cell, and the solid circle and arrow constitute the "upper" cell.The circles are lookup tables (LUTs).These process the inputs from neighboring cells, applying the desired rule and outputting the new cell state.The flip-flops (FFs) that hold the cell's state and super-rule state are indicated with dashed and solid arrows, respectively.
Our SPCA uses rules 90 and 165 for the lower cells and rule 90 for the super-rule.The upper cells have a neighborhood of −2, +1.We have simulated this CA using a C program and confirmed with a Logisim simulation.
One of the limitations of the design in [27] is that the PRNG was designed for a VLSI implementation.Therefore, pairs of rules that can be implemented with minimal overhead were chosen.However, since our PRNGs will be implemented on FPGAs, we are not subject to the same types of overhead.As the basic logic unit in the Spartan 3E FPGA is a 4-input LUT, it takes up the same area whether it is implementing a logical AND, an XOR, or a more complicated rule or pair of rules.Therefore, all 256 rules or any pair of these rules are available at the same cost in overhead.
A pair of rules can be implemented in a 4-input LUT because three inputs can be used for the neighborhood (left neighbor, cell's own state, right neighbor) and the fourth input can be used to consider the state of the super rule-that is, to select which rule to use.If the rule selector is 0, rule 90 is applied to the other three inputs; but if the rule selector is 1, rule 165 is applied.As with the case of the SPCA, an LFSR is easily implemented using a 4-input LUT.In this case, the look-up table is configured as a 16-bit shift register.The LUT is configured as an addressable shift register rather than with fixed values and is referred to as a Shift Register LUT 16-bit (SRL16).The SRL16 allows for very efficient and compact FPGA implementations of LFSRs.
As previously noted, a hybrid LFSR + CA PRNG consisting of two state machines which are relatively prime in their cycle lengths will generate a sequence equal to 2 n+m − 2 n − 2 m + 1 [30].The advantage of this hybrid approach for FPGA implementation is the possible design tradeoffs.An LFSR by itself does not produce a good random sequence but can be compactly implemented on an FPGA.By comparison, a CA consumes more FPGA resources but provides good pseudorandom sequences due to the absence of adjacent bit correlations.The objective here is to create a new PRNG that possesses the CA's randomness and the LFSR's compactness.
A 16 cell heterogeneous CA consisting of a mixture of rules 90 and 150 was utilized as the baseline design.The rules were arranged to yield a maximal length pattern as described in [22].This PRNG was implemented on an FPGA and was verified to run properly.Usage of hard macros allowed each cell to fit in one slice.This baseline PRNG failed many of the Diehard tests as shown in Table 4.A similar 22-cell hybrid CA was also simulated for comparison with the 22-cell SPCA.This CA too failed most of the Diehard tests as expected (Table 4).

Performance Evaluation
This section evaluates the FPGA PRNG implementations in terms of the quality of the random numbers, hardware resources utilized, throughput, and test results.The taps required to implement the various LFSRs in this study were obtained from [26] and are summarized in Table 1.

4.1.
Overhead.The different CA combinations for the SPCA and CA/LFSR hybrid that were synthesized on an FPGA and associated resource usage are summarized in Table 2.When synthesized on a Xilinx Spartan 3E FPGA, a 52-bit LFSR requires 4 slices (3 flip-flops and 5 LUTs) while a 16 cell CA requires 16 slices (16 flip-flops and 16 LUTs).The 22bit SPCA requires 30 slices (44 flip-flops and 60 LUTs).The SPCA is larger because it uses one slice per cell (1 bit) while the LFSR takes advantage of the compact 16 bit shift-register implementation in a single LUT (i.e., half a slice).
Figure 8(a) shows the FPGA editor view of the SPCA.Each colored box is a cell or some auxiliary circuit such as a multiplexer for initializing the SPCA.
Figure 8(b) shows the view of the LFSR + CA combination.The inset highlights the bulls-eye marking that indicates a slice containing the hard macro for implementing one cell of the cellular automaton.

Quality of the Generated Random Numbers.
The Diehard suite of statistical tests was run on all the configurations listed in Tables 3 and 4. The tables also summarize the test results.The names of the tests are listed in Table 5 (for more details on each test, see [28]).Any test that returns a Pvalue equal to "1.000" or "0.000" is considered to have failed and is represented with an "F" on the table.A test with up to two P-values equal to "0.999" or "0.001" is said to barely pass and is represented with a "BP."Any test with at least three Pvalues equal to "0.999" or "0.001" is said to barely fail and is represented with a "BF."Otherwise, the test is said to pass (represented with a "P").
The results for maximum site spacing techniques are discussed first, followed by the maximum throughput results.
As for the various LFSR + CA configurations where only 1 bit is generated per clock cycle (i.e., the maximum site spacing case), the 37 bit LFSR + 16 bit CA and the 52 bit LFSR + 8 bit CA produced the best results, passing all 18 tests.An intriguing result is that the 52 bit LFSR + 8 bit CA performed slightly better than the 52 bit LFSR + 16 CA.
As can be seen from Table 4, a hybrid CA by itself is insufficient to pass all the Diehard tests.As for techniques with maximum throughput, the SPCA was simulated with twenty different initial seed patterns for the on-off states of the CA.Ten seeds were nonrandom, orderly patterns such as all-on, alternating on/off, a single cell on in the middle, and so forth.Six of ten of these simulations failed Diehard using our standards and two barely passed.Ten random initial seed patterns were also utilized.These random seeds have an unpatterned set of on/off states with approximately equal numbers of on and off states.Five of these simulations failed Diehard by our standards, one barely passed, and the remaining four passed all the Diehard tests.
In sum, the SPCA has much greater throughput than the LFSR + CA, while the LFSR + CA is not sensitive to the initial seed values for passing all the Diehard tests.
According to Diehard, an "ideal" PRNG should have a uniform distribution of P-values.Figure 9 compares the P-value distributions for different PRNGs tested using the Diehard test suite.PRNG configurations that failed Diehard tests tend to contain a high frequency of P-values in the 0.9 to 1.0 range as seen for the 35LFSR + 8CA and for the two baseline hybrid CA.PRNG configurations that passed all tests such as the 37LFSR + 16CA contain P-values that are more equally distributed across the range [0,1).

Tradeoffs between Throughput and Quality of the Random
Numbers Generated.After these results were obtained, it was considered whether a better balance between throughput and quality of randomness could be found.In order to achieve a greater throughput for the LFSR + CA, more than just the last bits were XORed to generate the output.The number of bits that could be XORed was limited however, because the shift register bits inside the compact LUT cannot be accessed.Therefore, only bits that were tapped out or that could not be placed in a LUT were used to XOR with bits from the CA.In two different variations of the 37 LFSR + 16 CA, throughput was increased to 2 and 4 bits per timestep.These PRNGs still pass all Diehard tests (Table 6).As seen in Figure 10, all the P-value distributions remained relatively level.

Test Results and
Comparisons with Related Works.The designs were implemented on a Xilinx Spartan-3E FPGA and were tested using a Tektronix TLA 7012 logic analyzer, as shown in Figure 11.
In the SPCA design, each cell was programmed to assume an initial state when an onboard switch was in the "on" position.This initial condition allowed the SPCA to be set and held at an initial state, allowing for consistent readings.After setting up the initial cell states in the FPGA, four million timesteps were read, with the FPGA running at 50 MHz.Because the logic analyzer connection to the FPGA board only had 16 pins, the FPGA was programmed to use an onboard switch to multiplex the output pins between the first 16 cells of the 22 SPCA and the last 16 cells.In order to get all 22 states, the FPGA was run twice, starting from the same initial states each time.On the first run, the states of the first 16 cells were read out: on the second run, the states of the latter 16 cells were read.The 10-cell overlap between these two sets of data helped confirm that they did indeed coincide.This method was used three times to generate three sets of data from the FPGA.All three sets of data matched the simulation.The LFSR + CA has also been successfully implemented, matching the simulation.
The hybrid LFSR + CA design could operate at a speed of 110 MHz with a power dissipation of 89 mW while the SPCA design had a maximum operating frequency of 115 MHz and a power dissipation of 103 mW.In terms of throughput, the  [27] and Tkacik [30] were implemented on custom integrated circuit processes instead of FPGAs, so a direct comparison is not feasible.It can be observed that the SPCA design by Guan reported an estimated throughput of 3510 Mb/sec for his 20 bit SPCA design [27].Given that these design simulations targeted a 0.35 µm CMOS process and that the Spartan-3E FPGA used in our design is built from a 90 nm CMOS process, the comparable throughput numbers are consistent with the differences in performances typically experienced between ASIC and FPGA implementations [31].

Conclusions
Cellular automata represent a basic form of a high-speed massively parallel computation engine.Such forms of cellular computing can be implemented on current reconfigurable platforms, such as FPGAs, and will form the basis for quantum computers developed for emerging nanotechnologies.This paper has evaluated the performance of CA-based PRNGs suitable for implementation on FPGAs.Synthesis results for the Xilinx Spartan 3E FPGA give a good idea of the relative resources required for each configuration.The LFSR + CA combination uses less overhead than the SPCA, due to use of the compact LUT implementation of the LFSR.The Diehard suite of statistical tests was used to evaluate the quality of the random numbers produced from each configuration.It was found that the 37 bit LFSR + 16 bit CA and the 52 bit LFSR + 8 bit CA and the SPCA with random initial seeds passed all the tests.There is a large gap in throughput between the LFSR + CA and the SPCA.This is due to the inaccessibility of bits that are inside the compact LUT used for the LFSR.In order for more bits to be accessible, the LFSR must be split up, increasing the overhead.The SPCA, however, has a high throughput since the state of every cell can be used.Although the states of the theoretical upper CA in the SPCA could be used to double throughput, this technique was avoided because it could compromise usefulness in an encryption setting [27].
In the future, we will attempt to add more throughput to the LFSR + CA and also explore the aspect of maximum cycle length.The LFSR + CA combination can pack a large number of states in a small space using the current design

Figure 1 :Figure 2 :
Figure 1: Schematic of a maximal length 3-bit linear feedback shift register.

Figure 5 :Figure 6 :
Figure 5: Space-time evolution patterns for: (a) a simple rule-90 CA, (b) a hybrid 90/150 CA, and (c) the SPCA using rules 90/165 and super-rule 90.All are initialized with a single bit in the center.

Figure 7 :
Figure 7: FPGA editor view of the hard macro used to form an SPCA cell showing the lower and upper cells.

Figure 8 :
Figure 8: (a) The FPGA editor snapshot showing the layout of the 22-cell SPCA, using 30 slices (small, colored boxes).(b) The FPGA editor view of a portion of the CA + LFSR implementation.The inset shows a pair of hard macros used to realize the CAs.

Figure 11 :
Figure 11: Test setup showing the Spartan 3E FPGA development board connected to a Tektronix TLA 7012 logic analyzer.

Table 2 :
FPGA resource utilized by the CAs.

Table 4 :
Diehard results for hybrid CA and SPCA optimized for throughput.

Table 5 :
Diehard tests.CA design can output 440 Mb/sec while the SPCA can deliver 2530 Mb/sec.The SPCA has a significantly higher throughput since it outputs all 22 bits in one clock cycle, while the hybrid LFSR + CA design outputs a maximum of 4 bits per clock cycle.The previous designs by Guan and Tan