This paper presents an improved interconnect network for Tree-based FPGA architecture that unifies two unidirectional programmable networks. New tools are developed to place and route the largest benchmark circuits, where different optimization techniques are used to get an optimized architecture. The effect of variation in LUT and cluster size on the area, performance, and power of the Tree-based architecture is analyzed. Experimental results show that an architecture with LUT size 4 and arity size 4 is the most efficient in terms of area and static power dissipation, whereas the architectures with higher LUT and cluster size are efficient in terms of performance. We also show that unifying a Mesh with this Tree topology leads to an architecture which has good layout scalability and better interconnect efficiency compared to VPR-style Mesh.

The work presented in this paper can be divided into two parts. In the first part, we present an improved Tree-based FPGA architecture. In the second part of the paper, the architecture presented in the first part is used for the improvement of connection blocks and intracluster interconnect topologies in a cluster-based Mesh FPGA architecture.

The motivation behind the work presented in the first part is to reduce the domination of the interconnect area in field programmable arrays (FPGAs). In FPGAs interconnect area takes up to 90% of the total area while the remaining 10% is used by the logic part of the architecture. Domination by interconnect greatly affects the delay and area efficiency of the architecture. In [

The motivation behind the second part of the paper is the optimization of connection blocks and intracluster interconnect topologies in a cluster-based Mesh FPGA architecture. There are different ways to connect signals to the LUT input muxes. In Xilinx Virtex architectures [

The remainder of the paper is organized as follows. Section

We propose a Tree-based architecture called MFPGA (Multilevel FPGA) where LBs (Logic Blocks) are grouped into clusters located at different levels. Each cluster contains a switch block to connect local LBs. A switch block is divided into MSBs (Miniswitch Blocks).

This architecture unifies two unidirectional networks. The downward network uses a “Butterfly Fat Tree’’ topology to connect DMSBs (Downward MSBs) to LBs inputs. As shown in Figure

Tree-based interconnect: upward and downward networks.

As shown in Figure

Using UMSBs and DMSBs greatly enhances routability, but it increases the interconnect switches number. However this increase is compensated by reducing in/out signals bandwidth of clusters at every level. In fact, netlists implemented on FPGA architecture often communicate locally (intraclusters) and this fact can be exploited to reduce the bandwidth of signals with inter-clusters communication. A good estimation of netlists communication locality is given by Rent's Rule [

We define Rent's parameter for an architecture as follows:

Intuitively,

Tree-based interconnect depopulation using Rent's rule (level 1 with

The way LBs are distributed between Tree clusters has an important impact on congestion. It is worthwhile to reduce external communications, since local connections are cheaper, not only in terms of delay but also in terms of routability, as this allows to get more levels (more paths) for connecting sources to destinations. Another way to decrease congestion consists in eliminating competition between sources to reach their sinks. This can be achieved by depopulating clusters based on netlist instances fanout. Instances with high fanout need more resources to reach their sinks. Thus in the partitioning phase, instances weights are attributed according to their fanout size. We use a top-down recursive partitioning approach. First, we construct the top level clusters, then every cluster is partitioned into subclusters, until the bottom of the hierarchy is reached. Since logic block positions inside the owner cluster are equivalent, the detailed placement phase (Arrangement inside clusters) is done randomly.

After placement, the routing process is started. Interconnect resources are presented by a routing graph with nodes corresponding to wires and LBs pins and edges to switches. We use the negotiation-based algorithm

To evaluate the proposed architecture performance, we place and route the largest MCNC benchmark circuits available and consider as a reference the optimized clustered Mesh (VPR-style) architecture. We use t-vpack [

Standard cells characteristics.

Cell | Area |
---|---|

Sram | |

Tri-state | |

Buffer | |

Flip-flop | |

Mux |

Unidirectional versus bidirectional wires.

As explained in Section

Just like VPR which applies a binary search to find the smallest value of channel width for Mesh architecture, we apply a binary search to determine the smallest value of Rent's parameters for each level of Tree-based architecture. Depending on levels order processing, we tested 3 different approaches.

MFPGA architecture optimization flow (bottom-up approach).

The 3 approaches have the same objective and aim at reducing clusters signals bandwidth at each level. The difference is the order in which levels are processed. In Table

Levels Rent's rule parameters (21 benchmark average).

Level | Circuits partitioning | Archi. top-down | Archi. bottom-up | Archi. random |
---|---|---|---|---|

1 | 0.64 | 0.98 | 0.79 | 0.88 |

2 | 0.55 | 0.88 | 0.74 | 0.79 |

3 | 0.50 | 0.80 | 0.77 | 0.76 |

4 | 0.49 | 0.75 | 0.86 | 0.73 |

5 | 0.45 | 0.59 | 0.87 | 0.7 |

A netlist routing example showing architecture Rent's parameter increase.

Partitioned netlist

Routed netlist with conflict

Routed netlist with no conflict

In Figure

Overhead between Architecture and partitioned netlist Rent's parameters (21 benchmark average).

A comparison of the average results obtained from the 3 optimizing approaches is shown in Table

Area and performance comparison between various optimizing approaches (21 benchmark average).

Optimizing approach | Area ( | Critical path switches |
---|---|---|

Top-down | 1498 | 98 |

Bottom-up | 1326 | 106 |

Random | 1221 | 101 |

To reduce the gap between circuit and architecture Rent's parameters, we must improve the partitioning tool (especially the objective function) to reduce congestion and resources (clusters inputs) required to route signals.

In this section we evaluate the effect of LUTs size

Total area for clusters sizes 4–8 (21 benchmark average).

In order to analyze further LUTs size effect on area we divided it into two parts, logic blocks area and interconnect area. From our experimentation we notice that logic area increases with LUT size. This area is the product of the total number of LUTs times the area per LUT. A plot of these two components for clusters arity equal to 4 is shown in Figure

LUTs number and LUT area versus LUT size (for cluster arity = 4).

Varying clusters arity

The second key metric is the critical path delay. Since we have no accurate wire length estimation (we do not have a complete layout generator yet), we only evaluate the number of switches crossed by the critical path. Figure

Critical path switches number clusters sizes 4–8 (21 benchmark average).

According to [

Buffers number clusters sizes 4–8 (21 benchmark average).

SRAM cells number clusters sizes 4–8 (21 benchmark average)

To get a good tradeoff between area and path delays reduction, using different LUTs sizes is necessary. This was confirmed by the Stratix II architecture [

Here we compare MFPGA to the Mesh-based architecture in terms of area efficiency. In both cases we consider architectures with clusters arity 4 and LUT size 4. In each case, we determine the smallest architecture implementing every benchmark circuit. In the case of Mesh we use VPR to find the smallest channel width, and in the case of MFPGA we use the random optimizing approach described in Section

In Figure

MFPGA area versus Mesh area (21 benchmark circuits).

We compare the areas of both architectures using a refined estimation model of effective circuit area. The Mesh area is the sum of its basic cells areas like SRAMs, tri-states, multiplexers and buffers. The same evaluation is made for the Tree, composed of SRAMs, multiplexers, and buffers. Both architectures use the same symbolic cells library.

The Tree architecture efficiency is due essentially to its ability to control simultaneously logic blocks occupancy and interconnect population, based on LBs number

Levels Rent's parameters for 2 circuits.

Circuits | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 |
---|---|---|---|---|---|

apex2 | 1 | 0.89 | 0.86 | 0.84 | 0.77 |

tseng | 0.79 | 0.79 | 0.79 | 0.72 | 0.67 |

Area distribution between interconnect and logic blocks: Tree and Mesh cases.

We showed that with a Tree-based topology, we obtain good density and we cut area by a factor of 2 compared to Mesh. Nevertheless, based on our layout experimentation we noticed that this Tree-based architecture is penalizing in terms of physical layout generation. It does not support scalability and does not fit with a planar chip structure, especially for large circuits. Conversely the Mesh and in specially the Mesh of Tree [

Layout view of Mesh and Tree interconnect structures.

Maximum wire lengths depending on Tree size (arity 4).

Maximum wire delays depending on Tree size (arity 4).

To take advantage of the positive points of both topologies, we propose an architecture where LBs are connected into a cluster (Mesh cluster) with a local interconnect built as a Tree. Mesh clusters are connected with an external interconnect with a Mesh topology. We use the same Tree topology presented previously. In the Mesh interconnect we use only unidirectional wires, since in [

As presented in Figure

Node of Mesh of Tree architecture.

Cluster interface: external input and output connections.

Each input Superpin contains 4 inputs connected to the 4 adjacent channels. Each input is connected to all UMSBs located at level

Each output Superpin contains 4 outputs connected to the 4 adjacent switch boxes. Each output is connected to all DMSBs placed at level

This distribution has an important impact on routability and eliminates constraints in the placement of LBs inside Tree clusters. All 4 Mesh cluster sides have the same number of inputs and outputs. Side inputs and outputs numbers depend on the number of Tree leaves and on the level where they are connected:

As presented in Figure

In the Mesh interconnect structure we use only single-driver unidirectional wires; in [

Mesh switch box topology.

SB: Cluster outputs to channel tracks

SB: channel tracks to channel tracks

The configuration flow used to implement benchmark netlists on the proposed architecture is described in Figure

Mesh of Tree configuration flow.

To evaluate the proposed architecture and tool performances, we place and route the 3 largest MCNC benchmark circuits and the

Interconnect distribution in Mesh of Tree architecture.

Comparison of various FPGA architectures areas.

We also notice that, compared to a stand-alone Tree, the total area is increased by 28%. This increase is compensated by the Mesh of Tree layout generation simplicity and wires length reduction, compared to stand-alone Tree, especially when we target large circuits sizes. In this case, wires lengths depend only on Mesh clusters sizes and not on architecture total LBs number.

We proposed a Tree-based architecture with high interconnect and low logic utilizations. Based on the largest MCNC benchmark implementation, we showed that this architecture has better area efficiency than the common VPR-Style clustered Mesh. We showed that in general LUTs with size 4 and cluster size 4 produce most efficient results in terms of area and static power dissipation for Tree-based FPGA. We also determined the evolution of the number of switches crossed by the critical path as a function of LUT and cluster size and we showed that LUTs with higher input size, and with higher cluster size can be more optimal in terms of performance though they are not very good in terms of density. Nevertheless, this Tree-based architecture is penalizing in terms of physical layout generation. To deal with such problem we proposed an architecture unifying both Mesh and Tree strong points. The Mesh of Tree has a good physical scalability: once the cluster layout is generated we can abut it to generate Mesh layouts with the desired size and shape factor. The proposed Mesh of Tree architecture is a good tradeoff between area density and layout scalability.