PackeX: Low-Power High-Performance Packet Classifier Using Memory on FPGAs

Networks are continuously growing, and the demand for fast communication is rapidly increasing. With the increase in network bandwidth requirement, efficient packet-classification techniques are required. To achieve the requirements of these future networks at component level, every module such as routers, switches, and gateways needs to be upgraded. Packet classification is one of the main characteristics of a stable network which differentiates the incoming flow into defined streams. Existing packet classifiers have lower throughput to cope with the higher demand of the network. In this work, we propose a novel high-speed packet classifier named as PackeX that enables the network to receive and forward the data packets in a simplest structure. A size of 128-rule 32-bit is successfully implemented on Xilinx Virtex-7 FPGA. Experimental findings show that our proposed packet classifier is versatile and dynamic compared to the current FPGA-based packet classifiers achieving a speed of 119 million packets per second (Mpps), while consuming 53% less power compared with the state-of-the-art architectures.


Introduction
Network devices in their early stages lack packet classification because of the limited application, but the use of Internet is now a multiservice ecosystem; a classifier is one of the main components of the whole networking system [1][2][3]. It enables the network to deploy service classification, i.e., security, quality of service (QoS), multimedia communications, and monitoring, and distinguish different network traffic from each other [4]. Some simple techniques are also applied to a router to decide if packets should be forwarded or dropped in order to prevent network infrastructure from being violated. Generally, rules are kept relatively static in conventional implementations, so fast classification can be accomplished with the algorithms running over the constructed classifier's well-designed data structure [5]. The primary objective of classifier design has been to perform highspeed packet processing in the past, such as content detection, load balancing, and packet filtering. The classifier can be installed offline as the rule update is rare at that moment. These new applications can respond simultaneously to a wide range of requests from different users, so that the classifier must be updated regularly in order to satisfy different requirements. Standard network migration operations modify the topology of the network, and the procedure adjusts the classifier accordingly [5,6]. This allows the sorting of packets to be carried out electronically with the help of updating the fast dynamic policy, which is undoubtedly a necessary prerequisite for current and future classifiers.
Software-defined networks (SDN), traffic engineering, and network function virtualization (NFV) as the nextgeneration networks are providing the most flexible networks. FPGAs fit completely to its requirement of reconfigurability and flexibility [17,18]. Figure 1 shows a generalized structure of a packet classifier that helps the network to forward the incoming packet into the corresponding node or hop based on its destination address. It uses content-addressable memory (CAM) to store the destination addresses (IP addresses in the packet) and random-access memory (RAM) to store the next node where the packet needs to be transferred. There are basically many types of packet classifiers but the one we cover in this paper is based on the destination address and the next-hop [19]. A next-hop packet classifier consists of RAM and CAM. RAM performs searches using a memory address and then returns the data from the address [20]. The CAM-based search does the opposite. A function calls the CAM by passing a key that consists of a data word structure, and the CAM search returns a memory address. CAM further differentiates itself from different types of memory that it can perform memory searches in a single clock cycle. CAM can be a binary CAM or a ternary CAM depending on the requirement of the application [21]. However, other classifiers use other techniques to classify the packet from the incoming stream.

Motivation
Existing packet classifiers on FPGAs have the performance bottlenecks and cannot achieve higher throughput as required by the high-speed networks [17]. Thus, a highspeed (high throughput) packet classifier is needed to achieve the demand of the next-generation networks while not compromising on power. The novelty of our proposed architecture, PackeX, is its minimalism without compromising the high throughput requirement of the network to classify the incoming traffic. The proposed architecture provides higher speed packet classification by not compromising on the throughput of the network and consumes less power as compared to the existing packet-classification architectures.

Key Contributions
The key contributions of our proposed packet-classification architecture are as follows: (i) The proposed packet classifier, PackeX, is state-ofthe-art architecture to classify the network packets compared to the existing architectures (ii) The proposed PackeX consumes half the power as compared to the state-of-the-art architecture (iii) PackeX process the incoming data packets at a rate of 119 Mega packets per second (Mpps) using the distributed RAM on target FPGA (iv) The proposed architecture is scalable and dynamically reconfigurable compared to the existing state-of-the-art TCAM-based packet-classification architectures The rest of the paper is organized in the following way: Section 4 addresses the related work. Section 5 explains the proposed PackeX classification system and the proposed architecture. The pipelining of the architecture proposed is defined in Section 6. The outcomes of the deployment and performance assessment of our proposed architecture are addressed in Section 7. Section 8 concludes the paper.

Related Work
Hardware-based packet classifiers can be divided into three main types: decision-tree, exhaustive search, and decomposition. Decision-tree classifiers have large hardware requirements and the time to process the defined rule set. Our proposed architecture has lower hardware resource requirement and is implementable on FPGAs, providing high performance compared to the state-of-the-art packet classifiers which is shown in Table 1.
Zhang and Zhou reported a scheme based on reducing the TCAM memory usage [22]. This code based on split function was proposed in this study. Firstly, the splitting of d-tuple rule set into d-tuple field takes place, followed by obtaining a unique field test for each dimension. Based on matching with the incoming packet, its storage in SRAM memory and indexing in TCAM using a concatenated field occurs. On the other side, our proposed packet classifier uses a novel and simple architecture to develop the packet addresses into a TCAM that is partitioned to store in the distributed memory of the target FPGA. This improves the throughput compared to the available state-of-the-art packet-classification algorithms.
Using a Multimatch Using Discrimination (MUD) approach, Lakshminarayanan et al. utilized the extra bits of TCAM entry for encoding it, followed by keeping the encoded value in the same TCAM entry [23]. Evidently, these multiple lookup cycles are one of the major disadvantages for high-speed networks. This fact has been explained in detail in the packet classification section of this article. Although the multiple lookup cycles consume a considerable amount of power, nevertheless, the individual power values of PackeX remain low owing to its simple structure [24].
Another packet classifier based on TCAM has been proposed in [25]. This packet classifier reduces the width of TCAM to 36 bits, which leads to reduce the space requirement for TCAM. Consequently, the storage of set rules takes place in the SRAM. However, due to field code formation and indexing mechanisms, this packet classifier requires large memory. In fact, the scattered and sparse SRAM array leads to wasting the majority of SRAM entries. PackeX uses the distributed RAM which is partitioned to reduce the RAM usage and provide the high-speed searching to achieve the high-speed classification of the incoming data packets.
Updating of the packet classifier is also an important aspect of the design, which sometimes becomes the bottleneck for dynamic networks. TCAM update modules are developed to speed up the updating of the searching module 2 Wireless Communications and Mobile Computing of packet classifier which is constantly improving and depends on the type of RAM used in FPGA [26,27]. BRAM and distributed RAM have different update latencies based on their available depth in the target FPGA. We have chosen the distributed RAM which provides the minimum clock cycle update time which makes our packet classifier unique in its updating process. This makes PackeX dynamic as well as faster to support the modification of the network to mitigate the incoming traffic from time to time. Lookup tables (LUTs) were used instead of memory blocks by Khatami and Ahmadi Partial reconfiguration in field-programmable gate array (FPGA) is used to shorten the time required to adjust the actions of an architecture. The pipeline system, on the other hand, results in unbalanced memory sharing, resulting in poor throughput and inefficient source allocation [28]. PackeX employs a very basic framework and a pipelining scheme, resulting in fast throughput and optimal use of source allocation.
Aceto et al. proposed a scheme of MIMETIC that can handle capitalized heterogeneity data traffic by using both inter-and intramodalities. The scheme outperforms the single modality-based techniques by supporting more challenging mobile traffic scenarios. MIMETIC uses three datasets to validate the performance improvement over the fusion classifiers, ML-based traffic classifiers, and single modality DL-based schemes [29]. The authors in [30] applied deep learning (DL) algorithm for packet classification on mobile traffic. The algorithm is centered on feature extraction, capable of running with encrypted and complex data traffic. It outperforms the ML state of art technique and previous deep learning-based algorithms by improving F-measure. On the other hand, our proposed packet classifier employs a novel and easy design to convert the packet addresses into a TCAM partitioned for storage in the target FPGA's distributed memory.
The development of binary and ternary CAM also involves the packet classifier because it is the vital component in classifying the address from the stored address in a little time [31]. Thus, Bi-CAMs and TCAMs include the partitioning of the corresponding CAM, binary to ternary conversion of the storage cells, and pipelining of the internal signals to improve the throughput. The proposed packet classifier uses the idea of partitioning taken from HP-TCAM in [15] and pipelining taken from D-TCAM in [16] to develop a novel structure for packet classification in a fastest way. We thus attain a higher speed in terms of millions of packets per second, which is the best and fastest packet classifier to the best of our understanding, while consuming 40% to 53% less power compared to the state-of-the-art architectures. Table 2 shows the notations that are used to describe the proposed packet-classification design (PackeX).  (i) A-block: a TCAM that stores the addresses of the incoming packets. These addresses in this paper are considered to be destination addresses. However, it can also be a source of addresses by using PackeX as filters for data packets from which they come and determine whether or not to drop or forward the packet. This block is known as A-block (address block)

Terminology.
(ii) P-block: a RAM that stores the information (pointers) for the corresponding nodes. It is known as P-block (pointers block) Other blocks/components include demultiplexer, priority encoder or a simple encoder, ANDing modules, and connecting circuitry. Figure 2 shows the internal structure of PackeX. The destination address (D A ) from the incoming packet is searched/compared with the stored addresses in A-block. One or more of the stored entries are matched with the input D A and generates the corresponding Match-Lines (MLs). Priority encoder (PE) converts the Match-Lines into an address for P-block which provides the values of the corresponding node. For example, "00" represents node A, "01" represents node B, and so on. The packet is forwarded according to the node specification provided by P-block using a demultiplexer (DEMUX). A-block and P-block are the main blocks storing the address information and node information. The size of demultiplexer (DEMUX) depends on the number of nodes in the network. Here, the nodes are 4 which need a DEMUX of size 1 to 4.

Classification
Procedure. The two basic blocks involved in the classification process are A-block and P-block. Algorithm 1 shows the procedure of the data flow. The destination addresses lead to the Match-Lines which find out the node, and the demultiplexer (DEMUX) activates only one channel out of many channels using the content stored in the P-block. The size of DEMUX is determined by the number of nodes that are supported by PackeX. For instance, 4-node PackeX require 1 : 4 DEMUX while the size of P-block depends on the number of destination addresses stored in the A-block.

5.3.
Partitioning of TCAM Module. The TCAM part of PackeX has 6-bit sub-TCAM modules. Multiple sub-TCAM modules combine to form the complete TCAM component of the proposed packet classifier, as shown in Figure 3. The use of 6-bits is due to the built-in structure of distributed RAM inside modern FPGAs that are based on the 6-bit lookup tables (LUTs) also known as LUTRAMs. It holds the destination addresses of the packets that are received by PackeX for classification.
TCAM submodules of 6-bits each forms the TCAM module that stores the destination addresses of the incoming data packets to the packet classifier. The symbol W represents the total size of the stored addresses while the symbol w is 6 because of the 6-input LUTs available in the target FPGA.
The numbers of required sub-TCAM modules are determined from the required size of the addresses that need to be stored in the proposed packet classifier. The ideal addresses are always multiples of 6, i.e., 6, 12, 18, and so on. If the size of addresses/rules is 6 bits, the required sub-TCAM modules    Wireless Communications and Mobile Computing are 1. If the size of addresses/rules is 12 bits, the required sub-TCAM modules are 2, and so on.

Pipelining
Performance of digital design degrades when the shortest path from input to output is larger as it reduces the speed of the whole circuit. Pipelined registers are introduced in stages to the digital design in order to reduce the path and improve the throughput. This is done recently by D-TCAM [16] which has shown improvement by incorporating flipflops (FFs) to each of the distributed RAM that improved the throughput of TCAM on FPGA. We took the pipelining procedure from D-TCAM [16] to enhance the throughput of our system which is degraded by the ANDing circuitry, P-block, encoder, and demultiplexer. The proposed packet classifier achieves a high speed in classifying the incoming packets with 119 Mega packets per second (Mpps), while consuming 40 to 53% less power.

Why an FPGA-Based Packet Classifier?
Reconfigurability and high performance can easily put to work. As described in Section 4, packet classifiers are hardware and software-based. Hardware-based packet classifiers are faster but mostly fixed. FPGAs are reconfigurable according to the requirement of the system. Building a packet-classifier on FPGA gives us hardware-like performance (speed) and dynamic nature (reconfigurability) of the whole system. If the network grows in size, so is the packet classifier, because of the reconfigurable blocks available on FPGAs. Thus, our proposed architecture for packet classification, PackeX, has a simple structure Input: Data packet having destination address (A D ) Output: Corresponding node where the packet needs to be forwarded (O P ) MLs < = A-block ½A D Node n < = P-block ½MLs Output < = DEMUX ½Node n Procedure (The algorithm is used to forward incoming packet to appropriate output port) [Apply the incoming data packet to A-block to get destination output port]; forD T = 0 to ndo D MUX = D P Next D P end for [Apply the incoming destination address to P-block to get destination output port]; forD R = 0 to ndo ifD A = RAM ði, jÞthen O P = D A Exit Next D P end for [Finish]; end procedure Note: A-block represents the address block where the destination addresses are stored, while P-block represents the pointers block where the pointers to the corresponding node are stored. A-block is a TCAM and P-block is a RAM.

Implementation
We have successfully implemented PackeX on Xilinx Virtex-7 FPGA using Xilinx Vivado 2018.2. The FPGA device xc7vx690tffg1157 is used with speed grade -2. Table 3 shows the implementation results for 64-rule 36-bit and 128-rule 32-bit packet classifier which classifies the incoming packets at a speed of 71 and 119 million packets per second (Mpps) and power consumption of 11 mW and 17 mW, respectively, compared to the state-of-the-art packet classifier.
Most importantly, the proposed design is purely a hardware-based design which eliminates the deficiencies of a software-based packet classifier. However, the proposed design is reconfigurable just like software-based design. Thus, PackeX combines the effective and useful properties of both software (i.e., reconfigurability) and hardware (i.e., speed). The experimental results show the feasibility and scalability of PackeX for future software-defined networks by its ability of adapting structure according to the need to the underlying application. Table 4 shows the hardware utilization of the proposed packet classifier. The utilization of logical resources on the target FPGA is using under 1% of the available hardware except in case of the input/output (I/O) pins. The I/Os can be further reduced to optimize the performance of the system in future work. The lookup tables (LUTs) are 0.25%, LUTRAMs are 0.44%, and Slice Registers (SRs) are 0.09% of the available resources which show the scalability of our proposed architecture. Higher sizes of the packet classifier can be implemented on PackeX.
PackeX receives data from the server, classifies according to the stored addresses/rules, and forwards to the corresponding node. A node can be anything including but not limited to mobile phone, laptop, computer, and another server. Only one of the connected channels is activated to transfer the processed packet.
To implement the P-block, there are many options in modern FPGAs, i.e., single port distributed RAM, single port block RAM, and single port ROM. In our proposed design, the numbers of nodes are fixed to node 1, 2, N − 1, and N as shown in Figure 4.
We use single port ROM to store the content of P-block which provides the information about the type of node to which the packet needs to be forwarded. It can also be implemented using RAM whenever the system needs dynamic number of nodes in the network. Table 5 presents the comparative analysis of various approaches in terms of metrics such as dynamic power consumption, latency, and FPGA resource utilization. A com-parison of the proposed PackeX-TCAM architecture with the current state-of-the-art FPGA-based TCAMs in terms of power consumption output with the current state-of-theart FPGA-based TCAMs [36] is given by equation (2).

Normalized Power Consumption mW/D R ½ Performance
: ð2Þ Figure 5 presents the comparative analysis of logical resource utilization of existing state-of-the-art TCAMbased packet-classification architectures implemented on FPGA. As shown, compared with other TCAM-based classifiers, our proposed PackeX classifier is utilizing efficient logical resources.
Hence, the utilization and less number of resources for PackeX-I and PackeX-II approach lead to lower power consumption ranking without compromising on throughput.

Conclusions
Packet classification is a key operation needed in provisioning several important network services. One of the major challenges in designing the next-generation high-speed switches is to deliver high-speed low-power packet  Figure 4: PackeX receives data from the server, classifies according to the stored addresses/rules, and forwards to the corresponding node. classification. TCAM-based labeling of packets is a de facto standard for the high-performance processing of packets. However, due to inherent parallel structure, the high cost and high energy consumption are the major challenges of its efficient usage/implementation. TCAMs are the dominant industry standard used for multibit classifiers. However, as packet classification policies grow thorough and complex, a fundamental tradeoff arises between the TCAM space and the number of addresses for hierarchical policies. This paper proposes a novel algorithm based on dynamic programming for solving important problems concerned with packet classification. PackeX presents a high throughput in packet forwarding as compared to the existing algorithms with less power consumption. The proposed packetclassification design classifies the incoming packets with simplicity and provides high throughput and consumed less power compared to the existing algorithms. The PackeX design uses fewer hardware resources. It is reconfigurable when the network needs modification, making PackeX favorable to future defined networks. Our algorithms do not require any change to existing packet classification systems and can be easily deployed. As far as we know, this is the first work to study TCAM speed and memory optimization for packet classification.
In our current work, PackeX is deployed for fast mapping and updating algorithms to eliminate the complication of sequential steps in generating lookup tables (LUTs) and iterative procedure for calculating the required address of the packet. The future works may include deployment of submodules at such locations that can balance the energy consumption and speedup the process. This may be possible by using some statistical distribution for the deployment of modules with respect to the horizontal partitioning and virtual partitioning. In addition, the use of proper pipelining and partitioning can also be investigated in our proposed algorithm in the future.

Data Availability
The data that support the study's outcomes are all briefly introduced, and all information is included in the manuscript. Table 5: Power consumption of PackeX with RE-TCAM [32], G-AETCAM [33], CLB-TCAM [34], and REST [35].