A CRC-Based Classifier Micro-Engine for Efficient Flow Processing in SDN-Based Internet of Things

,


Introduction
Our world is connected by Internet of things (IoT). In the past few years, the considerable growth of network bandwidth and development of hardware technologies, especially in mobile communications, have led to a significant increase in the speed of communication lines of this worldwide network [1,2]. at is, the speed of communication lines is reached to higher than "terabits per second." e SDN paradigm aims to achieve good performance in managing networks by accelerating routers and switches to process the packets with the rate of network links [3,4]. Making SDN flexible enough to satisfy the different requirements of heterogeneous IoT applications is desirable in terms of software-defined IoT (SD-IoT) [5,6]. For this purpose, network devices are equipped with a new mechanism, naming packet classification, which lets them to be flowaware.
at is, the network device, first classifies the incoming packets into predefined flows according to a set of filters, then any further processing is done accordingly. erefore, a variety of packet processor devices including routers, firewalls, intrusion detection systems, account management systems, and network management systems use packet classification [1,[7][8][9][10].
at is, a number of important network management functions such as access control, quality of service provisioning, firewall, traffic policing, and policy-based switching make use of packet classification.
ere are five fields in a typical classification rule including source and destination IP addresses (SA and DA), source and destination port numbers (SP and DP, respectively), and protocol (PT). SA and DA are address prefixes, SP and DP are number ranges, and the PT field may be either a specified value or a wildcard. e order of rules in a rule set determines their priority. e last rule is the default rule in which all the five fields are equal to the wildcard. If an incoming packet matches more than one rule, the action corresponding to the rule with the highest priority is performed. A description for packet classification algorithms is found in [1].
Classification of Internet packets in network devices is conducted through either software-based or hardware-based approaches. Considerable time overload of software-based methods makes them less popular among network equipment manufacturers [11]. On the other hand, there is a widespread tendency towards hardware-based methods and their higher throughput rate and lower delay [12]. Hardware-based implementation of packet classification algorithms may categorized into two groups. e first group consists of algorithms based on parallel search in the content addressable memory (CAM) chips Z-TCAM [13], E-TCAM [14], and ZI-CAM [15]. In spite of their relatively high speeds, the use of ternary memories in these algorithms leads to disadvantages such as excessive power consumption, lower speed than other memory cells, lack of scalability, undue consumption of chip resources, and high prices. e second group consists of algorithms such as decision tree, decomposition tree, geometric space, field encryption, and similar methods which are realized on programmable hardware devices like ASIC or FPGA. e main challenge in designing hardware-based methods is increasing the ratio of throughput to design cost. e throughput of a classifier is the number of packets that are classified in unit time. e required memory space is the main indicator of the system design cost. To reach this optimal point, we propose a micro-core that lowers memory consumption and simultaneously increases the classification throughput. e chief contributions of this paper are as follows: (1) e proposed micro-core uses SRAM and BRAM cells, which allow for dual-port implementations. FPGA shows that the memory cost reduces to 14.5 bytes per rule, and simultaneously the throughput of the classifier reaches 324 Mpps. is result confirms the superiority of the proposed architecture to its counterparts. e rest of this article is organized as follows. In Section 2, the related works on hardware-based packet classification systems are reviewed. e proposed microclassifier architecture is explained in Section 3. e performance evaluation of the proposed architecture is presented in Section 4 after introducing the metrics. Finally, conclusions and directions for future research are discussed in Section 5.

Related Work
So far, a wide variety of hardware-based classifier architectures have been proposed for packet classification. All of them attempt to increase the throughput and decrease the memory usage. e CAM-based classifier architectures benefit from the parallel search property of CAM modules but suffer high implementation costs and high levels of the consumption power. In [16,17], two of the most recent architectures are proposed. ey utilize a pipelined decision tree algorithm and a ternary memory, respectively. e architecture proposed in [16] has achieved a throughput of 103 Gbps, which is the highest among all the works mentioned here. However, depending on its hardware parameters like the number and length of pipelines on the distribution of the values of the classifier's rule fields, any updating requires reconfiguration of the architecture. On the other hand, the memory usage of this architecture is 63.5 bytes per packet. In [17], the researchers used ternary memories of the size 52 * 144 and were able to reduced memory usage down to 18 bytes per rule; however, their maximum throughput was as low as 38 Gbps. e architectures proposed in [18,19] could achieve a throughput of 100 Mpps while keeping the memory usage at 23.5 and 17.4 bytes, respectively. e focus of the architecture in [18] is on the rule search, and it does not address the issue of longest prefix matching (LPM). e architecture proposed in [19] adopts a TCAM-based approach to packet classification. Its major drawback is linear growth of TCAM usage is proportional to the number of rules that increases consumption of chip resources and power. Implementation of a merge algorithm based on decomposition tree in [20] achieves a throughput of 94 Gbps (amounting to 147 Mpps, given that each packet is 40 bytes). e study does neither provides the memory usage nor suggests any solution for updating rules.
Some classifiers like that presented in [21] use a special model to accelerate accessing the memory containing the rules, which in turn raises their memory consumption. Pipelined implementation of packet classification algorithms seeks appropriate solutions to reduce the number of pipeline stalls and the required memory space. For example, in the pipelined packet classifier of [22], the memory consumption varies from 16 to 24.5 bytes per rule.
To overcome the abovementioned disadvantages, we propose a packet-classifying micro-core with low memory consumption and high throughput. e proposed microcore makes use of SRAM and BRAM cells, which allow for dual-port implementations. Processing based on cyclic redundancy check (CRC) codes in the internal structure of the micro-core without any need for ternary memories reduces the consumption of FPGA hardware resources and the time required for memory access in this classifier.

Proposed Architecture
is section explains the architecture of the proposed classifier which is aimed at increasing packet classification speed. Underlying the architecture are two principles: first, making use of BCAM memory in prefix matching and, second, using hash codes to reduce memory usage.
In this classifier, a set of processing micro-cores act like a CAM, each one storing the information about one rule field of the rule set. e architecture is shown in Figure 1. In this architecture, n micro-cores are defined for each of the fields used for packet classification, where n is the number of rules in the classifier representing the number of micro-cores per field.
e incoming packets are classified as follows: First, the packets are transmitted through a shared bus to the microcore unit. As soon as a packet enters a micro-core, the fields of source and destination IP addresses of the header are read by Parallel Hash Calculator. Next, in a parallel manner and proportional to the length of the value of the Prefix register, CRC-16 generation process is performed on the input address and the result is stored in the Temp register. Each micro-core has a control unit that, in addition to controlling the function of the micro-core, manages the generation process as well as the process of matching the hash code generated and stored in the Temp register against the hash code of the prefix field of the corresponding rule of the micro-core in the Hash-of-rule register. If the hash code of the incoming packet header matches the hash code in the micro-core, the Adder will add one to the variable stored in the Rank register. Moreover, in the case of correct matching, the one-bit flag Match and the corresponding bit of the micro-core in the n-bit register Packet Matching will be set. If matching fails, these bits will be reset by default. e Packet Matching register is used to record matches or mismatches in other micro-cores. In this register, any bit with a value of one denotes a match and any bit with a value of zero denotes a mismatch in the micro-core corresponding to a field that is being searched. After matching all fields, the results are written to the bits corresponding to each field in the Packet Matching register. Next, the result of logical AND operation on all Packet Matching registers is stored in Matched Vector. Finally, a prioritized decoder selects the matching rule with the highest priority.
As seen in Figure 1, the classifier consists of a set of processing micro-cores. In the following, we shall discuss the internal architecture of the micro-cores that is illustrated in Figure 2. Also, the length of each register of the micro-cores is shown in the bottom of Figure 2.
e main body of the micro-core is composed of two modules, i.e., CRC Calculator and Controller.
e Controller module is responsible for management of all control lines, inputs, and outputs. Operations in this unit include management of selection lines, injection, and updating of registers. In fact, this unit is the decision-maker of the microcore, which is separately controlled by the main controller of the classifier. In other words, the functioning of the microcores is not disrupted by updating and changing one of the micro-cores. e second important module of the micro-core that bears a major part of its processing load is CRC Calculator. Figure 3 shows the function of this module. It receives the incoming packets and calculates their hash code in parallel (Line 1 of Algorithm 1). For this purpose, each IP address along with the corresponding prefix which has been already stored in the Prefix register enters the module. Next, a hash code is computed using them and sent to the Controller for the purpose of matching. Implementation of this module consumes 40 out of 204000 LUTs on Virtex-6. e micro-core has 7 input pins and 2 output pins. Table 1 lists the input and output ports of the micro-core along with their length in bits as well as their description. A micro-core is composed of registers, hash generator modules (CRC Calculator), and a controller.
One set of the input pins of the micro-core is named "Select," which determines the operation mode of the microcore (Line 2 of Algorithm 1). In fact, this pin is responsible for management of the micro-core's functioning. It is connected to a two-bit bus which is used to address different modes of the micro-core function as described below (also, see Table 2): Mode 0: the micro-core performs its main task, which is the packet classification (Lines 3-7 of Algorithm 1). Mode 1: when the address from the Address port is identical with that in the Address register, the information in the selected micro-core is updated by means of the information from the Prefix and Rules ports (Lines 9-11 of Algorithm 1). Mode 2: each micro-core sends the information related to its rank into a shared output bus. is operation is aimed at selecting the best candidate for being removed by the classifying controller. e central controller stores the number of the classifier with the lowest rank to update the rules (Lines 12-13 of Algorithm 1). Mode 3: the majority of the existing classifiers do not offer a dynamic solution for updating rules and perform this task by reconfiguration of the chipset. However, given its low-rank feature, our proposed architecture enables us to update the rules under certain conditions. It is easy to inject new rules without changing the order of the existing rules. e last mode is Select(1, 1), which is used to initialize the micro-cores by injecting rules into each micro-core (Lines 14-17 of Algorithm 1).
Before injecting the packets into a processing microcore, the rules are injected. In this step, the rules that have been converted to hash codes are stored in the Hash Code register. Prefix and Address registers keep the prefix length and the address of the processing core of the rules. Flag belongs to the internal controller which manages the input/ output operation inside the micro-core so that the acts of processing the input packets would not overlap. Rank register has a length of 32 bits and holds the correct matches between the incoming packets and the matched field in the micro-core. For each correct match in the micro-core, one is added to the value of this register. is is a criterion used for updating and removing rules in other processing micro-cores.
us, the micro-core with the lowest rank is selected for removing and updating. In fact, in this classifier, a lower rank is indicative of decreased use of the rule in the rule set.

Implementation and Evaluation
In this section, the results of implementation of the proposed micro-core architecture are discussed. is architecture is implemented on a XC6VLX75T chip from Virtex-6 FPGA family with a Xilinx ISE 14.7 simulator using VHDL language. Experiments are done on a system with the characteristics mentioned in Table 3. e evaluation criteria in this experiment are throughput and memory. roughput refers to the number of packets classified in a second. Assuming a minimum of 40 bytes for each packet [20], we use Gbps instead of Mpps for measuring the throughput. Memory usage is measured in bytes for each rule in the classifier.
We use Classbench tool to generate rules and experimental packets in our experiments. Classbench runs on Linux platform and is used for generating rulesets with desirable distributions in the model of the geometric space of rules. It generates rules and corresponding headers by using a set of input distribution parameters [23]. e highest throughput achieved so far belongs to Chang [16], which is 103.53 Gbps. However, it is 0.150 Gbps less than our throughput rate. Also, the memory usage for storing the classifier's rules in that architecture is four times greater than in our method. In fact, Chang's architecture resembles traditional TCAM-based architectures in terms of memory usage. Table 4 compares the proposed micro-core architecture with the existing architectures. With a clock frequency of 170 MHz, the processing time of each micro-core is in the worst case 6.2 nanoseconds per packet and power consumption is 118 mW. With a dual-port memory which can process two packets simultaneously, the proposed microcore is able to process 324 million packets of at least 40 bytes in a second which amounts to a throughput of more than 100 Gbps. In this simulation, our architecture used 137 Slice registers and 182 search tables.
In Table 4, the proposed micro-core architecture is compared with major recently proposed counterparts in terms of throughput and memory usage per rule. From among these architectures, Jiang and Prasanna [19] and Irfan et al. [17] require the least amount of memory, i.e., 17.4 and 18 bytes per rule, respectively. With a required memory of 14.5 bytes per rule, our proposed micro-core outperforms these two architectures. e major reason behind the low memory usage in our method is that, in contrast to the two mentioned architectures, our classifier does not rely on TCAM and mask for matching operation.

Mobile Information Systems
Updating rules and propertyof micro-cores 9-11 10 Outputting rank of micro-cores on BUS 12- 13 11 Initializing the selected micro-core 14-17 For this purpose, the throughput as well as the memory consumption of the proposed architecture is measured on the chipsets that are used in the evaluation of the competitor designs. e memory usage per rule is always constant in the proposed design. erefore, the performance per memory of the proposed design is the highest as compared to its major counterparts.
e ratio of superiority of the efficiency of the proposed method with respect to each counterpart is illustrated in Figure 4. Our comparisons suggest that the proposed architecture has considerably improved the throughput rate and memory usage of the Internet packet classification systems.
e performance per memory of the proposed design is at least 18% and at most 990% better than the best and the worst designs, namely, Pao and Lu [22] and Orosz et al. [25].

Conclusion
In this paper, we proposed a new micro-core architecture for classification of Internet packets that is capable of being updated. e proposed architecture allows for adding or removing rules during processing and enjoys lower memory usage as well as higher throughput rate in comparison with other architectures. Our evaluations suggest that this micro-core can classify packets with a throughput of more than 103 Gbps, which amounts to about 324 Mpps. Another advantage of this architecture is that memory usage per rule is always constant. erefore, the performance per memory of the proposed design is the highest as compared to its major counterparts.
is achievement helps the proposed micro-core to avoid the problem of resource requirement in the longest prefix matching (LPM). A fruitful topic for future research would be to apply this inherent feature of the micro-core to the implementation of pipeline classifiers in which LPM is a great problem in pipeline processing.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.