An Energy-Efficient Silicon Photonic-Assisted Deep Learning Accelerator for Big Data

Deep learning has become the most mainstream technology in artificial intelligence (AI) because it can be comparable to human performance in complex tasks. However, in the era of big data, the ever-increasing data volume and model scale makes deep learning require mighty computing power and acceptable energy costs. For electrical chips, including most deep learning accelerators, transistor performance limitations make it challenging to meet computing’s energy efficiency requirements. Silicon photonic devices are expected to replace transistors and become the mainstream components in computing architecture due to their advantages, such as low energy consumption, large bandwidth, and high speed. Therefore, we propose a silicon photonicassisted deep learning accelerator for big data. The accelerator uses microring resonators (MRs) to form a photonic multiplication array. It combines photonic-specific wavelength division multiplexing (WDM) technology to achieve multiple parallel calculations of input feature maps and convolution kernels at the speed of light, providing the promise of energy efficiency and calculation speed improvement. The proposed accelerator achieves at least a 75x improvement in computational efficiency compared to the traditional electrical design.


Introduction
In a modern society driven by big data, artificial intelligence (AI) has brought great convenience to human life. As an indispensable part of solving complex problems in the field of AI, deep learning has been used in many applications, e.g., image and speech recognition, machine translation, self-driving, Internet of Things (IoTs), 5th generation (5G) mobile networks, and edge computing [1][2][3][4][5][6][7][8][9][10][11][12][13]. Deep learning can use effective learning and training methods to discover the inherent rules in the data model, thus helping machines to perform advanced reasoning tasks like human beings. In deep learning, convolutional neural networks (CNNs) are considered the most representative framework due to its advantages: the simple structure, few parameters, noticeable extraction features, and high recognition rate [14,15]. Due to the enormous amount of data, the efficient inference of CNNs has high computing requirements. Therefore, the development of the hardware inference accelerator, which can provide strong computing power, is the key to meet the needs of CNNs.
At present, hardware accelerators that perform CNN operation mainly include GPUs, ASICs [16], FPGAs [17], TPU [18], and the emerging near data processing accelerator ISAAC [19]. However, current accelerators rely on a large degree of data movement. The energy consumption of electrical wire-based data movement is even greater than the energy consumed by the computing itself. Due to the widening gap between abundant data and limited power budget, these electric-based accelerators' energy crisis is still unpredictable. Limited by the transmittance rate of the electrical line, the calculation speed and throughput of these accelerators may not be able to keep up with the increase in power, resulting in limited throughput per second per watt.
Recently, silicon photonic technology has emerged as a promising solution to address the issues above [20][21][22][23][24][25]. Firstly, a certain transistor-based circuit's power consumption has a positive correlation with f 3 (f is the clock frequency). The photonic circuit only consumes the power proportional to f , so that the photonic circuit can provide ultralow energy consumption [26]. Secondly, light has a very low transmission delay on a chip, typically 0.14 ps for 10 microns, which is 1-2 orders of magnitude faster than the transistor-based circuit [27]. Finally, the photonic circuit is insulated and has strong antielectromagnetic interference performance.
Furthermore, benefitting from the peaceful development of photonic integration technology and manufacturing platform, various mature active and passive building blocks have been demonstrated experimentally, such as modulators, photodetectors, splitters, wavelength multiplexers, and filters [28][29][30][31]. Based on these photonic devices, photonic computing elements such as photonic adders, differentiators, integrators, and multipliers can be realized [32][33][34][35]. Once the photonic devices can be successfully applied to the CNN accelerator's design, it is expected to improve energy efficiency in deep learning significantly. In addition, by utilizing optical multichannel multiplexing technologies, such as wavelength division multiplexing (WDM) [36][37][38], we can easily use the speed of light to achieve massively parallel computing to improve the inference speed of CNNs significantly.
Thus, we propose a silicon photonic-assisted CNN accelerator for deep learning. We first use the mature microring resonators (MRs) as the basic unit to design a photonic matrix-vector multiplier (PMVM) to perform the most complex convolution operation on CNNs. Then, we introduce an analytical model to identify the number of MRs used, power consumption, area, and execution time in each layer of the CNNs. At last, we introduce our PMVM-based photonicassisted CNN accelerator architecture and its workflow. The simulation results show that our accelerator can increase the CNN's inference speed by at least 75 times under the same energy consumption than the current electricity-based accelerators.
The rest of the paper is organized as follows. Section 2 briefly discusses the related works. Section 3 discusses the proposed PMVM and accelerator architectures, followed by Section 4 presenting the performance evaluation of the silicon photonic-assisted accelerator. Section 5 concludes this paper.

Related Work
In this section, we first describe CNNs' structure and computing process in deep learning. Then, we introduce photonic devices that might be used. These related works can be used as the guide for our research on the photonic-assisted accelerator design.
2.1. Convolutional Neural Network (CNN) Basics. CNN is comprised of stacking multiple computation layers for feature extraction and classification. Compared to the fully neural networks with simple training but limited scalability, CNN has very deep convolutional (CONV), pooling (POOL), and full connection (FC) layers. Therefore, it can achieve high accuracy [14]. In each CONV layer, the input maps are transformed into highly abstract representation feature maps and convolution with the kernel to generate output feature maps. After nonlinearity and pooling, the output features can be used as the input for the next layer. After multi-CONV and POOL layers, the features are sent to the FC layers and finally output the classification results. The CONV layers take more than 90% of the calculation time [39]. Therefore, the design of an optimization accelerator for CONV layers can significantly improve the entire CNN's performance. Figure 1 shows a CONV layer. It has M 3D convolutional kernels with size S × R × C and N input maps with size W × H × C. M kernels perform M times 3D convolution on the input maps with a sliding stride of S and generate an E × F × M output map. In each output map, the value of the element (m, f , e) can be computed as where I, K, and O are the input, kernel, and output matrices, respectively. σð⋅Þ is an activation function, such as ReLU and sigmoid. The pseudocode to perform this normal convolution operation is shown in Figure 1. Note that in each layer, all kernels share the same input data. Therefore, if the accelerator can support multiple kernels that simultaneously convolve with the same input data, the number of access buffers is reduced. The cycle time can also be reduced, thereby increasing the throughput. As shown in the pseudocode, assuming the input map can be reused by G m kernels simultaneously, the total convolution cycles can be saved by G m time. The size of G m is determined by the accelerator. Therefore, designing the corresponding accelerator architecture to maximize this data reuse capability is the paper's primary motivation.

Silicon Photonic Devices.
Microelectronic devices are the basis of the current CNN accelerator. But with the reduction of feature size, the ability of electronic information processing has approached its limit. Silicon photonic devices offer an exact route to solve the electrical processing bottleneck due to its low loss, high speed, low energy consumption, and compatibility with CMOS platforms. Among the various silicon photonic devices, MRs are considered the most critical devices in photonic computing due to their excellent wavelength selection characteristics, small size, high modulation rate, low energy consumption, and high-quality factors [40,41]. Figure 2 shows two commonly used MR structures: all-pass MR (Figure 2(a)) and 1 × 2 cross-MR ( Figure 2(e)). All-pass MRs include one straight waveguide and one MR, assuming that the resonant wavelength of the MR is λ mr and the input signal wavelength is λ in . When λ in = λ mr , the input signal will be wholly coupled into the MR, so that the signal power output from the through port is zero (transmittance rate is 0). When λ in ≠ λ mr , the coupling ability between the input waveguide and the MR will become weak, and when it is weak enough, the signal will output from the through port (transmittance rate is 1). When the MR's resonance wavelength is between λ 1 and λ 2 , the transmittance rate of the MR will be between 0 and 1.

Wireless Communications and Mobile Computing
Therefore, we can use the resonance effect of MR to adjust the output power to realize the photonic multiplication calculation. For instance, as shown in Figure 2(a), assuming that the input optical signal power is A, the transmittance of the MR is B (0 ≤ B ≤ 1). When the input optical signal passes through the MR, part of the light (1 − B) will be coupled to the MR, and the output optical power of the through port is C = A × B. Usually, by adding a bias voltage to the MR, the transmittance rate of MR (B) can be changed under the thermooptic or electrooptic effect. According to [34], each MR can store more than 16 levels of transmittance rate (i.e., 4 bits). Therefore, for a 16-bit floating-point calculation [19], only 4 MRs are needed. Since the multiplication operation of the above two structures can be realized in the optical domain, they have a high processing speed, making them ideal choices for photonic multiplication units.

Silicon Photonic-Assisted CNN Accelerator
Architecture Design In order to use silicon photonic technology to improve the calculation rate in deep learning, we first propose a PMVM based on photonic devices in this section. Then, we create a photonic-assisted CNN accelerator architecture based on PMVM.
3.1. Silicon Photonic Matrix-Vector Multiplier. Matrix-vector multiplication is the most important operation in CNN. Therefore, in this section, we will use the essential photonic devices to construct a PMVM and map the input feature map and kernel weight data to the PMVM to complete the parallel multiplication operation. Figure 3 shows the PMVM architecture. It relies on an all-pass MR-based input matrix and 1 × 2 cross-MR-based kernel matrix. Current CNNs have tens of kernels in each layer to convolve the same set of input data. Therefore, in PMVM, we multiplex the input data to be convolved with multiple kernels simultaneously, reducing the waste of time            Wireless Communications and Mobile Computing Here, λ res is the resonant wavelength, n eff is the effective refractive index, and R is the radius of the MRs, respectively. Therefore, in this paper, we use MRs with different radii to realize the control of different resonance wavelengths.
As shown in Figure 3, the weight value of the coordinate (i, j, n) in the m-th kernel can be represented by the drop port transmittance rate of the m-column and ððn − 1Þ × S × R + ði − 1Þ × S + jÞ -row MR in the crossbar array, where 0 < i < S, 0 < j < R, 0 < n < C, and 0 < m < M. According to CNN's characteristics, the state of all MRs in the kernel matrix remains unchanged during the inference process. In PMVM, the feature data of the input feature maps are  Figure 3, assuming the stride of the sliding window is 1, the value of MR with wavelength λ 1,1 is b 1,1,1 at time t 1 , and it will be updated to b 1,2,1 at time t 2 . In this PMVM, the multiwavelength optical signals emitted by the lasers are injected from the input port of the input matrix and output from the kernel matrix after photonic multiply-accumulate (MAC) operation. The output power is the sum of all wavelength signals. As shown in Figure 3, the calculation process of the PMVM at time t 1 is Therefore, the PMVM enables all MAC operations to finish with high parallelism. According to [39], the number of multiplexed wavelengths can reach 128. Thus, the computation speed of the PMVM will be 128 × 128 × 10 × 10 10 = 1:6384 × 10 15 MAC/s when all MRs work at 10 Gb/s modulation speed.

Silicon Photonic-Assisted Accelerator Architecture Design.
Based on the PMVM, we propose a photonic-assisted CNN accelerator architecture, as shown in Figure 4. The accelerator consists of multilayer CONV layers, pooling layers, and FC layers, and all layers are processed sequentially. According to different CNN models, the distribution between layers can be adjusted. The proposed PMVM is deployed in the CONV layers. The input matrix and kernel matrix values are read from the off-chip DRAM (the off-chip DRAM data will be sent to the on-chip buffer first). Once the CNN model is sufficiently trained, the weight values of kernels in each layer are determined and programmed into PMVMs by con-figuring each MR's transmittance rate in the kernel matrix. During the whole process, only the value of the input matrix will be updated. After highly parallel MAC operations, the output optical signals are converted into the electrical signals by photodetectors (PDs) and then activated and pooled. This process can be done very fast because all the photonicassisted devices' operating frequency can reach tens of GHz, e.g., lasers, MR, and PD. The calculation results are stored back to the off-chip DRAM for reading and calculation of the next layer. After multiple layers of convolution, pooling, and full interconnection operations, the accelerator will output the final inference results.

Simulation Evaluations
In this section, we used a widely adopted deep learning accelerator simulator, FODLAM [42], to evaluate the performance of our accelerator. FODLAM does total up the latency and energy for each layer, including the storage and read/write costs of the intermediate layers. The simulation of the photonic part of our accelerator structure is performed using a professional optical simulation platform, i.e., Lumerical Solutions [43]. The configuration parameters of other accelerators are obtained from the prior art as referenced.

Photonic Matrix Multiplication Function Verification.
The photonic vector multiplication results of B × W with different working frequencies are exhibited in Figure 5. Assuming the matrix size is4 × 4, we perform the simulation using four CW lasers with different working wavelengths. The input matrix (B = ½b1 ; b2 ; b3 ; b4) is modulated by four 2 7 -1 pseudorandom binary sequence (PRBS) from the pattern generators. The values in the kernel matrix W are randomly generated once programmed into the corresponding MR  It can be seen from Figure 5 that when PMVM works at 1.28 GHz, the simulation results are almost the same as the ideal results. Although a particular error will occur as the operating frequency increases, the designed PMVM can also maintain good calculation accuracy under the operating frequency of 25 GHz.

Area and Power Consumption Evaluation Models.
The area of PMVM is affected by MRs. According to [44], the area of each MR unit is 25 μm × 25 μm with 0.025 mW energy consumption. The size of the kernel determines the number of MRs used in PMVM. For example, the first CONV layer of the AlexNet architecture contains 96 kernels, and the size of each kernel is 11 × 11 × 3. Assuming that a set of input data completes all convolution operations of this layer within one cycle, theoretically, the PMVM of this layer needs 69,696 MRs. The area and power of PMVMs in this layer are 43.56 mm 2 and 1.74 W, respectively. Due to the current technological limitations, it is difficult to integrate so many MRs on a single chip. Therefore, multiple interconnected chips are usually used to complete the above functions [19,39]. Figure 6 shows the number of MRs, occupied area, and power consumption in each convolutional layer of AlexNet. It can be seen that the fourth layer of AlexNet has the largest consumption because this layer has the largest convolution kernel.

Execution Time Evaluation Models.
As mentioned in the previous section, our PMVM can compute convolutions of multiple kernels in parallel for a single input data within one cycle. In AlexNet, the length and width of the input patches are the same. Assuming the size of input patches is W × W, the kernel size is K × K, the padding size is P, and the stride is S. Thus, the number of convolution calculations for each input patch is Thus, the computation time of each input patch is where f PMVM is the operating frequency of the PMVM. Assuming P = 0 and S = 1, the execution time results for each layer of AlexNet as shown in Table 1 when the working frequency of the PMVM is 25 GHz.

Inference Performance.
To fully evaluate our accelerator's inference performance, the energy-efficient performance is considered in our simulation, i.e., MAC/s/watt. We compared our accelerator with GPU, FPGA, TPU, and ReRAMbased CNN accelerator ISAAC. The CNN architecture are AlexNet, LeNet-5, and ResNet-18, and the database are Ima-geNet (AlexNet and ResNet-18) and MNIST (LeNet-5). In the simulation, we use the parameters of the electrical devices listed in Ref. [19]. The simulation results of MAC/s/watt are shown in Figure 7. Compared to other electricity-based accelerators, our accelerator can increase energy efficiency  Wireless Communications and Mobile Computing by at least 75 times because it can use silicon photonics' advantages to increase computing speed while reducing energy consumption.

Conclusions
This paper proposed a silicon photonic-assisted CNN accelerator to maximize the inference performance in deep learning. It achieved a high inference throughput by exploiting the high modulation rate MRs and WDM technology. The proposed accelerator achieves at least 75x improvement in computational efficiency compared to the state-of-the-art designs. The photoelectric hybrid CNN accelerator needs to match the operating frequency of the electronic device, which affects the performance of the photonic device. In the future, we will explore the all-optical accelerators to maximize acceleration performance.

Data Availability
Data are available on request. The data are available by contacting Mengkun Li (limengkun@cnu.edu.cn).

Conflicts of Interest
The authors declare that there is no conflict of interest.