Nonlinear All-Optical Diffractive Deep Neural Network with 10.6 μm Wavelength for Image Classification

A photonic artificial intelligence chip is based on an optical neural network (ONN), low power consumption, low delay, and strong antiinterference ability. +e all-optical diffractive deep neural network has recently demonstrated its inference capabilities on the image classification task. However, the size of the physical model does not have miniaturization and integration, and the optical nonlinearity is not incorporated into the diffraction neural network. By introducing the nonlinear characteristics of the network, complex tasks can be completed with high accuracy. In this study, a nonlinear all-optical diffraction deep neural network (N-DNN) model based on 10.6 μmwavelength is constructed by combining the ONN and complex-valued neural networks with the nonlinear activation function introduced into the structure. To be specific, the improved activation function of the rectified linear unit (ReLU), i.e., Leaky-ReLU, parametric ReLU (PReLU), and randomized ReLU (RReLU), is selected as the activation function of the N-DNNmodel.+rough numerical simulation, it is proved that the N-DNNmodel based on 10.6 μmwavelength has excellent representation ability, which enables them to perform classification learning tasks of the MNIST handwritten digital dataset and Fashion-MNIST dataset well, respectively. +e results show that the N-DNN model with the RReLU activation function has the highest classification accuracy of 97.86% and 89.28%, respectively.+ese results provide a theoretical basis for the preparation of miniaturized and integrated N-DNN model photonic artificial intelligence chips.


Introduction
Deep learning is a branch of machine learning that has been successfully used in various applications, such as image classification [1], natural language processing [2], and speech recognition [3]. Generally, deep neural networks have a remarkable layer, a connection with many parameters, making it highly capable of learning better feature representation [4]. Although the training phase for learning network weights can be completed on the graphic processing units (GPU), large models also require enough power and storage during inference because of millions of repeated memory references and matrix multiplication. Optical computing has high bandwidth and speed, inherently parallel processing, and low power compared with digitally implemented neural networks. A variety of methods for optical neural networks (ONN) have been proposed, including Hopfield networks with LED arrays [5], optoelectronic implementation of reservoir computing [5,6], spiking recurrent networks with micron resonators [7,8], and fully connected feedforward networks using Mach-Zehnder interferometers (MZIs) [9]. ONN uses optical methods to construct the neural network, which has many interconnected linear layers, and has the unique advantages of parallel processing, high-density wiring, and direct image processing. It can be realized by free-space optical interconnection (FSOI) and waveguide optical interconnection (WOI).
FSOI can be implemented ONN by a spatial light modulator (SLM), microlens arrays (MLA), and holographic element (HOE). HOE is an optical element made according to holography, which is generally formed by a photosensitive film [10,11]. Many researchers have explored diffractive optical element (DOE) based on the principle of diffraction. Bueno et al. introduced a network consisting of up to 2025 diffraction photonic nodes and formed a large-scale recursive photonic network. A digital micromirror device (DMD) is used to realize reinforcement learning with significant convergence results. Network consists of 2025 nonlinear network nodes, and each node is an SLM pixel. Moreover, DOE is used to implement a complex network structure [12]. Sheler Maktoobi et al. investigated diffraction coupled photonic networks with 30000 photons and described its extensibility in detail [13]. Lin et al. from UCLA realized the all-optical diffraction deep neural network (D 2 NN). ey moved the neural network from the chip to the real world in 2018, and the chip relies on the propagation of light and achieves almost zero consumption and zero delays in deep learning [14,15]. e physical model consists of an input layer, 5 hidden layers, and an output layer. A terahertz band light source illuminates the input layer, and the phase or amplitude of the input surface encodes optical information.
e incident light is diffracted through the input layer, and the hidden layer modulates the phase or amplitude of the light. An array of photodetectors at the output layer detects the intensity of the output light and identifies handwritten digits based on the difference in light intensity of 10 different areas. e updated phase models the diffraction grating produced by 3D printing. However, this scheme has some defects. Except for the lack of miniaturization and integration, the 3D-printed diffraction grating layer cannot be rapidly programmed in real-time. In 2019, the team proposed a wideband diffraction neural network based on the above architecture [16]. e requirements of the model for the light source are no longer limited to monochromatic coherent light, and the application scope of the framework is extended. However, the experimental environment is limited by using terahertz light sources, the large size of the diffraction grating goes against integration, and in the D 2 NN model, the author stated that no activation function was added in the simulation state; so the nonlinear representation ability and generalization ability of the model need to be improved. us, a phase grating was used in our previous work to replace the 3D-printed diffraction grating. e carbon dioxide laser is used to emit a 10.6 μm infrared laser, and HgCdTe detection array is used to detect the light transmitted from the output layer. e size of each neuron can be reduced to 5 μm, so that a 1 mm × 1 mm phase grating can contain 200 × 200 neurons. us, this kind of diffraction grating will obtain a wider range of applications [17]. e advantage of this diffraction grating is that it has the size of 1 mm × 1 mm, which is conducive to miniaturization and integration of all-optical D 2 NN architecture.
At present, a complex-valued neural network [18] has been successfully used for various tasks [19][20][21][22][23][24][25][26][27], such as processing and analysis of complex numerical data and tasks with intuitive mapping to complex numbers. Image and signal transformation in waveform or Fourier transform has been used as input data of complex numerical neural networks [28]. In the ONN, due to the complexity of the phase value of light, the phase and amplitude of light need to be widely considered. If only a real-valued neural network is used, ignoring imaginary parameters, part of the information would omit [29,30]. erefore, it is necessary to apply complex-valued neural networks to optical computing.
Nonlinear activation functions are widely used in various neural networks. It plays a crucial role in neural networks by learning the complex mapping between input and output. If there is no activation function in the neural network and no matter how many neural networks there are, the output is a linear combination of inputs. is means that the system lacks a hidden layer, resulting in a low nonlinear representation ability of the model. At present, nonlinear activation functions mainly include sigmoid, tanh, and ReLU. ereinto, ReLU is the most common ones for three reasons: (1) solving the so-called explosion and gradient disappearance, (2) accelerating convergence [31], and (3) making the output of some neurons 0, which leads to the sparse network. ReLU activation function includes Leaky-ReLU, PReLU, and RReLU. ese functions improve the speed and accuracy of classifying different datasets. ReLU activation function allows the network itself to introduce sparsity.
is method is equivalent to the pretraining of unsupervised learning and greatly shortens the learning cycle.
In this study, an all-optical diffraction deep neural network (N-D 2 NN) model with nonlinear activation functions based on a 10.6 μm wavelength is proposed. Comparing with the work investigated by UCLA [14,15], the characteristic size of the neural network is reduced by 80 times, and the classification accuracy of the model is verified by simulation. Our model provides a theoretical basis for the future research of the N-D 2 NN model framework in 10.6 μm wavelength and lays a foundation for the further realization of large-scale integrated and miniaturized photonic computation chips.
In summary, the main contributions of this study are as follows: (1) an N-D 2 NN framework with nonlinear activation functions based on 10.6 μm wavelength is proposed by combining ONN and complex-valued neural networks. (2) e representation ability of N-D 2 NN with ReLU improvement activation functions is evaluated in the experimental simulation state, and the detailed evaluation process is given. e rest of this study is organized as follows. e method used in our research is described in Section 2. Section 3 presents the experimental results. e discussion is reported in Section 4. Finally, conclusions are given.

Materials and Methods
is part introduces the basic theory and improved diffraction deep neural network method based on a 10.6 μm laser wavelength. First, the optical calculation theory of N-D 2 NN based on 10.6 μm wavelength is introduced. en, the network model structure is explained in detail. Finally, to improve the nonlinear representation ability of N-D 2 NN, an improved method of N-D 2 NN is given by adding the nonlinear activation function into the N-D 2 NN model.

2
International Journal of Optics 2.1. Optical Computation. Figure 1 shows the structure of N-D 2 NN. Light passing through each grating is modulated by grating grids of different thickness, and it is then received by all grating pixels on the secondary grating. is network connection mode is similar to the fully connected neural network. e first layer of grating receives input images and corresponds to the input layer in the neural network structure. e middle layers of gratings correspond to the hidden layers in the neural network structure, and the detection plane corresponds to the output layer in the neural network structure. e phase modulation effect of the input light is different from the height of different gratings, which corresponds to different weights in the neural network structure.
According to the Rayleigh-Sommerfeld diffraction equation, the neurons in each layer of N-D 2 NN can be calculated by the secondary wave source equation, and the formula is as follows [32,33]: where l represents the l th layer of the network, i represents the i th neuron of layer l, r represents the Euclidean distance between l layer node i and l + 1 layer node, and j � �� � −1 √ . e input plane is the 0 th layer, and then, for l th layer (l ≥ 1), the output field can be expressed as where n l i (x, y, z) represents the output of the i th neuron at the l th layer (x, y, z), g represents the nonlinear activation function in the neural network whose function is to transmit the modulated second-wave neurons to the next layer through the nonlinear unit, and g � ϕ[t l i (x i , y i , z i ) · k n l k − 1(x i , y i , z i )] � ϕ[w l i (x, y, z) · |A| · e jϕ l i ]. t l i denotes the complex modulation, i.e., t l i (x i , y i , z i ) � |A|exp(jϕ l i (x i , y i , z i )), |A| � a l i (x i , y i , z i ) is the relative amplitude of the secondary wave, and ϕ l i (x i , y i , z i ) represents the phase delay increased by the input wave k n l−1 k (x i , y i , z i ) and the complex-valued neuron modulation function t l i on each neuron. For N-D 2 NN structure with the only phase, the amplitude a l i (x i , y i , z i ) is considered a constant, and the ideal state is 1 when the optical loss is ignored.

e Architecture of N-D 2 NN.
To simplify the representation of the forward model, equation (1) can be rewritten as where i refers to a neuron of the l th layer, and p refers to a neuron of the next layer, connected to neuron i by optical diffraction. e input pattern h 0 k is located at layer 0. It generally has a complex-valued quantity, which can carry information in its phase and amplitude channels. e diffraction wave function generated by the interaction between illumination plane wave and input light can be expressed as When the input light is diffracted through a multilayer grating, a result image will be output on the detection plane. e detector detects the detection area in the generated image and obtains the network classification result. erefore, it is necessary to process the data labels in the parameter training stage, and the corresponding labels are designed in the resulting images of different labels. As shown in Figure 2, by judging the region with the highest light intensity in the detection region of the generated image, the label represented by the generated image can be obtained. To match input data of different lengths, the resulting image corresponding to the label is also scaled.
After the input light is diffracted by multilayer grating, a result image will be output in the detection plane. e detector probes the detection area in the resulting image to obtain the network classification results. erefore, it is necessary to process the data labels in the parameter training stage and design different labels to correspond to the marks in the resulting image, as shown in Figure 2. e label represented by the resulting image can be obtained by judging the region with the highest light intensity in the detection region of the resulting image. e resulting image corresponding to the label needs to be scaled to match input data of different lengths.
For N-D 2 NN containing N hidden layers, the light intensity of its output layer can be expressed as (5) e intensity measured by the detector on the output plane is normalized so that they are located in the interval (0, 9) of each sample. I l is used to represent the total amount of optical signals incident on the detector in the output layer l, and the normalized intensity I l ′ is

e Proposed Method.
Based on a previous research, Lin et al. did not consider adding nonlinearity to the D 2 NN framework. erefore, in the classification task, D 2 NN is weak in nonlinear representation. In this study, an N-D 2 NN model architecture is proposed, as shown in Figure 3. Assume that a neuron is physically equivalent to a grid of ONN, and the modulated secondary wave neurons are transmitted to the next layer through the nonlinear unit, as shown in Figure 3.

Complex-Valued Neural
Network. According to equation (3), the phase factor in the complex form of the wave function contains the spatial phase factor exp(jϕ l i ), so International Journal of Optics 3 the product of the amplitude and the spatial phase factor is t l i � x + jy � a l i exp(jϕ l i ). t l i can be represented by two real numbers: the real part Re(t l i ) � x, and the imaginary part Im(t l i ) � y. Any complex-valued function of multiple complex variables can be represented by two functions: Although directly used and represented in neural networks, complex numbers define the interaction between two    International Journal of Optics parts. Using Euler's constant e jϕ l i � cos(ϕ l i ) + j sin(ϕ l i ) as the equivalent representation in the form of polarity, Because more operations are required, complex parameters increase the complexity of the neural network. erefore, equations (7) and (8) can be used according to the selected implementation mode and representation, which can significantly reduce the computational complexity. e product of input t l i and complex numerical weight matrix w l i is calculated as follows: So this exchange means that the model design needs to be rethought to simplify the structure. A deep learning framework that performs poorly under real-valued parameters may be suitable for complex-valued parameters. According to the experimental results in [34], real-valued data do not require this structure. e imaginary part of Im(t l i ) is zero, so equation (9) can be simplified as For training, this means that the real parts Re(t l i ) and Re(w l i ) dominate the overall classification of the real-valued data points.

Activation Function.
e activation function can enhance the representation ability of nonlinearity and perform a complex task of deep learning. However, in some nonlinear activation functions, such as sigmoid and tanh, they have two disadvantages: (1) when performing backpropagation to calculate the error gradient and calculating the activation function (exponential function), the derivation involves division, so the computation is relatively large, and (2) when the sigmoid is close to the saturation region, the transformation is too slow, and the derivative tends to zero. is situation will cause information loss. In all of these nonlinear activation functions, the most notable one is the rectified linear unit (ReLU) [35]. It is generally believed that the excellent performance of ReLU comes from sparsity [36,37]. It reduces the interdependence of parameters and alleviates the occurrence of overfitting problems. ere are also some improvements to ReLU, such as leaky rectified linear (Leaky-ReLU), parametric rectified linear (PReLU), and randomized rectified linear (RReLU), namely, ReLU family functions. ese ReLU family functions improve the speed and accuracy of neural network training. In this section, the three kinds of rectified units are introduced: Leaky-ReLU, PReLU, and RReLU. ey are illustrated in Figure 4.

Optical inputs
Activation function  International Journal of Optics Figure 4(a) shows the mathematical model of ReLU, which is first used in restricted Boltzmann machines. It is a piecewise linear function that cuts the negative part to zero and keeps the positive part. After passing with ReLU, activation is sparse. Formally, rectified linear activation is defined as where input signal m l i · t l i < 0 and output is 0; when the input signal m l i · t l i ≥ 0, the output is equal to the input signal. Figure 4(b) shows the mathematical model of Leaky-ReLU and PReLU. ReLU sets all negative values to zero. In contrast, leaky rectified linear unit (Leaky-ReLU) assigns a nonzero slope to all negative values. Leaky-ReLU activation function is first proposed in the acoustic model [38]. It is mathematically defined as where a i is a fixed parameter in range (0, 1). In this study, a i in the Leaky-ReLU function is selected as 0.2. PReLU is proposed by He et al. [39]. e authors reported that its performance is much better than ReLU in large-scale image classification tasks. In the PReLU function, the slopes of the negative part are learned from the data rather than defined in advance. PReLU function learns a i through back propagation during training in equation (12). Figure 4(c) shows the mathematical model of RReLU, which is the randomized version of Leaky-ReLU. It is first proposed and used in the Kaggle NDSB competition. e highlight of RReLU is that in the training process, a ji is a random number sampled from a uniform distribution U(l, u). e mathematical terms are defined as where a ji is an arbitrary constant in the interval U(l, u), l < u, and l, u ∈ [0, 1). Suggested by the NDSB competition winner, a ji is sampled from U (3,8). In this study, the same configuration is used.

Model
Training. e forward propagation model compares the result of the physical output plane with the training target of the diffraction network, and the error propagation generated is updated iteratively to each layer of the diffraction network. Based on the reports [15], the crossentropy function is adopted as the loss function for N-D 2 NN, which significantly improves the classification accuracy of the MNIST dataset [40] and Fashion-MNIST dataset [41], respectively. e output results of N-D 2 NN are compared with the input values. e error backpropagation is used to iterate the grating parameters, and the loss function is defined according to the output of N-D 2 NN based on the target characteristics. e cross-entropy function is used as the loss function in the neural network. According to the following formula, define the cross-entropy function as where p l i (x) � e I′ / K l e I′ represents the output value of the Softmax layer in the neural network, and Softmax regression can be thought of as a learning algorithm to optimize classification results. q l i (x) represents the actual image output value, and e I′ represents the normalized intensity of the output plane. To train the N-D 2 NN model into a digital classifier, the MNIST handwritten digital dataset and Fashion-MNIST dataset are used as the input layers.
In  Tables 1 and 2,  respectively. From Tables 1 and 2, it can be seen that the N-D 2 NN model with the RReLU function takes the least training time and inference time compared with other activation functions on the MNIST and Fashion-MNIST datasets. In the training phase, the model with Leaky-ReLU and PReLU achieves the same training time on the datasets. However, the inference time of the model with Leaky-ReLU is faster than the one with PReLU. In the Kaggle NDSB competition, it is reported that a ji in the RReLU function is favorable due to its randomness in training, and overfitting can be reduced. erefore, no matter in reasoning time, training time, or recognition accuracy, RReLU function has advantages. e a i in the Leaky-ReLU function is fixed, and the a i in the PReLU function changes based on the data; thus, the inference time of the PReLU function is slightly longer than that of the Leaky-ReLU function.

Experimental Results
To test the performance of the N-D 2 NN structure, the MNIST dataset and Fashion-MNIST dataset are introduced in Section 3.1. Section 3.2 shows the evaluation method. Performance evaluation is reported in Section 3.3. Section 3.4 discusses the comparison with the representation ability results of a neural network framework without nonlinear activation functions.

MNIST Dataset and Fashion-MNIST Dataset.
In this study, the MNIST handwritten digital dataset and Fashion-MNIST dataset are used as the training digital classifier at the input layer based on the 10.6 μm N-D 2 NN model. e MNIST dataset is a handwritten digital dataset composed of numbers 0-9. e dataset comprises four parts: training set image, training set label, test set image, and test set label. e MNIST dataset comes from the National Institute of Standards and Technology (NIST). e training and testing sets are a mixture of handwritten numbers from two databases, one from high school students and the other from the Census Bureau. e MNIST handwritten dataset contains a training set of 60,000 samples and a test set of 10,000 samples. Each image in the MNIST dataset contains 28 × 28 pixels, and these numbers are normalized and fixed in the center.
e Fashion-MNIST dataset is a ten-category clothing dataset that replaces the MNIST handwritten number dataset. It has the same number of training sets, test sets, and    Table 3.

Evaluation
Method. e confusion matrix with ten classes is listed in Table 4. First, each category H i (i � 0-9) needs to compute ten in one confusion matrix [42]. en, for a single class, the evaluation method is defined by TP i , FN i , TN i , and FP i . e following formula can express accuracy of the proposed classifier: where TP i � χ ii represents the totality of the predicted sample is true, and the true sample is true for H i ; TN i � 9  Tables 5 and 6. e grid search method is used to select the hyperparameters of the neural network, so the number of grating layers belongs to the hyperparameters of the neural network. In the simulation state, each batch of data in the network model is selected to be 100. To reduce the simulation time, the number of cycles is 10, the pixel scale is 28 × 28, the loss function is the cross-entropy function, and the optimizer is the Adam optimizer, and the learning rate is chosen as 0.01. e number of grating layers in N-D 2 NN based on 10.6 μm wavelength will influence the final classification result, which is also the unique advantage of this neural network compared with other linear networks. Figure 6 shows the recognition accuracy of different grating layers in N-D 2 NN models with various activation functions. When the number of grating layers is ≤5, the classification accuracy of the neural network model increases with the number of grating layers. When the number of grating layers is >5, the classification accuracy reaches saturation. In general, the deeper the neural network is, the stronger its feature representation ability will be. Furthermore, the neural network could have a better performance on the image classification task. However, the selection of the layer number of the neural network also largely depends on the dimension of the input data features. If the feature dimension of the input data is low and the layer number of the neural network deeper, it is easy to cause the loss and saturation of the feature information during the training process. erefore, its classification accuracy tends to be saturated or even decreased. erefore, in the simulation experiment environment, the number of grating layers is selected as 6.
After determining the number of grating layers in the neural network model, the pixel scale and the spacing of diffraction gratings in the hyperparameters of the model are optimized, among which the number of grating layers is 6. In the N-D 2 NN model, pixel sizes and classification accuracy corresponding to the three activation functions, Leaky-ReLU, PReLU, and RReLU, are shown in Tables 7-10, respectively.        International Journal of Optics As can be seen from Tables 5-8, when the spacing of diffraction gratings in the neural network model is fixed, accuracy generally increases with pixel size. When the pixel size of the diffraction grating in the neural network model is fixed, its precision generally decreases with the increase of the spacing of the diffraction grating. When the model selects RReLU activation function, the pixel size is 100 × 100, and the spacing of diffraction gratings is 30 λ; the neural network has the highest recognition accuracy.
Finally, the learning rate of the Adam optimizer in the model is optimized. Figure 7 shows the classification accuracy of the N-D 2 NN model with RReLU added to the MNIST dataset. Among them, the selection learning rate is 0.01, 0.025, 0.05, and 0.075. It can be seen from Figure 7 that the classification accuracy of the model is the highest when the learning rate is 0.05. e selected hyperparameters of the Fashion-MNIST dataset evaluated by the N-D 2 NN model are optimized by the above method, and the selected hyperparameters are consistent with the models in the MNIST dataset. e activation function is not added into the standard N-D 2 NN model based on 10.6 μm wavelength, and the classification accuracy of the MNIST (Fashion-MNIST) dataset obtained under the simulation state is 86.78% (81.10%).
As shown in Figure 8(a), the classification accuracy of the standard N-D 2 NN model for each label in the MNIST dataset is not the same, and the classification accuracy of the model for label 1 is as high as 98%. However, the classification accuracy of the model to label 8 is only 73%. In Figure 8(b), the classification accuracy of the standard N-D 2 NN model for each number in the Fashion-MNIST dataset is not the same, and the classification accuracy of the model for label 8 is as high as 95%. However, the classification accuracy of the model to label 6 is only 35%. It can be seen that the nonlinear fitting ability and generalization ability of the standard N-D 2 NN model without the activation function is weak. According to the accuracy curve, when the epoch is 50, the accuracy of model recognition tended to be saturated.

Comparison with the N-D 2 NN Framework.
Comparison with the test results of the N-D 2 NN structure with ReLU family nonlinear activation functions is presented in Section 3.3. Experimental simulation results show that N-D 2 NN frameworks with different nonlinear activation functions have significantly improved representation ability. e necessity of nonlinear activation function in the N-D 2 NN framework is proved. Leaky-ReLU, PReLU, and RReLU functions are selected as the activation functions in the N-D 2 NN model. e classification accuracy results of the MNIST dataset and Fashion-MNIST dataset obtained under simulation are shown in Table 11.
Among them, the neural network with the RReLU function for the MNIST dataset has a classification accuracy of 97.86%. Comparing with the results shown in the [14,15], the classification accuracy of the N-D 2 NN model based on 10.6 μm is improved by 0.05%. e neural network with PReLU and RReLU function for the Fashion-MNIST dataset has a classification accuracy of 89.28%. is theory proves the correctness of introducing ReLU family activation functions into the model. Figure 9 shows the accuracy and confusion matrix images of N-D 2 NN with different activation functions.
According to the accuracy image, when epoch is 50 in the model, the recognition accuracy region of the model is saturated. Confusion matrix reveals that the classification accuracy of each label in the MNIST dataset of the neural network with three activation functions is above 94%. Among them, the recognition accuracy of the model with three activation functions to the label 0 and the label 1 is as high as 99%. However, the classification ability of the model to the label 9 is slightly worse, with accuracy rates of 94%, 97%, and 94%. is may be due to the high similarity between label 9, label 4, and label 8, so the model misclassified label 9 into other labels. Figure 10 shows the recognition accuracy rate of various neural network models to various labels in the MNISTdataset. It can be seen that in the MNIST dataset, the recognition accuracy for each label of the model with three ReLU family activation functions is higher than that of the standard model without activation function.
According to the accuracy image, when epoch is 50 in the model, the recognition accuracy region of the model is also saturated. Confusion matrix reveals that the classification accuracy of each label in the Fashion-MNIST dataset of the neural network with three activation functions is above 80%, except for label 4 and label 6. Among them, the recognition accuracy of the model with three activation functions to the label 8 is as high as 98%, 96%, and 97%, respectively. However, the classification ability of the model to the label 6 is slightly worse, with accuracy rates of 58%, 66%, and 62%, respectively. e low recognition accuracy of the model for label 6 (shirt) may be because it is mistakenly divided into label 0 (T-shirt), label 2 (pullover), and label 4 (coat). Figure 11 shows the recognition accuracy rate of various neural network models to various numbers in the Fashion-MNIST dataset. It can be seen that the recognition accuracy for each label of the model with three ReLU family activation     functions in the Fashion-MNIST dataset is higher than that of the standard model without activation function.

Discussion
Nonlinear activation function can improve the representation ability of traditional deep learning. However, in a previous work, optical nonlinearity is not incorporated into deep optical network design, so it is not proved whether the nonlinear effect could improve the representation ability of the N-D 2 NN framework. In this study, the nonlinear activation function is added to the N-D 2 NN framework. e represent abilities of the nonlinear N-D 2 NN framework and the linear N-D 2 NN framework are analyzed, and it is proved that the nonlinear activation function can improve the representation ability in the N-D 2 NN framework. e proposed theory can also be extended to any laser with the required wavelength, that is, the diffraction grating suitable for the all-optical D 2 NN model.
In practice, there are three kinds of methods to realize the nonlinear activated function. e first one is nonlinear material, including crystal, polymer, or semiconductor. Any third-order nonlinear material, which has a strong thirdorder optical nonlinearity χ (3), can be used to form a nonlinear diffraction layer: glass (As 2 S 3 , for example, of metal nanoparticles doped glass), polymer (poly two acetylene, for example), organic thin-film, semiconductor (for example, gallium arsenide, silicon, and CdS), and graphene. e second method is saturable absorbent materials, such as semiconductors, quantum dot films, carbon nanotubes, and even graphene films, that can be used as nonlinearity elements for N-D 2 NN. Recently, a material with the strong optical Kerr effect [43,44] brings light to the deep diffraction neural network architecture. e third method is that the optical nonlinearity can be introduced into the layers of N-D 2 NN by using the direct current electrooptical effect.
is is an all-optical operation that deviates from the device, and each layer of the diffraction neural network has a direct current field. is electric field can be applied externally to each layer of N-D 2 NN.
Since, graphene and cadmium sulfide (CdS) have achieved a series of important research results in the field of nonlinear optics. In the following work, the nonlinear saturation absorption coefficient of the above materials will be used to fit the optical limiting effect function, which is used as the activation function in the miniaturized nonlinear diffraction deep neural network. In the simulation state, the classification accuracy of the N-D 2 NN model for nonlinear optical materials will be verified. One is the method of material coating, that is, a layer of graphene or CdS material is plated on the diffraction grating of germanium material to achieve the physical establishment of the N-D 2 NN model. Another approach is to directly fabricate diffraction gratings using nonlinear materials such as graphene and CdS.

Conclusions
In this study, an N-D 2 NN structure based on 10.6 μm wavelength nonlinear activation function is proposed based on the optical neural network and complex-valued neural network, and the simulation proves its correctness. e experimental results show that using three ReLU functions, the N-D 2 NN framework of classification performance is better than that without using a nonlinear activation function N-D 2 NN framework. is proves the necessity of nonlinear activation function in N-D 2 NN framework. It can improve recognition accuracy. Comparing with the D 2 NN model in literature [14,15], the N-D 2 NN model using RReLU function can improve the identification accuracy of MNIST dataset by 0.05%. However, there are still two challenges: one is to find the corresponding nonlinear optical materials in the physical model. e other is that there may be a better nonlinear activation function in the N-D 2 NN framework. ese two points are the works that should be completed in the future. In the follow-up study, the neural network model will be further optimized. e nonlinear activation function more suitable for N-D 2 NN  will be further searched, which provides a theoretical basis for realizing the N-D 2 NN physical system of 10.6 μm wavelength.
Data Availability e raw/processed data required to reproduce these findings cannot be shared at this time as the data also form part of an ongoing study.

Conflicts of Interest
e authors declare that there are no conflicts of interest.