A Crop Leaf Disease Image Recognition Method Based on Bilinear Residual Networks

Deep learning models are widely used in crop leaf disease image recognition. These models can be divided into two categories: global model and local model. The global model directly takes the whole leaf disease images as input to training and recognition. It can achieve end-to-end training and recognition, which is very convenient to use. But this kind of model cannot very accurately and completely extract the features from the very small diseased spots in the image. Before training and recognizing, the local model needs to extract the diseased spots part from the image by image segmentation technology. Then the local model takes the disease spots part images as input to training and recognition. Features extracted by local model are more accurate and complete. But this kind of model cannot achieve end-to-end training and recognition, and the image segmentation will bring additional overhead. Considering the disadvantage of global model and local model, we proposed a crop leaf disease image recognition method based on bilinear residual networks (named DIR-BiRN). DIR-BiRN extracts features by two residual networks feature extractors and then integrates the features by a bilinear pooling function. By this way, it can extract features more accurately and completely while achieving end-to-end training and recognition. Experiments on the PlantVillage dataset show that, when compared with the standard ResNet-18 model, the DIR-BiRN improves on accuracy performance, recall performance, precision performance, and F1-measure performance by averages of 0.2918, 0.81641, 0.59185, and 0.52151 percentage points, respectively.


Introduction
Crop leaf disease image recognition is one of the most important agricultural engineering technologies. It is also one of the most important supporting technologies for crop disease recognition and control, which can provide e ective guarantee for the safety of agricultural production [1]. How to quickly and accurately identify crop leaf diseases is one of the key research problems in the researches of crop leaf disease image recognition.
To address this problem, machine learning methods have long been applied in the crop leaf disease image recognition research, such as support vector machine [2], linear discriminant analysis [3], K-means [4], and Bayesian network [5]. ese methods require manual feature selection, which will lead to high overhead. Moreover, the performance of these methods is limited. Usually, they can only recognize few diseases for a single crop [6]. In recent years, deep learning technology has developed rapidly and has become a popular technology in the research of crop leaf image disease recognition.
Di erent from the traditional machine learning methods, deep learning methods can be applied in an endto-end way while extracting the features automatically and accurately [6]. e concept of deep learning comes from arti cial neural network, but it constructs a deeper network model. As early as 2007, researchers [7] applied arti cial neural network to the disease image recognition of Phalaenopsis seedling, but this method still needs to select features manually before applying. In 2012, Alex and others [8] applied the convolutional neural networks (CNN) to image recognition and achieved top-1 error rate of 15.3% on the ImageNet dataset. Since then, many researchers have carried out studies to apply the CNN models to crop disease image recognition. Amara et al. [9] realized banana leaf disease image recognition based on LeNet and verified the effectiveness of deep learning model in the complex environment. Wu et al. [10] applied the GoogLeNet to classify five diseases of tomato and achieved an accuracy of 94.33%. Durmus et al. [11] trained an AlexNet to classify ten diseases of tomato leaves and achieved an accuracy of 95.65%. Srdjan and others [12] developed a deep convolutional neural network (DCNN) to detect different disease of 13 crops, with an accuracy of 91%-98%. Hu et al. [13] suggested a ResNet model to classify 59 diseases of different crops and achieved an accuracy of 85.22%. LeNet, ResNet, AlexNet, and GoogLeNet all belong to CNN models. ese studies have verified the effectiveness of deep learning technology in the crop disease image recognition. e current research are no longer limited to detecting whether the crop is healthy or unhealthy; they also need to classify the crops that are unhealthy to a specific disease category [14]. is means we need more effective deep learning models, which can classify many different diseases for different crops. is kind of models is called fine-grained classification models. With more disease categories, the accuracy performance of the model is more difficult to achieve. is is because the features generally are hiding in the details of the image, such as tiny disease spots.
is makes it very difficult for the models to extract the features from the image samples. So, some researchers will apply image segmentation technology to get the disease spots part image before feature extraction. But the image segmentation will lead to additional overhead and make the recognition model not able to be deployed in an end-to-end way.
To solve this problem, we propose a crop leaf disease image recognition method based on bilinear residual networks (named DIR-BiRN). DIR-BiRN integrates two feature extractors by a bilinear pooling function, and the feature extractors are constructed by residual networks. e two feature extractors extract features, respectively, and then the features are combined by the bilinear pooling function. It can extract features with more accuracy and completeness compared to a single deep learning model in an end-to-end way. Figure 1 shows an example of crop leaf image recognition in which deep learning model is applied. It is essentially a classification problem, and each crop disease corresponds to a category label.
ere are four main steps when we applying a deep learning model to crop leaf disease image recognition: (1) Build training set. In this step, we need to collect leaf images which belong to the different diseases. e images need to be grouped into different disease labels. In order to distinguish healthy leaves from diseased ones, a healthy class must be added in the dataset. en, we need to preprocess all the images in the dataset.
e procedure of image preprocessing involved transforming all the images to uniform size and cropping of all the images, making the square around the leaves, in order to highlight the region of interest (plant leaves) [12]. If the training set is unbalanced, we generally will apply the data augmentation methods to prevent the model overfitting.
e common data augmentation methods include image flipping, cropping, rotation, translation, and noise injection. Deep learning models can realize automatic complex feature extraction with sufficient training samples. So they can help us to avoid too many complex preprocessing operations.
(2) Build model. In this step, we need to construct the basic net structure of the recognition model and set the hyperparameter of the model. e model's feature extraction performance has decisive influence on the image recognition method's performance. So our main purpose in this step is to build a more effective model net structure as far as possible, which is suitable to extract the crop leaf disease images' fine-grained features.

Related Works
ere are two kinds of deep learning models which are applied to crop leaf disease image recognition: global model and local model. In this section, we will respectively discuss the recognition methods which are based on the global model and local model.
Global model directly extracts features from the whole crop leaf image. e research based on global models tried to apply more advanced and appropriate model to get better recognition performance. Research [11] tested AlexNet and SqueezeNet on tomato leaf disease images and found that AlexNet had better recognition performance. Research [12] developed a deep convolutional neural network (DCNN) to detect different diseases of 13 crops, with an accuracy of  [19], VGG [20], and GoogLeNet [21]. We can nd that most of the crop leaf disease research are based on global model. is is because the global model can be deployed in an end-to-end way which is very convenient to be used. It does not need complex preprocessing on the images before training and recognition. But the crop leaf disease image's features are very di erent from other images' features; they usually are hiding in the very ne-grained disease spots, and it is generally hard for the global models to extract this kind of ne-grained features. is reason limited the performance of the recognition methods which are based on the global models to a certain extent.
In order to make up for the shortcoming of the global model, some crop leaf disease image recognition research are based on local models. Local models need to extract the partial image of the disease spots from the whole image by image segmentation technology. en the partial image is input to recognition model for feature extraction. Research [22] suggested an e ective method to locate the disease spots. It proposed a spatial pyramid-oriented encoder-decoder cascade convolution neural network for crop disease leaf segmentation, which consists of a region disease detection network and a region disease segmentation network and can achieve higher disease spots segmentation accuracy. Research [23,24] realized the tomato leaf disease recognition based on local model. Both of them located the disease spot locations before recognition and then applied a Yolo V3 CNN model and DNN model for diseases and pests recognition, respectively. Research [25] designed a framework that can realize real-time apple leaf disease identi cation and classi cation by local model. It con gured MASK RCNN to detect the infected regions and then utilized a pretrained CNN model for features extraction and achieved a best accuracy of 96.6%. Research [26] designed a framework for recognition of guava plant diseases. It employed the ΔE color di erence image segmentation to segregate the areas infected by the disease, and then color (RGB, HSV) histogram and textural features were applied to extract feature vectors. At last, it achieved an accuracy of 99% in recognizing four guava fruit diseases. Local models will locate the disease spots before training and recognition and can realize more accuracy and complete extraction for ne-grained disease features. en we can achieve more accurate crop disease recognition. But the location step will lead to extra operations and overheads; this makes it very hard for the local models to be deployed in an end-to-end way.
e related works are summarized in Table 1. Global model methods are very easy to use, because they can be applied or deployed in an end-to-end way. ey do not need additional preprocessing or overhead. But, generally, they cannot accurately and completely extract ne-grained disease spots features as local models. Local model methods can extract ne-grained disease features more accurately and Mathematical Problems in Engineering completely, but their image segmentation step will lead to additional operation and overhead. is makes local model methods not able to realize end-to-end deploying. e most ideal method to realize the crop leaf disease image recognition is that which can accurately and completely extract the ne-grained disease features in an end-to-end way. For this purpose, we introduced a bilinear residual networks model. e concept of bilinear model is proposed by Lin et al. [27] in 2015. It can be applied in an end-to-end way and extract features accurately and completely by two extractors. Based on this concept, we proposed a bilinear residual networks model for crop leaf disease image recognition, which integrated the advantages of global model and local model.

Method Overview.
In this section, we present our proposed crop leaf disease image recognition method based on bilinear residual networks (DIR-BiRN). e DIR-BiRN's build training set, training, and recognizing steps are the same as those in other methods which are based on the global models. So we will not discuss these steps in this paper. Our main di erence is in the build model step; we built a bilinear residual network for the DIR-BiRN. As shown in Figure 2, there are two residual networks feature extractors in the DIR-BiRN's model. After an image is input to the model, these two extractors will extract features respectively. e features are integrated to a bilinear vector by a bilinear pooling function. en the bilinear vector is input to the softmax function for disease classi cation. is method is related to the two pathways hypothesis of visual processing in the human brain [27]. e hypothesis indicates that human brain uses a pathway to locate object and another pathway to recognize the object [28]. e two extractors in the method can extract more di erent features from the images. e local features are integrated in a linear  way by a bilinear pooling function. So, this method can model local pairwise feature interactions in a translationally invariant manner [27]. DIR-BiRN is a quadruple, and it consists of the following: In this formula, f RNA and f RNB are feature extractors and they are constructed by residual networks. f BP is a bilinear pooling function, and f c is a classification function. Once image I t is input to DIR-BiRN, f RNA and f RNB will extract features in each location L s , respectively, which can be represented with functions f RNA (I t , L s ) and f RNB (I t , L s ).
e DIR-BiRN's feature extraction for image I t in location L s can be represented as en we can obtain an image representation vector vec t by the bilinear pooling function f BP . It can be described as At last, we can get the classification result by inputting vec t to the classification function.

Base-Net.
We construct the DIR-BiRN's feature extractors by residual networks (ResNet). ResNet was proposed by He et al. [25] in 2015. ResNet is essentially a convolution neural network model, but it introduced residual block in the networks. By the residual block, ResNet can avoid the degradation problem caused by the increase of networks depth. So, it can build more deep networks and realize more accurate feature representation with higher training efficiency [29]. In recent years, many researchers [30,31] have applied ResNet to crop disease image recognition in their research and achieved better accuracy performance compared to some other models.
e Base-Net structure of the feature extractors in DIR-BiRN is shown in Figure 3. It has 18 layers: 17 convolutional layers and a max-pooling layer. e input images are resized to a size of 224 * 224 and 3 color channels. In Figure 3, "conv" means convolutional kernel, and the parameter in front of it shows its size. "s" means stride, and "p" means padding. e main characteristic of residual networks is as follows: a shortcut connection is added between each two convolution layers to form a residual block, and many residual blocks are constructed to a residual network. e shortcut connections are represented by red and blue arrows in Figure 3. ey used different residual functions: Blue arrows use formula (4), and red arrows use formula (5). In formula (5), w(x) is a convolutional function, to downsampling and ascending dimension for x. eir connection structures are shown in Figure 4. e network structure in Figure 4(a) or Figure 4(b) is called a residual block.
Once the crop leaf disease image is input to the net, it will be transformed to a feature map with size of 512 * 7 * 7.
ere are a total of 7 * 7 � 49 locations in the feature map, and each location has a feature with 512 dimensions. ere are two residual networks in DIR-BiRN, so we can obtain two different feature maps for each image. en, we can input the two feature maps into bilinear pooling function f BP to get feature vector.

Bilinear Pooling Function.
By the two residual networks, we can get two feature extraction functions f RNA (I t , L s ) and f RNB (I t , L s ) for each location L s in image I t . Suppose that image L s can be represented by a matrix M I t ; we can get M I t by the following formula: M I t can be reshaped to a vector: en, we can get the feature vector of L s by normalizing vec x . e normalization is performed by the two following formulas:

Classification Function.
At last, we can input vec I t into classification function f c to achieve classification result. f c is defined as follows: In formula (9), z j means the output value of node j, and n means the total number of categories. To answer RQ 1, we select the crops with more than 3 image categories to experiment. ey are apple, corn, grape, potato, and tomato. Each of them has 3 more categories of images in the PlantVillage dataset (including a category of healthy). For RQ 2, we combine 5 kinds of crops' datasets and then apply the combined dataset to experiment. e basic information of dataset used in the experiment is shown in Table 2, and the image samples in the dataset are shown in Figure 5. In the experiment, we use the standard single 18 layers' residual networks model (ResNet-18) as baseline method.

Performance Metric.
Many researchers have used accuracy, precision, recall, and F1-measure as metrics to evaluate classi cation performance. ey are de ned as follows: Convolution layer

Convolution layer
f (x) Convolution layer

Convolution layer
f (x) f(x) + w(x) x Relu 1*1 conv, s=2, p=0 In formula (10), n is the total number of categories, A i is the total number of images of category i, and TP i means the number of images which are correctly classi ed as category i. In formulas (11) and (12), Precision i and Recall i mean the precision and recall performance of category i, respectively. FP i is the number of images which are incorrectly classi ed as category i, and FN i means the number of images which should belong to category i but are incorrectly classi ed as other categories. However, precision and recall are reciprocal; they are unable to re ect the comprehensive performance of a disease image recognition method. us, we also employ the F1-measure to measure the method performance. It is the harmonic average of the precision and recall and is calculated in formula (13).

Experimental Parameter Settings.
In the experiment, the batch size is set as 128, the learning rate is set as 1 * 10 −3 , and the Adam optimizer is used. We used the hold-out method as the cross-validation method in the  e dataset was randomly divided into training set and testing set, and the training set to testing set ratio was 1 to 4. In order to prevent over tting, we used L2 regularization in the loss function. In the experiment, we found that the models' loss function was basically converged after training 100 epochs. So, the epoch is set as 110 in the experiment.

Experimental Results.
We trained our models for a total of 110 epochs on each dataset and recorded the model parameters, training loss, accuracy, precision, recall, and F1measure of each epoch. Researchers usually use the model parameters which have the optimal accuracy as the nal model parameters. In our experiment, we also use this kind of way to obtain the nal parameters.

Discussion of RQ 1.
e DIR-BiRN model and ResNet-18 model are trained on the apple dataset, corn dataset, grape dataset, potato dataset, and tomato dataset, respectively. e training loss and accuracy in the training process are shown in Figure 6. e training loss and accuracy in the 110 epochs on the di erent crop datasets are shown in Figure 6. In these gures, the red line represents training loss, and the black line represents accuracy. From the gures, we can nd that the DIR-BiRN model and ResNet-18 model have similar training processes. Both of them have higher training loss and they are lower in the beginning epochs. With the increasing of epochs, the training loss begins to decrease, while accuracy increases. At last, the training loss reached minimum, and accuracy reached maximum. en we recorded the model parameters with optimal accuracy as the nal model parameters. e performance of the model with the nal model parameters is called the model's optimal performance. To compare the performance of our approach with that of the single ResNet-18 model, we recorded their worst performance and optimal performance on the different crop datasets. e worst performance means the performance which is obtained by the model parameters with the lowest accuracy. e performance results on different crops are shown in Tables 3-7, respectively. e worst performance is obtained in the beginning epochs. We can nd that DIR-BiRN's worst performance is not as good as the ResNet-18 model from the tables. But, with the increasing of epochs, both models' performances are improved. After reaching the maximum performance, DIR-BiRN obtained a better optimal performance compared to the ResNet-18 model. Generally, researchers will apply and deploy the recognition models by using the model parameters which obtained optimal performance. In this perspective, we can suggest that DIR-BiRN has better performance than ResNet-18.  Figure 7. e column label of the confusion matrix represents the predicted category, and the row labels of the confusion matrix represent the true category of the predicted image. e value at the diagonal line shows correctly predicted tags. e darker the diagonal line indicates the better the model's effect [33].
rough Figure 7, we can find that the DIR-BiRN recognized more correct images on 9 kinds of crop diseases' recognition (2 kinds of apple diseases, 2 kinds of corn diseases, 3 kinds of grape diseases, 1 kind of potato disease, and 1 kind of tomato disease). ResNet-18 recognized more correct images on 3 kinds of crop diseases' recognition (1 kind of corn disease, 1 kind of grape disease, and 1 kind of potato disease). From this perspective, we can consider that the DIR-BiRN model shows a better recognition performance compared to the ResNet-18 model.
By this, we can answer research question 1: compared with traditional single residual networks model, DIR-BiRN has an improvement on the optimal performance when classifying many different diseases for the same crop. Because researchers usually apply and deploy the recognition models by using the model parameters which obtained optimal performance, we can suggest that DIR-BiRN has a performance improvement.

Discussion of RQ 2.
For RQ 2, we combined the apple, corn, grape, potato, and tomato datasets in a combined dataset.
ere are a total of 25 image categories in the combined dataset. en we ran the DIR-BiRN model and the     ResNet-18 model on the combined dataset. e training loss and accuracy in 110 epochs are shown in Figure 8. e training process is similar to that in RQ 1. Both models have higher training loss and lower accuracy. With the increasing of epochs, the training loss decreased to a minimum, and the accuracy increased to a maximum. e worst and optimal performances of DIR-BiRN and ResNet-18 are shown in Table 8.
Both models' worst performance is obtained in the rst epoch, and the optimal performance is obtained in the last few epochs. Compared with ResNet-18, DIR-BiRN also has a better optimal performance on the combined dataset. It achieved 0.0319 percentage points accuracy improvement, 1.0534 percentage points precision improvement, 1.6804 percentage points recall improvement, and 1.3804 percentage points F1measure improvement compared to ResNet-18. Because researchers usually apply and deploy the recognition models by using the model parameters which obtained optimal performance, we can suggest that, compared with traditional single residual networks model, DIR-BiRN can achieve performance improvement when classifying many di erent diseases for di erent crops.

Discussion
is paper proposed a crop leaf disease image recognition method based on bilinear residual networks (named DIR-BiRN). According to our experiments, DIR-BiRN has better recognition performance (accuracy, recall, precision, and F1-measure) than the traditional ResNet-18 model. is is because DIR-BiRN's bilinear model can extract more negrained features in the crop leaf disease images, and these ne-grained features are very useful to crop disease recognition.
ere are still some limitations in our methodology. At rst, the DIR-BiRN's bilinear model is more complex than the single model, so it has higher time and storage consumption than the single-model method. Second, our experiments are only performed on the leaf disease images   which have simple background. If our method is applied to the leaf disease images which have complex background, it is very possible to lead to recognition performance degradation. At last, we only tested our method on five kinds of crops; the recognition performance of DIR-BiRN on other crops still needs to be verified.

Conclusions
To address the fine-grained classification problems in the crop leaf disease image recognition, we proposed a method based on bilinear residual networks (named DIR-BiRN). It integrated two 18-layer residual networks feature extractors by a bilinear way. It can extract features more accurately and completely than the single residual networks model, while deploying and applying the model in an end-to-end way. So it has the advantages of both global model and local model. We tested DIR-BiRN on the PlantVillage dataset. In our experiment, DIR-BiRN showed a better performance (accuracy, recall, precision, and F1-measure) than the single residual networks model. From the confusion matrix results, we can also find that DIR-BiRN shows better recognition performance on some crop diseases which have very small disease spots (apple scab, apple black rot, grape black rot, etc). is experimental result approved that our bilinear residual networks can extract more fine-grained crop disease features in the images, making our method able to realize more accurate disease recognition. In our future works, we will try to integrate more different feature extractors to get a better recognition performance and also test our method on more datasets.

Data Availability
All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.