A Fruit Tree Disease Diagnosis Model Based on Stacking Ensemble Learning

Fruit tree diseases have a great influence on agricultural production. Artificial intelligence technologies have been used to help fruit growers identify fruit tree diseases in a timely and accurate way. In this study, a dataset of 10,000 images of pear black spot, pear rust, apple mosaic, and apple rust was used to develop the diagnosis model. To achieve better performance, we developed three kinds of ensemble learning classifiers and two kinds of deep learning classifiers, validated and tested these five models, and found that the stacking ensemble learning classifier outperformed the other classifiers with the accuracy of 98.05% on the validation dataset and 97.34% on the test dataset, which hinted that, with the smalland middlesized dataset, stacking ensemble learning classifiers may be used as cost-effective alternatives to deep learning models under performance and cost constraints.


Introduction
In recent years, due to the influence of global climate and environmental changes, crop disasters around the world occur more frequently than ever, which results in a significant decline of the yield and quality of agricultural products, especially of fruit products. For example, the loss rate of fruit yield in the United States is about 20%, and that of some other countries is even up to 50% [1]. Crop disease has been the major reason that causes the yield loss of agricultural production, which limits the high-quality, high-efficiency, and sustainable development of the world's agriculture [1,2]. However, most of the farmers have not master efficient and effective methods to identify fruit disease by themselves.
In the early 20th century, the traditional disease recognition methods were mainly based on biological experiments. Professionals used electronic microscopes and other equipment to observe bacterial changes, such as enzyme-linked immunosorbent Assays, DNA probe technology, PCR technology, and other biological methods [3][4][5]. However, those recognition methods cannot be widely practiced due to the large investment of instruments and equipment, and high cost of time and labors. Since the 1970s, a large number of traditional expert systems have been used to diagnose crop diseases. For example, PLANT/DS, as a kind of expert system, was developed to diagnose soybean diseases and insect pests [6]. In 1982, PLANT/CD was developed to diagnose corn borer pests. In 1990s, intelligent expert systems were developed to treat with agricultural problems. e various intelligent technologies were introduced to expert systems to improve the accuracy, intelligence, and practicability of disease diagnosis. However, expert systems are still in the rolebased reasoning mode, which is considered much difficult to maintain and evolve. In recent 10 years, machine learning, especially deep learning, helps with plant disease diagnosis based on images recognition. is paper aims at proposing a machine learning-based model for fruit disease diagnosis.

Related Studies
Related studies have recently focused on image segmentation, feature extraction, and model training of diagnosis models of plant diseases. Jaisakthi et al. proposed a grape disease system, which can segment leaves from background images and segment the ill areas based on global threshold processing and semisupervision technology. e systems with classification models were, respectively, trained with support vector machine, AdaBoost, and random forest machine learning algorithms [3]. Chakraborty et al. used Otsu thresholding algorithm and histogram equalization to preprocess images for recognition of black rot and cedar apple rust. ey separated the image segmentation region of the infected part, and the accuracy of the improved multiclass SVM model was up to 96% [4]. Hossain et al. proposed a k-nearest neighbor (KNN) classifier to detect and classify black spots, anthracnose, bacterial wilt, leaf spots, and canker of various plants, which mainly depended on the extraction of color and texture features of ill leaves. e classifier was validated with the final accuracy of 96.76% [5]. To identify apple diseases, Zhang et al. extracted 38 features of color, texture, and shape of leaves and combined genetic algorithm with complete-fair-scheduler algorithm to extract the main features. ey claimed that the recognition rate based on support vector classifier reached more than 90% [7]. Mohamed et al. carried out the research to identify the disease detection of four kinds of grape disease leaves, which included four stages: image enhancement with stretch method, image segmentation with K-means, texture feature extraction, and classification based on multi-SVM and Bayesian classifiers. e average accuracy was nearly 100% in their validation experiment [8]. For the diagnosis of four common alfalfa leaf diseases, Qin et al. extracted 129 features of texture, color and shape based on K-mean clustering algorithm, and linear discriminant analysis. After screening important features, a disease identification model was established based on SVM. e results showed that the SVM model built with the most important 45 features selected from 129 features was the final optimal model. For this SVM model, the recognition accuracies on the training set and the testing set were 97.64% and 94.74% [9].
In recent years, deep learning has attracted the attention of agronomic experts. Because of significant advantages of feature extraction and easy-to-use, deep learning technologies have effectively promoted the development of agricultural intelligent mechanical applications [10]. e related studies are mainly conducted in data enhancement and model improvement. For example, to identify five common apple leaf diseases, Jiang et al. constructed 26377 apple leaf disease samples through data enhancement and image annotation technology and proposed a deep CNN model by introducing GoogleNet Inception and Rainbow concatenation. e model achieved 78.80% in mean average precision [11]. Liu et al. proposed an architecture of deep CNN model based on AlexNet to detect diseases of apple leaves. Using 13689 ill leaf images as the sample dataset, the recognition rate of the model reached 97.62% in model test [12]. Based on more than 7000 pear disease images, Yang et al. established models using deep learning neural network models including VGG16, Inception V3, ResNet50, and ResNet101 to explore the relationship between key influencing factors and severity of pear disease. e recognition rate of diagnosis models was proved from 97.67% to 99.44% [13]. To identify types of maize leaf disease, Agarwal et al.
improved the model from four aspects of enhanced convolution neural network (ECNN), fusion of extended convolution layer, one-dimensional convolution layer, and ECNN motivation. ey established the ECNN model and achieved better performance than AlexNet, GoogleNet in Precision, Recall, and Accuracy [14]. Zhang et al. proposed multiscale fusion convolutional neural network (MSF-CNNS) for segmentation of cucumber ill leaf images. e method of gradual adjustment of transfer learning was adopted to speed up the training speed. By introducing multilevel parallel structure and multiscale connection, multiscale features of crop ill leaf images were extracted. e final average accuracy rate was 93.12%. Compared with Fully Convolutional Networks (FCNs), SegNet, U-NET, and DenseNet, the accuracy of the proposed model was increased by 13.00%, 10.74%, 10.40%, 10.08%, and 6.40%, respectively, and the training time was reduced by 0.9 hours [15].
Ensemble learning has also been introduced to imagebased crop disease diagnosis. Ensemble learning aims at constructing a powerful classifier by using simple base classifiers. Ensemble learning successfully avoids the high training cost and large dataset demand of deep learning. For example, Rehman et al. proposed a hybrid contrast stretching method for apple ill leaves to increase the visual impact of the image, which used the pretrained CNN model for feature extraction. ey achieved 96.6% recognition rate on the ensemble subspace discriminant analysis (ESDA) classifier [16]. To identify the three disease categories of corn leaves, Bhatt et al. collected the image features with CNN and used the boosting ensemble learning method with decision tree classifiers to train the features from CNN. e experiment showed that the accuracy of the model was up to 98% [17]. Azim et al. proposed a model to detect three kinds of rice leaf diseases. By removing the background, segmenting the disease area, extracting color, shape, and texture features, they used eXtreme gradient boosting (XGBoost) to enhance the recognition performance. e result showed that the accuracy of 86.58% was achieved [18].

e Data Source.
In this study, we selected samples for four common fruit tree diseases including pear black spot, pear rust, apple mosaic and apple rust. ese diseases are the most common diseases for apple and pear trees. e data for model training and validation are from the fruit tree disease image library of the Agricultural Knowledge Service System of Chinese Academy of Agricultural Sciences (AKSS), which contains 10,000 leaf images of pear black spot, pear rust, apple mosaic, and apple rust diseases.
ere are 2,500 pictures of each disease. e pictures were collected by agronomists during the fruit tree growth period. As shown in Figure 1, each leaf picture is separated from the panorama with pure white background and the color temperature between 5200 and 5500. e resolution of the picture is 2816 × 2112.
We also used Baidu image search engine (https://image. baidu.com) with the disease names as keywords to gather the fruit leaf images for the model test. As a result, 500 images 2 Complexity were finally selected into the test dataset by agronomic experts. As shown in Figure 2, the pictures for the model test are mixed with pictures in different quality levels and background, which is reasonable for generalization ability evaluation.

e Feature Extraction.
e feature extraction is the process of extracting invariant features from images to solve practical problems. Before building the fruit tree disease diagnosis model, the features of ill leaves should be extracted. eoretically, it is necessary to integrate multidisciplinary knowledge such as mathematics and physics to define the features of images. Technically, it is necessary to combine digital image processing and computer vision techniques to depict digital image features [19]. In practice, the features about color, shape, texture, and number of disease spots of leaves are usually used to recognize the plant disease.

Color Feature Extraction of Ill
Leaves. CMYK, HSV, RGB, bitmap, and grayscale contribute to the representation of color attributes of pictures. In this study, RGB is used to define the color feature. RBG uses the change of red (R), green (G), and blue (B) color channels and their superposition to express a variety of colors. As one of the most widely used color systems, RGB system almost includes all colors that human vision can perceive. Since the color and size of the disease spots are clearly different from the healthy parts of the leaf and different from those of different diseases, the statistical description of RGB data contributes more to the recognition of leaf diseases. We defined the following indexes to describe the color feature of fruit leaves with RGB system.
As shown in Equation (1), L i , the first moment of color data, denotes the general level of the color in channel i, where P is the number of pixels in R, G, and B channels, and i is the channel ID. X ij is the color brightness value of channel i.
In Equation (2), σ i is the second moment of color data, which uses the standard deviation (Std.) value to reflect the fluctuation degree of leaf colors.
In Equation (3), R i denotes the range of color values in channel i, which reflects the extreme difference of colors in a channel.
In addition, since the mean value cannot objectively reflect the overall level of color in a channel when the data are not in normal distribution, we took the median value M i of channel i as a supplement to L i .
All color features of the dataset are shown in Table 1.

Texture Feature Extraction of Ill Leaves.
As one important visual feature of pictures, texture refers to properties inherent to the surface of an object and optical properties, microgeometric features, and other information of the object surface that is closely related to it. In this study, through the observation of ill leaves of four kind of fruit trees, we found that spots of pear black spot, pear rust, and apple rust were scattered on the surface of ill leaves, while spots of apple mosaic leaf disease were irregular and spread in a continuous way. erefore, the texture feature is an important factor to distinguish the different ill leaves. As a powerful tool to extract texture features of pictures, the gray-level cooccurrence matrix (GLCM) statistically characterizes the cooccurrence level of gray-level pixels [19].
is kind of texture context information is adequately specified by the matrix of relative frequency Complexity which two neighboring resolution cells are separated by a distance d occurs on the image, one with gray tone i and the other with gray tone j at the angle of θ (see Equation (4)), where N is the gray level: Such matrices of gray-tone spatial dependence frequencies are a function of the angular relationship between the neighboring resolution cells, as well as a function of the distance between them. θ is usually set to 0, 45, 90, and 135. Figure 3 illustrates a GLCM example with d � 1 and θ � 0. e gray level of the image is 8. With GLCM, Haralick et al. proposed 14 indexes to illustrate the texture of pictures, which includes angular second moment (ASM), contrast (CON), correlation, sum of squares, inverse difference moment (IDM), sum average, sum variance, sum entropy, entropy (ENT), difference variance, difference entropy, 2 information measures of correlation, and maximal correlation coefficient [20]. Due to the diversity of leaf image textures, 14 statistical indexes are all used in this study, and texture features are traded off by dimensionality reduction operations before model training. Table 2 shows the calculation results of some important texture indexes.

Shape Feature Extraction of Disease Spots.
For a typical fruit disease, the shape of the disease spots in leaves is always more stable. However, the shape features of different disease spots are often different. erefore, the shape features of disease spots are essential in the recognition of fruit diseases. Because the shape of the disease spots is often smaller and irregular, it is difficult to describe the shape  At present, the methods to calculate the fractal dimension of irregular objects include box counting method, perimeter area method, variable method, and radius method. Among them, the box counting method is popular and easy to use. It is available whether the object is a curve or a surface surrounded by a curve, and it has little to do with the physical nature of the object. e counting dimension value D used in box counting method is defined in the following equation: where N(A, r) is the number of pixels in all square grids with the width r, and A is the binary image matrix. Figure 4 shows a spot in the grid-like background. In practice, for ease of computation, the linear fitting coefficient of N(A, r) and r is often used as the approximate value of D. e coefficient is easy to get by ordinary least squares (OLS) method. In this study, the counting dimension values of different fruit diseases are shown in Table 3.

e Number of Feature Extraction of Disease Spots.
For different diseases, the number of disease spots differs to some extent. By observing the ill leaf pictures, we found that (1) there are a few black spots for pear black spot and a large number of yellow spots for pear rust; (2) there is a large area of light colored patches for apple rust, and the color of the spots has obvious variability. So, we adopted Simple-BlobDetector (SBD) [21,22] to count the disease spots.
SBD is a kind of image segmentation methods based on topological and morphological theories. is algorithm is good at handling weak edge information and has good ability to connect grayscale edges. Meanwhile, the catch basin concept effectively preserves the regional features of the image. erefore, SBD is suitable for image segmentation. e flowchart of SBD is shown in Figure 5.
Since the color brightness of the spots varies from dark to gray, we capture white blobs and black blobs separately. e blobColor parameter is set to 255 to count white spots and 0 to count black spots. e number features of disease spots are described in Tables 4 and 5.

Data Standardization and Dimensionality Reduction.
rough the feature extraction, we got 33 features to describe the leaf pictures. ese features are grouped into 12 color features, 14 texture features, 2 number features, and 5 shape features. Since the values of different features vary in different ranges, before the model training, we conducted the data standardization. Each feature is transformed by the following equation: where μ is the mean of feature samples, and σ is the standard deviation.
To simplify the diagnosis model and improve the generalization performance of the model, we also reduced the dimensionality of the 33-feature dataset. Principal component analysis (PCA) was used to conduct the task. PCA algorithm has only one parameter, n_Components, which is used to determine the dimension after dimensionality reduction or the proportion of information retained after dimensionality reduction. e parameter is usually set according to experience rather than definite rules. To ensure the rationality of dimensionality reduction, n_Components were validated from 2 to 33. e reduced datasets with different dimensions were tested by a classifier such as logistic regression classifier. e dimensions with the best f1_score were selected. e test results are shown in Figure 6.
According to Ockham's Razor, the dataset with 6 dimensions was taken as the dataset for model training and validation. e data after dimensionality reduction is shown in Table 6.

Model Training and Selection
As is known in machine learning, deep learning is the most popular technology for image recognition. However, if the training dataset is in small and middle size, the performance of the deep learning model is not certainly guaranteed. In this study, we, respectively, conducted ensemble learning and deep learning tactics to find the best model. e tactics are compared by the metrics defined in Section 4.1.

Model Evaluation Metrics.
We used f1_score to evaluate machine learning models. f1_score is defined in Equations (7) and (8). As a deliberately designed metric, f1_score fairly measures the bias and variance of the model.

Ensemble Learning.
Ensemble learning is a powerful way to integrate many weak classifiers for better prediction. In practice, ensemble learning classifiers show better performance than a unique classifier, even almost better than deep learning ones on small-and middle-sized datasets. According to different ensemble tactics, ensemble learning is divided into 3 branches: bagging-based ensemble learning, boosting-based ensemble learning, and retraining-based ensemble learning.

Bagging Ensemble Learning-Based Model Training.
In bagging ensemble learning, all base classifiers are trained concurrently, so the efficiency of the training is much higher than other ensemble learning algorithms. If the sampling of features is also different from other base classifiers, the generalization ability is further improved. e output of the bagging ensemble learning model is usually decided by majority vote. e flowchart of the training of bagging ensemble learning is shown in Figure 7.
In this study, we chose random forest algorithm, which is proved to be one of the best ensemble learning algorithms [23,24]. Since random forest uses classification and regression tree (CART) and feature sampling to train base classifiers, the main hyperparameters including max_depth, max_features, min_samples_leaf, min_samples_split, and n_estimators are required to be determined before the model training. We used GridSearchCV method of Scikit-learn to optimize the hyperparameters and got the final random forest-based diagnosis model. e final parameters were determined with the following values: {"max_depth": 40, "min_samples_split": 2, "min_sam-ples_leaf": 1, "n_estimators": 100, "max_features": 0.6}. e f1_score was 0.9249, and the train time was 38.4 minutes.

Boosting Ensemble Learning-Based Model Training.
Boosting is one of the most important developments in model training methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking the majority voting or weighted mean of the sequence of classifiers thus produced. at is, after a base classifier is trained, the latter base classifier is trained on the validation result of the former base classifier. e weight of the false predicted samples will be adjusted to improve the latter classifier's accuracy. As a result, the bias of the latter classifier is decreased. Generally, the final outputs of ensemble classifiers are averaged with different weights [25]. e   flowchart of the training of boosting ensemble learning is shown in Figure 8. Boosting ensemble learning algorithm family has some famous members such as AdaBoost, GBDT, lightGBM, and XGBoost. Compared with AdaBoost and other algorithms, XGBoost uses the same sampling methods as random forest, which is proved to decrease the variance effectively. We also used GridSearchCV method to optimize the hyper-parameters and got the final XGBoost-based diagnosis model. e best parameter sets are as follows: {"subsample": 0.7, "learning_rate": 0.1, "max_depth": 8, "colsample_bytree": 0.5, "n_estimators": 200}. e f1_score is 0.9398, and the train time is 42.6 minutes.

Retraining Ensemble Learning-Based Model Training.
Retraining ensemble learning uses the primary classifiers' outputs as inputs to train the secondary classifier. e typical algorithm is the stacked generalization (also called stacking ensemble) algorithm [26,27]. e algorithm mainly consists of primary classifiers and secondary classifier. e number and type of primary classifiers are not limited. However, for the sake of efficiency and generalization ability, the simple classical and heterogeneous classifiers are preferred. Figure 9 shows the flowchart of the stacking ensemble learning process.
In this study, we explored 4 classical simple classifiers as primary classifiers and random forest as the secondary classifier to create the stacking ensemble model. e 6-dimension dataset was used to train these classifiers. In the first training stage, all primary classifiers were trained by GridSearchCV to get the high-accuracy classifiers and their outputs. en, in the second training stage, the outputs were merged as the training data to train the secondary random forest classifier. Table 7 shows the test results.
If the secondary classifiers are trained concurrently, the total training time can be estimated to be about 54.8 minutes, which is the sum of the time of support vector classifier in first training stage and random forest in the second stage. 8 Complexity used to express complex features. After feature extraction, features are fed into pooling layers for feature selection and information filtering. Consequently, the high-dimension data are significantly condensed before being fed into the fully connected layer. e process is shown in Figure 10.
CNN is a family of many deep learning algorithms. In this family, there are many famous algorithms including LeNet [32], AlexNet [33], ZF Net [34], GoogLeNet [35], VGGNet [36], ResNet [37], and DenseNet [38]. In this study, ResNet-101 and DenseNet-121, as two popular CNN algorithms, were selected to create diagnosis models. We conducted the model training in TensorFlow [39] on the 33-dimension dataset. Test results of two models are shown in Table 8. Obviously, the stacking ensemble-based model outperformed other models on small-and middle-sized dataset, and the time-consuming (nearly 60 minutes) is acceptable. On the contrary, two deep learning models also show better scores. However, the training time is much more than that of the stacking-based model. is hints that even if deep learning algorithms usually showed better performance than other algorithms in image recognition, their performance may not be as good as the performance of simple machine learning algorithms when datasets are not large enough and diverse enough.

Results and Discussion
To further evaluate the above models, we used the test dataset introduced in Section 3.1 to test the models. Since the test data are not used in model training, we can evaluate the generalization ability of all models. e f1_scores of the models are 93.88% (random forest), 94.65% (XGBoost), 97.34% (stacking), 95.21 (ResNet-101), and 96.27% (DenseNet-121).
e stacking-based model was still the best one.
We observed the outputs of all models. 57 out of 500 test samples were predicted with inconsistent values, among which the stacking-based model has the most right prediction values. Table 9 shows the difference.
We also observed the accuracy of the models on different diseases. Table 10 shows that the stacking-based model is still better than other models.   According to the results demonstrated above, the stacking ensemble-based model is selected as the final model for the diagnosis of fruit tree disease.

Conclusions and Future Studies
To automatically identify fruit tree diseases with leaf pictures, we trained the machine learning models with ill leaf pictures to create the diagnosis model. Since the size of the dataset is not large enough to implement reliable deep learning models, we trained 3 kinds of ensemble learning models and compared the accuracy of ensemble learning models with 2 deep learning-based models. e results showed that the stacking ensemble-based model outperformed other kinds of models. is study also hinted that when the dataset is in small and middle size, the accuracy of the deep learning models may not be satisfactory. e ensemble learning models, especially the stacking ensemble-based model, would be a high cost-effective solution with the help of high-quality feature engineering. Some studies tried ensemble learning of deep learning classifiers and implemented high accuracy of prediction [40]. However, the cost of the model training was heavily increased, while the efficiency of the model was decreased. It hinted that stacking ensemble learning classifiers may be used as cost-effective alternatives to deep learning models under performance and cost constraints.
It should be noted that the study has limitations in feature engineering and test data collection. (1) As was discussed in Section 3.2, we only tried RGB color scheme to extract the color features and box counting method to extract the shape features, which inevitably led to incomplete and inaccurate feature expression. (2) e test dataset to evaluate and select the final model was limited to the size and diversity, which may lead to inaccurate evaluation and choice of the best model. erefore, in future studies, we will improve our work  Data Availability e training dataset was downloaded from the database (http://agri.ckcest.cn/specialtyresources/list29-1.html).

Conflicts of Interest
e authors declare no conflicts of interest.