Optimizing Pretrained Convolutional Neural Networks for Tomato Leaf Disease Detection

. Vegetable and fruit plants facilitate around 7.5 billion people around the globe, playing a crucial role in sustaining life on the planet. The rapid increase in the use of chemicals such as fungicides and bactericides to curtail plant diseases is causing negative eﬀects on the agro-ecosystem. The high scale prevalence of diseases in crops aﬀects the production quantity and quality. Solving the problem of early identiﬁcation/diagnosis of diseases by exploiting a quick and consistent reliable method will beneﬁt the farmers. In this context, our research work focuses on classiﬁcation and identiﬁcation of tomato leaf diseases using convolutional neural network (CNN) techniques. We consider four CNN architectures, namely, VGG-16, VGG-19, ResNet, and Inception V3, and use feature extraction and parameter-tuning to identify and classify tomato leaf diseases. We test the underlying models on two datasets, a laboratory-based dataset and self-collected data from the ﬁeld. We observe that all architectures perform better on the laboratory-based dataset than on ﬁeld-based data, with performance on various metrics showing variance in the range 10%– 15%. Inception V3 is identiﬁed as the best performing algorithm on both datasets.


Introduction
No life is possible without plants; they provide food to all terrestrial living organisms and protect the ozone layer that is responsible for filtering harmful UV radiations of the sun. Although plants are essential for life, they face several challenges to grow as a variety of diseases hit them. e need for rapid recognition and diagnosis of diseases helps reduce the chances of damage to ecosystem. In the absence of systematic disease identification, the quality and quantity of products are affected. is further affects the economy of a country [1].
e United Nations Food and Agriculture Organization (FAO) proposes that agriculture production needs to increase by 70% by 2050 to overcome the world's food needs [2]. On the other hand, a rapid increase in the use of chemicals such as fungicides and bactericides to curtail diseases has been negatively affecting the agro-ecosystem. us, we need rapid and effective early disease detection and classification techniques to identify the plant disease to sustain the agro-ecosystem. Among several fruit plants, tomato is a part of the daily diet. e need for early identification of tomato plant leaf diseases through technologyoriented approaches such as image processing and deep learning provides the opportunity for the development of such systems. Approximately 50% of the plant's production is damaged due to several diseases [1]. Farmers identify the disease by examining the plant and making judgements based on their past experiences [3]. is method does not provide accurate results as different farmers may have different experiences and the method lacks scientific rigour as well. ere are chances that the farmers might miss classifying a disease and a wrong treatment may cause more damage to the plant. Likewise, domain experts' visit to the field is costly. is necessitates the need of an automated image-based disease detection and classification mechanism that can replace the domain expert.
A number of researchers focused on the development of automated techniques for plant's disease identification using state-of-the-art techniques [4][5][6][7][8][9]. Durmuş et al. [4] used two different deep learning neural network architectures, namely, AlexNet and SqueezeNet for automated detection of disease in tomato leaves. e authors used images of the PlantVillage dataset. e authors did not evaluate the performance of the neural network architectures based on standard performance metrics of F1, recall, precision, etc., instead they used only accuracy and inference time of the model. Tm et al. [5] proposed a variant of the convolutional neural net called LeNet for detection and identification of diseases in tomato leaves. e objective of the work was to identify a computationally robust technique for the underlying problem. e authors used images from Planet-Village repository and reported an accuracy of 94%. Mohanty et al. [6] used AlexNet and GoogLeNet deep learning architectures to develop models for classification of tomato leaf diseases. e authors used a combination of learning algorithms and various splits of training and testing and reported an accuracy of 99.35% using the PlanetVillage dataset. For other contributions, the reader is referred to [7][8][9]. A common problem observed in the literature is the choice of dataset. e majority of the techniques proposed used controlled datasets that contain images obtained in perfect conditions in a controlled environment. However, in the real world, it is not possible to obtain high quality and high-resolution images of higher quality for possible detection and classification of tomato leaf diseases.
In this work, we evaluate convolutional neural networkbased architectures, namely, VGG-16, VGG-19, ResNet, and Inception V3, for image-based detection and classification of tomato leaf diseases. Unlike previous studies, we used two types of datasets; firstly, we collected real field data from a tomato field in an uncontrolled environment and then used data augmentation technique to increase the number of instances; secondly, we also used laboratory data collected in a controlled environment. We report the results based on various performance metrics including accuracy, recall, precision, and F1-score. us, our evaluation methods are more robust and representative of a real-world scenario.
Rest of the paper is organized as follows. Section 2 presents the proposed approach including working of the CNN models, description of the datasets, and performance evaluation metrics. Section 3 presents the results based on the performance evaluation metrics for feature extraction as well as parameter-tuning. e results are discussed in the same section as well. Section 4 concludes the work and provides directions for future research.

Approach.
We consider four well known convolutional neural networks (CNNs) architectures for identification and classification of diseases in the tomato plant leaf. ese architectures include VGG (VGG-16, VGG-19), residual neural network (ResNet), and Inception V3. Although the concept of deep neural network is not new, the availability of substantial amount of data and affordable computational power made it a reliable method in a variety of domains. CNNs are well known for image-based classification problems [10][11][12]. A distinguishing feature of CNN is the use of convolution layer, which omits the need of matrix multiplication. e various layers in a typical CNN include convolution, activation, pooling, and classification [13]. e purpose of the convolution layer is to reduce the dimension of input. e task of the activation layer is to apply nonlinear operators such as rectified linear unit (ReLu). e pooling layer is applied to further reduce the dimensions by the application of a statistical function such as MaxPool on neighboring values. After applications of these steps, a softmax function can be applied to classify the input into one of predetermined classes. Although CNNs are shown to achieve excellent performance on image classification problems [14], two key problems are reported in the implementation and use of CNNs. Firstly, CNN involves a considerable number of parameters that are estimated in the training phase; secondly, the training phase by itself requires a large number of input images.
us, designing and training CNNs from scratch is not considered an ideal solution. Instead, a rather unique and novel approach is used in which pretrained CNN models are considered and only the last few layers of the model are used in the training phase to estimate the parameters associated with those layers. Several such pretrained models are proposed in the literature, and we discuss the notable ones selected for our study.

VGG Net.
e pretrained model was introduced by Visual Geometric Group (VGG) at the University of Oxford, and thus the name VGG [10]. e basic working principle of VGG Net is to use deeper layer with smaller filters. e input layer dimension of the VGG architecture is set for an image size of 244 × 244. Preprocessing involves subtraction of the mean RGB value from each pixel of the input image. Preprocessing is followed by a stack of 5 convolutional layers, each of which is followed by a MaxPool layer, i.e., each set of convolutional layers is followed by a MaxPool layer. e final MaxPool layer precedes three fully connected (FC) layers. e first two FC layers have 64 × 64 (4096) channels, whereas the last FC layer has 1000 channels, which is followed by a softmax activation function. VGG network has multiple flavors, notably VGG-16 and VGG-19. VGG-16 and VGG-19 use the same architecture with different number of layers. VGG-16 uses 16 layers, whereas VGG-19 uses 19 layers. e differentiating factor is the number of convolution layers in the 3 rd , 4 th , and 5 th layers of convolutional layers stacks.

ResNet.
Residual network (ResNet) addressed the problem of training and overfitting in deep neural networks by introducing the concept of residual learning [15]. He et al. [15] highlighted that as the neural network architecture 2 Complexity becomes deeper, degradation occurs. Degradation is the phenomenon of increase in the training error as more layers are added to the architecture of a neural network. To solve the problem of degradation, the authors introduced residual block. Unlike VGG, which adds a stack of convolutional layers followed by a MaxPool layer, ResNet attempts to identify a residual mapping between the input to the convolutional layers and the output at the MaxPool layer, thus eliminating the computational cost of input being processed by the convolutional layer stack.

Inception Network.
Szegedy et al. [16] extended the concept of network in network and proposed a modified CNN architecture to achieve improved performance by increasing the depth of the network and keeping the computational cost low. In contrast to VGG, Inception networks proved to be computationally efficient in terms of computing resource utilization as well as the number of parameters. However, the downside of the original inception network was its limited application adaptability in new use cases. Szegedy et al. [14] refined the original inception network model by introducing factorized convolutions with large filter size, factorization into smaller convolutions, and asymmetric convolutions. For details, the reader is referred to [14,16]. As a preprocessing step, we used histogram equalization to increase the contrast. In addition, the input images are resized to match the requirements of the individual network (for instance, for VGG, the images are resized to 244 × 244).

Datasets.
A categorized dataset is an essential part of a quantitative assessment. Although a standard categorized laboratory-based tomato leaf disease dataset has been developed for the assessment of the system [17], it is recorded in a controlled environment. ere is no inclusive standard field-based database. In this research, we collected tomato leaf data from various fields in a natural uncontrolled environment. Afterward, the data are inspected by a domain expert to identify and classify the images into various categories. e laboratory-based dataset contains 2364 images categorized into four types of different tomato leaf diseaseinfected high-resolution images. Each class contains 591 images. For system training and evaluation, we divided this dataset into three parts for training, validation, and testing, respectively. e specific ratio of each is 70%, 20%, and 10%, respectively. e detailed summary of the laboratory-based dataset is provided in Table 1.
It was a challenging task to collect datasets from different fields of tomato crop. e data were collected using a cell phone and in natural daylight conditions. e resultant datasets contain six types of infected tomato leaves. A total number of 317 images were collected with a cell phone camera. ese were less in numbers for model training and evaluation. A higher number of images were needed to train a deep learning algorithm. erefore, the data augmentation technique was used to increase the number of samples in the dataset. After the data augmentation, we obtained 15,216 samples for the field-based dataset. e dataset was further divided into three parts for model training (70%), validation (20%), and testing (10%), respectively. Summary of distribution of the field-based dataset is provided in Table 2.

Performance Evaluation Metrics.
We used accuracy, precision, recall, and F1-score as performance evaluation metrics. Note that the basic confusion matrix can be misleading; therefore, we used the aforementioned performance evaluation criteria.

Accuracy. Accuracy (A) represents the proportion of currently classified predictions and is calculated as follows:
Note that TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively.

Precision. Precision (P)
represents the proportion of positive outcomes that were actually correct and is calculated as follows:

Recall.
Recall (R) measures the proportion of actual positives that were identified correctly and is calculated as follows:

F1-Score.
F1-score is defined as the harmonic mean of precision and recall and calculated as follows:

Results and Discussion
In this section, we report the results of the experiments performed in the study. e experiments performed in the study utilized several pretrained neural network architectures as feature extractors and fine-tuned the higher dimensional layers (last few layers) to learn features corresponding to the dataset. Pretrained neural network architectures utilized are VGG1-6, VGG-19, ResNet, and

Results Using Feature Extraction.
In this section, results are reported for both datasets using pretrained neural networks as feature extractors. e image classification process can be divided into two parts. Feature extraction is carried out by convolutional neural networks, and the classification is performed by fully connected layers with the ReLu activation function and softmax. Table 3 presents the results on both datasets using pretrained neural network architectures as feature extractors. Analysis of results concludes that Inception V3 outperformed all other pretrained models with the highest reported accuracy in both datasets. However, the laboratory dataset achieved better classification accuracy (93.40% using Inception V3). is may be due to the reason that the laboratory dataset is a standard balanced dataset curated by experts in the domain, whereas our field dataset was collected through a cell phone camera and is an imbalanced dataset.

Results
Using Parameter-Tuning. In this section, results are reported for both datasets by fine-tuning the parameters of the pretrained neural network architectures. e highdimensional layers of pretrained neural network architectures are trained to adjust the parameters according to our dataset. Low-dimensional layers' features are kept the same for both datasets. e classification task is carried out using fully connected layers with the ReLu activation function and softmax is employed at the final layer. Table 4 presents the results on both datasets by fine-tuning the pretrained neural network architecture. Analysis of results concludes that the Inception V3 architecture outperformed all other pretrained models with the highest reported accuracy in both datasets. As expected, the accuracy is high on the laboratory-based dataset. Table 5 summarizes recall, precision, and F1-score achieved by the four models on the two datasets.
In terms of recall score, as expected, all models performed better on a laboratory-based dataset than a field-based dataset. e average performance difference in the recall score for the two datasets is 13.7%. Inception V3 is the best performing model achieving 0.996 and 0.906 recall score on the laboratory-based dataset and field-based dataset. e same performance trend is observed on precision and F1-score. In both instances, the score achieved on the laboratory-based dataset is superior to that achieved on the field-based dataset. In terms of precision, the difference between the score achieved on the laboratory-based data and field-based dataset is 17.9%. For F1-score, the difference is 15.8%. Figure 1 summarizes the average accuracy, recall, precision, and F1-score using the parameter-tuning technique on laboratory-based and field-based datasets.
Several interesting observations can be drawn from the reported performance metrics. For all models, fine-tuning the parameters of a pretrained neural network architecture achieved better classification accuracy as compared to using the neural network architecture with feature extraction only.
is observation is typically common as using the feature extractor that is trained on a different dataset may not always capture the best set of discriminative features of the images under study (tomato leaf diseases in our case). As far as accuracies among models are concerned, Inception V3 outperformed all other pretrained models. is may be due to the reason that Inception V3 uses different kernel sizes for the effective recognition of variable-sized features. Instead of simply going deeper in terms of the number of layers, it goes wider. Multiple kernels of different sizes are implemented within the same layer. Expectedly, all models performed   better on the laboratory-based dataset than field-based dataset as the data are collected from the field in an uncontrolled manner.

Conclusions
In this work, we used different pretrained convolutional neural networks for automatic detection and classification of diseases in a tomato plant leaf. We considered four different models, namely, VGG-19, VGG-16, ResNet, and Inception V3, and evaluated their performance on two divergent datasets. e first dataset is a controlled dataset whose images are acquired in a laboratory; the second dataset is prepared by us by collecting data from the field in natural light with the help of a cell phone. us, the second data are representative of a real-world situation and were hence proved to be more challenging for various pretrained neural network models. We observed that parameter-tuning results in more accurate results than feature extraction. Likewise, the average performance on the laboratory-based dataset was 10%-15% superior in comparison to the field-based dataset. Inception V3 was the best performing model on both the datasets. As these models do not perform well on the field-based dataset, therefore, a natural extension of our work will be to optimize these models for better performance on real-world field-based data.

Data Availability
Previously reported data (laboratory-based dataset) were used to support this study and are available at https://github. com/PrajwalaTM/tomato-leaf-disease-detection. e fieldbased data used to support the findings of this study are available from the corresponding author upon request.

Accuracy
Recall Precision F1-score Performance metrics