A Perspective View of Cotton Leaf Image Classi�cation Using Machine Learning Algorithms Using WEKA

Cotton is one of the major crops in India where 23% of cotton gets exported to other countries. Hence, the cotton yield depends on the crop growth, and it gets affected because of diseases. In this paper, cotton disease classi�cation is performed using different machine learning algorithms. For this research, the cotton database was created by capturing images in the �eld under controlled conditions. The same database is used for segmenting the images using modi�ed factorization-based active contour. The color and texture features are extracted from segmented images and later its fed to the machine learning algorithms like Multilayer perceptron, Support vector machine, Naïve Bayes, Random forest, Ada Boost, K nearest neighbor. The performance of the classi�ers is better when color features are extracted than texture features extraction. The color features are enough to classify the healthy and unhealthy cotton leaf images. Among the different classi�ers, Multilayer perceptron gives nearly 96.69% which is greater than other classi�ers.


Introduction
In India, agriculture is the main occupation and two-third of the population is dependent on agriculture directly or indirectly.
Hence yield of the crop depends on the growth and diseases might affect the yield of the crop. So, identifying the plant disease at the early stage will bene t in diagnosing and preventing unnecessary crop loss. Among different parts of the plant, the leaf is the part if gets affected then it affects the crop yield. The detection of disease can be recognized by visible symptoms and plant pathologists can suggest a suitable pesticide. For this, there are numerous image processing methods, and is one of the techniques used for processing the images and identify the disease. There are different leaf disease classi cation is performed example pomegranate [23], grape [29] (Krithika et al), cotton [5], maize (Panigrahi et al) [28], etc. Figure 1 shows the basic image processing steps for classi cation.
Normally, the acquired leaf images with the natural background are ltered using the gaussian mask. The ltered image is used for the segment using the Modi ed factorization based active contour method and followed by extracting features. But if we use all the features in the classi cation then training time will be more so, selection of features is performed [40]. Finally, the classi cation of cotton leaf images is performed using different classi ers shown in Fig. 2.
Among the different steps involved in this process, after feature extraction selection of features to improve the performance of the classi er. There are various color and texture features extracted to get accurate classi cation accuracy. In literature, many authors have submitted surveys on different classi er's performance. In this paper, based on the features selected [41] the performance of the classi ers is compared. The classi ers like a neural network, support vector machine, Adaboost, naïve bayes, random forest are taken into study.
The different classi ers have advantages and disadvantages concerning different parameters like training data size etc. On these classi ers' authors have researched in different leaf disease classi cation especially in cotton. These classi ers have application in breast cancer diagnosis problem [2], Heart Disease Prediction System [3], predicting economic events [4], Text Categorization [5], Skin diseases diagnosis [7], medical science[8], Face recognition [9], health science [10], Brain tumor [11], Terrain Classi cation [12], Real-time facial expression recognition [13], Discrimination of breast tumors in ultrasonic images [15], Cancer Genomics[16], Detection of Skin Cancer [19], Skin Lesion Segmentation [20], Malignant Melanoma Detection [23], disease detection in pomegranate leaf and fruit [25], Apart from the survey of leaf classi cation even an automatic detection system can be useful for disease identi cation and many researchers have introduced different methods for disease classi cation.
The organization of this paper is as follows, Sect. 2 is for material and the method, Sect. 3 is related works, Sect. 4 methodology and Sect. 5 results and discussion followed by the conclusion.

Material And The Method
Database: For the study, the cotton leaf images are considered. The images are captured from the cotton eld under controlled conditions. Images are captured with a natural background in the eld. The database consists of nearly 3000 images. There are 2 categories Healthy and diseased, which were used for training and testing. The sample images are shown below in Fig. 3.

Related Works
Many researchers published a paper on leaf disease classi cation. The classi ers like K-Nearest Neighbour (KNN), Adaptive Boosting (AdaBoost), Support Vector Machine (SVM), Random forest, Bayes classi er, Arti cial Neural Network (ANN). These classi ers contributed a lot to the eld of image processing, so we use these classi ers to show the classi ers performance for our database. Further we give the brief description about the classi ers which we used for cotton database.
A Neural network was introduced by Alexander Bain and William James in 1890. It was inspired by resembling the brain neurons. Since it resembles the human brain, the algorithms are patterned accordingly. The Neural network has advantages like it has ability to handle imperfect data, ability to detect all possible interactions between predictor variables. Hence, it is used in regression analysis, classi cation, and data processing. It has numerous applications in the eld of agriculture and there has been extensive research in this eld. Here we focus on leaf disease classi cation, wherein we are using the cotton leaf images database. Different types of disease classi cation like bacterial blight, powdery mildew, etc. The leaf disease classi cation is carried out using a neural network by authors [7]. Disadvantages require greater computational resources, are prone to over tting [Tu JV. Et al 34]. But still, the performance of the network is good when compared to other classi cation models. The classi cation accuracy depends on the features extracted to train the model and also on the dataset. It even relies on the network weights and no of times the model is trained. Support Vector Machine (SVM) was introduced by Vapnik at AT&T Bell laboratories with colleagues. It is used to categorize unlabelled data. The advantage of using this model is that there is less risk of over tting. It helps in e ciently classify unlabelled data also. Here the classi ers use hyperplane which helps in separating the data points. Since this classi er is used in many applications like leaf disease classi cation [Patil et al [8], Adhao et al [5], medical eld [9], etc. The disadvantage of using this classi er is that training time takes a longer time.
In 1951, K nearest neighbors algorithm(K-NN) classi er is introduced by Evelyn Fix and Joseph Hodges. This classi er used in regression and classi cation. In this, k is de ned by the user and it can be any integer. Choosing the value of k differs based on the dataset, and k value decides the classi er accuracy. This classi er is used in many applications like text classi er, visual recognition, Wisconsin-Madison breast cancer diagnosis problem, classifying heart disease, predicting economic events (Imandoust et al. [4]), text categorization (Guo G et al. [1]), etc. Later, hybrid classi ers came into existence were combining of KNN classi er with other classi ers (R. G. Devi et al. [6]) so that classi cation accuracy was improved. It is used in classi cation and regression. Its application is in leaf disease classi cation like grapes [38], maize [35], groundnut [32], etc.
Likewise, we have used this classi er for our database also Random Forest classi er algorithm was introduced by Tin Kam ho in 1995 using the random subspace method. This often gives higher accuracy than the single decision tree. This method is used for classi cation and regression. Random Forest has lot of applications like implementation is not complex, fast in operation, and it has its role in various elds. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing. Hence, the classi er is used in different sectors like Bank, Healthcare, etc. Its application is in the agriculture sector that is leaf disease classi cation [39]. One of the disadvantages of this classi er is it needs more resources and computational power to build many tress so that it can combine the different trees output. Since, many trees need to combined and for this the time taken to train the classi er will be more. Naïve Bayes classi er -This classi er is based on Bayes theorem and is widely used in a classi cation task. The name naïve used since it assumes that features that are fed are considered as independent of each other. It means even if you change any one feature it won't affect the other features. Because of this feature, it's been used in many applications. Table 1 gives a brief overview of the authors contribution to leaf disease classi cation. Random Forest is the simplest and diverse method to solve classi cation problems. Here the forest term is meant ensemble of decision trees and usually trained using the bagging method as shown in Fig. 4. The Bagging method is combining different learning models to get good accuracy results. Based on the each tree class labels maximum voting the classi er output is decided.
Advantages: 1) It is easy to measure the relative importance of each feature for prediction Disadvantages:1) Too many decision trees will lead to the slow algorithm Basavaiah et al [53] introduced a model for tomato leaf disease classi cation by means of random forest classi er. The dataset consists of 500 images and resizing of size 500x500. The features colour histograms, local binary patterns, Hu moments are extracted. Further, dataset of size 300 images is used for training, and testing is done for 200 images. The classi cation is performed using a decision tree classi er and random forest classi er. The experiment resulted in 90% and 94% accuracy for decision tree and random forest classi ers respectively.
Chaudhary et al [54], introduced a modi ed random forest classi er for multi-class groundnut leaf disease classi cation problems. In this paper, a modi ed random forest classi er uses a random forest classi er, an attribute evaluator method, and an instance lter method. To show the performance of the proposed author compared existing machine learning algorithms such as SVM, neural network, and Logistic regression with the proposed model to check which classi er will be suitable for their dataset. An accuracy of 97.80% is achieved on ve UCI machine learning repository benchmark datasets using the projected model.

Naïve Bayes classi er
A Naive Bayes classi er [58] is based on Bayes theorem and is a probabilistic machine learning model that's used for classi cation tasks as shown in Fig. 5.
The fundamental Naive Bayes assumption is that each feature makes an independent and equal contribution to the outcome.

Feedforward neural networks
Feedforward is a form of an arti cial neural network [57] and it is inspired by a biologically inspired algorithm. Here the information passes only one direction forward and never comes backward. One of the simplest form feedforward networks is single layer perceptron and another form is multilayer perceptron. The single layer perceptron has a single layer of output nodes as shown in Fig. 6. Based on the weight series are fed as input to the output .
Multilayer perceptron (MLP) [47] consists of multiple layers of computational units or perceptron which are interconnected to the output layers as shown in the Fig. 7. It used the concept of backpropagation learning for training data.
MLP has advantages concerning for solving any complex problem with greater e ciency. It has a lot of applications in the eld of speech recognition, image recognition and classi cation [48]. Advantages: 1. It helps in solving the complex problem 2. Adaptive learning makes the network extract the patterns from imprecise data. Disadvantages:

Sometimes it might take a longer time for training a large dataset
Since multilayer perceptron has a lot of advantages which led to the usage of this classi er in the eld of leaf disease classi cation. Shak et al [ 48] used MLP for healthy and unhealthy leaf classi cation. With a training sample of 90, the accuracy of the classi er is 97.15%. The accuracy reduces as the training sample reduces since the test data set is more when compared to the train dataset. Next, MLP has marked its place in watermelon leaf disease classi cation [49]. Author Kutty et al [49], used MLP for watermelon leaf disease classi cation. The color features are extracted and feed to the classi er. The accuracy of 75.9% is achieved for 200 leaf samples.
Though MLP usage is extensively used in disease classi cation and the dataset which was used for classi cation was simple.
The leaf dataset images were with a white or black background which helps the classi er outstands as the feature extraction will be easy. In this paper, the cotton dataset is with complex background and the performance of the classi ers is compared.

Adaptive Boosting (Ada Boost) classi er-
Ada Boost [14,59] was proposed by Yoav Freund and Robert Schapire in 1996 and it is an iterative collective method as shown in Fig. 8. It helps in a combination of multiple poor performing classi ers so that classi er accuracy will be more. The basic idea behind Adaboost is to set the weights of classi ers and training the data sample in each iteration so that it ensures the correct predictions of unusual observations. Two conditions should be met by Adaboost: 1. Different weighted training examples should be interactively trained by the classi er.
2. In each iteration, by minimizing training errors, it aims to provide an excellent match for these instances.
This method normally selects randomly the subset of training data. By choosing the training set based on the accurate forecast of the last training, it iteratively trains the AdaBoost machine learning model. It allocates the higher weight to incorrectly categorized observations so that these observations will have a high likelihood of classi cation in the next iteration. It also assigns weight to the quali ed classi er according to the accuracy of the classi er in each iteration. Elevated weight will be given to the more accurate classi er.
This process iterates until the complete training data suits without any error or until the maximum estimator number speci ed is reached. To identify, perform a "vote" across all of the learning algorithms you created. Advantages: 1. It is less vulnerable to the over tting problem Disadvantages: 1. It is sensitive to noisy data and outliers In the paper, author Subasi et al [35] proposed ensemble Adaboost classi er is used to nd the human activity using a sensor.
Here the activity recognition is achieved using wearable sensors. The different physical activities were checked by authors proposed model and proved that their model is better when compared to others. 4.5 Support vector machine (SVM) classi er-SVM [60] is a supervised machine learning algorithm that can be used for classi cation as well as for regression. It is formally de ned by separating the hyperplane as shown in Fig. 9. A hyperplane is the line that helps in separating the data points. The SVM constructs hyperplane in high dimensional space or in nite dimensional space. These hyperplanes help in classifying the data and there can be more than one hyperplane. The hyperplane which is at maximum distance from data points will be considered for classi cation. The classi er is used for high dimensional spaces. A support vector machine [17.18] constructs in a high-or in nite-dimensional space a hyperplane or set of hyperplanes that can be used for classi cation, regression, or other tasks such as detecting outliers. Automatically, the hyperplane that has the largest distance to the nearest training data point in any class (so-called functional margin) achieves a good separation since, in general, the greater the margin, the lower the classi er's generalization error. SVM has its application in text classi cation, bioinformatics, hand-written recognition, image classi cation. Advantages: 1. classi cation accuracy is high 2. Works well for a smaller dataset Disadvantages: 1. Training a large dataset will take a longer time

Noise sensitivity
Priya et al [51], proposed a leaf recognition algorithm using Support Vector Machine (SVM). Here 12 features were extracted and the classi er uses the features extracted for classi cation. This process was carried out on avia dataset and a real dataset. The author compared SVM classi er with the KNN classi er to show that the SVM has more accuracy and takes less training time.
Alehegn et al [52], worked on the Ethiopia maize disease leaf dataset and the author claims that the research carried out is not proposed by anyone. In this, pre-processing RGB to gray conversion, image enhancement is performed to improve the image quality. Further, texture, color, and morphological features are extracted. They are fed to the classi er and the accuracy is 95.63%.

K-NN classi er:
It is one of the simplest supervised classi cation algorithms. The K-NN [61] algorithm stores all available data and classi es, based on similarity, a new data point. This implies that it can be conveniently categorized into a well-suite group using the K-NN algorithm [1] as new data emerges. It can be used for classi cation and regression. It is often referred to as a lazy learner algorithm because it does not automatically learn from the training set, but instead stores the dataset and performs an operation on the dataset at the time of classi cation.
At the training point, the KNN algorithm only stores the dataset and then classi es the data into a group that is very close to the new data when it receives new data as shown in Fig. 10.
The K-NN working is based on the selection of K value so that Euclidean distance can be calculated for k number of neighbors. The categories are done based on the distance between data points. The query point will belong to the category where maximum number of neighbors Advantages: 1. It is very simple to implement.
2. The performance will be good if the training data is large.

No Training time
Disadvantages: The computation cost is high.
Hossain et al [45], proposed the leaf disease classi cation using the KNN classi er. In this paper, the Arkansas plant disease database and Reddit-plant leaf disease datasets are used for their research. The input image RGB to l*a*b* model so that color segmentation is performed. A segmented image is used further to get the color features to be extracted. The features are fed into the KNN classi er and an accuracy is 76.63%.
Krithika, N et [29], presented individual grape leaf disease identi cation using a KNN classi er. Author proposed tangential direction image segmentation. The color and GLCM features are extracted and further fed to the KNN classi er to get greater accuracy.
For cotton leaf disease classi cation, the images are segmented from the complex background, and removing the background is a challenging task. The background removal is considered as segmentation technique and to achieve that we used a modi ed factorization based active contour method. This method helps in recognizing the required leaf image from the image.

Results And Discussion
The experiment was carried out on a cotton leaf images dataset. The images were captured in various elds using a digital camera resolution of 4048x4048. For processing, these images of larger size are di cult to process so resizing the images to     Table 3, it can be seen that behaviour of the classi er doesn't perform so well and we can also conclude that Multilayer perceptron performed well. Figure 12 shows the evaluation measures for all the classi ers, which replicates the values of Accuracy, Precision, Recall, Fmeasure, MCC.
The Table 4 and Fig. 13 gives the 4 color, features consideration for classi er performance. The multilayer perceptron performs well relative to other classi ers.  Figure 14 and Table 5, gives the comparison of different classi ers based on which features are fed as an input to reduce the classi er's training computation time and to improve the classi cation accuracy. The multilayer perceptron performs well when compared to other classi ers for all different types of features.

Conclusions
Leaf disease classi cation is an important task in the eld of agriculture. The disease identi cation helps the farmer to nd out what precautions can be taken further. The classi cation can be performed using different machine learning algorithms and it is used for the cotton leaf database. The segmentation is performed as the cotton images are taken from the eld and the background is complex. The segmented output images are later features like colors and texture features are extracted. In this paper, we are showing that the color features are enough to extract to nd the classi cation between healthy and unhealthy images. The same features are used to feed into the WEKA tool which helps in the analysis of different classi ers. Comparison is performed for 4 color features, 8 texture features and 12 (texture and color) features. It can be observed that color features is enough for improving the classi cation accuracy. From the survey, it can be seen that arti cial neural network accuracy is better than the other classi ers like Naïve Bayes, random forest, SVM, K-NN, AdaBoost. In the future, the work can be prolonged to disease classi cation.

Declarations
Ethics approval and consent to participate-Informed consent was obtained from all individual participants included in the study.

Consent for publication-NA
Availability of data and materials-It's not available publicly.
Competing interests-There was no con ict of interest.
Funding-No funds, grants, or other support was received.
1. Chen Z, et al. "The Lao text classi cation method based on KNN.". Procedia Computer Science. 2020;166:523-8. Figure 1 Leaf disease classi cation  Classi er evaluation measures for texture and color features Six classi ers performance