Wheat Seed Classification: Utilizing Ensemble Machine Learning Approach

Recognizing and authenticating wheat varieties is critical for quality evaluation in the grain supply chain, particularly for methods for seed inspection. Recognition and veriﬁcation of grains are carried out manually through direct visual examination. Automatic categorization techniques based on machine learning and computer vision oﬀered fast and high-throughput solutions. Even yet, categorization remains a complicated process at the varietal level. The paper utilized machine learning approaches for classifying wheat seeds. The seed classiﬁcation is performed based on 7 physical features: area of wheat, perimeter of wheat, compactness, length of the kernel, width of the kernel, asymmetry coeﬃcient, and kernel groove length. The dataset is collected from the UCI library and has 210 occurrences of wheat kernels. The dataset contains kernels from three wheat varieties Kama, Rosa, and Canadian, with 70 components chosen at random for the experiment. In the ﬁrst phase, K-nearest neighbor, classiﬁcation and regression tree, and Gaussian Na¨ıve Bayes algorithms are implemented for classiﬁcation. The results of these algorithms are compared with the ensemble approach of machine learning. The results reveal that accuracies calculated for KNN, decision, and Na¨ıve Bayes classiﬁers are 92%, 94%, and 92%, respectively. The highest accuracy of 95% is achieved through the ensemble classiﬁer in which decision is made based on hard voting.


Introduction
In many developing nations, farming is the most significant economic sector. Most of the tasks are carried out without the use of modern technology. Seed categorization is usually done based on human understanding. e purification of seeds plays an important role in this process and must be enhanced. Manually determining the type of wheat needs expert judgment and takes time. When an array of seeds appears so similar, manually distinguishing them becomes a challenging process [1][2][3][4].
A quality evaluation method for wheat crops is required with the evolution of the grain chain [5]. e goal is a highquality wheat product in greater quantities. Tests on seed germination are required for seed labeling [6]. e seeds were tested for pureness test, which is required to ascertain a seed sample's physical and varietal purity. e genetic integrity of the original wheat cultivar may be compromised by mechanical mixing and improper labeling [7]. Classification testing is accomplished via a taxonomic categorization method and nondestructive grain feature analysis [8,9]. Seed tester classifies seeds on two levels: species level and varietal level for varietal purity. e varietal level may be challenging due to the great degree of resemblance in the characteristics of different kinds of wheat seeds. e growth circumstances may also affect the grain's properties [5]. e actual classification test procedure is still a low-throughput process, and its correctness is contingent on the expert's performance and cumulative experience.
Machine learning techniques are now the subject of study in a variety of disciplines, particularly with the expansion of the Internet and the usage of bigger datasets [4]. Without specific tools or automatic software procedures, it is difficult for a human operator to interpret or handle such data. Machine learning is frequently used in various applications such as categorization, regression, and forecasting to meet these demands [10]. e use of a single classifier for the objects which has a very minute difference in physical characteristics such as color, texture, and morphology does not give better accuracy [11]. To address this problem, ensemble approach is used in the present work. By combining many models into a single, highly reliable model, ensemble methods seek to increase model prediction. e most common ensemble approaches are boosting, bagging, and stacking. Ensemble techniques are particularly well suited to regression and classification, where they minimize bias and variance while increasing model accuracy. e major findings of this paper are as follows: (1) An optimized classifier is designed for wheat seed classification by utilizing an ensemble machine learning approach with bagging (2) e model is compared with three machine learning classifiers: K-nearest neighbors (KNN) classifier, decision tree classifier (CART), and Gaussian NB (NB). (3) e highest achieved accuracy is 95%, which is with the ensemble method

Related Work
Seed categorization using machine algorithms has been the subject of research. ese studies employed a variety of machine classifiers and achieved a high degree of accuracy in their work. Machine learning techniques had previously been successfully applied in a variety of production chains for seed and cereals classification [12][13][14]. In [15], the study shows the capability and possibilities of machine vision for shapes, sizes, and varietal types using well-trained multilayer neural network classifiers. ey utilized Weka classification tools such as function, Bayes, meta, and lazy approaches to categorize the seeds. In [16], the authors proposed a fuzzy theory-based approach for recognizing wheat seed types that take into account the features of the seed. e tabu search technique was used. In [17], the authors have used an artificial neural network for classifying wheat seeds based on VLC and obtained an accuracy of 92.1 percent and 85.72 percent, respectively. In [18], authors have discussed morphological, color, and textural characteristics of the seed. If there is a very minute difference in morphological features, then seed classification is very difficult. Cereal yield is determined by the number of grains per ear and the size of the grains. Counting seeds and morphometry by sight is timeconsuming. As a result, different ways for effective grain morphometry employing image processing techniques have been proposed [19,20].
In [21], the authors created a workstation to aid in grain analysis for classification, and a video colorimetry approach is presented to support in determining cereal grain color. e categorization of chickpea seed varieties was done based on the morphological characteristics of chickpea seeds, using 400 samples from four types: Kaka, Piroz, Ilc, and Jam [22]. According to the commercial point of view, a machine vision built of existing neural network models may be utilized for rice quality assessment [23]. In this, it uses neural networks to categorize rice varieties, using a total of nine separate varieties of rice. e authors employ seed image acquisition to classify these variations. ey also created a method for extracting 13 morphological features, 6 color features, and 15 texture features from color photographs of seeds. eir model has produced an overall classification accuracy of 92%. e k-nearest neighbors classifier necessitates storing the entire training set, which can be prohibitively expensive when the set is huge, and several researchers have attempted to eliminate the training set's redundancy to relieve this problem [24,25].
For plant categorization, the authors have utilized deep learning models [26]. Two tendencies may be seen in the current state of the art. e first is linked to high-throughput phenotyping and plant identification, as evidenced by Ubbens and Stavness' work in this area [27]. e second problem is plant disease identification and monitoring [28,29]. In [30],the authors present many voting techniques for testing ensembles of classifiers learned using the bagging approach. Multilayer perceptron is used as classifiers. Using groups of classifiers rather than individual ones is one option. Bagging [31] and boosting [32] are two of the most well-known ensemble techniques, in which many classifiers are combined to generate a single, more accurate result. In [33], the authors studied the performance of several voting techniques, with bagging being utilized for the reconciliation model, which is a process of merging classification models. Table 1 contains various features considered in machine vision systems for food grain quality evaluation.

Methodology
e methodology adopted for this work includes the collection of datasets, features identification, data augmentation, classification using machine learning algorithms like KNN, Naive Bayes, and CART implementation, implementation of ensemble approach for better accuracy with fine-tuning, and comparison of results.

Dataset.
In the study, the seed dataset was received from the UCI library [42]. ere are 210 occurrences of wheat kernels in the collection. In addition to the class attribute, each instance contains 7 other attributes. All the samples share the same 7 characteristics (area, perimeter, compactness, kernel length, kernel width, asymmetry coefficient, and kernel groove length). All of the characteristics are constant. Figure 1 shows the features of the dataset number of classes presented and selected machine learning algorithms.
e examined set contained kernels from three wheat varieties: Kama, Rosa, and Canadian, with 70 components. To perceive high-quality picturing of the interior kernel construction, a soft X-ray method is employed. It is less abrasive and more affordable than other sophisticated imaging techniques such as scanning microscopy or laser technology. Images are recorded on X-ray KODAK plates measuring 13 × 18 cm. At the Institute of Agrophysics of the Polish Academy of Sciences in Lublin, studies were supplemented by association harvested wheat grain patenting from experimental fields. In Table 2, classes represent 3 varieties of wheat which are Kama (1), Rosa (2), and Canadian (3).

Machine Learning Models
K-nearest neighbor: a k-nearest-neighbor algorithm is a data categorization method that predicts how likely a data point belong to one of two groups based on which group the data points closest to it belong to [43]. Any common method can be used to determine the distance. A Euclidean distance is an example of a distance. After that, we collect a certain feature value from all the training set in the immediate vicinity. We categorize our fresh testing data using most of this number as a prediction [44]. Classification and regression trees (CART): CART is a nonparametric supervised learning method [45,46]. e objective is to build a model of the value of a target variable by learning basic decision rules from data characteristics. A piecewise constant is approximated by a tree. Nonstatisticians can analyze CART quite well [45,47]. Gaussian Naïve Bayes: A basic approach for Naive Bayes is Bayes decision theory. e likelihood is used to make this classification decision. e posterior probability is calculated using the likelihood, prior probabilities, and evidence. Evidence is merely a scalar that ensures that posterior probability equals one. Resultant classes for the given test data are chosen based on the category with the highest posterior probability. Ensemble methods: ensemble methods are strategies for increasing the accuracy of model outputs by integrating many models rather than utilizing just one. e integrated models greatly improve the accuracy of the outcomes. e popularity of ensemble techniques in machine learning has risen because of this. When dealing with enormous amounts of data or a lack of appropriate data, ensemble-based solutions can be unexpectedly effective. When the quantity of training data is too enormous to train a single classifier, the data can be divided into smaller groups deliberately. After that, each division may be used to train a distinct classifier, which can then be merged using a suitable combination algorithm. If there is not enough data, bootstrapping may be used to train alternative classifiers using distinct bootstrap samples of the data, each of which is a random sample of the data taken with replacement and handled as if it were drawn independently from the underlying distribution [48].
Bagging is one of the oldest, more obvious, and most likely simplest ensemble-based methods, and it has very maximum performance. Bootstrapped copies of the training data are used to obtain the diversity of classifiers in bagging such that different chunks of training data are arbitrarily chosen from the entire training dataset, with substitution. Each subset of training data is utilized to train a particular sort of classifier. Individual classifiers are then merged using a simple majority vote. By resampling the data, boosting also produces an ensemble of classifiers, which are subsequently  Scientific Programming merged by majority vote. Resampling, on the other hand, is used deliberately in boosting to give the most useful training data for each subsequent classifier. We applied a hard voting classifier, which means that the forecasted output class is the one that obtains the most votes.
ree classifiers predicted the output wheat classes Kama, Rosa, and Canadian, and most of them anticipated Kama wheat variety as the result. As a result, the ultimate forecast will be Kama wheat.

Experimental Setup.
Python3 libraries such as NumPy, SciPy, scikit-learn, Keras, pandas, and Matplotlib are utilized to perform the categorization through ML models. Scikit-learn appears to be the most user-friendly and reliable machine learning library [49,50]. e foundations of this package are NumPy, SciPy, and Matplotlib. e results of the dataset analysis, as well as the model's training and testing using numerous feature extractions are presented in this section. Figure 2 shows the classification process. (1)

Performance Evaluation
Precision: this evaluation parameter tells how frequently a model predicts true positive. e low value of precision infers high false positives. Formula for calculating precision is as follows: Recall: this parameter gives information regarding how often a model predicts false negatives. e low value of recall means the model predicted high false negatives. Formula for calculating recall is as follows: F1-score: the F1-score is calculated by combining both precision and recall. at is, a high F1 score indicates a low number of false positives and false negatives, which infers that the model is accurately detecting actual threats and are not bothered by false alarms. e formula for calculating the F1 score is

Result
Results present the evaluation of considered four ML models: KNN, Naive Bayes, CART, and ensemble method. For evaluating models, we have divided the collected data into 70% for the training set and 30% for the testing set. e wheat classes Kama, Rosa, and Canadian have been assigned numbers 1, 2, and 3, correspondingly. e results of all considered algorithms are evaluated based on recall, precision, F1-score, and accuracy. Tables 3-7 show the value of these parameters for KNN, Naive Bayes, CART, and ensemble method, respectively. Figures 3-6 show the confusion matrix for KNN, Naive Bayes, CART, and ensemble method, respectively. e accuracy determined for KNN, decision, and Naive Bayes classifiers is 92 percent, 94 percent, and 92 percent, respectively, according to the data. Ensemble classifier, which makes decisions based on hard voting, has the best accuracy of 95 percent.

Discussion
To implement the KNN, we use the scikit-learn K-neighbors classifier. As an input parameter, the approach requires the number of neighbors. We can determine the wheat seed category by simply changing the number of neighbors. With this procedure, 92% accuracy is attained. e accuracy of building a CART using the scikit-learn is 94 percent. We used the scikit-learn Gaussian NB classifier and 92% accuracy is attained. We        ensemble method are compared in Table 8. Table 9 shows the summary of various classifiers accuracy. A chart is also presented to compare the accuracy of various methods in Figure 7 which shows the summary of various classifier accuracy.

Conclusion
Machine learning approaches in grain seed analysis and classification are playing a very important role. e major challenge in seed classification is the very minute difference between different categories of seeds. e accuracy of predictions with this challenge is improved by utilizing the concepts of ensemble learning. Wheat seed classification by considering seven independent features area, perimeter, compactness, kernel length, kernel width, asymmetry coefficient, and kernel groove length is presented in the paper. e ensemble machine learning approach with bagging and hard voting is utilized to best fit the classifier. ree machine learning algorithms K-nearest neighbors classifier (KNN), classification and regression trees (CART), and Gaussian NB (NB) are also implemented to compare the results. e   Scientific Programming results reveal that the accuracy calculated for KNN, decision, and Naïve Bayes classifiers are 92%, 94%, and 92%, respectively. e highest accuracy 95% is achieved through the ensemble classifier in which the decision is made based on hard voting. In the future, we can use other classification algorithms to improve accuracy.

Data Availability
Datasets related to this article can be found at "https:// archive.ics.uci.edu/ml/datasets/seeds," an open-source online data repository hosted at UCI Machine Learning Repository [42].

Conflicts of Interest
e authors declare that they have no conflicts of interest.