Identification of Dry Bean Varieties Based on Multiple Attributes Using CatBoost Machine Learning Algorithm

,


Introduction
People eat dry beans, which are a type of legume that is selfpollinated. Beans are a signifcant crop on a global scale and are popular with both farmers and consumers. Dry beans account for nearly 50 percent of the grain legumes consumed directly by humans in the majority of developing countries [1]. Beans are a staple food in Sub-Saharan Africa, where they are consumed by more than 200 million people [2]. A system of quality control makes sure that approved seed meets national and global quality benchmarks. For the majority of food products, visual characteristics are the primary criterion used by consumers when making purchasing decisions [3]. Like other legume species, common beans show the most variation in terms of growth patterns, physical features (size, shape, and shading), maturity, and ability to grow and adapt [4,5]. Sorting and classifying bean seeds manually is a time-consuming process. Additionally, this method is inefcient and tedious, particularly when working with large production volumes. Human inspectors are usually in charge of checking raw materials, and it is difcult to streamline the inspectors' fndings. Tese considerations reafrm the importance of objective measurement systems. As a result, automatic grading and classifcation methods are required.
Recent technological changes have helped researchers in this feld a lot. Computer vision systems (CVSs) are being used for quality control and have recently begun to be used as an objective measurement and evaluation system [6][7][8][9]. CVS technology, which is primarily camera cum computer based, has been considered for sensory characteristics of agricultural products. Tis system consists of a light source, an image acquisition device, and computer peripherals and software. Te digital repository systems provide this information widely with various attributes.
Equal numbers of input samples represent each output class (or target class), which is known as a balanced dataset. Imbalanced training data has a major negative impact on real-time performance [10]. Te majority of the reported studies used a target class with an uneven distribution of observations, i.e., an imbalanced dataset.
Te main contribution of this research is to develop the unbiased ML based multiclass classifcation model to identify the dry bean variety with the best accuracy using the balanced dry bean dataset available at the UCI ML digital repository [11]. Using the preprocessed balanced dataset, the dry bean types such as "Dermason," "Sira," "Seker," "CAli," "Bombay," "Horoz," and "Barbunya" have been identifed without losing any features available in the dataset. Te 22 ML algorithms have been evaluated with 10-fold crossvalidation to identify the best ML multiclass dry bean classifcation model. To make the model more accurate, the BCT was used to reduce the skewness of the dataset's attributes, making them almost identical to a normal distribution.

Related Work
Kilic et al. [12] used computer vision to develop the classifcation system for bean varieties. Te system consisted of hardware and software. Te hardware was developed to capture a standard image from the samples. Te software part discusses segmentation, morphological operation, and colour quantifcation of the samples. Te 69 samples have been used in their artifcial neural network (ANN) model. Te system's overall performance in classifying beans was 90.56 percent.
Using an infrared hyperspectral imagery method that works in the wavelength range of 390-1050 nm, Sun et al. [13] examined a quick and nondestructive method for categorising black bean variants. Te primary component of the image was used to extract 16 textural and 6 morphological features by using ray level co-occurrence matrix analysis. Hasan et al. [14] examined various categories of dry beans and used a deep neural network-based method to categorise them. Te outcomes indicate that their approach was 93.44 percent accurate and had an F-1 score of 94.57 percent when applied to a dataset of seven varieties of dry beans.
Giza3, Giza461, Misr1, Nobarya1, and Sakha1 are the fve varieties of Egyptian faba-bean seeds studied by Abdulwahed et al. [15]. Tis method uses morphological features and an ANN to grade and classify the quality of Egyptian faba-bean seeds. Based on 15 physical traits of the seeds, artifcial neural networks separated faba beans into diferent types.
It was presented by Araújo et al. [16] to develop a computer-based visual inspection system for beans that used correlation-based multishape granulometry in order to locate each grain in an image as well as its size and eccentricity. Using this method, their system correctly located 29,993 out of 30,000 grains, even when there were a lot of "glued" grains in the image.
De Oliveira et al. [17] used ANN as the transformation model and the Bayes as the classifer to identify the cofee beans types such as whitish, cane green, green, and bluishgreen. Te neural network models achieved a generalisation error of 1.15 percent, and the Bayesian classifer identifed all samples.
Gope and Fukai [18] discussed the assessment of the Raspberry Pi 3 system's capacity in low-income countries for classifying peaberries and normal beans. Tey discovered that due to hardware constraints in the case of large-sized images, the Raspberry Pi 3 could not complete computation with linear support vector machines (SVMs) and k-nearest neighbors (kNNs).
Arboleda et al. [19] created the classifcation model for identifying cofee bean species. From 195 training images and 60 testing images, signifcant cofee bean morphology attributes such as bean area, perimeter, equivalent diameter, and percentage of roundness were extracted. Te cofee beans were automatically classifed using ANN and kNN. ANN obtained classifcation scores of 96.66 percent.
Koklu and Ozkan [11] used CVS to develop a multiclass classifcation of dry beans. Te CVS-derived bean images were subjected to segmentation and feature extraction stages, yielding a total of 16 features, 12 dimensions, and 4 shape forms from the grains. With 10-fold cross validation, multilayer perceptron (MLP), SVM, kNN, and decision tree (DT) classifcation models were developed, achieving overall classifcation rates of 91.73 percent, 93.13 percent, 87.92 percent, and 92.52 percent for MLP, SVM, kNN, and DT, respectively. Table 1 shows the methodology and performance of various classifcation approaches for bean variety classifcation.
In this article, the proposed multiclass classifcation model uses the balanced dataset with 16 features and 7 varieties of dry beans. To avoid classifcation biassing of ML algorithms towards the majority group due to the unbalanced multiclass dataset, each dry bean type has 522 instances (522 * 7) with 16 features in the processed dataset.

Exploratory Data Analysis and Methodology
Te proposed multi-class classifcation model is depicted in Figure 1. Te model's initial stage is data preprocessing. Te second stage of the model is the Box-Cox transformation, and the fnal stage is ML model development.
3.1. Data. Te data science process is a methodical way to address a data problem. In most scenarios, a data science project will have to go through fve critical stages: problem defnition, data processing, modelling, evaluation, and implementation. Te dry bean dataset for this research was obtained from the UCI ML repository, which is accessible at [11]. It is also available as a supplementary fle with this article. Te dataset contains information about the images taken with a high-resolution camera of 13,611 grains of seven diferent registered dry beans. From the grains, a total of 16 features were extracted. Tis study examined seven distinct varieties of dried beans, with market conditions dictating features such as aspect, shape, category, and structure. Te dataset is available in.csv format for the dry bean varieties "Dermason," "Sira," "Seker," "CAli," "Bombay," "Horoz," and "Barbunya" with a total of 13611 instances. Table 2 shows quantile and descriptive statistics for 16 features of the dry bean dataset.

Data Preprocessing.
Preprocessing strategies improve the performance of classifers [20]. Te information extraction (IE) method of extracting structured content such as entities, interactions, facts, and terms, as well as other kinds of information that aid the data analysis pipeline in prepping the data for the study [21]. Te distribution in the dry bean variants of dry bean dataset is shown in Figure 2. Figure 2(a) shows the percentage of distribution of seven dry bean varieties, and Figure 2(b) shows the individual dry bean variety count in the raw dataset. It is observed that the dry bean type "DERMASON" has appeared at a maximum of 26.1 percent and the dry bean type "BOMBAY" at a minimum of 3.84 percent. Te most frequently encountered problem in data quality is the absence of feature values in some entries. Te missing values for each instance have been checked. Te total data set instances become 13543 from 13611 instances after dropping the duplicate instances. Classifcation is a process that can be applied to structured or unstructured data. Te class wise count of dry bean dataset is 3546, 2636, 2027, 1860, 1630, 1322 and 522 for DERMA-SON, SIRA, SEKER, HOROZ, CALI, BARBUNYA, and BOMBAY, respectively, after dropping the duplicate instances. Except for the target "Class," all feature data types have been converted to foat.

Creation of a Balanced Dataset.
Because of the unbalanced multiclass dataset, learning algorithms will be infuenced towards the majority population. In contrast, the minority class is typically more signifcant from the perspective of data mining, as it may contain valuable information amidst its rarity. When encountered with such disparities, the researchers should design an efective model capable of handling the bias. Tis is referred to as learning from unbalanced data [22]. In terms of balancing distributions, there are methods for creating new objects for the minority group (over sampling) and methods that eliminate instances from the majority group (under sampling) [23]. Overftting may result from the creation of new instances for the minority group. As a result, the random undersampling method used in this article will make the majority group of instances in the dry beans dataset matchable with the minority dry bean group. All of the dry bean types of instances were brought to 522 instances uniformly using the random undersampling method. Tis can be observed in Figure 3. To develop the model, a balanced dataset with 3654 instances has been considered. Each bean variety has 522 instances.
Te steps followed in the creation of a balanced dataset are as follows: (i) Step 1: Te majority and minority classes in the dataset have been identifed. Te majority class index in the preprocessed dataset is "DERMASON," and the minority class index is "BOMBAY," with 3546 and 522 instances, respectively.   the BCT was applied to all of the features of the dataset for transforming the skewed data into a normal distribution. For each attribute, the fgure on the left shows the distribution before BCT, and the fgure on the right shows the distribution after BCT. Te skewness can be found at the top right corner of the fgure. Y represents the dependent (continuous) variable, while X represents the independent variables (1, x 1 , x 2 ,. . ., x k ). In the equation, the BCT [24] used to transform the skewed distribution into a normal distribution without the original scale is given (1). Te maximum likelihood technique is commonly used to determine the parameter lambda (λ). where X is the covariate matrix, which includes the intercept. β is a regression coefcient vector. σ is the variance of random error. ε is a random error.

Training and Test Dataset.
Te training dataset is the set of data used to construct the model, which contains known features and a target. Te created model will also need to be validated against another well-known dataset known as the test dataset or validation dataset. To meet this challenge, the entire known dataset can be divided into training and a test set [25]. Te dry bean categorical classes, namely "SIRA," "BOMBAY," "DERMASON," "BARBU-NYA," "HOROZ," "CALI," and "SEKER" were converted into integer types as 1-7, respectively. Te training and test sets have been split in an 80 : 20 ratio, with 2923 and 731 instances with 16 features, respectively.

Machine Learning Algorithm (MLA) Selection.
A model built with a single method may not ofer the best prediction for a specifc dataset. Each machine learning technique has its own constraints and creating a model with signifcant accuracy is difcult. Te 22 MLAs were used to determine the accuracy of various MLAs on a balanced dataset. It helps us to bring out a better predictive model. Te 10-fold cross validation has been performed and the mean accuracy of 19 MLAs has been listed in Table 3. Ensemble methods [26] such as AdaBoost classifer, Bagging classifer, and extra tree classifer, generalised linear models [27] like logistic regression, passive aggressive classifer, Ridge classifer, stochastic gradient descent classifer, and perceptron, Navies Bayes models [28] like Bernoulli and Gaussian MLA, kNN, and SVM algorithms [29], tree-based methods [30] such as DT classifer and extra tree classifer, and discriminant analysis methods [31] such as linear and quadratic discriminant analysis. Gaussian process MLAs have been evaluated with 10-fold cross-validation. Figure 5   provides the highest test accuracy of 92.69 percent in the 19 MLAs and the lowest accuracy of 12.77 percent found with the Bernoulli Naive Bayes ML classifer. From the initial screening during validation, it is observed that the XGBoost, RF, and CatBoost algorithms ofer greater precision. Terefore, in the following sections, the performance of these three algorithms with an 80 : 20 balanced dry bean dataset and with 10-fold cross validation is described.

Random Forest Algorithm.
Te DT modelling is an important part of RF. It is used on several samples of the original data obtained by the bootstrap method. Samples of the original data are used to make the bootstrap samples, and each sample has the same number of data points as the original data. Te RF [32] constructs multiple DTs as well as merges them to produce more precise and stable predictions. Te node's importance is calculated as follows: where C j � node j's impurity value, w j � the weighted sample size arriving at the node j, and right(j) and left(j) are the child node from right and left split on node j, respectively. An individual attribute's feature importance is

Extreme Gradient Boost.
XGBoost [33] is a framework of the gradient boosting machine (GBM), a well-known algorithm for supervised learning. It is appropriate to both classifcation and regression tasks. If DS is the set of data containing "m" attributes, then for "n" occurrences Letŷ i be the ensemble tree model's target value constructed using the equation.
Here K denotes the model's total number of trees and f k denotes the model's k th tree. Classifcation and Regression Trees (CART) serve as the base learner for Gradient Boosted Trees, which is a popular machine learning algorithm for both classifcation and regression problems. F's functional space is f, and the set of feasible CARTs is F. [34]. CatBoost implements oblivious DTs (binary trees in which the same features have been used to create left and right splits for every level of the tree), thereby limiting the number of features split per level to a single instance, which aids in reducing prediction time. In the dataset "D" of dry beans, for every instance has "m" features in a vector "x" and the target dry bean class type, y.

Cat Boost Classifer. Categorical boosting (CatBoost) is a Yandex-developed open-source boosting library
Mathematically, the target assessment of the i th categorical data of the k th element of dry bean dataset D for dry beans can be expressed as follows: when a > 0. When the i th component of CatBoost's input vector x j is equal to the i th component of input vector x k , the indicator function 1 x i k �x j k returns the value 1. Te parameters "a" and "p" (prior) prevent underfowing in the equation. σ is a permutation at random.

Results and Discussion.
Te use of diverse bean varieties in dry bean cultivation actually inhibits the production of uniform crops. As a result, the resulting product, which includes a set of dried bean species, incurs economic losses. To address this issue, the purpose of this study is to distinguish the seven classes of dry beans cultivated in Turkey, as determined by the Turkish Standards Institute (TSE). Te dry beans dataset has been processed through the developed model. Te confusion matrix of three MLAs, namely RF, XGBoost, and CatBoost, is shown in Figure 6. Confusion matrices enable a more detailed visualisation of results and a comparison of actual and predicted values. In Figure 6, "SIRA," "BOMBAY," "DERMASON," "BARBUNYA," "HOROZ," "CALI," and "SEKER" are denoted as 0, 1, 2, 3, 4, 5, and 6. Te correctly predicted sample numbers can be found in the diagonal part of the confusion matrix. Te misclassifed instances are available in other parts of the confusion matrix. For example, in Figure 6(c), for the dry bean variety "SIRA," the correctly identifed test set instances were 84. Seven test instances were identifed as "DERMASON," two instances were identifed as "BARBUNYA," three instances were identifed as "HOROZ," two instances were identifed as "CALI," and two instances were identifed as "SEKER." Figure 7 shows the receiver operating characteristic (ROC) curve that shows the performance of the RF, XGBoost, and CatBoost ML classifcation algorithms. ROC is the plot between true positive and false positive. In ROC, the area under the curve (AUC) represents the degree or measure of separability. It shows the model's capability of distinguishing between dry bean classes. It is observed that the CatBoost algorithm provides the AUC value for the "SIRA" dry bean type as 0.99, and for other dry bean types such as "BOMBAY", "DERMASON", "BARBUNYA", "HOROZ", "CALI", and "SEKER" has an AUC value of 1. Table 4 provides the performance metrics like precision, recall, and f1-score of the three ML algorithms, and Table 5 provides the ML model accuracy with an 80 : 20 dataset. Te accuracy of the model has been improved by about 1.49 percent using the balance dataset and the CatBoost ML algorithm. Among the 22 MLAs tested, it is observed that the CatBoost ML classifer provides the best performance. Table 6 shows the performance comparison with the existing method. Te Cat-Boost ML classifer performs well as compared to the existing method under balanced instances for seven dry bean types.      CatBoost ML excels at solving classifcation problems with heterogeneous data.

Model Performance with Cross-Validation (CV).
Te three algorithms RF, XGBoost, and CatBoost have been validated with 10-fold cross validation with 90 : 10 data split.
In cross-validation with k folds, the original dataset is randomly subdivided into "k" mutually exclusive subgroups or "folds" (F 1 , F 2 , . . .F k ) of roughly equal size. Tere are k training and testing iterations. In iteration "i" the test set is partition F i , while the remaining segments, subgroups, or folds are used to train the model collectively [29]. Table 7 and Figure 8 show the 10-fold cross validation accuracy of the three MLAs. In 10-fold cross validation, the CatBoost ML algorithm achieves the highest overall mean accuracy of 93.8 percent, with a range of 92.05 percent to 95.35 percent.

Conclusion
Classifcation of dry bean seed varieties is critical for seed uniformity and quality assurance. Compared to human inspectors, the system possessed two signifcant advantages. It produces higher, reproducible, and objective sample classifcation, and also excludes the possibility of human inspectors misclassifying specimens. Initially, the dry bean dataset features has been applied with log transformation. It fails with a reduction in negative skewness. Te BCT was applied to all of the features of the dataset for transforming the skewed data into a normal distribution. A model constructed using a single method may not provide the best forecast for a given data set. Each machine learning technique has its own set of restrictions, making it challenging to create a model with substantial accuracy. Te accuracy of various MLAs on a balanced dataset was determined using the 22 MLAs. It supports us in developing a more accurate predictive model. Te accuracy of the model has been improved by about 1.49 percent using the balance dataset and the CatBoost ML algorithm. Te developed models' high success rates across all metrics indicate that they are efective at classifcation. Te overall system mean accuracy of a balanced dataset is obtained as 93.8 percent for the CatBoost ML model. Te results indicate that the proposed CatBoost ML classifer can be used efectively to classify a variety of dry bean variants. Additionally, this developed framework can be applied to various kinds of dry beans from various regions. Te model is developed without losing any features from the dataset. Te ML model can be upgraded further by combining ML, deep learning, and novel algorithms.

Data Availability
Te dataset are available in a publicly accessible database.