Image Recognition for Garbage Classification Based on Transfer Learning and Model Fusion

Garbage is an underutilized resource, and garbage classication is one of the eective ways to make full use of these resources. In order to realize the automation of garbage classication, some deep learning models are used for garbage images recognition. A novel garbage image recognition model Garbage Classication Net (GCNet) based on transfer learning and model fusion is proposed in this paper. After extracting garbage image features, EcientNetv2, Vision Transformer, and DenseNet, respectively, are combined to construct the neural network model of GCNet. Data augmentation is used to expand the dataset and 41,650 garbage images are contained in the new dataset. Compared with other models through experiments, the results show that the proposed model has good convergence, high recall rate and accuracy, and short recognition time.


Introduction
With the continual and rapid development of the economy, environmental pollution is becoming more serious, endangering the lives of billions of people, reducing life expectancy, and harming the growth and development of children [1].
Garbage is the primary source of pollution. "Garbage in the city" and "garbage in the countryside" are becoming more and more of an issue for towns and villages. Garbage classi cation is a symbol of social and environmental development.
To promote the work of garbage classi cation and delivery, many cities, represented by Shanghai, have issued mandatory garbage classi cation laws. However, there are still signi cant issues with garbage classi cation. For instance, residents' awareness of garbage classi cation is still relatively low, and many people do not understand garbage classication and have unclear standards for garbage classi cation.
Garbage classi cation methods that are automated can assist in resolving these challenges. China launched the rst garbage classi cation system in Shanghai [2]. However, there has a problem that the automated garbage classi cation system cannot classify the garbage images accurately [3]. erefore, this paper proposes an image recognition algorithm for garbage classi cation based on transfer learning and model fusion.
As the basis of the above algorithms, in recent years, deep learning has developed rapidly due to the improvement of computational power and theoretical systems. Compared with traditional image feature extraction methods, deep learning does not require the preextraction of features [4]. In the era of big data, models are allowed to learn from largescale data. erefore, deep learning has greater learning ability, better adaptability, and a higher upper limit.
Deep learning is highly dependent on data, but di erent countries have di erent standards of domestic garbage classi cation, and there is no suitable dataset in China or even internationally in terms of dataset selection. erefore, this paper uses transfer learning to make up for the lack of datasets. Transfer learning is a machine learning method that takes knowledge from one domain and transfers it to another domain, enabling better learning results in the target domain.
Convolutional Neural Network (CNN), the most fundamental network for deep learning, was rst proposed by LeCun and others. After continuous development, Krizhevsky [5] and others used it for the first time for classification tasks and achieved excellent results. CNN is also the most widely used deep learning algorithm in the field of computer vision (CV).
However, the types of household garbage are complex, and the distinctions between them are not clear. Ordinary CNN has difficulty learning the differences between categories and cannot complete the classification effectively. As a result, model fusion is used to improve the model's underlying feature extraction ability, which improves the model's learning ability.

Related Work
Research on garbage classification, Yang and his colleagues from Stanford University created the public TrashNet Dataset. ere are 2,527 images separated into six categories: 403 cardboard, 501 glass, 410 metal, 594 paper, 482 plastic, and 137 other waste materials. Yang and ung used a method called support vector machine (SVM) to perform early trials on this dataset with a 63% accuracy [6].
Satvilkar used an algorithm called random forest (RF) to classify these images and achieved an accuracy of 62.61%. e RF algorithm is a classifier that trains and predicts samples through multiple decision trees, each of which plays a role in the final decision of the predicting outcomes [7,8].
In the era of big data and samples, RF training can highly parallelize data, which improve training speed. Later, Satvilkar did experiments using another algorithm called XGBoost with an accuracy of 70.1% [3]. e XGBoost algorithm is an improvement on the Gradient Boosted Decision Tree (GBDT) algorithm. It is based on two integrated tree-based learning classifiers named RF and GBDT [9], which have the advantage of being less prone to overfitting.
Costa et al. and others used the K-nearest neighbor (KNN) algorithm to classify images and achieved an accuracy of 88.0% [10]. e KNN classification algorithm is easy to implement, has remarkable classification performance [11], and is frequently used in image classification. In the KNN algorithm, the determination of an image category is based on the class of the nearest one or more images [12]. Traditional machine learning methods described above have been around for a long time and have achieved good results in the field of image processing. However, these methods usually consist of several independent processes, so they require a great amount of storage space for intermediate results, causing cumbersome and unintelligent implementation procedures. Now, many scholars have been using deep learning methods to solve problems in the field of image processing. Rabano et al. applied the MobileNet model to the TrashNet Dataset with an accuracy of 87.2%. is model application was successfully installed on Samsung Galaxy S6 Edge-+ mobile phone [13]. Ruiz et al. used a combined Inception-ResNet model on the TrashNet and achieved an average accuracy of 88.6% [14]. Adedeji et al. used the 50-layer residual net pretrain (ResNet-50) CNN model as the extractor and replaced the full-connected layer with an SVM in the later classification stage. An accuracy of 87% was achieved on this dataset [15]. Aral et al. did experiments with two fine-tuned model (95% for the DenseNet121 and 94% for the Incep-tionResNetV2) [16]. Ozkaya et al. compared a variety of combinations of network and classifier for extracting classification features, and then found the best combination of the GoogleNet and the SVM classifier with an accuracy of 97.86%, which is the best result on the TrashNet Dataset by far [17].
In addition to some nonpublic datasets, Mittal et al. created the GINI dataset of 2561 images, used the GarbNet model, and obtained an average accuracy of 87.69% [18].
Yang et al. proposed a GarbageNet model. It uses the garbage classification dataset from the Huawei Cloud Garbage Classification Challenge, employs transfer learning, and learns noise-resistant features through a feature synthesis module. In addition, they designed a memory pool and a metric-based classifier to improve the model without retraining it. e best performance was achieved with an average accuracy of 96.96% [19].
Guo et al. investigated an algorithmic model for garbage classification based on EfficientNet. e dataset from Huawei Artificial Intelligence Competition was used. To prevent some irrelevant information in the images from affecting the training of the model, an attention mechanism was added after the EfficientNet output to emphasize or select the important information of the target-processing object and suppress some irrelevant details, enabling the model to focus on key features and better recognize the images. e final average accuracy rate reached 93.47% [20].
Fu et al. proposed a new migration learning-based GNet model for rubbish classification and an improved Mobile-NetV3 model, with an average accuracy of 92.62% [21].
In conclusion, the powerful feature characterization capability of convolutional neural networks not only completely liberates the process of manual extraction of image features in traditional image classification but also makes good use of the huge amount of current image data, which have extensive research significance.
However, there are still great challenges in applying convolutional neural networks to garbage classification: (1) e accuracy of convolutional neural networks relies heavily on the quality of the training dataset. But there are a few publicly domestic and international datasets. is increases the resistance to the application of convolutional neural networks. (2) e image background is single, and the algorithm's generalization cannot be proved. (3) e types of household garbage are complex, and the differences between them are not obvious. erefore, it is difficult for ordinary convolutional neural networks to learn the differences between different categories and complete classification effectively.
In recent years, self-attention-based architectures, particularly Transformers, have become the preferred model for natural language processing (NLP) [22]. Inspired by this, Dosovitskiy et al. attempted to apply Transformers to the field of the image. After they performed pretraining on 2 Mathematical Problems in Engineering Google's JFT dataset, VIT approaches or beats state of the art on multiple image recognition benchmarks. e best model on ImageNet achieved an accuracy of 88.55% [23]. VIT does not need to rely on CNN architecture to achieve good results on image classification tasks. is paper compares this model with others.
In summary, the main contributions of this paper are as follows: (1) In addition to collecting lots of datasets of existing garbage images on the Internet, this paper photographs and labels more than 2,000 garbage images for processing. Finally, this paper uses more than 40,000 garbage images and uses data augmentation techniques to enrich the dataset. (2) Using pretrained models on ImageNet through transfer learning, this paper greatly improves the identification results of classification tasks with insufficient samples. (3) Considering the problem that ordinary convolutional neural networks do not have strong generalization ability, this paper designs a network model based on model fusion, combining various pretrained models, to effectively learn the differences between garbage categories. Finally, this paper produces predicted results and completes garbage classification.

Parameter Debugging Based on Adam's Adaptive Method.
Tuning parameters is a major difficulty in deep learning. By iteratively updating each sample, Stochastic Gradient Descent (SGD) improves the overall optimization efficiency with the loss of a small fraction of precision while increasing the number of iterations by a certain amount. e number of extra iterations is significantly less than the number of samples. In the training process, a fixed learning rate is usually used for training, using gradient descent for the parameters θ. g t is the gradient and η is the learning rate.
However, SGD has several obvious drawbacks: (1) SGD is parameter-sensitive and must pay close attention to parameter initialization. (2) It is simple to fall into local minima.
(3) As more data become available, the training process will take longer. (4) All of the data from the training set are used for each iteration step.
In SGD, each parameter is updated with the same learning rate. However, in practical application, each parameter has different importance, so different learning rates should be dynamically adapted for different parameters, to achieve faster convergence objective function.
To make a dynamic update of learning rate by Adagrad adaptive method, square the gradient of each iteration of each parameter, then take the square root after accumulation, and divide the basic learning rate. e learning rate of each parameter is thus tied to its gradient, resulting in a separate learning rate for each parameter, which is referred to as the adaptive learning rate.
Based on gradient descent, a gradient accumulation variable S t is added: ⊙ denotes the dot product between elements, and the learning rate is adjusted by gradient: ε is the small parameter introduced to maintain numerical stability. It can be seen that the learning rate has changed from a fixed adaptive learning rate to an adaptive learning rate controlled by a gradient accumulation variable.
It is easy to see that as the algorithm continues to iterate, S t will get bigger and the overall learning rate will get smaller. So in general, Adagrad adaptive method starts as an incentive convergence, and then it becomes a penalty convergence, slower and slower. e learning rate of each element of the Adagrad adaptive method has been decreasing (or unchanged) in the iteration process, and it is difficult to find a useful solution in the late iteration due to the low learning rate. Given the above problems, RMSProp uses an exponentially weighted average for the gradient and cumulative variables: RMSProp and Adagrad algorithms use the same adaptive learning rate method.
Adam is essentially RMSProp with momentum terms, combining the strengths of the Adagrad and RMSProp algorithms. It dynamically adjusts the learning rate of each parameter using first-order moment estimates and secondorder moment estimates of the gradient.
Adam's advantage lies mainly in the fact that, after bias correction, there is a defined range of learning rates for each iteration, making the parameters relatively smooth. Formulas are as follows: Mathematical Problems in Engineering m t and n t is the first and the second moments estimator of the gradient, respectively; m t and n t are corrections to m t and n t .

Neural Network Activation Based on ReLU.
When Sigmod is used as the activation function in deep neural network training, gradient dispersion phenomenon occurs, network parameters cannot be updated for a long time, and developing deeper network models becomes hard, etc.
erefore, ReLU is used as the activation function of the neural network.
e definition of ReLU is as follows: Figure 1 shows the ReLU function diagram: e derivative of the negative half ReLU function is 0. Once the neuron activation value enters the negative half, the gradient is 0, and the positive value remains unchanged.
is is known as unilateral inhibition, and it is more similar to the biological activation model.
In addition, the derivative of the ReLU function is much faster to calculate. e program implementation is an if-else statement, whereas the sigmoid function has to perform a floating-point four operations. So the ReLU function is considerably less computational than the Sigmoid function.
Moreover, when the input signal is strong, the difference between signals can still be preserved, so that the garbage image data can be processed centrally to obtain the image dataset.

Evaluating Model Performance Based on Cross-Entropy
Functions. Garbage recognition is a multiclassification problem. is paper chooses the multiclassification crossentropy function, which is most commonly used in classification problems, as the loss function, and the training aim is to minimize this loss function. e function is defined as follows: N denotes the number of categories: 3.4. Transfer Learning. In the field of deep learning computer vision, without a sufficiently wide range of training samples, the generalization ability of models will be poor. Transfer learning uses pretrained models, with minor changes to the architecture. If a deep neural network is trained with a vast quantity of data and gains knowledge in the form of "weights" in the neural network, these weights can be extracted and transmitted to other deep neural networks so that other deep neural networks are not trained from scratch [24].
While the garbage types are numerous, and there are a few publicly domestic and international dataset. e employment of appropriate transfer learning can yield positive results in this setting. is paper conducts transfer learning experiments and training from scratch experiments for five models, ResNet, DenseNet, EfficientNetV2, Vision Transformer, and VGGNet, respectively. ese models achieve state-of-the-art performance on ImageNet for object recognition and detection [25]. e purpose is to compare the accuracy and convergence rate of transfer learning and training from scratch on garbage datasets, as well as acquire experimental findings to demonstrate transfer learning's involvement in garbage classification.

Model Fusion. Model fusion is a type of model integration that is frequently utilized in Kaggle competitions.
e final performance of the model can be enhanced by fusing numerous models, and variations in characteristics across categories in the classification task can be effectively learned.
ere are various methods of model fusion, the more common ones being voting, averaging, and stacking.
Voting fusion is suitable for classification tasks, voting on the predicted results of multiple learners, that is, the minority rules the majority. eoretically, the larger the structural differences across models, the better the voting fusion results for models that are independent of each other. e averaging method is appropriate for regression and classification problems, in which numerous models' predictions are averaged. Averaging has the advantage of smoothing the findings and so reducing overfitting.
e Stacking method is based on the original data, training several basic models, then combining the predictions of these basic models into a new training set to train a new classification model.
In this paper, model fusion experiments are performed mainly based on averaging, using pretrained models of DenseNet, EfficientNetV2, and Vision Transformer on ImageNet.

Garbage Classification Net.
e three base models for model fusion are DenseNet, Vision Transformer, and Effi-cientNetV2. First, each of the three models is run through a global average pooling layer, which is then regularized to prevent overfitting and the feature vectors are obtained. Second, feature fusion is performed using the concatenate layer. ird, the full-connected layer is used to ensure that there are enough features. Fourth, Dropout is added to prevent overfitting. Finally, Softmax is used for classification, and this new network structure is called Garbage Classification Net (GCNet) in this paper. e model structure of GCNet is shown in Figure 2.

Experimental Configuration.
In this paper's experiments, the operating system is Windows 10, 11th Gen Intel(R) Core(TM) i7-11700K @ 3.60 GHz 3.60 GHz with the memory of 32G, and the graphics card model of NVIDIA GeForce RTX 3090 with the memory of 24G.    Figure 3 shows a visual display of the number of all datasets. e dataset is divided into three main sections:

Introduction to Datasets and Data Augmentation
(1) A large number of images are obtained through online crawlers. First, the principle of crawling technology is accessing web resources recursively through keywords. Second, the quality of the data collected is too poor because of inaccurate keyword matching and other reasons. Over 30,000 images are filtered out by manually eliminating images with blurred images, serious watermarks, and the presence of multiple objects. (2) Domestic garbage dataset is opened by Huawei.
(3) 2,136 images of everyday household garbage are taken by hand. e images are taken from above, to the left, and in front of the object in a well-lit scene to extract features better during training. Figure 4 shows a collection of garbage images.  Due to the complexity of the garbage categories, there is the problem of insignificant differences between different categories. In addition, achieving learning-by-learning data representations is the core of deep learning, and if this paper wants to enhance the robustness of the learning model, this paper must have a large amount of data for training. erefore, this paper expands the dataset by data augmentation.
As shown in Figure 5, the original images from the garbage classification training set were transformed by certain techniques, such as spatial and chromatic transformations, to obtain 10 enhanced images as shown in Figure 6. e first layer from left to right in order is the center crop, random crop, resize horizontal flip, and vertical flip; the second layer in order is random flip, followed by grayscale, turning the images into squares, color inversion, and c transform.

Transfer Learning Experiments.
In this experiment, the network structures ResNet, DenseNet, EfficientNetV2, Vision Transformer, and VGGNet are trained in two ways, training from scratch and transfer learning, respectively, and garbage datasets are divided into four types. e aim is to analyze and compare the accuracy and convergence rate of transfer learning and training from scratch on the four types of garbage datasets to derive experimental results.
e PyTorch framework randomly initializes the network parameters by default, and training from scratch is simply a matter of training all the trainable layer parameters on the dataset once the network structure has been designed.
While transfer learning training needs to be divided into two steps: feature extraction and fine-tuning. Feature extraction is to load pretrained weights first and transfer the parameters pretrained in the source domain to the network in this paper, which not only speeds up the model convergence but also enables to gain the generalization ability for better training of the model.
In contrast, fine-tuning is the process of unfreezing some or all of the trainable layers of the original network, using a lower learning rate trained through the garbage dataset of this experiment to fine-tune its original parameters to make it more suitable for the garbage classification task of this experiment.

Experimental Parameter Settings.
e experiment is divided into two phases: training from scratch and transfer learning. Table 1 shows the basic training parameter settings.

Feature Extraction.
In this experiment, the pretrained weights of the five networks ResNet, DenseNet, Effi-cientNetV2, Vision Transformer, and VGGNet on ImageNet are downloaded separately, and the network parameters of the pretrained weights are transferred to the experimental network.

Fine-Tuning.
Directly using the pretrained model for classification obviously cannot solve the garbage classification image recognition problem of this experiment. To make the pretrained network weights better adapted to the garbage image data of this experiment, the pretrained model was fine-tuned. First, this paper adjusts all layers of the model, and in the original model as well as the added classification layers. Second, this paper continues training with the garbage image data to better apply it to the garbage image classification problem. Ultimately, this paper improves the recognition accuracy of the model. Table 2 shows the specific details of fine-tuning training for each network model. Figures 7 and 8 show the transfer learning's accuracy and loss change curves and Figures 9 and 10 show the training from scratch's accuracy and loss change curves. As can be seen in Figures 7 and 10, the convergence rate is slower with training from scratch, and it is difficult to train a better model on the garbage dataset in this paper because the garbage image samples are not widely available and the accuracy rate on the test set is not high. In contrast, the model under transfer learning has a faster convergence rate, and on the test set, each model has high accuracy. e highest accuracies of training from scratch and transfer learning training in the test set are shown in Table 3. e model cannot provide enough underlying features because the garbage image samples are not extensive enough.

Experimental Results.
In contrast, the model after transfer learning, which has powerful underlying features with weights on ImageNet, greatly improves the recognition results for classification tasks with insufficient samples.

Model Fusion Experiments.
In this experiment, multiple models are fused to extract features together. e performance of GCNet is compared to that of individual models to demonstrate its superiority.
After transfer learning experiments, the pretrained models of DenseNet, Vision Transformer, and Effi-cientNetV2 on ImageNet are obtained and have the highest accuracy for garbage image recognition in this paper. erefore, the three base models for GCNet are DenseNet, Vision Transformer, and EfficientNetV2. Table 4 shows the model fusion experimental parameter settings.

Conclusions
A transfer learning and model fusion-based garbage classification image recognition algorithm is proposed for the classification problem. After transfer learning experiments, it is found that the pretrained models of DenseNet, Vision Transformer, and EfficientNetV2 on ImageNet work best for the garbage image dataset in this paper. At the same time, it also confirms that Transformer is not only suitable for natural language processing related tasks but also for computer vision and the garbage image recognition in this paper. erefore, this paper uses DenseNet, Vision Transformer, and EfficientNetV2 as the basic models for model fusion experiments and designs a neural network model named Garbage Classification Net suitable for garbage image recognition. is algorithm achieves the best performance of 97.54% when the inference speed is acceptable, which exceeds most of the mainstream methods.
is paper also improves the generalization ability of the model by filtering and enhancing the dataset obtained from the collection and hand-photographed datasets. However, the following shortcomings still exist in this paper's research: (1) is paper creates a garbage classification dataset, which provides a database for the training of the 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29      classification model the classification effect is good but the multitarget garbage detection is not achieved and the target detection task needs to be improved. (2) e model classification still has a certain false detection rate, which needs to be optimized. (3) Model fusion can also consider more ways of model fusion, as well as choosing other models for fusion.
Based on the above issues, this paper has the following outlook: (1) Launching research on multiobjective spam detection and expanding the dataset.
(2) e model is further optimized by adding a selfattention mechanism and modifying the model structure to achieve more accurate garbage classification through experiments, which will help to promote the further development of garbage classification. (3) is paper adopts the fusion of three models, DenseNet, Vision Transformer, EfficientNetV2, and other fusion methods that can be considered in the subsequent research to further improve the classification accuracy.

Data Availability
e dataset used to support the findings of the study can be obtained from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.