A New Efficient Classifier for Bird Classification Based on Transfer Learning

the


Introduction
Classifcation of diferent bird species in images is important for diferent areas [1,2]: (1) Environmental studies: Identifying bird species in images can assist in monitoring and analyzing birds for ecosystem research, determining changes in populations and studying the efects of climate change and other factors on birds.(2) Conservation and biodiversity study: Recognition of bird species in images can serve to study and protect bird diversity by helping to identify diferent species and their migratory paths.(3) Agricultural research: Automatic bird classifcation can be used to identify species that may afect agriculture.For example, the allocation of species that can harm agricultural crops or participate in natural processes that are useful for agriculture.(4) Studying bird behavior: Image analysis can help study the behavior of diferent bird species, such as their migratory pathways, nesting, feeding, and other aspects of behavior.(5) Biological research: Te classifcation of birds in images may be important for biological research aimed at understanding the evolutionary and genetic aspects of diferent bird species.(6) Environmental pollution monitoring: Changes in bird populations can serve as indicators of environmental pollution.Classifcation of birds in images can help in assessing the impact of pollutants on the bird world.
Te use of deep learning and computer vision methods for automatic classifcation of birds makes it much easier to process a large amount of data and provide fast and accurate analysis [3,4].
As it is known [5], classifcation is a type of problem in the area of artifcial intelligence, the essence of which is to assign each object a certain class based on its characteristics.Te main purpose of classifcation is to teach a model to recognize regularities or patterns in data and determine which class a new, previously unknown object belongs to.
Te task of classifying diferent species of birds in images belongs to the area of computer vision.
Computer vision is a feld of science and technology that studies and develops systems that provide computers with the ability to "see" and interpret images and videos in the same way that human vision does [6,7].Te main purpose of computer vision is to give computers the ability to recognize objects, determine their characteristics, and interact with the environment based on visual data.
Problems related to this area are best solved by applying deep learning methods.Deep learning is a sub-branch of machine learning that uses neural networks with a signifcant number of layers (deep architectures) to automatically identify and research high-level data representations.Deep learning becomes especially powerful when using neural networks with many layers, as it allows models to automatically identify complex dependencies and abstractions in the input data [8].
Eventually, we analyzed the current state of the problem studied in our research work on the classifcation of diferent species of birds in the images.At the same time, the main advantages and disadvantages of each of the considered works are highlighted.As a result, in Section 3, we will provide a table of comparison of the best values of the performance indicators of existing models on the test data and the one proposed by us.So, the preprocessing of images, namely, increasing their resolution is described in detail in [9].
Te authors introduced a investigate framework in [10] that explores the classifcation of bird species by integrating deep neural features from visual and audio data through the kernel-based fusion technique.Specifcally, the deep neural features are derived from the activation values of the inner layer of the convolutional neural network (CNN).Te authors employed multicore learning (MKL) to fuse these features for the ultimate classifcation.In experimental trials, the proposed CNN + MKL method, which incorporates both types of data, demonstrates superior performance compared to single-modal approaches, certain basic kernel combination methods, and the traditional early fusion approach.
A method for identifying birds based on visual features using convolutional neural networks was proposed by the authors in [11].Two methods are proposed in this paper.Te frst is an attention-driven data enhancement.And the second method is a compression model for distilling disjointed knowledge.As a result, a fne-grained bird classifcation model was created and an accuracy of 87.63% was achieved.
In [12], the authors compared three approaches to solve the problem of classifying diferent bird species in images.Te frst approach is the use of a traditional machine learning algorithm (SVM), the second is the use of a deep learning algorithm (ResNet50 model), and the third is the use of a deep learning algorithm in combination with transfer learning (ResNet50 trained model).As a result, it was concluded that the classifer based on deep learning in combination with transfer learning achieved the best results.Tis experiment completely proved the inefectiveness of using traditional machine learning methods to classify diferent species of birds in images.It was also shown that the use of deep learning methods in combination with transfer learning gives a signifcant advantage to deep learning in the context of solving this problem.
In [13], the authors investigated the use of convolutional neural networks to classify diferent bird species in images.Four models with diferent architectures based on transfer learning were used to classify images: ResNet152V2, Inception V3, DenseNet201, and MobileNetV2.Te models were trained on the BIRDS 400 SPECIES dataset, containing about 50 thousand images of 400 diferent species of birds.As a result, models of ResNet152V2 and DenseNet201 had the best results.For ResNet152V2, accuracy � 95.45% and loss (categorical cross-entropy) � 0.8835.At the same time, DenseNet201 resulted worse accuracy � 95.05% but better loss � 0.6854.
In the article [14], the author researched the application of deep learning methods to classify diferent species of birds in images.For these researches, the BIRDS 400 SPECIES dataset was selected.Tere were relatively many diferent deep learning-based models, both simple and complex, trained on the selected dataset.As experiments have shown, the best solution to this problem was done by a complex of models based on transfer learning.In addition, the technique of image augmentation was applied, which had a very positive impact on the fnal efectiveness of the models.In conclusion, it was determined that the best performance was shown by the pretrained VGG19 model (the frst 17 layers were frozen), which was trained on augmented training data.It was able to achieve the best results: loss (categorical crossentropy) � 0.1426.
In [15], the authors applied deep learning techniques combined with transfer learning to solve the problem of classifying diferent bird species in images.A number of models were created, previously trained on the ImageNet dataset, based on the following deep architectures: Ef-cientNetB0, DenseNet201, MobileNetV2, MobileNet, ResNet152V2, VGG16, and VGG19.Next, these models were customized to perform the task of classifying diferent bird species in images using the BIRDS 400 SPECIES dataset.Experiments have shown that the best solution to this problem coped model EfcientNetB0.After that, all models of the EfcientNet (B0-B7) family were compared in detail.As a result, it was concluded that the EfcientNetB0 model showed the best results on the test data: accuracy � 98.60%.
Te research paper [16] conducts an extensive examination of bird detection and species classifcation, employing the YOLOv5 object detection algorithm and the EfcientNetB3 deep learning model with retraining.Te dataset utilized by the authors corresponds to the one employed in our study.Consequently, in Section 3, we will perform a comparative analysis of the achieved outcomes.
Te aim of this work is to develop a new classifer using deep learning methods that would allow for high accuracy and efectively classify as many diferent bird species as possible in images.
Te novelty of this paper is summarized as follows: (1) A dataset including a large number of bird species was analyzed and preprocessed in detail.(2) Te optimal architecture of the model is proposed, using the approach of transfer learning.(3) In addition, a new efcient algorithm has been developed to classify diferent bird species in images based on deep learning.(4) Trough large-scale testing, it was established that on the basis of the proposed algorithm, it was possible to signifcantly increase the performance indicators of the model: loss, accuracy, precision, recall, and F score.
Te rest of the text is built according to the following structure: In Section 2 data preprocessing is described, the optimal architecture of the classifcation model based on deep learning is built, and model training is discussed in detail.In Section 3, the model training process is carried out based on two phases, and the test results are given.Te last section contains conclusions and prospects for further research.

Dataset.
To train the model in this research work, the BIRDS 525 SPECIES dataset [17] was chosen, which contains about 90 thousand images with 525 diferent species of birds.Each image is color, has a size of 224 × 224 pixels, and is stored in jpg format.It is a very high-quality dataset where there is only one bird in each image, and it usually occupies at least 50% of its pixels.Likewise, it should be noted that all images are original and not created by applying augmentation techniques.
Te dataset is predivided into three samples: training, validation, and test.Te training set contains 84,635 images (94%), while the validation and test sets each contain 2,625 images (3%).
Next, we analyze how well the dataset is balanced.For this purpose, graphs are built with visualization of the number of images for each species of birds.Tis step was taken for each sample of the dataset separately.
As can be seen in Figure 1, the training sample is quite imbalanced because the number of images for diferent species of birds ranges from 140 to 260.However, this imbalance is not critical and can be simply ignored.At the same time, Figures 2 and 3 demonstrate that the validation and test datasets are perfectly balanced, as each species of bird in these datasets has exactly fve images.
Ten it was decided to apply the technique of image augmentation [18].It provides for applying various transformations to the original images to create new, slightly modifed versions.Tis will increase the invariance of the work of the model, making it more stable and able to classify images in more difcult circumstances, and will contribute to additional regularization due to the introduction of randomness and diversity into the training data, thereby preventing overftting.
Te following transformations were applied: (1) RandomFlip ("horizontal") horizontally fips the image with a probability of 50%.( 2) RandomTranslation (0.05, 0.05, fll_mode � "nearest") shifts the image vertically and horizontally by a random amount in the range of [−5%, +5%], flling the resulting empty pixels with the value of the nearest pixel from the original image.Te result of applying the augmentation technique is shown in Figure 4.
Te dataset for model training was also optimized by using the prefetch method [19], which allows data batches to be loaded asynchronously into memory even before the models need them.Tis approach helps improve workout performance, especially when working with large amounts of data, which is a relevant case for this research work.bufer_size � AUTOTUNE automatically determines the optimal bufer size for maximum performance.
In order to apply transfer learning, we also used the following dataset: ImageNet [20].Te latter is one of the largest and most famous datasets in the feld of computer vision (see Figure 5).It includes more than 14 million images that belong to more than 21 thousand classes of completely diferent kinds.ImageNet has become popular due to its wide variety because it covers a wide range of diferent objects, from animals and plants to household items and vehicles.Tis dataset is publicly available and is provided to researchers free of charge for noncommercial use.

Model
Architecture.Further, it was necessary to build the optimal architecture of the classifcation model based on deep learning.Tis is an essential task, as it has a signifcant impact on the fnal efciency of the model.It is also laborious since it requires a large number of experiments.
It was decided to use the architecture of a convolutional neural network [21], because this is the method of deep learning, which is ideal for solving problems of image classifcation.Te main characteristic of convolutional neural networks is the use of convolutional layers.Tey apply convolution operations to input images, thereby detecting local patterns (features) on the basis of which classifcation will take place.
Initially, we used Inception V3 and VGG19 convolutional neural network architectures.As a result, the highest accuracy on the test sample was achieved, which was 92% and 95%, respectively, when training with the Adam optimizer [22].Eventually, the experiments led to the conclusion that the most optimal solution would be to use transfer training [23].Tis is a popular approach in deep learning, the essence of which is that knowledge obtained during the solution of one problem is reused to solve another.Tat is, a model previously trained to solve a problem is reused to solve a new one.So, in order to apply this approach in this work, you need to choose a ready-made model of a convolutional neural network, which is pretrained on a suitable dataset.As a result of an analysis of modern literature [16], it was concluded that the most efective would be the use of the EfcientNetB5 model [24], previously trained on the ImageNet dataset, which was previously described in Section 2.1.Te EfcientNetB5 model belongs to the EfcientNet family.Te latter belongs to a family of models designed for problems in the feld of computer vision, including image classifcation.Tese models are characterized by high efciency with a small number of parameters.Te EfcientNet family includes diferent versions of models, designated from B0 to B7, where B0 is the least powerful model and B7 is the most powerful.EfcientNetB5 difers from smaller versions such as EfcientNetB0, in more options and a deeper architecture.Typically, a deeper architecture allows the model to better adapt to solving complex problems that require a large amount of training data.
In order to load the pretrained model, the keras.applications.EfcientNetB5 function was used.Te following parameters were passed to it: Input_shape � (224, 224, 3): it determines the size of the input data of the model.Include_top � False: it indicates that the top layer of the model (the one directly responsible for classifcation based on extracted features) will not be included in the loaded model.Weights � "imagenet": itindicates that the model will be loaded with the fnished weights that it received during training on the ImageNet dataset.Pooling � "max": it indicates that the last pooling layer in the loaded model architecture will use the maximum pooling operation.
As a result, was loaded a model containing 578 layers.After the pretrained model was loaded, it was necessary to build the architecture of the fnal model.To do this, a pretrained model was integrated into it and top layers were added, which already directly perform the classifcation.
Data augmentation: it applies the augmentation technique to the images the model takes as input.Tis layer is active only during the training of the model because when it is used to solve real problems, augmentation will be completely redundant.Dense (fully connected layer): it is used to perform classifcation based on features that have been extracted by the convolutional layers of the pretrained model.
RELU is an activation function for introducing nonlinearity into a convolutional neural network.
Batch Normalization: it is used to normalize the input data by applying a transformation that makes the average value close to 0 and the standard deviation close to 1.
Dropout: it isused to prevent overtraining of the model, by deliberately losing a certain part of random neurons.
Te last dense layer has the same number of neurons as the number of classes in the selected dataset and uses the Softmax activation function to output the vector of the probability of image belonging to a particular class.

Model Training.
After the architecture of the fnal model was built, it was necessary to proceed to the process of training it.It is necessary in order to fll the model with weights that will be optimal for solving the problem of this work.Tis stage is key since it is the level of optimality of the weights that determines the efectiveness of the model to a very large extent.
In order to start training the model, it was necessary to determine the optimization algorithm, loss function, and training indicators.
As an optimization algorithm, it was decided to use Adam.Tis algorithm is an optimization algorithm widely used in deep learning to train models of convolutional neural networks.It combines ideas from other optimization algorithms, such as stochastic gradient descent (SGD) and RMSprop, to enable efcient and adaptive updating of model weights during training.Adam uses stochastic gradient descent to update the model parameters.It also adjusts the learning rate for each parameter separately using the previous gradients and their squares.Tis adaptability helps the model converge faster and prevents it from getting stuck in local lows.Adam also uses the concept of momentum.Its essence is to accumulate the history of gradients for each parameter of the model during optimization.Tis story helps stabilize the optimization process, especially under conditions of noise in the data or unstable gradients.
As a function of loss, it was decided to use sparse categorical cross-entropy [25].Tis function is used in deep learning to measure the diferences between the probability distribution that the model predicts and the authentic class distribution.Tis loss function is especially useful in multiclass classifcation problems, where each object can belong to one of the possible classes.Sparse categorical crossentropy is calculated by comparing the true class distribution with the probability distribution predicted by the model.If the predicted probabilities correspond exactly to the authentic distribution, the loss function equals zero.In other cases, the loss increases, indicating diferences between predictions and true values.In order to be able to clearly track the progress of the model training, it was decided to use the accuracy indicator.
It is worth noting that the size of the batches of images the model will take as input for training � 32.
Callbacks for model training were also identifed as follows: EarlyStopping (monitor � "val_loss," patience � 12); ModelCheckpoint ("model1.h5"monitor � "val_loss," save_best_only � True).EarlyStopping is used to prematurely stop the model's training process if it has stopped improving [26].It avoids the unreasonable waste of expensive training resources when the value of the specifed indicator of the efciency of the model ceases to improve over a certain number of eras.Monitor � "val_loss" indicates that the EarlyStopping callback will observe the value of the loss function on the validation sample of the dataset in order to understand whether the model has stopped improving.Patience � 12 indicates that to prematurely stop model training, it requires a lack of improvement in the value of the selected indicator for the next 12 epochs from the moment the best value is fxed.
ModelCheckpoint callback is used to automatically store the state of the model (including weights and architecture) during training at certain times [27].It can be very useful because training on a large amount of data is expensive, and if for some reason it does not end successfully, you can lose all the previously gained weight, which will be a very unpleasant situation.
Also, this callback is useful because it allows you to save exactly the best version of the model, overwriting the old version with it and discarding all the others with worse values of efciency indicator.
Filepath � "model1.h5"specifes the fle name in which the model will be saved.Monitor � "val_loss" indicates that the Mod-elCheckpoint callback will observe the value of the loss function on the validation sample of the dataset.When this value improves, ModelCheckpoint saves the model.Save_best_only � True indicates that only the best version of the model will be saved, that is, the one for which the value of the selected indicator will be the best.
Ten, it was necessary to go directly to the process of training the model.
Since for the pretrained model EfcientNetB5 trainable � false, its layers will be frozen during training (they will not change the weights).Te weights will only change at the top layers that were added to the fnal model after the pretrained one.
Te following indicators are used to evaluate the model's performance: loss, accuracy, recall, precision, and F1 score.
Te following values will be used for the formulas of the following indicators:

Journal of Engineering
True Positives: the number of images that belong to a certain class and have been correctly defned as this class.False Positives: the number of images that do not belong to a particular class and were mistakenly defned as this class.True Negatives: number of images that do not belong to a particular class and were correctly defned not as this class.
False Negatives: the number of images that belong to a certain class and were mistakenly defned as not belonging to this class.Accuracy: it measures the overall correctness of the model classifcation.It is defned as the ratio of the number of correctly classifed images to the total number of images.Accuracy can be useful in the case of a well-balanced dataset: accuracy � true negatives + true positives true positives + false positives + true negatives + false negatives . (

Phase 1.
Te model was compiled with a certain optimization algorithm, a loss function, and training indicators.
After that, the training process was started using the model.ftfunction, in which the parameters were transferred to the training and validation samples of the dataset, and the callbacks that were previously defned.It was also determined that the maximum number of training epochs � 100.
Te training took place in the Google colab environment, using premium resources (A100 GPU and Colab Pro+), and took several hours.As a result, the model trained the maximum number of epochs (100), and the results achieved are given in Table 1.
Te obtained results are additionally visualized in Figures 7 and 8.In particular, the change in the loss indicator is shown in Figure 7, and the change in accuracy during eras is shown in Figure 8.
As can be seen, the accuracy of the model on the validation sample of the dataset was quite high (val_accuracy � 0.9463).Tis result is satisfactory, but it was decided to conduct additional experiments in order to increase the efectiveness of the model.

Phase 2.
To improve the efciency of the model, a certain number of layers of the pretrained EfcientNetB5 model were activated (not including BatchNormalization layers, because changing their weights will negatively afect the efectiveness of the model) and continue training the model (start the second phase of training).In general, the trained model kneads 578 layers, and the deeper the layers are, the more complex features they form.Tat is, it makes sense to activate the last layers, since they are responsible for the formation of high-level features that could be adapted to our task.At the same time, the initial layers form low-level features, and since they will be suitable for our task, the modifcation of their formation (changing the weights of the initial layers) will not make sense.
Te performed experiments demonstrated that the best results were obtained by activating the last 92 layers (not including BatchNormalization layers) of the pretrained model, so this is the number of layers activated for the second phase of training.
In this phase, the learning rate for the Adam optimization algorithm was set to 1e − 5.
Callbacks for model training were also changed: EarlyStopping (monitor � "val_loss," patience � 13) ModelCheckpoint ("model2.h5,"monitor � "val_loss," save_best_only � True) ReduceLROnPlateau (monitor � "val_loss," factor � 0.2, patience � 3) For the EarlyStopping callback, there was a variable parameter patience, now it � 13 For ModelCheckpoint, the flepath parameter has been changed, now it � "model2.h5"Factor � 0.2 indicates that the learning speed will be reduced by 5 times, if there is no improvement in the value of the selected indicator for a certain number of eras.Patience � 3 indicates that in order for the learning speed to decrease, the selected indicator should not improve over 3 subsequent eras.
Te second phase of training once again took place in the Google colab environment, using premium resources (A100 GPU and Colab Pro+), and also took several hours.As a result, the model was trained for 48 epochs, and the results are given in Table 2.
Visualization of the obtained results in the second phase of model training is additionally presented in Figure 8.It was decided to stop the model training process, since the accuracy of the model on the validation sample of the dataset was extremely high (val_accuracy � 0.9745), and there were no more ideas for its further improvement.
After the process of training the model was fully completed, it was necessary to evaluate the efectiveness of its work.Te efciency results after the model was run on the test sample of the dataset are given in Table 3.
Te model error matrix on the test sample of the dataset was also visualized (see Figure 9).
In Figure 10, the operation of the model on specifc images is demonstrated.
Comparing the model proposed by this research work with those proposed in the analyzed literary sources (see Tables 4 and 5), we can conclude that there are two signifcant advantages of our model: (1) High efciency of the model for the number of bird species equal to 400: in other words, the values of the model efciency indicators on the test data were lower than in this work-loss � 0.0224, and accuracy � 99.86%.

Conclusions
Te BIRDS 525 SPECIES dataset which contains 525 species of birds was selected, analyzed, and preprocessed.It was eventually used to train the proposed model.Te optimal architecture of the model was also built using the transfer learning approach.For this purpose, the EfcientNetB5 model, which had been previously trained on the ImageNet dataset, was integrated into it.
A new optimization algorithm for the classifcation of diferent species of birds in the images was developed.Te model training process was carried out, which included two phases: (1) only the upper layers were activated and (2) the last 92 layers of the pretrained EfcientNetB5 model (not including BatchNormalization layers) and the top layers were activated.Te efciency of the model was also evaluated using indicators: loss, accuracy, precision, recall, and F score.An error matrix was visualized so that the model could be analyzed for diferent classes.As a result, the efciency indicators accuracy � 98.86%, precision � 0.99, recall � 0.99, and F1 score � 0.99 were obtained.A comparative analysis of the obtained indicators with the corresponding indicators obtained by other authors was also carried out.From the analysis of the results obtained, it can be argued that the task of classifying various species of birds was efectively managed while ensuring high accuracy.
As future research, it would be important to optimize the proposed algorithm in order to accelerate model training and the possibility of obtaining a real-time solution [29,30].

( 3 )
RandomRotation (0.05, fll_mode � "nearest") rotates the image by a random amount in the range of [−18 °, +18 °], flling the resulting empty pixels with the value of the nearest pixel from the original image.(4) RandomZoom (0.05, fll_mode � "nearest") scales the image by a random amount in the range of [−5%, +5%], flling the resulting empty pixels with the value of the nearest pixel from the original image.(5) RandomContrast (0.2) adjusts the contrast of the image randomly by the formula (x − mean) * factor + mean, where the factor is in the range of [0.8, 1.2].

Figure 2 :Figure 3 :Figure 1 :
Figure 2: Visualization of the number of images for each bird species in the validation sample of the dataset.

Figure 4 :
Figure 4: (a) An example from the selected dataset and (b) the same image, but after applying the augmentation technique.

Figure 5 :
Figure 5: Example of images from the ImageNet dataset.

Figure 9 :
Figure 9: Model error matrix on the test set of data.
TRUE LABEL: GREEN MAGPIE PREDICTED LABEL: GREEN MAGPIE TRUE LABEL: ASIAN DOLLARD BIRD PREDICTED LABEL: ASIAN DOLLARD BIRD TRUE LABEL: AMERICAN AVOCET PREDICTED LABEL: AMERICAN AVOCET TRUE LABEL: COMMON GRACKLE PREDICTED LABEL: COMMON GRACKLE TRUE LABEL: BALD IBIS PREDICTED LABEL: BALD IBIS TRUE LABEL: GOLDEN EAGLE PREDICTED LABEL: GOLDEN EAGLE TRUE LABEL: VULTURINE GUINEAFOWL PREDICTED LABEL: VULTURINE GUINEAFOWL TRUE LABEL: FAIRY PENGUIN PREDICTED LABEL: FAIRY PENGUIN TRUE LABEL: BLUE THROATED TOUCANET PREDICTED LABEL: BLUE THROATED TOUCANET

Figure 10 :
Figure 10: Demonstration of the operation of the model on specifc images.

)
Precision: it measures how much of all images defned by the model as a specifc class really belong to this class.It is useful in situations where it is important to maximize the accuracy of the defnition of a particular class and avoid erroneous defnitions of that class:

Table 3 :
Efciency indicators of the model run on the test sample.

Table 4 :
Comparison of models from the relevant literature with the model proposed in this research paper for the number of bird species equals 400.

Table 5 :
[16]arison of our model with the models from[16]for the BIRDS 525 SPECIES dataset-precision, recall, and F1 score.High efciency of the model for a large number of diferent bird species (525) that the model can classify: accuracy � 98.86%, precision � 0.99, recall-� 0.99, and F1 score � 0.99.