Satellite and Scene Image Classification Based on Transfer Learning and Fine Tuning of ResNet50

Image classiﬁcation has gained lot of attention due to its application in diﬀerent computer vision tasks such as remote sensing, scene analysis, surveillance, object detection, and image retrieval. The primary goal of image classiﬁcation is to assign the class labels to images according to the image contents. The applications of image classiﬁcation and image analysis in remote sensing are important as they are used in various applied domains such as military and civil ﬁelds. Earlier approaches for remote sensing images and scene analysis are based on low-level feature representations such as color- and texture-based features. Vector of Locally Aggregated Descriptors (VLAD) and orderless Bag-of-Features (BoF) representations are the examples of mid-level approaches for remote sensing image classiﬁcation. Recent trends for remote sensing and scene classiﬁcation are focused on the use of Convolutional Neural Network (CNN). Keeping in view the success of CNN models, in this research, we aim to ﬁne-tune ResNet50 by using network surgery and creation of network head along with the ﬁne-tuning of hyperparameters. The learning of hyperparameters is tuned by using a linear decay learning rate scheduler known as piecewise scheduler. To tune the optimizer hyperparameter, Stochastic Gradient Descent with Momentum (SGDM) is used with the usage of weight learn and bias learn rate factor. Experiments and analysis are conducted on ﬁve diﬀerent datasets, that is, UC Merced Land Use Dataset (UCM), RSSCN (the remote sensing scene classiﬁcation image dataset), SIRI-WHU, Corel-1K, and Corel-1.5K. The analysis and competitive results exemplify that our proposed image classiﬁcation-based model can classify the images in a more eﬀective and eﬃcient manner as compared to the state-of-the-art research.


Introduction
Image classification and analysis is an active research area and there are many applications of automatic image classification in computer vision domains such as pattern recognition, image retrieval, object recognition, remote sensing, face recognition, textile image analysis, automatic disease detection, geographic mapping, and video processing [1][2][3]. In any image classification-based model, the primary objective of research is to assign the class labels to images. A group of images are used as training samples and learning of classification-based model is done by using a training dataset. After training, the test dataset is assigned to the trained model to predict the class labels of images. On the basis of prediction of test dataset, images can be arranged in a semantic and meaningful order. Selection of discriminating and unique features is always beneficial as it can enhance the performance of any classification-based system [4][5][6]. In remote sensing, the problem of image classification is more challenging as objects are rotated within a view and background is usually more complex [7]. Satellites, unmanned aerial vehicles, and aerial systems are used to capture the image datasets that are used to evaluate the research of remote sensing [7]. According to the recent reviews [8,9], there are three main approaches that can be used to classify digital images and they are based on (i) lowlevel features representation [10], (ii) mid-level features representation [11][12][13][14], and (iii) approaches based on Convolutional Neural Network (CNN) [7]. Figure 1 represents a block diagram of a CCN which consists of multiple hierarchical layers including feature map layers, classification layers, and fully connected layers. CNN takes an input image, processes it, and classifies it under certain categories/class labels, for example, elephant, flower, cat, and dog. In a deep CNN, input image is passed through a series of layers called convolution layers with certain filters (kernels), pooling layers, fully connected layers, and finally classification layers. Typically, the first layer in CNN is convolution layer, which generates the feature maps with the help of filters [15,16]. e filters that are used in convolution layers can perform operations such as edge detection, blurring, and sharpening. e feature maps generated by the convolution layers are passed to the sampling layers to reduce the size of the impending layers. ey help to reduce the size of parameters when the size of the input image is large.
e size is reduced in such a way that important information is preserved while omitting the information that is not necessary. en, the feature maps are converted into vectors and passed to the fully connected layers. Finally, activation function and classification function classify the images into respective categories. Backpropagation is followed by CNN to carry out the process of classification in a more efficient way [8]. Figure 2 represents different levels for remote sensing image classification which are (i) pixel level, (ii) object level, and (iii) scene level [8]. According to the literature [8,17], the early research models for remote sensing image classification are based on pixel level or subpixel level. e reason for this classification is the low resolution of satellite image as capturing devices are not that capable to create a highresolution image as available information is in the form of small pixels [18,19]. Due to recent advancement in imaging technology, the spatial resolution of remote sensing images is increasing, and it is possible to capture the visuals in more semantic way [8]. Due to this reason, in satellite image classification, it is not much beneficial to focus more on pixel level [8]. Blaschke and Strobl [20] concluded that, for remote sensing image classification, it is more beneficial to focus on object-level classification instead of pixel-level analysis. e authors suggested that object-level analysis for remote sensing images is more efficient and semantic as compared to the previous approaches based on pixel-level analysis. Since the last two decades, significant research has been published by considering the object-level classification for remote sensing images [18,19]. Later on, due to advancement in technology of image-capturing devices, remote sensing images may contain many object classes [8]. So, in this case, the former two pixel-level and object-level approaches may not be significant. Due to this reason, it is considered to classify the images in a global context, and the focus of research is shifted to the use of scene-level remote sensing image classification. e scene-level classification of images is considered as a significant approach to represent visual information as discriminating features [8]. In last two decades, extensive efforts are exerted by computer vision research community to develop the discriminating features such as Scale-Invariant Feature Transformation (SIFT) [21], Speeded-up Robust Features (SURF) [22], Histogram of Oriented Gradients (HOG) [23], and Maximally Stable Extremal Regions (MSER) [24]. Bag-of-Features (BoF), Spatial Pyramid Matching (SPM), and Vector of Locally Aggregated Descriptors (VLAD) are the examples of simple and efficient encoding models and they have been used in various fields of remote sensing and scene classification [25,26]. Due to recent increase in the size and number of training images, the use of CNN models and Graphics Processing Unit (GPU) are considered as current research trends. e concept presented by Hinton and Salakhutdinov by using multilayered neural networks has provided a foundation for deep learning research [27].
e comprehensive literature reviews about remote sensing image classification and use of recent trends of deep learning models can be found in [8,17,28,29]. According to the literature, the most popular CNN architectures are AlexNet [30], VGG network [31], Residual Network (ResNet) [32], and GoogLeNet [33]. ere are 08 layers in AlexNet [30], 19 layers in VGG network, and 22 layers in GoogLeNet [34]. ResNet50 is based on ResNet with 50 layers and is inspired from the idea to make deeper layers with a higher value of classification accuracy for complex tasks [35]. Usually in neural networks, when we increase the number of layers, the classification accuracy begins to degrade, while this problem is handled by residual training [35]. Here are the main contributions of this research: (i) We fine-tuned ResNet50 by using network surgery and creation of network head along with the finetuning of hyperparameters.
(ii) e learning of hyperparameters is tuned by using a linear decay learning rate scheduler known as piecewise scheduler. To tune the optimizer hyperparameter, Stochastic Gradient Descent with Momentum (SGDM) is used with the usage of weight learn and bias learn rate factor.
(iii) Experiments and analysis are conducted on five different datasets, that is, UC Merced Land Use Dataset (UCM), RSSCN (the remote sensing scene classification image dataset), SIRI-WHU, Corel-1K (1000 images), and Corel-1.5K (1500 images). e analysis and competitive results exemplify that our proposed image classification-based model can classify the images in a more effective and efficient manner as compared to the state-of-the-art research.
e remainder of the paper is organized as follows: Section 2 is about literature review and discussion about relevant research based on remote sensing image classification, Section 3 presents the proposed fine-tuned ResNet50 and provides details of ResNet50 parameters, Section 4 is about the description of image benchmarks that are used for evaluation of this research, Section 5 is about results, experimental values, discussion, and comparisons, and Section 6 concludes the proposed research based on finetuned ResNet50.

Related Work
Content-based image analysis is widely used in various applied and real-time domains of computer vision [36,37]. Classification of images according to the image contents, visual appearance, and human visual perception is considered as an open research problem [38]. Remote sensing image classification approaches are broadly categorized into three groups based on the type and the usage of visual clues, that is, approaches based on low-level visual features, approaches based on mid-level features, and high-level feature extraction approaches [11,39]. We have hand-picked recent state-of-the-art approaches from the above-mentioned categories, which have reported results on similar image benchmarks. e earlier research for remote sensing and scene classification is formulated on the use of low-level visual features [40,41]. Khalid et al. [40] reduced the semantic gap and proposed an efficient feature vector-based image representation. Histogram-based approach is used to compute the feature vector of images. e authors extracted the autocorrelogram by using RGB format that is followed by a moment's extraction. e efficiency is enhanced by applying Discrete Wavelet Transform (DWT) on multiple resolutions and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is used to compute the codebook. Different variants of Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Decision Tree (DT) are used to classify images, and the authors have presented a comprehensive comparison while using different classifiers. e proposed research based on DBSCAN is evaluated on three publicly available datasets, that is, Corel-1K, Corel-1.5K, and Corel-5K [40]. Raja et al. [41] proposed an approach for content-based image analysis which is based on feature extraction from color images. e region of  interest in an image is computed with the help of first-order derivatives. Due to closeness with respect to human visual perception, the HSV (Hue, Saturation, Value) histograms are used to represent the color space. Neural networks (NN) are used for the purpose of image classification/class label assignments, and the results are reported while using Corel-1K and Corel-5K image benchmarks [41]. Desai et al. [42] proposed an image representation based on fusion of different features. e authors selected a combination of lowlevel visual features, which are DWT, Edge Histogram Descriptor (EHD), Sobel operator, Moment Invariant (MI), Histogram of Oriented Gradients (HoGs), and Local Binary Pattern (LBP). Different combinations of low-level visual features are evaluated to sort the most reliable image representation. According to the published results values [42], a combination of low-level features with SVM outperforms all other features combination. Shikha et al. [43] proposed a hybrid image representation and low-level attributes of images are computed by using a combination of color, shape, and texture. e authors computed a hybrid feature vector (HFV) by using a feature integration of three different visual attributes. A feed-forward neural network known as Extreme Learning Machine (ELM) is trained while using input as HFV. To enhance the performance of system, Relevance Feedback (RF) is applied to ELM. e performance of the proposed system is evaluated while using Corel-1K, Corel-5K, Corel-10K, and GHIM-10 image benchmarks. Aslam et al. [14] proposed a late fusion of mid-level features based on BoF model. According to the authors, midlevel image representation late fusion can enhance the performance of image classification-based model. In this research [14], the late fusion of Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG) is proposed by using BoF representation model. Support Vector Machine (SVM) is applied for the classification of histograms that are created on the basis of late fusion of two mid-level features. e proposed late fusion is evaluated while using Corel-1K and Corel-1.5K image benchmarks. Yu et al. [44] proposed High-order Distancebased Multiview Stochastic Learning (HD-MSL) approach for classification. According to the authors, the proposed learning approach (HD-MSL) is based on the features combination and labeling information is computed by applying a probabilistic framework. Spatial Pyramid Matching (SPM) and BoF model are used to represent various midlevel image categorization-based approaches. Zafar et al. [12] stated that SPM can only capture the absolute spatial distribution of visual words and is not robust to image transformations such as translation, flipping, and rotations. e discriminating power of SPM degrades if images are not well aligned and, due to this reason, Zafar et al. [12] proposed an image representation that can compute the relative spatial information based on histogram of Bag of Visual Words (BoVW) model. Global relationship of identical visual words with image centroid was explored by the authors to achieve the objective. Five image benchmarks are used for the evaluation of this research [12]. Ali et al. [11] stated that the classification accuracy of orderless BoF-based histograms suffers due to unavailability of image spatial clues.
e approaches that are centered on splitting of images into subblocks to capture spatial clues cannot handle rotations. In case of remote sensing image classification, these spatial clues can increase the learning ability and classification accuracy of the trained model [11]. e authors proposed in [11] a rotation invariant feature vector-based image representation that can compute spatial clues with the help of orthogonal vectors histograms. e results are computed while using three publicly available satellite image benchmarks (SIRI-WHU, RSSCN, and AID) [11]. Figure 3 shows an example of image classification based on a CNN model. Fine-tuning is used with transfer learning to adjust the parameters of a pretrained CNN model by using a new dataset with different number of classes.
is process is beneficial as the training is done with small learning rate by reducing number of training epochs [7,45]. According to Petrovska et al. [7], the recent focus of research for image classification is on the use of a pretrained CNN. e authors of [7] used a CNN for features extraction and then training was performed by using these extracted features. Transfer learning was implemented by the authors for the purpose of fine-tuning using pretrained CNNs. Support Vector Machine (SVM), Radial Basis Function (RBF) kernels are used for the purpose of image classification. Linear decay learning rate scheduler and cyclical learning rates are used to tune the hyperparameter of the network and label smoothing regularization is used to avoid the overfitting. Shafaey et al. [46] explored a deep learning model performance for remote sensing image classification. A comprehensive review is presented by considering the deep learning models such as AlexNet, VGGNet, GoogLeNet, Inception-V3, and ResNet101. Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), Naïve Bayes (NB), and SVM are used for predicting the class labels, and the results are compared with the above-mentioned deep learning models. A detailed quantitative comparison in terms of results is presented by considering seven publicly available datasets [46]. In another research, Zhao et al. [47] stated that Residual Dense Network (RDN) is with more learning ability as it can utilize the information available in convolutional layers. e authors designed an RDN that is based on channel-spatial attention for the classification of remote sensing images. In the first step, multilayer convolution features are fused by using residual dense blocks and, in the next step, channelspatial attention module is applied to enhance the effectiveness of features. By considering the training requirements, data augmentation is applied, and classification is done with the help of softmax classifier. e proposed research of Zhao et al. [47] is evaluated while using UCM and AID image benchmark.

Proposed Method of Research
e proposed methodology aims to enhance the image classification accuracy while using CNN model. Keeping in view the robust performance of the model, we selected Residual Network (ResNet50) for evaluation. ResNet50 is the short form of Residual Network with 50 layers. When researchers started to follow the phrase "the deeper the better" with deep learning models, they encountered some problems. " e deeper the network is the performance of the network should be better"; this theory was proved wrong when a deep network with 52 layers generated bad results as compared to the networks with 20-30 layers [32]. Multiple predictions are reported about this decrease in the performance of the model and the most appropriate reason for this is the vanishing gradients. When the network is too deep, the gradient value shrinks to 0, which causes the weights not to update, and as a result no learning is performed. Figure 4 shows the phenomena of vanishing gradients.
Deep networks faced many complications including the optimization of networks, degradation, and most importantly vanishing gradients. According to literature, finetuning of a pretrained CNN network can increase the classification accuracy in the respective domain [48,49]. ResNet50 is trained on ImageNet, which consists of almost 1.2 million images whose features and weights are transferred to the next task using the same pretrained network. Fine-tuning works and processes a new task with different numbers of classes and categories. e number of epochs referred to as iterations used to train a fine-tuned network is less compared to training the model from scratch. e motivation behind the usage of pretrained networks is to intensify the accuracy by using the concept of "transfer learning." Transfer learning refers to machine learning technique, which allows the transfer of information learnt from one domain to similar problems in related domain. It is recommended to use the model developed and trained for a task as a starting point of the task that is similar to the trained one [50]. Researchers have used diverse notations to describe different concepts of transfer learning to define it. Domain and task are the two basic concepts of transfer learning, which are explained mathematically. Transfer learning is defined arithmetically to make the picture clearer [51]. Domain D consists of two parts, that is, a feature space 5 and a marginal distribution P(F) [51].
Here, F represents an occurrence set (called instance set), which is explained as F � x|x i ∈ 5, i � 1, . . . , n . A task T comprises a decision function t and a label space L; that is, A starting domain referred to as source D S related to a main task (source) T S is analyzed by the number of occurrence-label pairs; that is, D S � (x, y)|x j ∈ 5 S , y j ∈ T S , j � 1, . . . , q S ; target domain observation usually comprises unassigned occurrences and/ or limited labeled occurrences.
Here, we report some observation(s) related to m S ∈ N + source domain(s) and task(s), that is, (D S k , T S k )|k � 1, . . . , m S , and observation(s) corresponding to m T ∈ N + target domain(s) and task(s), that is, Deep Neural Network (DNN) ResNet50 is fine-tuned by doing "network surgery." In the process of network surgery, final layers of the pretrained network are removed. e layers removed from the network are "fc1000," "fc1000 softmax," and "ClassificationLayer fc1000" layers. ese layers are than replaced with the new layers. e new layers introduced into the architecture establish a "network head." e composition of network head is the combination of three layers: A fully connected layer with WeightLearnRa-teFactor given a value of 20 and BiasLearnRateFactor given a value of 20. e second layer added is a new softmax layer  Mathematical Problems in Engineering and finally a new classification layer is added to the network head. Learning rate is said to be the step size (which is the number of weights updated during training) at each iteration in the model. It is perhaps the most important hyperparameter to tune the neural network. It is a configurable hyperparameter that can be altered according to the needs to enhance the performance of the model. e learning curve which is also known as a function is expressed as [52] c � aχ b , where c represents the progressing average time called cumulative (or cost) per unit, χ is the progressing/growing number of units manufactured, a shows the time necessary to obtain the first unit, and b � log of the learning rate/log2. Learning rate in our model is modified and an initial learning rate is assigned to the model, which is 0.001, while a learning rate schedule is applied which will be used to modulate how the learning rate of the optimizer changes over time [53]. While training neural network models, it is suggested to reduce the learning rate with respect to training progress. e learning rate is reduced using predefined schedule; in our case, we used piecewise learning rate schedule. With the increase in epochs or iterations, the learning rate decreases using the predefined schedule. e mathematical form that is used to calculate the learning rate (decreasing) is given as [54] where n is iteration step, η n is learning rate at the nth step, and d is decay rate. As the learning progresses, the rule updates the learning rate by reducing the denominator. Since n is initialized at zero, 1 is added to the denominator in order to prevent it from being zero. We used Stochastic Gradient Descent with Momentum (SGDM) as optimizer. is helps gradient vectors to accelerate into the direction in which they are supposed to. Usage of SGDM enhances the converging process. e mathematical representation of SGDM is given as follows [55]: e momentum gained at the tth recurrence for the ith parameter is m t,i . e hyperparameter that controls the momentum is β. SGDM is an improved version of SGD with better convergence rate than the former one. Figure 5 shows the proposed research methodology, while Figure 6 demonstrates the process of fine-tuning.
e Residual Network (ResNet) has solved the problems associated with deep networks with the addition of new neural network layer called the Residual Block. e idea of solving identity function through neural network seemed easy and hence the output of the function becomes the input itself. e following equation represents the identity function which is considered to be of prime importance in solving the problem of deep architectures [32].
By providing the input of the initial layer of the model as the output of the last layer, it is assumed that the model will learn and predict whatever it was learning before the addition of input.
e above equations are important, and they formulate the concept of "skip connection" and identity mapping. Identity mapping is a simple concept and has no parameters. Its main function is to add the output from the descending layers to the preceding layers. e diagram below shows the architecture of ResNet50 with all the layers. When x and 5(x) have the same dimensions, the process follows the same equations; however, sometimes the dimensions of both 5(x) and x are not the same. In that case, a multiplication factor Wis introduced to match the shortcuts or skip connection. By doing so, x and 5(x) become the input of next layer as explained by the following equation: is equation is used when 5(x) and x are of different dimensions. W s adds extra parameters to the model which helps to avoid the problems of dual dimensionality. With the help of ResNet, gradients can flow using skip connections back to initial layers without touching all the layers. In ResNet50 architecture, there are different groups of identical layers, and each group is distinguished by a different color used in Figure 7. e curve lines represent the skip connection or identity mapping through which the input of previous layer is passed into the next layers. ese skip connections are the key features that help ResNet to overcome the problems of degradation and vanishing gradients. e figure illustrates that the first layer is a convolution layer with 7 × 7 size and 64 kernels followed by 3 × 3 max pooling layer. Next there is a block of identical layers separated by different colors. e curves in Figure 7 represent the skip connections.
e overall parameters of ResNet50 are 23.521 M. Multidimensional input problem is handled by introducing two shortcuts. ese shortcuts are identity shortcut and projection shortcut. e identity shortcut does a simple operation of bypassing the input to the addition operator. Projection shortcut makes sure that the inputs at addition operation are of the same size and performs the convolution operation to make this possible.
To escalate the efficiency and competence of the model, the process of fine-tuning is performed. is is a very critical process and small modifications with careful observations are done to get the better accuracy and optimization. e changes that are made for the purpose of fine-tuning are so crucial that they affect the training process a lot. We repeated the process of fine-tuning over and over again to increase the accuracy of our model. Table 1 illustrates the parameters that affected the accuracy and performance of our model.

Dataset Description
To analyze the effectiveness of the implemented technique, diverse image classification benchmarks which are widely used in literature have been utilized. Table 2 summarizes details regarding the total number of classes, images per class, number of images per class and total number of images in the benchmark, image spatial resolution, and dimensions: (i) RSSCN: the remote sensing scene classification dataset [59] comprises images gathered from Google Earth Engine and covers widespread areas. RSSCN consists of 7 classes of quintessential scene images having a size of 400 × 400 pixels. Figure 8 shows indiscriminately selected samples of those classes and areas. Further description about this image benchmark can be found in [59].
(ii) SIRI-WHU: the description such as image size, total number of images, images per class, and date of creation can be found in [56]. e images have a spatial resolution of 2 m with image size of 200 × 200 pixels. Figure 9 shows randomly selected images taken from each class of SIRI-WHU dataset. (iii) UC Merced Land Use Dataset: the description such as image size, total number of images, images per class, and date of creation can be found in [57]. ere are a total of 21 distinctive scene categories with 100 images per class and dimensions of 256 × 256 pixels. Figure 10   Mathematical Problems in Engineering subset of Corel image dataset [58]. e dataset is comprised of 1500 images organized into 15 semantic categories. Figure 12 shows indiscriminately selected samples from each class of the dataset.

Performance Evaluation
All the experiments have been performed while using HP-ENVY-x360, with Intel Core-i7-7500U CPU, 2.7 GHz 2.9 GHz, 16 GB RAM, 64-bit Windows 10 OS, and 256 GB SSD as primary storage for OS; and a training : testing ratio of 70 : 30 is used for all experiments. is section provides details of the evaluation metrics used and presents a comprehensive discussion on results. e most widely used metric for evaluation of classification performance is the classification accuracy (A), defined as total instances (images) correctly classified and fractionated by total number of instances (images) within the dataset under consideration. It is mathematically expressed as  where tp denotes true positives, tn denotes true negatives, fp denotes false positives, and fn denotes false negatives.
Precision (P) and recall (R) are used very commonly for the performance assessment of image classification systems. Precision is the equivalence of the ratio of correctly classified images to the total number of classified images.
Here, tp represents the correctly classified image and fp represents misclassified images, also known as false positives.
e recall is the fraction of correctly classified images to the total number of related images present in the database. e mathematical form of recall is   Here, fn denotes false negatives, the images which belonged to the correct class but were misclassified by the classifier.
F-score is the result of the harmonic mean of precision and recall; a higher value of it is a symbol of the better predictive power of the system. Alone precision or recall is not adequate to evaluate the performance of systems. F-score could be expressed mathematically as Here, P and R represent precision and recall, respectively. F-score is used for comparing the performance in those scenarios, where one approach has higher precision but a lower recall rate than the comparative approach.

Results for RSSCN Image Benchmark.
e classification accuracy and performance of the proposed approach in comparison with the state-of-the-art research are shown in Table 3. Here, the proposed research based on fine-tuned ResNet50 outperforms the approaches based on mid-level features, that is, RGSIR [12] and POVH [11], by 10.56% and 7.93%, respectively, which are based on low-level handcrafted features. Table 3 shows a quantitative analysis and comparison of the proposed fine-tuned ResNet50 with the methods based on deep learning architectures. It can be evidently seen that the proposed research achieves highest classification accuracy as compared to the methods based on deep learning models, that is, AlexNet, GoogLeNet, Inception-V3, VGG-VD-16, and CaffeNet, outperforming these methods by 6.4%, 6.16%, 5%, 4.82%, and 3.75%, respectively. Figure 13 demonstrates the precision, recall, and F-score for RSSCN image dataset using the proposed research. F-score is important since if precision or recall values are very low, F-score helps balance the two metrics. e higher the F-score, the better the results, with 0 being the worst possible and 1 being the best. A good F-score is indicative of a good precision and recall value. e average precision, recall, and F-score for RSSCN image benchmark are 92.74%, 92.84%, and 92.76%, respectively. Figure 14 shows confusion matrix from RSSCN image benchmark. e confusion matrix summarizes the performance of a classification algorithm and provides an insight into how correct the predictions were and how they hold up against the actual values. On the confusion matrix plot, the rows correlate to the true class and columns conform to the predicted class. e diagonal values correspond to correctly classified observations. e off-diagonal values indicate the observations incorrectly classified.

Results for SIRI-WHU Image Benchmark.
e experimental results for the SIRI-WHU image dataset are presented in Table 4. It can be evidently seen that the overall classification accuracy of the proposed research is higher than that of the research selected for comparison. POVH [11] uses mid-level attributes or features and captures the spatial attributes, which are considered very important for classification of satellite imagery. e proposed research based on high-level features outperforms POVH by 13.89%. Further the comparison of the proposed research is presented against deep learning models. e proposed research based on ResNet50 surpasses the state-of-the-art deep learning models VGGNet, Inception-V3, GoogLeNet, and AlexNet by 7.43%, 5.03%, 4.73%, and 3.83%, respectively. Table 5 shows the precision, recall, and F-score for each class of SIRI-WHU image benchmark. e average precision, recall, and F-score for the SIRI-WHU image dataset are 94.03%, 94.19%, and 94.02%, respectively. Figure 15 demonstrates the confusion matrix for the SIRI-WHU image dataset.

Results for UCM Image Benchmark.
In this subsection, we will discuss the result of UCM image benchmark. Table 6 presents a comparison of proposed fine-tuned ResNet50 with recently published research and deep learning models.
It can be clearly seen that the proposed approach based on ResNet50 achieves the highest classification accuracy as compared to the related research. In [46], the authors used Inception-V3 deep learning model, and their reported accuracy was 6.68% times low as compared to the proposed research. e authors in [60] proposed an approach based on fusion of low-level features with high-level ResNet features and used SVM as classifier. e proposed approach achieves 3.97% higher classification accuracy as compared to the feature fusion-based approach [60]. e proposed research outperforms AlexNet, GoogLeNet, CaffeNet, and VGG-VD-16 by 3.58%, 3.47%, 2.76%, and 2.57%, respectively. Table 7 shows the precision, recall, and F-score for each class of UCM image benchmark. e average precision, recall, and F-score for the UCM image dataset are 97.78%, 97.83%, and 97.77%, respectively. Figure 16 demonstrates the confusion matrix for the UCM image dataset. Here, we can see that most of the classes are correctly classified, and the major confusion is observed between classes storage tanks and buildings, medium residential, and dense residential.
is is because the classes medium residential and dense residential are overlapped and vary in the density of structures.

Results for Corel-1K Image Dataset.
Corel-1K image benchmark is the third dataset used for the experimentation in this research. Table 8 presents a comparison of the research proposed with the state-of-the-art research. It can be manifestly seen that the proposed research provides the highest accuracy and outperforms the state-of-the-art approaches based on mid-level and high-level features. In [43], a hybrid feature vector is created by integrating three visual attributes, that is, color, texture, and shape. e experimental evaluation and analysis illustrate that the implemented technique outstrips many state-of-the-art related approaches based on varied hybrid systems. e proposed research achieves the highest accuracy as compared to the state-of-the-art research, thereby outperforming the researches of Li et al. [61], Aslam et al. [14], SCNN-ELM [61], MKSVM-MIL et al. [62], Raja et al. [41], Desai et al. [42], Yu et al. [44], and Shikha et al. [43] by 26.16%, 15.74%, 12.68%, 11.8%, 10.34%, 8.8%, 1.02%, and 0.5%, respectively. Table 9 demonstrates the classwise performance for Corel-1K image benchmark in terms of precision, recall, and F-score. e average precision, recall, and F-score values for Name of algorithm/model Classification accuracy (%) RGSIR [12] 81.44 POVH [11] 84.07 AlexNet [46] 85.6 GoogLeNet [39] 85.84 Inception-V3 [46] 87 VGG-VD-16 [39] 87.18 CaffeNet [39] 88.25 ResNet50 92 Corel-1K image benchmark are 97%, 97%, and 96.99%, respectively, which demonstrate the good prediction performance of the proposed research. Figure 17 demonstrates the confusion matrix computed while using Corel-1K image benchmark. It can be seen that all classes are correctly classified except for African, Beach, and Mountain. e major confusion exists between categories African and Beach, since similar objects can be observed between both classes.   Name of algorithm/model Classification accuracy (%) POVH [11] 80.14 VGGNet [46] 86.6 Inception-V3 [46] 89 GoogLeNet [46] 89.3 AlexNet [46] 90.2 ResNet50 94.03  [14] and outperforms [40] by 0.66%. Hence, it can be safely concluded that the proposed research based on ResNet50 provides better performance for scene classification as compared to the related state-of-the-art research. Table 11 provides classwise comparison of precision, recall, and F-score for Corel-1.5K image benchmark. e average precision, recall, and F-score for the Corel-1.5K image dataset are 99.56%, 99.78%, and 99.66%, respectively. High precision depicts a low false positive rate, and high recall depicts a low false negative rate. A good Fscore is indicative of low false positives and low false negatives, as well as the capability of the model to correctly identify instances. An F-score of 1 is considered perfect, while an F-score of 0 indicates that the model is a total failure. Figure 18 demonstrates the confusion matrix for Corel-1.5K image benchmark. Here, we can see that almost all classes are correctly classified with only one misclassified instance in each of categories Africa and Model.

Time Performance Analysis.
Besides classification accuracy, time performance analysis of the proposed system is an important parameter to be considered to determine its efficiency. Here, the time analysis is done during testing the model which is based on the testing time of the complete proposed model. Figure 19 shows the time comparison for  Name of algorithm/model Classification accuracy (%) Inception-V3 [46] 91.1 Feature RCG SVM [60] 93.81 AlexNet [46] 94.2 GoogLeNet [39] 94.31 CaffeNet [39] 95.02 VGG-VD-16 [39] 95.21 ResNet50 97.78  Figure 16: Confusion matrix for UCM image dataset. Name of algorithm/model Classification accuracy (%) Li et al. [61] 70.84 Aslam et al. [14] 81.26 SCNN-ELM [61] 84.32 MKSVM-MIL et al. [62] 85.2 Raja et al. [41] 86.66 Desai et al. [42] 88.2 Yu et al. [44] 95.98 Shikha et al. [43] 96.5 ResNet50 97   Name of algorithm/model Classification accuracy (%) Aslam et al. [14] 66.36 Aslam et al. [14] 71.69 Aslam et al. [14] 81.15 Khalid et al. [40] 98.9 ResNet50 99.56 all the image datasets used for experimentation representing time per image, time per class, and time for the entire image dataset. From Figure 19, it can be deduced that, with the increase in number of images or with the data being more complicated, the time utilized for testing the model increases. Hence, it can be concluded that the training time is directly proportional to the size of the image datasets. Table 12 shows the time comparison of the proposed approach with the state-of-the-art research in terms of time per image for classification. It can be evidently seen that the proposed approach is computationally efficient as compared to the state-of-the-art research.

Conclusion
Remote sensing, distant perceiving, image classification, and categorization are considered as challenging research areas in the field of computer vision. e recent focus of research in this domain is to explore the novel deep learning model that can enhance the classification accuracy. In this research article, we fine-tuned the ResNet50 by using network surgery and creation of network head along with the fine-tuning of hyperparameters. e learning of hyperparameters was tuned by using a linear decay learning rate scheduler known as piecewise scheduler. To tune the optimizer hyperparameter, Stochastic Gradient Descent with Momentum (SGDM) was used with the usage of weight learn and bias learn rate factor. Experiments and analysis were conducted on five different datasets, that is, UC Merced Land Use Dataset (UCM), RSSCN (the remote sensing scene classification image dataset), SIRI-WHU, Corel-1K, and Corel-1.5K. e analysis and competitive results exemplified that our proposed image classification-based model can classify the images in a more effective and efficient manner as compared to the state-of-the-art research. e overall performance of any deep learning model is dependent on the availability of training samples. In the future, we aim to explore an efficient ResNet50 when there are a less number of training samples available. Most of the deep network models are trained while using natural images such as ImageNet, while remote sensing images are different from natural images as they are acquired from different remote sensors. To explore transfer learning while using a combination of natural images and remote sensing images is another possible future research direction.
Data Availability e details about the data used are included within this manuscript.

Conflicts of Interest
e authors declare that they have no conflicts of interest.