Remote Sensing Image Classification: A Comprehensive Review and Applications

Remote sensing is mainly used to investigate sites of dams, bridges, and pipelines to locate construction materials and provide detailed geographic information. In remote sensing image analysis, the images captured through satellite and drones are used to observe surface of the Earth.emain aim of any image classication-based system is to assign semantic labels to captured images, and consequently, using these labels, images can be arranged in a semantic order. e semantic arrangement of images is used in various domains of digital image processing and computer vision such as remote sensing, image retrieval, object recognition, image annotation, scene analysis, content-based image analysis, and video analysis. e earlier approaches for remote sensing image analysis are based on low-level and mid-level feature extraction and representation. ese techniques have shown good performance by using dierent feature combinations andmachine learning approaches.ese earlier approaches have used smallscale image dataset. e recent trends for remote sensing image analysis are shifted to the use of deep learning model. Various hybrid approaches of deep learning have shown much better results than the use of a single deep learning model. In this review article, a detailed overview of the past trends is presented, based on low-level andmid-level feature representation using traditional machine learning concepts. A summary of publicly available image benchmarks for remote sensing image analysis is also presented. A detailed summary is presented at the end of each section. An overview regarding the current trends of deep learning models is presented along with a detailed comparison of various hybrid approaches based on recent trends. e performance evaluation metrics are also discussed. is review article provides a detailed knowledge related to the existing trends in remote sensing image classication and possible future research directions.


Introduction
Deep learning and computer vision are used in various applications such as image classi cation, object detection in industrial production, medical image analysis, action recognition, and remote sensing [1][2][3][4]. Satellite images are considered the main source of acquiring geographic information [5], and there are many applications of satellite image analysis in the eld of civil engineering such as design, construction, urban planning, and water resource management. e data obtained from satellite sources are huge and are growing exponentially; to handle these large data, there is a need to have e cient techniques for data extraction purpose. rough image classi cation, these large number of satellite images can be arranged in semantic orders. e satellite image classi cation is a multilevel process that starts from extracting features from images to classifying them into categories [6]. Image classi cation is a step-wise process that starts with designing scheme for classi cation of desired images. After that, the images are preprocessed which include image clustering, image enhancement, scaling, and so on. At third step, the desired areas of those images are selected and initial clusters are generated. After that, the algorithm is applied on the images to get the desired classification, and corrective actions are made after that algorithm phase which is also called postprocessing. e final phase is to assess the accuracy of this classification, as shown in Figure 1.
Recent research is focused on the use of mid-level features and deep learning models to build robust decision support systems for smart vehicles, Internet of ings (IoT), and remote sensing images [7][8][9]. To get the geographical data on large scales, remote sensing plays a significant role, and efficient land use could be achieved through aerial images of Earth [10]. Some are supervised techniques while some of them are unsupervised. Similarly, while keeping in focus the parameters, there are parametric and non-parametric approaches; another type is fuzzy classification [11]; besides this, classification can also be performed on prepixels or subpixels. e latest research in remote image classification is towards hybrid approaches, where two or more techniques are combined to get better classification [3,12,13]. e most recent research is focused towards scene-based classification. e whole remote sensing image classification process is divided into three kinds of basic division: supervised learning, unsupervised learning, and deep learning approaches. Supervised learning techniques are further divided into distributed and statistical learning [14][15][16]. ere are many types of distributed learning like logistic regression, decision trees, support vector machine (SVM), ensemble methods, and so on, whereas statistical learning techniques are further divided into parametric and non-parametric approaches. Similarly, different types of unsupervised learning techniques like K-means clustering, spectral clustering, fuzzy C-means, and reinforcement leaning are discussed in detail. Moving towards the third division, that is, deep learning approaches, they are further divided into three categories: generative methods, hybrid methods, and discriminative methods. Deep belief network (DBN), network autoencoder, and deep Boltzmann machine (DBM) are discussed in generative methods, whereas deep neural network (DNN) and grey wolf optimization (GWO) are discussed in hybrid methods. In discriminative methods, transfer learning, convolutional neural network (CNN), AlexNet, VGG, GoogLeNet, MobileNet, ResNet, artificial neural network (ANN), and so on are discussed, as shown in Figure 2.
ere are basically three types of remote image classifications that are mainly pixel-based classification, objectbased classification, and scene-based classification, and the recent research is focused towards scene-based classification [17,18]. Figure 3 shows that due to the improved research, spatial resolution of images is increased drastically [19].
ere is no need to classify remote images on the basis of pixels, and research trend changed towards the object-based classification of images. By objects of remotely sensed images, we mean semantics or scene units [20]. During the last two decades, processing visual features of an image was a time consuming and computationally expensive task, which required lots of effort and resources. According to the literature in recent years, scale invariant feature transform (SIFT), textual descriptors (TD), color histogram (CH), histogram of oriented gradients (HOG), and global image descriptor (GIST) [21] were proposed by human engineers. After a while, some improvements were made and improved Fisher kernel, spatial pyramid matching (SPM), and bag of visual words (BoVW) were introduced [22]. ese encoding techniques were relatively more efficient than the existing techniques [23].
In simple words, we can differentiate between supervised and unsupervised learning as follows: supervised learning algorithms are trained using labeled data, whereas unsupervised learning algorithms are trained using unlabeled data. In case of unsupervised learning, principal component analysis (PCA), sparse coding, and K-means clustering were introduced later on. e benefit of these techniques is that they are able to automatically learn the features. But these unsupervised learning techniques were not doing justice when there are larger datasets [24]. Due to advancement in deep learning techniques and parallel computing, these remote images can be easily classified by initializing weights in training layers so that the prediction of scene could be more accurate in later deep learning layers [25]. ere are many deep learning models that exist in literature like AlexNet, GoogLeNet, VGG, and ResNet [26]. AlexNet was proposed in 2012, and it has 60M parameters and is 8 layers deep [27]. GoogLeNet was proposed in 2015, and it has 4M parameters and has 22 layers. It also comes under the category of spatial exploitation [28]. After that, VGGNet was proposed in 2015, and it has 138M parameters and is 16 and 19 layers deeper [26]; it has two types: VGG16 and VGG19. Later ResNet was proposed, and it has various variants like ResNet18, ResNet34, ResNet50, ResNet101, ResNet110, ResNet152, ResNet164, and ResNet1202, and it has 25.6M parameters [29]. e above-mentioned models come under the category of spatial exploitation. Comprehensive reviews relevant to remote sensing image classifications are published in recent years. In [30], we can see a detailed review of multimodel remote sensing image classification. Also, in another article before 2016, all the remote sensing classification techniques were discussed in detail in [31]. In 2017, a detailed review about process of remote sensing image classification was discussed in [32]. e details about the resources for remote sensing research are enlisted in this article [32]. In 2017, a detailed comparison of existing deep learning techniques for hyperspectral classification was given in [33]. In 2017, a review about support vector machine (SVM) techniques relevant to remote sensing was discussed in [34]. In [35], AID dataset was proposed, and it also includes remote sensing image classification surveys before 2017. A review of multiple remote sensing techniques and also NWPU-RESISC45 dataset was proposed in [36]. In recent years, comprehensive reviews relevant to hyperspectral and spatial-spectral images analysis are published [20,37]. A detailed summary about deep learning in remote sensing applications, current challenges relevant to deep learning methods, benchmarks, and possible future research directions are referred to the following review articles [38][39][40]. e article is organized as follows: there is a basic introduction about remote sensing image classification in the start. Section 2 is about machine learning. Section 3 contains the detailed description of CNN models and their applications. Section 4 deals with existing deep learning techniques. Section 5 is about the datasets commonly used for remote sensing image classification which are discussed in detail. Section 6 deals with unsupervised learning techniques. Section 7 is about optimization techniques. Sections 8, 9, and 10 are about feature fusion techniques. Section 11 deals with hybrid approaches. Section 12 is about performance evaluation criterion for classification. In the last section, a conclusion of the proposed research is presented.

Machine Learning
Machine learning (ML) is the field of computer science which incorporates both supervised and unsupervised learning techniques [41][42][43]. It covers both regression and classification problems [44]. In machine learning, a detailed dataset is constructed that covers maximum of system parameters. ML is useful in the scenarios where theoretical knowledge is not sufficient to predict some information out of it [45,46]. It has a huge number of applications in many areas like land use and cover concerns [47] disaster management, atmosphere changes, and many more [48]. ML is the subdivision of artificial intelligence (AI) [49]. ML basically designs an algorithm to be able to learn from the data to predict something out of it. ere are many algorithms present in the field of machine learning that are doing exceptional job like support vector machine, Bayesian network, decision trees, ensemble methods, random forest, neural networks, genetic programming, and many more. ML has a huge impact on remote sensing and geosciences. It automatically extracts features from the data using statistical techniques [50,51]. At the start, the classification of remote sensing images was considered to be "shallow structures." To perform remote sensing classification, there exist different   Mathematical Problems in Engineering techniques like decision trees, SVM, artificial neural network, bag of visual words, and many more [52][53][54]. Another important application of ML techniques is to detect the change from the normal scenarios. Images are captured through satellites or drones and then ML techniques are applied to predict the behavior or change [55]. SVM and GA are combined to detect the change. Both supervised learning approaches and unsupervised learning techniques are combined to get the association of adjacent pixels of images, while using SVM, radial basis kernel is used and its parameters like C and Ω are optimized using genetic algorithm (GA); this optimization process increased the efficiency of the process. e authors have performed experimentation while using Mexico dataset and Sardinia image datasets. e results are validated with existing results and the proposed approach outperforms when compared to the existing results [56]. In early years of ML, the accuracy of only high spectral images was high [57]. To overcome this issue, a new 3-D approach was used in combination with spatial and spectral images. e experimentation was performed on Pavia University (PU), Pavia Center (PC), and Kennedy Space Center (KSC) datasets, and the results show that the proposed methods achieve better accuracy with low computational cost [58]. It has many applications in different fields of life like speech recognition systems, search engines, and other AI-based applications like robotics [59]. ere are many ML techniques available in literature like K-means clustering and PCA for classification tasks, and to perform regression, there are techniques like SVM, decision trees, ANN, ensemble methods, random forest, and so on [60,61]. Remote sensing image classification can be performed using existing CNN methods, but they require high computational power and a big labeled dataset for better performance.
ere are freely available datasets. We can use pretrained networks to get better accuracy.
ere exist strategies to avoid overfitting and dropouts which also play an important role. e training time of CNN models is quite long, but GPUs help us to solve this issue [62]. Remote images captured from satellite images have a huge importance, but there are some issues in the clarity of images when weather conditions are not so clear which affect the feature selection part of ML process and thus performance degrades [63]. e article described below fills this gap by using a specially designed toolbox. In the first step, gaps between spatial relationships and pixels are filled, while remaining gaps of temporal dynamics of each pixel are filled in the second phase. e experimentation of above algorithm was performed on two datasets Sentinel-3 SLSTR and Terra MODIS. Data were collected in different seasonal conditions. Also, the experimentation was performed on GNU GPL3 which is a public repository [64].

Convolutional Neural Network
Convolutional neural networks are useful in many multimedia applications where we need to classify images without human interference. In this article, four different deep learning models: AlexNet, VGG19, GoogLeNet, and ResNet50, were used for feature extraction. e experimentations were performed on different datasets: SAT4, SAT6, and UCMD, where images for the datasets SAT4 and SAT6 were extracted from NAIP dataset which has around 330000 scene images of all over US. SAT4 and SAT6 have 4 and 6 classes, respectively, and labels are trees, grassland, barren land, building, road, water, and so on, whereas for UCMD dataset, images were extracted from a large dataset named USGS. It has 20 classes in it. Training and testing ratio for SAT datasets is selected as 80:20, respectively, whereas for UCMD, it is 70:30 [65,66]. Figure 4 shows the basic process of image classification for CNN. ResNet50 gives better accuracy on all the three abovementioned datasets. Accuracy on UCM is 98% and that on SAT4 is 95.8%, whereas that on SAT6 is 94.1%. Satellite image classification is a challenging task due to its variability. Due to this issue, existing approaches are not feasible for object detection in satellite images. In this article, a new model DeepSat V2 is proposed which is basically an augmented version of CNN. e first phase is feature extraction phase where 50 features were extracted and then statistical approaches were used for feature ranking to extract the useful features. It has 2 convolutional layers with RELU layer attached. After convolutional layers, there is a max-pooling layer with dropout layer at the end. After that, feature concatenation layer is present followed by fully connected layers. Last layer is softmax layer based on cross entropy loss function. e optimizer used in this model is Adadelta. All the experimentation was performed on SAT4 and SAT6 datasets.
is proposed model has achieved accuracy of 99.9% and 99.84% on SAT4 and SAT6, respectively [67].

Deep Learning-Based Methods and Approaches
Satellite images have high importance in many fields of life. is article is about the available datasets on remote sensing and the techniques used to classify satellite images. e existing image classification techniques can be divided into four categories: manual feature extraction, unsupervised feature extraction, supervised feature extraction, and objectbased classification, as shown in Figure 5.
Dataset used in this article for classification is UCM land use which has 21 classes and 2100 images. Experimentation was performed using AlexNet. Images used for training are about 10%, and after eight iterations, accuracy reached at 94%. By comparing GoogLeNet and CaffeNet, GoogLeNet gives better accuracy, that is, 97%, on UCM dataset. But AlexNet is almost 4 times faster. Deep learning methods perform better in image classification as compared to other feature extraction techniques [68]. e article is about useful methods for feature extraction using deep learning techniques. AlexNet, VGG19, GoogLeNet, and ResNet50 were used here, whereas experimentation is performed on 3 different datasets: SAT4, SAT6, and UC Merced. e accuracy of UCM dataset on multiple deep learning models is summarized in Table 1.
Performance of the proposed ResNet50 on SAT6 is better as compared to previous models, whereas accuracy on SAT4 is degraded. Classification accuracy of the proposed ResNet50 on SAT4 is 95.8%, that on SAT6 is 94.1%, and that on UCM is 98% [80]. In this article, a new CNN model known as deep convolutional neural network (DCNN) is proposed which works in twofold. In the first phase, multiple filters were introduced to minimize variance, whereas in the 2nd phase, best suited hyperparameters were selected from the pool. Based on these found parameters, a new convolutional neural network (DCNN) model is built and experimentation is performed. e results are validated using DeepSat model, whereas datasets used are SAT4 and SAT6. Table 2 summarizes the accuracy of RSSCN dataset on different CNN models. Classification accuracy of SAT4 using convolutional neural network (DCNN) is 98.408%, whereas that of SAT6 using convolutional neural network (DCNN) is 96.037% which is better than the model used for validation [83]. In satellite image classification, the process of scale selection is very important task. e remote image datasets are in larger number, and it is very important to   Mathematical Problems in Engineering 5 select the relevant techniques for the selection process. In [84], an enhanced technique of CNN model is used and experimentation is performed on WHU-RS, UC Merced, and Brazilian coffee scene datasets. Here classification accuracies of all three datasets are presented. For UCM dataset, the accuracy is better at stage 2, whereas for WHU-RS datasets, the accuracies are measured after stage 1, stage 2, and stage 4 of image scaling. After scales 1 and 2, the accuracy is improved, but after scales 3 and 4, the improvement in accuracy is very small. Land cover land use has a great link between human and nature, and many research studies are available on one-class extraction, but there is a need to focus on multiclass classification. Here in this article, to overcome the issue of low-resolution loss, a new model HR-net is introduced. Comparison was performed on Deep-Lab and U-Net. e proposed model performs better, test accuracy is 95.7%, mean I/U value is 88.01%, and kappa value is 94.55% [85]. ere are datasets available on remote sensing and also techniques used to classify satellite images. e existing image classification techniques can be divided into four categories: manual feature extraction, unsupervised feature extraction, supervised feature extraction, and object-based classification. Dataset used in this article for classification is UCM land use which has 21 classes and 2100 images. Experimentation was performed using AlexNet. 10% images were used for training, and after eight iterations, the accuracy reached at 94%. By comparing GoogLeNet and CaffeNet, GoogLeNet gives better accuracy, that is, 97%, on UCM dataset, but AlexNet is almost 4 times faster. Deep learning methods perform better in image classification as compared to other feature extraction techniques [86]. Xia et al. [76] used google net algorithm and reported a classification accuracy of 94.31% while using UCM image benchmark. Zhang et al. [77] performed scene classification using gradient boosting random convolutional network framework and reported classification accuracy as 94.53% while using UCM image benchmark. Zhong et al. [79] used large patch convolutional neural networks and reported a classification accuracy value as 89.90% for UCM image benchmark. e classification accuracies of AID dataset on different CNN models are summarized in Table 3.
While using the same dataset, the experimentation was performed on CaffeNet, and the accuracy noted in this experimentation is 95.31%. In the third run using same dataset with a different algorithm, i.e., VGG-VD-16, the classification accuracy is 95.21% for UCM dataset. In 2nd run for AID dataset, the experimentation was performed using GoogLeNet and Inception-V3 algorithm, and the accuracies mentioned in the article are 86.39% and 93%, respectively [76]. Experimentation was performed using ARCNet-VGG16 in scene classification with recurrent attention of VHR remote sensing images reaching the accuracy up to 99.12%. e same experimentation was performed on AID dataset and the accuracy achieved is 93.10% using UCM dataset [19]. In [19], it is stated that they have performed the experimentation using minimum sum coloring problem (MSCP) algorithm and the classification accuracy achieved is 98.36% for UCM dataset. Again the experimentation was performed on AID dataset using three algorithms MSCP, DCNNS, and HW-CNN, whereas the accuracies are 94.42%, 96.89%, and 96.98%, respectively. Zhu et al. [69] reported a value of classification accuracy as 99.76% while using UCM image benchmark. Lu et al. [73] used feature aggregation convolutional neural networks and reported the classification accuracy as 98.81% while using UCM image benchmark. Experimentation performed using feature aggregation convolutional neural network (FACNN) algorithm has achieved accuracy of 99.05%, and the dataset used for experimentation is UCM [70]. Spatial frequency (SF-CNN) has reached the accuracy of 99.05% using UCM dataset. e same algorithm was used for AID dataset and accuracy achieved was 96.66%, whereas using feature aggregation convolutional neural network (FACNN) algorithm on AID dataset, the classification accuracy mentioned in the same article was 95.45% [71].
In [71], it is stated that using robust space-frequency joint representation (RSFJR) algorithm, they have achieved classification accuracy of 98.57% using UCM dataset. In another research, it is stated that they have achieved classification accuracy of 98.57% using GBN algorithm for UCM dataset [74]. ADFF algorithm gave an accuracy of 97.53% in another research using UCM dataset. e same experimentation was performed on AID dataset, and the accuracy achieved is 94.75% [75]. Another research achieved the accuracy of 99.05% using CNN-Caps Net algorithm using UCM dataset. In another article, they achieved accuracy of 93.81% using feature RCGSVM for UCM dataset [22]. AlexNet and inception algorithms gave an accuracy of 94.2% and 911.1%, respectively, using the UCM dataset. Again using AID dataset, the experimentation was performed on VGG-VD-16, and the accuracy achieved is 89.64% [78]. In the article, the author performed experimentation using SCCOV, and the accuracy achieved is 96.10%, and the dataset used for experimentation is AID [89]. Using AID dataset, accuracy achieved is 96.81%, and the algorithm used for experimentation is RSFJR [71]. Using ResNet, another research claimed that they have achieved accuracy of 89.1% Table 2: Classification accuracy of RSSCN on different CNN models.

Datasets
e details about different remote sensing datasets are described below: 5.1. SAT4 and SAT6. National Agriculture Imagery Program (NAIP) dataset was used to extract the images to the dataset. SAT4 consists of total of 500,000 image patches while SAT6 consists of 405,000 image patches, as shown in Figure 6.

Brazilian Coffee Scenes.
Dataset is taken from four countries with the size of 64 × 64 pixels.
ere are 600 images in 4 different kinds of dataset while the fifth kind has 476 images. Table below summarizes details regarding the total number of classes, images per class, number of images per class and total number of images in the benchmark, image spatial resolution, and dimensions. Figure 7 summarizes the details about coffee dataset images and other dimensions.

RSSCN.
e remote sensing image classification dataset comprises images gathered from Google Earth Engine and covers widespread areas. RSSCN consists of 7 classes of quintessential scene images having a size of 400 × 400 pixels. Further description about this image benchmark is discussed in the dataset description table. Figure 8 shows the picture gallery of all the classes of RSSCN dataset.

SIRI-WHU.
e description such as image size, total number of images, images per class, and date of creation is referred to the following research article [22]. e images Table 4: Classification accuracy of SIRI-WHU on different CNN models.

UC Merced Land
Use. e description such as image size, total number of images, images per class, and date of creation is referred to [65]. ere are a total of 21 distinctive scene categories with 100 images per class and dimensions of 256 × 256 pixels, as shown in dataset description table. Figure 10 shows the indiscriminately selected examples of each category included in the dataset (Table 5).
5.6. AID Dataset. AID dataset has 10000 images with 30 different classes. Figure 11 shows the photo gallery of AID dataset.

DIOR Dataset.
e DIOR dataset includes 23,463 images and 192,472 object. Figure 12 shows the photo gallery of DIOR dataset. Table 5 shows some of the existing datasets with image quantity and other descriptions.

Unsupervised Learning Approaches
Due to the advancement of space and satellite technologies, remote sensing has reached a new height [90]. Due to these high-resolution satellites, it has become easier to perform land use land cover surveys, to detect change, to recognize objects, and so on [91]. It has become easier to automatically interpret the image acquaintances due to the advancement in image classification techniques. Using these satellite data efficiently and in effective manner is still challenging. CNNs plays an important role in this image classification process. e article discussed below presents a framework called unsupervised restricted de-convolution neural network (URDNN). e main idea behind this framework is to get unsupervised restricted de-convolution using neural networks. It learns the pixel to pixel and end to end classification and then passes it to CNN model for assigning labels. Due to this, the issue of over and underfitting has been  reduced which occurs due to the large number of labeled data. e experimentation was performed on two datasets Geoeye and Quick-bird sensors [92]. e results are better than the previous models. e accuracy achieved is 97% and 98.9%, respectively [92]. Remote sensing image classification using unsupervised deep learning techniques is introduced here. In the first step, CNN extracts features using unsupervised techniques. After that, parameters of the network   are trained, which are then passed to the classifier. e cost of computation decreased due to unsupervised learning. SVM classifier was used in the process while spatiospectral information was efficiently extracted with this technique.
Adding new layers in the network improves efficiency, but the problem of overfitting is introduced [93]. While discussing unsupervised remote sensing image classification, the concepts of scale invariant feature transform and histogram-oriented gradient are very important [94]. Image is converted into feature vector by encoding; as compared to hand-engineered image representation, unsupervised learning techniques have achieved a new height [95]. We can get image features right from the start of raw pixels of an image. Gabor filters can be applied to those image patches to get the image features out of those pixels. Bags of word (BOW) is another concept of image classification and image retrieval [96]. To get the best results in terms of accuracy, we need to add SVM with non-linear kernel. While keeping remote sensing in mind, both color feature and intensity are important while classification. But most of the existing algorithms cannot handle this at a time. e article discussed below discussed this issue. e authors have considered the quaternion of color features and then proposed an unsupervised learning technique with the help of this quaternion concept, and they have jointly considered the color and intensity. e experimentation was performed on UCM and Brazilian coffee datasets. e proposed model has given better accuracy than the existing techniques [97]. With the enhancement of deep learning techniques, we are able to classify remotely sensed images using unsupervised learning techniques, more accurately. When available labeled data samples are limited, it becomes difficult to perform image classification using supervised learning techniques [98]. Scene image classification is a hot topic these days; with these classification and analysis techniques, we are able to perform land cover and land use surveys, urban area planning, disaster management and planning, crop analysis, weather prediction, and so on much easily and with high accuracy [99]. Previously BOW was the unsupervised learning technique that was used for remote sensing image classification. e article discussed below mentions a new technique. To overcome the issue of less labeled data, a new technique is proposed, and it is multilayered feature matching technique. e model uses both discriminative and generative models for earning unlabeled data. e experimentation was performed on two datasets: UCM and coffee, and as compared to other existing techniques, this proposed model MARTA GA1ns outperforms with the classification accuracy of 94.86% and 89.86%, respectively [100].

Reinforcement Learning.
Reinforcement learning is the concept of training a model for classification purpose where we reward the correct behavior and punish the undesired behavior. Reinforcement learning is the subbranch of machine learning which is quite similar to unsupervised learning where there are no labels assigned to the image. In reinforcement learning, agents learn the parameters and predict the outcomes [101]. On that prediction, there is a reward and punishment, and this process carries on till the game ends. Mostly reinforcement learning is used in gaming and in AI and robotics where you need to teach a robot some new tricks. ere are subelements of reinforcement learning that include policy, reward, value function, and environment as a model [102]. e reinforcement learning has achieved a new height as it is really helpful in minimizing the gap between training loss and matrix evaluation [103]. Captioning image is a challenging and most needed task of remote sensing. Most of the existing ML models suffer from the problem of overfitting. Below mentioned article has overcome this issue by proposing a two-stage model, one stage is for autoencoding variations while in stage 2, reinforcement learning is introduced. CNN is fine-tuned in stage 1, and in stage 2, it generates image captions. Reinforcement learning is then applied to improve the accuracy of the model. e experimentation was performed on NWPU-RESISC45 dataset. e results are far better than the previously mentioned results. But there exists a problem of overfitting, which should be addressed in future [104]. Fully polarized radar has the advantage to capture images throughout the time regardless of weather conditions. ey are useful for land cover land use type applications, crop management, forest estimation, disaster prevention, recognition of targets, and many more. e article discussed below proposed a new model called deep Q network (DQN) that is basically a deep neural network model for polarized SAR image classification. e data are first preprocessed to  reduce noise and extract features. e data are then fed into deep neural network for classification purpose where the concept of reinforcement learning is introduced. e experimentation was performed on two PolSAR image datasets and the researchers claim that their model outperforms on many existing models [105].

Optimization Techniques for Remote Sensing Image Classification
Optimization is the process of finding those input values that best find the output, and there should be a well-defined objective function. is is a very critical task for which there exist multiple machine learning algorithms. Optimization means to minimize operational cost and improve accuracy. An optimization algorithm tries different solutions until it gets the best suitable result which gives the most optimal solution to that problem [106]. In remote sensing image classification, selection of features is basically the most crucial task and it depends on number of available labeled samples. Feature selection is the process of selecting more important features out of the pool of features and excluding correlated features. In the article discussed below, there is a solution proposed for this feature selection task. For this purpose, the stochastic method is introduced for selection of relevant and important features. e experimentation was performed on two datasets: AVIRIS and ROSIS, and the results show that the proposed method gives better accuracy than the existing approaches [107]. Feature selection is one of the most important tasks in remote sensing image classification. Due to huge amount of data and correlated features, it becomes very tricky. To overcome this issue, a new methodology is introduced, and it has added the concept of wavelet analysis. reefold strategy is introduced in this framework: in the first phase, the resolution technique is modified, and in the second phase, 3-D discrete wavelet transformation is introduced. In the last phase, CNN is introduced, and the performance of this new model is tested on three different datasets: Indian Pines, University of Pavia, and Salinas datasets. e accuracy achieved using this model is 99.4%, 99.85%, and 99.8%, respectively [108].
When dealing with hyperspectral remote sensing, we usually have limited samples for training. Using the conventional techniques, it has become difficult to achieve high accuracy. SVM gives the better accuracy as it has good generalization and least structural risk, and it overcomes the issue of high time consumption and less optimized parameters. e article discussed below uses EO-1 Hyperion to optimize the parameters. e proposed model is tested, and the authors claimed that they have reached the accuracy of 91.3% which is quite higher than that of the existing approaches [109]. Remote sensing image classification has a huge benefit for land use land cover cases which is a latest research area. Existing classification methods have the issue of low efficiency and they usually have larger datasets. To overcome this issue, a new method is proposed where the concept of extreme programming is introduced. Ensemble methods along with full use of features and deep learning methods are introduced into the proposed model. All the three methods give better accuracy in terms of classification and efficiency, and the experimentation was performed on multiple datasets; classification accuracy depends upon the type of dataset. e optimization technique combined with deep learning outperforms as compared to other methods [110].
ere are many optimization techniques covering different aspects of image classification task: the summary of the techniques is described in later section.

Grey Wolf Optimization. Grey wolf optimization (GWO)
is a new metaheuristic technique, and it mimics the leadership quality of grey wolves [111]. ere are four types of grey wolves, namely, alpha, beta, omega, and delta. e fittest solution is called as alpha, second best is called beta, 3rd one is called delta, and the last one is omega. ere are three steps of hunting: searching for prey, encircling prey, and attacking prey; these three steps are implemented to get the optimized performance. is technique is basically a feature selectionbased technique [112,113]. In HSI, there are many consecutive and narrow spectral bands that give information about various land covers. Due to number of features, the time complexity increases. Selecting the best features out of the pool of features is a difficult and challenging task. e article discussed below proposed a new technique for feature selection of HSI, and it reduces redundant features. Fuzzy C-means algorithm is used for the decomposition of feature subset, whereas wolf optimization and max entropy are used for feature selection. e experimentation was performed on three known datasets: Indian Pines, Pavia University, and Salinas. e proposed methods outperform in terms of classification accuracy of existing techniques [114]. Image processing and analysis is an emerging field of computer vision. It has many different applications like image classification, segmentation, medical imaging, compression of image, and many more. ere exist multiple algorithms to solve these issues like GA, GP, grey wolf algorithm, bat algorithm, and so on. e article discussed below is a review of multiple optimization techniques, their usage, and their realworld applications [115]. Grey wolf is one of the recent trends which comes under the umbrella of swarm intelligence. It has better performance than swarm intelligence and hence is used more effectively than swarm intelligence, and it is simpler to implement and easy to understand. e article discussed below is a review of multiple applications of grey wolf techniques and its applications [116]. We can summarize these optimization algorithms as follows: (i) Grey wolf algorithm can handle large data efficiently, but it ignores smaller details which need to be addressed. (ii) Grey wolf divides the features into four groups; in future, there should be a direction where more or less than four groups are formed. (iii) Effectiveness of grey wolf should be checked in combination of different optimization algorithms. (iv) ere should be a focus on solving dynamic problems using GWO.
(v) Parameter tuning of GWO could also be focused in future. Table 6 shows the summary of some of the optimization techniques described in literature.

Fusion of Deep Learning with Spectral Features
Classification accuracy of hyperspectral images (HSI) has increased drastically when using in combination with CNNs.
To perform better, it is needed to have denser network which as a result causes overfitting, degradation of accuracy, and also gradient vanishing. To overcome these issues, a new framework hierarchical feature fusion network (HFFN) is proposed. e main idea behind this model is to fuse the output of all the layers which results in increase of accuracy. e experimentation was performed on three real HSI datasets: AVIRIS Indian Pines image, ROSIS-03 University of Pavia image, and AVIRIS Salinas image. e experimental results were compared with DCNN, SVM, and DRN. e results showed that the proposed method outperforms as compared to existing DL methods [117].
CNNS are known as most powerful methods when talking about hyperspectral image classification. Usually pooling layers and sampling features of CNNS are fixed, so they cannot be used for downsampling of features. A research article proposed a deformable HIS. e proposed method is evaluated on two real HSI datasets: University of Pavia and Houston University, and they have 12 and 15 classes, respectively. 1st experiment was performed on Pavia dataset where training samples (45, 55, and 65) are randomly selected from each class. e results showed that the proposed method accurately classifies pixels in the near edge regions. e 2nd experiment was performed on Houston dataset were training samples (30, 40, and 50) were also randomly selected from each class. It has been observed that the proposed method performs better than other existing methods [118].
A deeper network with 9 layers is proposed called as contextual deep CNN, and the idea behind this research is to have a model that can accurately find local contextual interactions by jointly exploiting local spatiospectral relationships of neighboring individual pixel vectors, as shown in Figure 13 In the first step, multiscale joint exploitation of the spatiospectral information is obtained through filter bank which is then combined in a map. e experimentation is performed on three datasets: the Indian Pines dataset, the Salinas dataset, and the University of Pavia dataset. Indian Pines dataset has 12 classes but only 8 were used as there were so many images. Pavia dataset has 16 classes, and all of them were considered for experimentation. e accuracy of Indian Pines using proposed technique is 93.6%, that of Salinas is 95.07%, and that of Pavia is 95.97% [119].
Hyperspectral image (HSI) is a new research area. In this article, a special CNN model is proposed that performs the desired classification by using lesser training and fine-tuning of data. To perform this task, the pixels can be pulled from the same class closer, while pushing the different class pixels farther away. e experimentation is performed on three HSI datasets: Indian Pines, Pavia, and Salinas. e results were validated on AlexNet, VGG-CNN-S, and GoogLeNet. e previous accuracies were 88.45%, 85.5%, and 88.8%, respectively, whereas the proposed model gives accuracies of 96.21%, 86.46%, and 88.48%, respectively [120].
In [121], a new model SAFF is proposed. In the 1st phase, multiple labels were identified by using pretrained CNNs and then a self-attention layer is added for channel-based and spatial-based weight assigning. At the end, SVM was used for classification. e experimentation was performed on three different datasets: (1) UC Merced Land Use Dataset having 2100 images and 21 classes; (2) Aerial Image Dataset having 10000 images and 30 classes; and (3) NWPU-RESISC45 Dataset with 31,500 images and 45 classes. e overall accuracy of UCM dataset is 97.02%, that of AID dataset is 90.25%, and that of NWPU dataset is 84.38%.

Feature Fusion
Earlier in the literature, we have seen that for image retrieval, one technique was used; later on, it was observed that fusion of more than one techniques can give better accuracy [122]. In this article, a new model weight feature convolutional neural network (WFCNN) is proposed that performs segmentation and extraction of information from images. e WFCNN model first performs encoding and then classification is performed. e proposed model is trained by using stochastic gradient decent (SGD) algorithm. e experimentation was performed on two datasets: Gaofen 6 images and aerial images. e results are validated using SegNet, U-Net, and RefineNet models. GF-6 datasets give accuracy of 94.13%, and aerial image dataset gives accuracy of 96.9% [123]. Ren et al. [124] proposed a full CNN based on multiscale feature fusion for the class imbalance for remote sensing image classification. e authors named the proposed research model as DeepLab V3+, with loss function based solution of samples imbalance [124]. Experimentation was performed on 2 datasets: sentinel-2 and sentinel-3. When compared with U-Net, PSPNet, and ICNet, the proposed method gives accuracy up to 97% [124]. is article proposed a new technique where large image is divided into small-scale images. To divide the samples into classes, support vector machine (SVM) is used. After this phase, a new module called active learning is added. e proposed model (SSFFSC-AL) performs better in terms of classification accuracy and also gives results in lesser time. e experimentation was performed on two datasets: Indian Pines and Pavia [125]. Feature fusion has two basic methods: local feature fusion and global feature fusion. Zhu et al. [126] claimed about local and global feature fusion for high-resolution spatial images for scene classification. Li et al. [127] discussed about scene image classification by fusion strategy to integrate multilayer features using CNN for pretrained data. CNN was used for feature extraction process and then fully connected layers were used for deep feature extraction; then, these extracted features were fused using PCA; after that, classification process was performed. e datasets used for the experimentation are WHU-RS and UCM, and the authors claim that they have achieved better accuracy than previously implemented classification processes. e gap identified in this article is to reduce computational time and to improve classification accuracy [127].
Yuan et al. in [128] discussed scene image classification which was performed by global rearrangement of local features, and the rearrangement of local features helped to get spatial information of the image. e experimentation was performed on four different datasets: UCM, WHU-RS19, Sydney, and AID, and they claimed that the performance was satisfactory. In future, there should be a focus towards improvement of classification accuracy.
In [128], the multilayer covariance pooling technique was used for extraction of features; then, these features were stacked to form a covariance matrix, and finally support vector machine was used for classification. e experimentation was performed on UCM, AID, and NWPU-RESISC45 datasets, and the proposed method outperforms existing methods of classification. In future, there should be an end to end CNN model which is able to classify with better accuracy using lesser features maps at each layer.
A research article discussed feature aggregation to learn about scene classification.
is model unites feature learning, aggregation, and classification into CNN during training process. Fine-tuning is performed to alleviate the training process, and it works for insufficient data as well. e experimentation was performed on three datasets: AID, UCM, and WHU-RS19. e limitation of this research is that there should be a technique that can get semantic information of images without cropping or resizing of images [87]. Figure 14 shows the complete process of how features are extracted and image classification process is completed. Another article is an unsupervised feature fusion technique for training of CNN. Due to this, training becomes easier and more efficient; after that, feature fusion was performed to classify images. e experimentation was performed on UCM and Brazilian coffee datasets, and the proposed model gives better accuracy of 87.83%. ere should be a focus on different feature fusion strategies to check their effect [73]. Table 7 summarizes the above-explained research articles.

Texture Features
Feature selection and extraction are the most important tasks in content-based image retrieval. ere could be two types of features: global and local features. Global features include color, texture, shape, and spatial information, whereas local features have the information about image segmentation, edge detection, corners and blobs, and so on [129].
Texture features are considered to be most powerful features among all. ey are the most visible and noticeable patterns in any image. But we cannot use texture features separately. Among low-level visual features, texture of an image is considered as a distinguishable image representation. ey are the considered as the visible and noticeable pattern of an image. Different fusions of texture features have shown good results in different application of remote sensing and image retrieval [130].
With these pros, there are also some cons of texture feature extraction. Complexity increases while processing and extracting texture feature [131]. To overcome these issues, different forms of texture features extraction methods are reported in literature such as wavelet transform [132] and Gabor filter [133], and Table 8 presents a detailed summary about texture features.
In this article [134], a new technique is proposed for classification and extraction of features from SAR images. e method is divided into three phases. In the 1st phase, two types of features were extracted: grey level co-occurrence matrix and Gabor filter. In phase 2, dimensionality is reduced, and at final phase, SVM is introduced for image classification. e experiments showed that this model gives better classification accuracy and is also good for dimensionality reduction. SAR image dataset is used for experimentation, and accuracy is 87.5% [134].
Speckle effect is a very common issue of PolSAR. To overcome this issue, a special technique is proposed that first extracts the features and then classifies them. Real PolSAR images were used for experimentation process and then validated using existing techniques. is article claims that they have reached the accuracy up to 99.8% [135].
Hyperspectral sensors can collect huge amount of data now. But it is still challenging to classify HSIs accurately. e technique used in previous research was spatiospectral classification, but these were not able to classify images accurately. In this article, the author proposed a new technique to classify images, and this technique is the process that is carried out in three phases. In the 1st phase, feature extraction is performed, and in the 2nd phase, images were classified using probabilistic SVM, while in the 3rd and last phase, probabilities were calculated to find the results. e experimentation was performed on two different HSI datasets: Indian Pines and Pavia. e results showed that the classification accuracy of the proposed model is better as compared to previously used techniques [136].
Kai et al. [137] claimed that they extracted texture features using the Gabor method. e datasets used for the experimentation are Corel, Li, and Caltech 101. ey managed to improve accuracy. e results showed 83%, 88%, and 70% accuracy of each dataset, respectively. e main limitation identified in this research is the increase in computational cost while feature extraction. In [138], Sajjad et al. reported that texture features could be extracted efficiently using the wavelet method. ey have claimed high accuracy of 99%, 56%, and 35% on Corel 1K, Corel 5K, and Corel 10K, respectively. Using wavelet methods of texture feature extraction, we can increase accuracy but computation cost also increases as a result. In another article [139], Sajjad et al. extracted texture features using the histogram method. e experimentation was performed on Corel 1K and Corel 5K datasets. Classification accuracy was 87%. In 2018, it is reported in their research that texture features extracted using the edge detection method give better accuracy, i.e., 98%. e dataset used for the experimentation is NUSWIND. e limitation of this research is the increase in computation cost.
Wang et al. [140] found that texture features extracted using Canny edge detector give better accuracy of 68%. e dataset used for experimentation is Corel 10K. e drawback of this research is increase in running cost as the number of input images was very large.
Nazir et al. [141] stated in their article that texture features extracted through discrete wavelet transform (DWT) and edge histogram descriptor (EDH) have better accuracy than those of other methods. e experimentation was performed on Corel dataset. e accuracy reported in this article is 73.5%. e drawback of this research is that no machine learning methods were used for classification or extraction of features.
In [142], usnavis Bella and Vasuki used the ranklet transformation method for texture feature extraction. ey claimed that they have increased the accuracy. e datasets used in their experimentation are Corel 5k and Corel 10K. Accuracies measured in the article are 67.4% and 67.9%, respectively. e limitation of this research is that due to many dimensions of texture features, the computation cost increased.
Bella et al. [142] performed texture feature extraction using the grey level co-occurrence matrix (GLCM) method. e dataset used in this experiment was Corel 5K, and they achieved accuracy of 66.9%. e computation accuracy is very high, as there was no algorithm used in their experimentation to reduce computational cost.
In [143], Ashraf et al. [143] claimed that using Gabor filter they have extracted texture features. e accuracy achieved in this experimentation was 79% while dataset was Corel 5K. e limitation of this research is also the increase In [144], Alsamadi et al. [144] reported that they have extracted texture features using the DWT method. e dataset used in the experimentation was Corel, and they achieved accuracy of 90%. e limitation of this research is high computational time.

Hybrid Approaches
In this article, a hybrid approach is used to accurately classify remotely sensed aerial images. SVM and KNN were combined in this article. First SVM was trained to classify images into different classes. In the testing phase, newly tested samples were entered, and average distance between the test samples for each class was calculated using the distance formula. Lastly, the images are placed to their respected classes where there is minimum average distance. is process is repeated till all the images are sorted into their respective classes. e experimentation was performed on two datasets: the ALOS data of the Yitong River and PMS sensor.
In an article, both parametric and non-parametric approaches were combined to classify the remote sensing images especially land cover land use data. Also, a new dataset was proposed in this article for this purpose which can also be used in other related research. e data of land were captured for both dry and wet conditions. e proposed model is basically the combination of ISODATA clustering and decision trees. e accuracy achieved for dry conditions is 84.54%, whereas for wet weather conditions, the accuracy is measured as 91.10%, which is better than existing deep learning models [149].
In an article, the authors combined two algorithms: kernel-internal value fuzzy C-means clustering and multivalue C-means clustering; by comparing the results with conventional fuzzy C-means clustering, it was observed that the proposed methods outperform the existing methods.
ey have constructed a new dataset: LANDSAT-7 Ba Ria area and Hanoi area. e accuracy noted in this research was 98.2% and 94.13%, respectively [11].
An article explains the phenomenon of sparse code that is used to reduce the calculation time for feature extraction. SC is commonly used for aerial images as it performs better in this particular case. With the help of existing approach, accuracy of local feature extraction is increased as compared to existing techniques. e experimentation was performed on UAS operating system data that are recorded for nearly 2 hours without flight interruptions. e accuracy achieved is 85.7% [150].
An article states that the combination of two techniques: pixel-based multilayer perceptron and CNN. is combined algorithm is applied on a dataset that is obtained through aerial photography and satellite. e dataset contains images of both urban and rural lands of different land uses of Southampton. e proposed method outperforms the existing deep learning methods. e accuracies achieved from this proposed model are 90.93% for urban and 89.64% for rural lands [151].
It states the hybrid approach that combines two techniques, SVM and ANN, for LULC classification of images captured through satellite. e fuzzy hierarchal clustering approach is used for classification purpose as shown in Figure 15. e dataset "Landsat-8 satellite images" is also proposed in this research. All the data are obtained from lands of Hyderabad and its surroundings. e accuracy achieved in this article are 93.159% for SVM and 89.925% for ANN. e authors claim that the proposed method gave better results than existing methods [153].
Yang et al. [134] proposed an efficient classification technique for agricultural lands that is based on spatial and spectral image features. Here a hybrid approach was used for classification purposes of healthy and non-healthy plants. Unmanned aerial vehicle (UAV) images of rice fields in Chianan Plain and Taibao City, Chiayi County, were collected. e accuracy achieved in this research is 90.67% [154]. Another article explains the research of a hybrid approach used for classification of remote sensing images. SVM and KNN were combined in this research for better results. Two datasets were used for experimentation: dataset-1 contains "ALOS data of the Yitong River in Changchun," whereas dataset-2 contains "the ortho image of a factory region in Jiangsu Province." e accuracies achieved are DS-1: 92.4% and DS-2: 97.9% [155].
An article explains a disaster scenario of southern India which was hit by flood. Data captured in this research were 200 flooded and non-flooded images. e approach used for research is the combination of SVM and K-means clustering. e accuracy achieved was 92%. Bitner et al. [157] extracted automatic building footprint by extracting multiresolution remote sensing images using a hybrid approach. e data of World View-2 imagery of Munich, Germany, were collected through satellite. e experimentation is performed by combining approaches, i.e., U-Net on top of the Caffe deep learning framework. e new hybrid technique performs better than existing techniques. e accuracy achieved is 97.4% [157]. e summary of all the hybrid approaches explained in the above section is given in Table 9.

Performance Evaluation Criteria
To evaluate the performance of classification, there are many ways that exist in the literature [65,158]. e selection of performance measure purely depends upon the type of classification we are going to perform and what type of results are (ii) Average precision: it can be defined as the mean of all the related queries.
(iii) Mean average precision: it is defined as the mean of average mean of all the relevant queries.
where S is the no. of queries.
(vi) F-measure: it is the harmonic mean between precision and accuracy.
(vii) Negative predictive value: it can be defined as the ratio between correctly labeled negative images to total number of negatively labeled images.
Negative predictive value � TN TN + FN .
(viii) Specificity: it is the ratio between correctly labeled negative images to total number of negative images.
(ix) Accuracy: it is the ratio of all the results either rightly labeled or falsely labeled to total number of labels that exist.
where w � overall accuracy, NT � sum of all nondiagonal elements in confusion matrix, and eij � total correct cells. (xi) Mean square error: the most popular metric used for measuring the error is mean square error. It computes the average of the squared difference between the target value and the value predicted by the model.
where N � the last iterations, y j � true value, and y j ′ � value predicted by the model. (xii) Mean absolute error: when we try to compute the average between the actual value and predicated value we use MAE. e mathematical representation of the metric is given below. y j − y j ′ , (11) where N � the last iterations, y j � true value, and y j ′ � value predicted by the model. (xiii) Root mean square error: it is easy to compute and gives a better idea of how well the model is performing. We just have to take the square root of average of the squared difference between the target value and the value predicted. Mathematically, it is pictured as (xiv) Area under the receiver operating characteristic curve (AUROC): this is a very interesting metric and is also known as AUC-ROC score/curves. While computing AUROC, true positive rate (TPR) and false positive rate (FPR) are used.
Mathematically, it is represented as follows: TRP � TP TP + FN ,

Conclusion and Future Directions
Remote sensing image analysis is used in various real-time applications such as monitoring of Earth, urban development, town planning, water resources engineering, providing construction requirements, and agriculture planning. Image analysis and classification is an open research problem for the research community working on remote sensing applications. Due to recent development in imaging technology, there is an exponential increase in the number and size of multimedia contents such as number of videos and digital images. Due to this increase in this volume of digital images, the automatic classification of images is an open research problem for computer vision research community. Various research models are proposed in recent years, but there is still a research gap between human understanding and machine perception. Due to this reason, the research community working on remote sensing image analysis is exploring the possible research directions that can bridge this gap. e earlier approaches for remote sensing image analysis are based on low-level feature extraction and mid-level feature representation. ese approaches have shown good performance on small-scale image benchmarks with limited training and testing samples. e use of discriminating feature representation with multiscale features can boost the performance of the learning model. ese approaches can mainly assign single labels to images, while in existing era, it is a requirement to assign multiple labels to single image on the basis of contents. One of the main requirements of a deep learning model is to build a largescale image benchmark that can be helpful to train a complex deep network. e creation of a large-scale image benchmark with all possible classes of remote sensing images is one of the main requirements and an open research problem in this domain. Most of the current research models based on deep learning are mainly using the fine-tuning and data augmentation techniques to enhance learning. If a largescale image benchmark is available, it will assist the learning model to learn parameters in a more effective way. e available large-scale image benchmarks are used through supervised learning, and this is a time consuming process and such fully supervised learning models are computationally expensive. Exploring the possible learning capabilities based on unsupervised and semi-supervised learning is a possible future research direction. e deep learning models use extensive computational power for training, and mostly, the research models are using GPUs as high-performance computing. Designing a deep learning model with less computations is also a possible research direction, and such model can be used on a device with less computation powers. e use of few-shot/zero-shot learning approaches can be explored in the field of remote sensing image classification.
Data Availability e details about the data used is mentioned and cited within this manuscript.

Conflicts of Interest
e authors declare that they have no conflicts of interest.