Satellite Imageries for Detection of Bangladesh’s Rural and Urban Areas Using YOLOv5 and CNN

. In recent years, there have been signifcant advancements in object identifcation in natural photos. However, when applying natural image object recognition techniques directly to satellite images, the results are often unsatisfactory. Tis is primarily due to inherent disparities in the object scale and orientation caused by the omniscient viewpoint of satellite imagery. Te distinguishing factors between rural and urban areas lie in the objects that cover them. Furthermore, the complex backdrop of satellite photos poses challenges in accurately extracting features, leading to the omission of small objects in many regions. Te performance of object detection, which is crucial for area identifcation, is also afected by dense object overlap and occlusion. To address these aforementioned issues, we made modifcations to the generalized one-stage detector YOLOv5, specifcally tailored for satellite photos. For this research, we manually collected data from Google Earth, meticulously labeling them and subsequently verifying them with human annotators. We then preprocessed the data using computer vision techniques, such as resizing and normalization. Next, we employed YOLOv5 and transfer learning-based CNN architectures of InceptionV3, DenseNet201, and Xception to compare their performances. Te goal was to accurately identify rural and urban areas from remote sensing images.


Introduction
High-resolution satellite imagery is obtained through the utilization of advanced earth satellite technology to observe the surface of our planet. However, the processing of a large volume of satellite photos poses signifcant challenges for current interpretation algorithms. One of the fundamental tasks in computer vision is object detection, which involves accurately and efciently identifying predefned objects within images. Tis capability fnds extensive application in areas such as precision farming, urban trafc control, and various other domains [1][2][3]. Te Earth's orbiting feet of commercial satellites produces an ever-increasing amount of imagery, growing at an exponential rate. Satellite imagery serves a multitude of purposes, including agricultural crop classifcation [4,5], scene classifcation [6,7], wildlife monitoring [8,9], forest characterization [10,11], meteorological analysis [12,13], infrastructure assessment, building localization [14,15], and soil moisture estimation [16,17].
Recent advancements in segmentation and object detection tasks have been signifcantly facilitated by datadriven deep learning techniques. Te size and quality of the training dataset have an impact on detection precision. Te development of object detection has been fueled by a number of extensive and difcult natural picture datasets, including PASCAL VOC and MS COCO. Nevertheless, recognizing objects in optical satellite photos remains challenging [18]. Te causes are listed as follows. First, satellite photographs taken from a bird's eye view provide a broad imaging range with full information, in contrast to the natural images captured by ground-based cameras with horizontal views. Tere is an uneven distribution of foreground items and intricate background information in complex landscapes and urban settings [19]. In addition, objects in satellite pictures often exhibit varying visual appearances and optical properties due to a variety of imaging circumstances, such as perspectives, illumination, and occlusion. Finally, smaller objects frequently have less information about their appearance than larger ones, making it harder to distinguish them from the background or other nearby objects.
To address the aforementioned issues, this research focuses on improving area identifcation performance in satellite pictures. Te detection speed also presents a substantial challenge for the detection algorithm as region detection in satellite images often needs to occur in real time. You only look once (YOLO) neural networks can signifcantly enhance detection speed by combining object categorization and localization (two-stage) into a one-stage regression problem. To the best of our knowledge, YOLOv5 is the most recent version of YOLO, which demonstrates the best object detection performance on natural photos. Tis is because YOLOv5 utilizes the path aggregation network (PANet) and the enhanced CSPDarknet53 as the network's neck and backbone, respectively.
It is challenging to directly apply YOLOv5 to satellite photos for area recognition. In this study, we utilized transfer learning-based CNN architectures and made updates to YOLOv5 from three perspectives listed as follows. First, due to excessive downsampling, the deep feature maps fused in the neck of YOLOv5 would lose information about tiny details. To overcome this issue, we implemented a new branch in the shallower network layer to perform the initial detection of each area. Tis allows us to preserve the feature information to the greatest extent possible. Second, while YOLO net is typically built on a convolutional neural network (CNN), the CNN is primarily efective at capturing local information. However, when processing highresolution satellite photos, the traditional transformer would incur a square computational cost, despite its ability to compensate for global modeling capability.
Te main contribution of this study can be summarized as follows: (i) We have proposed a deep learning-based method for identifying rural and urban areas using satellite images. (ii) We generated a dataset that included two classes, namely, rural and urban areas in Bangladesh. (iii) We conducted a comparative analysis of the same dataset using two techniques: a YOLOv5-based detection technique and a CNN-based classifcation technique.
Te structure of the paper is as follows. Section 2 clarifes the relevant work of several disease classifcation methods. Te method and materials that were used are illustrated in Section 3. Te experimental analysis, including performance and results, is depicted in Section 4. Section 5 discusses the article's conclusion.

Related Work
Signifcant progress has been made in the feld of Satellite Imagery, with several notable research studies that have been explored. Some of these studies are listed as follows.
Te deep learning approach by Kadhim and Abed [20] presented practical deep learning-based approaches for satellite image classifcation, which involved extracting features using four pretrained CNNs. Te paper [21] focused on object and facility classifcation in high-resolution multispectral satellite imagery, utilizing a deep learning system. Te system combined CNN predictions with satellite metadata through postprocessing neural networks. In another study [22], the speed and performance of modern object detection algorithms were compared in commercial EO satellite imagery datasets, specifcally for oil and gas fracking wells and small cars. Article [23] examined the efective classifcation of aerial images using their emergency net model while onboarding a UAV for monitoring and responding to emergencies. Pan et al. [24] introduced a paradigm for mapping a Chinese urban village in Guangzhou City using the U-net deep learning architecture. Teir fndings suggested that combining U-net-based deep learning with high spatial resolution satellite photos can provide valuable building information in complex urban settlements, crucial for urban revitalization. Yoo et al. [25] compared CNN to an RF classifer in order to map the local climate zone, using bitemporal Landsat images.
Other approaches: Yang et al. [26] utilized ensemble projection (EP) to learn semi-supervised features for satellite image classifcation, especially in scenarios with limited labeled data and a large amount of unlabeled data. Paper [27] focused on classifying specifc land cover in satellite images using the biogeography-based optimization approach. Dai and Yang [28] introduced a technique that incorporated visual attention in satellite image classifcation and addressed the classifcation task without a learning phase. Li et al. [29] investigated image cropping strategies for object detection, involving the cropping of large aerial images into uniformly sized smaller images. Teir density-map guided object recognition network (DMNet) was inspired by the understanding that an image's object density map reveals the distribution of objects in terms of pixel intensity. Rahman et al. [30] employed a hierarchical clustering approach based on fve specifed spatial criteria to divide the 331 cities of Bangladesh into six classes using remote sensing data. Research [31] demonstrated the usefulness of satellite images in detecting land use and land cover (LULC) analysis, as well as analyzing the coastal dynamics of agriculture in the Bhola region (characterized by dense forests) and the Dhaka region (characterized by dense cities). Mathieu et al. [32] explored the efectiveness of object-based classifcations that extract relevant ground features from images using automated image segmentation techniques.

Materials and Methods
In this section of the article, we will provide a concise summary of the stages involved in data collection, preprocessing, and preparation. Te next step is algorithm selection, where we study each model employed in detail. Ten, we will discuss the platforms and the key parameters for training and evaluation metrics. Figure 1 provides a visual overview of the steps involved in the classifcation detection process, highlighting the fow of information and the key stages.

Dataset Description.
Te data collection process involved meticulous manual gathering of satellite images using Google Earth. A total of 3267 satellite images were collected from diverse regions in Bangladesh, with 1631 images representing urban areas and 1636 images representing rural areas. Separate datasets were prepared for CNNs and YOLOv5, as shown in Table 1. For the experiment involving YOLOv5, a subset of 200 images was selected. Te data collection process aimed at ensuring a comprehensive representation of the target regions and facilitating accurate analysis and evaluation.

Preprocessing.
To enhance the predictive performance of the CNN architecture, the recommended approach used in this research minimizes the number of preprocessing steps. We optimized the training process for CNN models using three standard preprocessing steps.

Resizing.
Usually, raw collections of images are in diferent formats, which can lead to imbalanced image features. Technically, the total dataset should be unifed into one structure by resizing the image shape. Diferent sizes of images can be resolved using increasing or decreasing resizing matrix operations. Tere are two specifc solutions for efective performance and reduced complexity metrics. Tis dataset includes images of various resolutions and sizes. To ensure that all input images have the same dimension, we resized all images to 224 × 224 pixels from their original size.

Normalization.
As a preprocessing step of image normalization, utilizing ImageNet's mean subtraction process, we rescaled the pixel intensity values. We normalized the intensity values of all the images within the range [0, 255] to the standard normal distribution by applying min-max normalization [33] to the intensity range [0, 1], where where x denotes pixel intensity. In equation (1), the input image's minimum and maximum intensity values are Xmin and Xmax, respectively.

Augmentation.
Image augmentation is a technique utilized to expand the available resources within an image by generating nonduplicate regions. It involves applying various transformations to the original image, such as texture refections, grayscale variations, adjustments in brightness levels, color contrasts, and other relevant image modifcations. By introducing bounding boxes during augmentation, the accuracy of object detection can be improved, leading to the creation of synthetic data. Trough operations such as image fipping and rotation, the dataset size can be signifcantly increased, resulting in a larger and more diverse collection of images. Tis augmentation process contributes to the augmentation of image quantity while preserving the integrity of important regions. In the case of 2D images, factors such as resolution and image quality hold signifcant importance, particularly when dealing with images that exhibit substantial disparities in size, shape, and color. Synthetic data ofer immense potential to exponentially enhance accuracy by generating images that belong to the same category. (2) Image annotation, consisting of 200 .txt fles. Tese fles provide information that specifes the exact locations of the items in the corresponding images that have labels attached to them. Manual annotation was used, and the annotated data was saved in .txt in YOLOv5 format. Te images were cautiously labeled using the popular annotation application LabelImg.

Selection of Algorithm.
We employed the object detection architecture YOLOv5 and two pretrained CNN models, such as MobileNetV2 and NASNetMobile, for classifcation and compared their results. In deep learning, large amounts of data are often used to improve the network's ability to predict. Due to the lack of data, we employ the transfer learning [34] approach and pretrain weights from the used models to make the model better at making predictions.

YOLOv5.
Te network structure diagram of YOLOv5 consists of two main sections. Te frst section is the main architecture, which includes the input side and the backbone portion. Te second section is the detection architecture, comprising the neck and the prediction part [35]. YOLOv5 is trained on the COCO dataset, an object detection model, which contains 80 diferent classes and a total of 200,000 annotated images. Te YOLO family of models, including YOLOv2, YOLOv3, YOLOv4, YOLOv5, YOLOv6, and the recent YOLOv7, are widely employed for recognition tasks. Te variations in size among the diferent models of the YOLOv5 family, such as YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, are determined by the width and depth of the BottleneckCSP module [36]. Te primary function of the BottleneckCSP module is to extract features from the feature map, enabling the extraction of valuable information from the input image. In this study, the YOLOv5 model summary consisted of 270 layers, 7025023 parameters, 7025023 gradients, and a computational complexity of 16.0 GFLOPs. Figure 2 showcases the architecture of YOLOv5, highlighting its components. (CNNs). Te fnal step of our work involves classifcation using transfer learning. Deep convolutional neural networks (DCNNs) have recently attained a state-of-the-art performance in a variety of high-level computer vision tasks. Convolution neural networks (CNNs), more commonly referred to as ConvNets, are a type of feed-forward neural networks that employ a series of convolutional layers, each of which is followed by a pooling layer, to learn to extract features from input data and build a series of high-level feature maps. Te proposed CNN-based categorization approach has been evaluated on InceptionV3, DenseNet201, and Xception. Te selected architectures' network structures are as follows.

Transfer Learning-Based Convolutional Neural Networks
3.5.1. InceptionV3. InceptionV3 is a deep convolutional neural network architecture that was introduced by Google. It employs the concept of "inception modules" which consist of parallel convolutional layers with diferent flter sizes. Tis allows the network to capture features at multiple scales and resolutions. InceptionV3 is often used for transfer learning due to its strong performance on image classifcation tasks, as shown in Figure 3. In transfer learning, the pretrained InceptionV3 model is used as a feature extractor, where the initial layers are frozen, and only the fnal layers are fnetuned on the target dataset. Tis enables the model to leverage the learned representations from a large-scale dataset, such as ImageNet, and adapt them to the specifc task at hand.

DenseNet201
. DenseNet201 is a deep convolutional neural network architecture that emphasizes feature reuse and alleviates the vanishing gradient problem. It introduces dense connections between layers, where each layer receives input from all preceding layers. Tis facilitates the fow of gradients and encourages feature reuse, leading to a better gradient fow and improved information propagation throughout the network. DenseNet201 is commonly used in transfer learning scenarios, where the pretrained model is employed as a feature extractor, as illustrated in Figure 4. By freezing the initial layers and fne-tuning the later layers, DenseNet201 can efectively transfer knowledge from the source dataset to the target task, improving both training efciency and generalization performance.

Xception.
Xception, derived from "Extreme Inception," is an architecture that extends the Inception concept further by replacing the standard convolutional layers with depthwise separable convolutions. Tis factorizes the convolution operation into a depthwise convolution and a pointwise convolution, reducing the computational cost while maintaining expressive power. Xception has shown excellent performance on various image classifcation benchmarks, as depicted in Figure 5. In transfer learning, Xception is commonly utilized by leveraging its pretrained weights as a feature extractor. Te initial layers are frozen, and only the fnal layers are fne-tuned on the target dataset. Tis approach allows Xception to transfer high-level features learned from large-scale datasets, enabling efective generalization to new tasks with limited training data.

Training Experiment Setup.
Tis experiment was carried out, and Google Colab was used to train both the YOLOv5 and CNN models, which provides free access to powerful GPUs with no confguration required. For our research, 80% of the images belonging to each class were placed in the training set, while the remaining 20% were placed in the test set. Te size of the image was set at 640 × 640 pixels as part of the YOLOv5 training parameter setting. Troughout the duration of the training procedure, we experimented with  Figure 2: Architecture of YOLOv5 [37]. Mobile Information Systems a large variety of batch sizes and numbers of epochs, all of which featured early stopping conditions. In our trial-anderror experiments, the best results for prediction were obtained with a batch size of 1, a total of 100 epochs, and a learning rate of 0.01. We utilized a notebook invented by Robofow [38] based on YOLOv5 [39] and employed pretrained COCO weights. Te three diferent types of losses are shown in Figure 6, which are box loss, objectness loss, and classifcation loss. To determine an algorithm's performance, researchers have used a metric called "box loss," which evaluates how well it can locate an object's center and how completely it predicts a box around that object. Objectness measures the probability that an object exists in the proposed region of interest. Finally, the algorithm's ability to correctly predict an object's class is refected in its classifcation loss. Te training parameters for all convolutional neural networks are learning rate η � e − 5, β1 � 0.9, β2 � 0.999, ε � e − 8, and decay rate is set to 1e − 5 for adaptive moment estimation (Adam) optimizer. Activation function Softmax is used which sets a dropout rate of 0.5 to prevent the model from becoming overft. All models are trained over the duration of 15 epochs, with a batch size of 16.

Evaluation Metrics.
To assess the prediction performance of the algorithms in this study, we used highly regarded evaluation metrics such as recall, precision, accuracy, F1-score, and mAP (mean average precision).
Te ratio of the number of cases that were correctly classifed to the total number of test images is the commonly used measure of accuracy. Tis can be shown by Precision, often known as a positive predictive value, is defned as the percentage of labels accurately identifed in patients who are actually positive and is stated as Te weighted average of precision and recall, known as the F1-score or F-measure, combines precision and recall. Te F-measure is written as Te percentage of correctly classifed objects is measured by recall or sensitivity. And it is presented as Te overall intersection over union (IoU) thresholds or the mean average precision across all classes are utilized to determine the mAP value. It is expressed as [40] AP � 1 11 recallϵ[0, 0.1, . . . , 1] * Precision(r).
According to the abovementioned section, the number of correctly predicted cases is referred to as true positives (TPs), while the number of incorrectly predicted cases is referred to as false negatives (FNs), and true negatives (TNs) are the number of negative instances that were correctly predicted. In comparison, the number of mistakenly predicted negative events is known as false positives (FPs).

Result Analysis and Discussion
After training the YOLOv5 model with our data, we used it to make predictions for images in our test set that had not been seen before. Figure 7 demonstrates how the algorithm can more accurately identify both urban and rural areas. Table 2 displays the performance of YOLOv5 after training using diferent measures such as precision, recall, and mAP (mean average precision) when IOU is set to 0.5 (50%) and 0.95 (95%). A validation precision score of 0.995, a recall score of 0.999, and mAP scores of 0.995 and 0.978 for @0.5IOU and @0.95IOU, respectively, were obtained for the YOLO v5 model after evaluation. Figure 8 presents a collection of images extracted from the test set, illustrating the performance of the Xception model in accurately detecting urban and rural areas. Each image is accompanied by its corresponding actual label (urban or rural) and the target label, along with the associated confdence level. Te depicted results highlight the model's ability to classify the regions correctly, as indicated by the alignment between the actual and target labels and the confdence level assigned to each prediction. Tis visual representation provides valuable insights into the efectiveness of the Xception model in discerning urban and rural areas based on the provided dataset.
Te performance of three deep learning models, namely, InceptionV3, DenseNet201, and Xception, was evaluated for classifying cases into the urban and rural classes, as depicted in Figure 9  with 15 instances. Tese fndings provide valuable insights into the accuracy and efectiveness of these models in accurately classifying cases into the urban and rural categories. Such information is crucial for researchers and practitioners in the feld of deep learning when selecting appropriate models for similar classifcation tasks. Te performance of each architecture is individually examined to justify the performance of the proposed classifcation approach based on pretrained networks. Table 3 displays the accuracy of three deep learning models, namely,      Figure 10. In terms of the ROC curve performance, it can be seen clearly that Xception performs better than InceptionV3 and DenseNet201.

Conclusions
Tis article presents the development of a dataset for the identifcation of rural and urban areas in Bangladesh, along with an investigation of two distinct approaches: a detection approach utilizing YOLOv5 and a classifcation approach employing CNN. Te principal limitation encountered in this study pertains to the restricted quantity of available images. In order to address this constraint, transfer learning techniques were applied, leveraging pretrained YOLOv5 and three DCNN architectures, namely, InceptionV3, DenseNet201, and Xception. Te detection approach based on YOLOv5 exhibited favorable outcomes, achieving mean average precision (mAP) scores of 0.995 and 0.978 at intersection-overunion (IOU) thresholds of 0.5 and 0.95, respectively, when evaluated against the test datasets. In the classifcation approach, Xception emerged as the most profcient model, attaining an accuracy of 97.70%. To augment the comprehensiveness and reliability of the study, future eforts will entail an expansion of the image dataset, incorporating an increased number of images and classes. Tis expansion aims to facilitate more robust and precise conclusions. In addition, the exploration of ensemble methods integrating alternative architectural models will be pursued, with the objective of gauging their impact on overall performance. Te fndings presented in this research contribute to the ongoing advancement of rural and urban area identifcation in the context of Bangladesh, leveraging computer vision methodologies. Te identifed limitations and proposed avenues for further investigation establish a foundation for future research endeavors in this domain.

Data Availability
Te data used in this study are available upon request from the corresponding author.

Conflicts of Interest
Te authors declare that they have no conficts of interest. Mobile Information Systems 9