A Machine Learning Approach for Environmental Assessment on Air Quality and Mitigation Strategy

Air pollution has a signifcant impact on environment resulting in consequences such as global warming and acid rain. Toxic emissions from vehicles are one of the primary sources of pollution. Assessment of air pollution data is critical in order to assist residents in locating the safest areas in the city that are ideal for life. In this work, density-based spatial clustering of applications with noise (DBSCAN) is used which is among the widely used clustering algorithms in machine learning. It is not only capable of fnding clusters of various sizes and shapes but can also detect outliers. DBSCAN takes in two important input parameter-s—Epsilon (Eps) and Minimum Points (MinPts). Even the slightest of variations in the parameter values fed to DBSCAN makes a big diference in the clustering. Tere is a need to fnd Eps value in as minimum time as possible. In this work, the goal is to fnd the Eps value in less time. For this purpose, a search tree technique is used for fnding the Eps input to the DBSCAN algorithm. Predicting air pollution is a complex task due to various challenges associated with the dynamic and multifaceted nature of the atmosphere such as meteorological variability, local emissions and sources, data quality and availability, and emerging pollutants. Extensive experiments prove that the search tree approach to fnd Eps is quicker and efcient in comparison to the widely used KNN algorithm. Te time reduction to fnd Eps makes a signifcant impact as the dataset size increases. Te input parameters are fed to DBSCAN algorithm to obtain clustering results.


Introduction
In this section, an introduction to air pollution is given.Te release of toxins into the air which are detrimental to humans and the atmosphere is called air pollution.If toxic or unsustainable concentrations of chemicals, such as gases, particulates, and biological molecules, are released into the environment, this leads to air pollution.It has the potential to cause infections, illnesses, and even death of humans; it also has the potential to afect other living species such as animals and food crops, as well as degrade the natural or constructed environment.Air pollution can be caused by both manmade and natural factors.Air pollution is not a constant.Various compounds react with one another, resulting in ever-changing compositions of compounds.
An air pollutant is a substance in the atmosphere that can cause damage to humans and the environment in [1].Tis can include solid substance, water droplets, or gases as the constituents.Pollutants or contaminants may be either naturally occurring or man-made.Pollutants are divided into two categories: primary and secondary.Primary contaminants are normally formed by natural processes, such as volcano activity.Carbon monoxide gas emitted by automobiles or sulphur dioxide emitted by factories are two such examples.Secondary contaminants are not explicitly emitted.However, they form in the air as a result of the reaction or interaction of primary contaminants.Ozone at the ground level is a good indicator of a secondary contaminant.Some contaminants are both primary and secondary in nature, meaning they are released directly and produced by other primary contaminants.
Some of the air pollutants [2] are described below.
Smog: Te following two forms of pollution are very common-Smog and Soot.Smog, also known as "ground-level ozone," is produced when contaminants from fossil fuels combine with sunlight.Soot, also called as "particulate matter," is made up of small fragments of chemicals, dirt, ash, dust, or allergens which are borne in the air as gas or solids.Sulphur dioxide (SO 2 ): Te largest pollutant in the air is sulphur dioxide.SO 2 is primarily generated by burning sulphur-bearing stone fuels, and coal and petroleum usually contain between 1% and 5% sulphur.Nitrogen oxide (NO x ): Among oil, natural gas, and coal, coal comprises of a higher amount of nitrogen.At high temperatures, nitrogen and oxygen in the atmosphere react even more, resulting in signifcant emissions from transportation and power generation.Carbon particles and noncarbon particles: Te utilization of biomass energy from fossil fuels, such as diesel and gasoline generators, is the predominant contributor to carbon particles in the form of carbon elements and low volatile carbon compounds.Moving dust is a signifcant contributor of carbon-free particles.It is mostly emitted from fossil fuels.Carbon monoxide (CO): CO is a part of vehicle exhaust that is emitted primarily as a result of insufcient combustion.
Volatile organic compounds (VOC): In the atmosphere and water vapour, VOCs are made up of hydrocarbons, halogenates, and oxygen-containing compounds.Some sources comprise pressurised system leakage including natural gas and nitrogen and liquid fuel volatilization such as fuel tanks.Ozone: Ozone, an active oxidant is a chief constituent of photochemical smog, is produced when nitrogen oxides and sunlight react, which is a photochemical reaction.It is formed in the atmosphere by gases released by tailpipes, smokestacks, and a variety of several other infuences.
Particulate matter: Particles suspending in the atmosphere are called aerosol particles or particulate matter (PM).It foats in the air and stays there for weeks or even months, depending on their size.Dust is a very big, type of particulate matter, being 10 micron or greater and approximately of thickness of a hair.
PM2.5: Particles less than a diameter of 2.5 microns are called PM2.5.When such tiny particles are inhaled, they are carried into the lungs, causing health concerns.
PM10: Te particles of diameter 10 microns or smaller are called PM10.Dust and pollen form PM10 which are either solid or liquid course particles suspended in air.
Some of the other factors that afect air pollution are wind speed (WS), wind direction (WD), air temperature (AT), and sunlight [3].
Wind speed: Pollutants are likely to add up when the wind speed is not greater than 10 kmph in calm situations.15 kmph wind speeds or higher favour a pollution dispersion, which would clear the air.Te greater the speed of wind, the greater the pollutants dispersion and the lower their quantities in air.But strong winds can also produce dust-a concern in dry windy regions.
Wind direction: When the wind blows from an industrial region into a metropolitan area, the amount of pollution in the urban area would be higher than when wind blows from other direction, like open land.For the pollutants in the valleys, the wind afects the directions of pollutant movement that increases the levels of downwind pollutants over upwind.
Air temperature: Air temperature infuences air circulation and hence air pollution migration.Since the Sun absorbs energy from the surface of the earth, the air closer to the ground is warmer than the air in the troposphere.On the surface, the warmer, lighter air rises; and in the upper troposphere, the cooler and heavier air sink.
Sunlight: Tere is an efect of sunlight on air pollution.Te Sun is a signifcant source of energy.Te electrons represent everything when it refers to chemical activity and transition.In the sun's presence, the electrons in a chemical substance are neutral, whereas when they are subjected to the appropriate wavelength of radiation, the compound absorbs it.Tis excites the electrons, which opens up a whole new universe of possibilities for how the chemical can react.Photochemistry refers to how light and its ability to excite electrons play a part in allowing a chemical transformation to occur.As a result, the energy is quantized and contained in the electrons going stepwise.
Figure 1 shows the percentage of air pollutants.Chemical reactions involving primary air contaminants such as nitrogen oxides which are produced by power plants and oxygen are triggered by sunlight and elevated temperatures, resulting in the formation of ozone.More ozone is formed when the sun is brighter and the day is hotter.Primary particles are often transformed by heat and sunlight into 2 Journal of Engineering secondary, smaller particles which are potentially more harmful.
Air pollution remains at the ground level due to high ambient pressure, causing the level to rise.Since high pressure causes a humid atmosphere, heat waves and poor quality of air always go together.Contaminants do not get fltered from the air even when there are mild winds and no moisture, but they accumulate just above ground level.
Predicting air pollution using machine learning techniques aligns with various objectives and motivations, leveraging the capabilities of advanced computational models.Here are some key objectives and motivations for employing machine learning in predicting air pollution: 1.1.Enhanced Accuracy and Precision.Machine learning models can capture complex patterns and nonlinear relationships within large datasets.Using these techniques can improve the accuracy and precision of air pollution predictions compared to traditional modeling approaches, leading to more reliable information for decision-making.(i) Real-time and short-term prediction: Machine learning enables the development of models that can provide real-time or short-term predictions of air pollution levels.

Journal of Engineering
Machine learning-based air quality predictions can be communicated in a more accessible and engaging manner to the public.Visualizations and user-friendly interfaces can enhance public awareness and understanding of air quality issues.
In summary, the use of machine learning techniques for predicting air pollution aims to improve accuracy, enable real-time predictions, leverage diverse data sources, adapt to dynamic conditions, and provide valuable insights for decision-makers, ultimately contributing to better air quality management and public health outcomes.Te paper is organized into Abstract, Introduction, Literature Survey Design and Implementation, Results and Inferences, and References.

Literature Survey
Some of the research works to fnd the input parameters for DBSCAN algorithm are described below.
A DBSCAN-K Nearest Neighbor-Genetic Algorithm (DBSCAN-KNN-GA) is proposed by Mu et al. for performing DBSCAN on diferent density-level datasets to fnd the Eps range and MinPts automatically in [4].Iris, Aggregation, t5.8k and TLC taxi trip record datasets are used for comparison of the put forth approach with the original DBSCAN for evaluating the performance.Instead of the original concept of the k-dist graph which considers the distance from a point to all its k-nearest neighbors, this approach calculates the mean of the distance from one point to all its k-nearest neighbors.Te mean gives a smooth curve thereby removing noise, which makes it easy to fnd the threshold of the density levels.While the actual DBSCAN makes use of global Eps and MinPts which makes the result of clustering the dataset with varying density inaccurate, this paper introduces the use of KNN algorithm for determining a range of Eps values.On this the genetic algorithm is applied to fnd the most accurate Eps and MinPts which are further fed to the DBSCAN algorithm as input parameters.Experimental results show that this algorithm has a better accuracy in automatic detection of parameters in datasets having varying densities.For huge datasets, this algorithm will be expensive to calculate the distance matrix and lacks efciency in such a case.Te future research of this paper is to use a parallel or distributed computing for reducing the execution time of the algorithm and increasing the efciency.
An algorithm to fnd Eps automatically is proposed by Akbari and Unland, thereby eliminating human interaction in [5].Comparison between the original DBSCAN and the proposed algorithm is done on normal distribution artifcial datasets.Te method uses the 3 Sigma rule (an empirical or statistical technique) using the concept of standard deviation for detecting the outliers and fnding the Eps value.MinPts is chosen as 4 based on history for 2 dimensional dataset.Te k-dist values are calculated for each point and sorted in ascending order.Te border points might cause a negative efect for detecting the Eps value, hence the k-dist value of border points is replaced by the k-dist value of its nearest core point's k-dist value.Mean and standard deviation (SD) of the newly calculated k-dist values are obtained.Te Eps value is calculated using Mean + 3 * SD formula.Te time complexity of the algorithm is O (n2) to calculate the k-dist value.O (n) time is needed to calculate the mean and SD.An Eps value is found based on outlier detection for normal distribution dataset.Te points lying beyond the 3 sigma value is considered an outlier or noise.Te negative efect of border points is removed by replacing the k-dist value of the border point with its nearest core point's k-dist value.Tere is no need to fnd knee which is a challenging task.Nested clusters can also be detected which the normal DBSCAN cannot detect.Tis study focus primarily on using an empirical rule for outlier identifcation in normally distributed data.Teir future work aims at using Chebyshev's inequality for nonnormal distribution datasets.
Starczewski et al. put forth a new method for determining the Eps and MinPts for DBSCAN for diferent kinds of clusters in [6].Artifcial datasets are used for the evaluation having diferent clusters and diferent dataset shapes and size of two dimensions and three dimensions.Te basis of the approach is to fnd sharp distance increases obtained using k-dist function that calculates distances between every element in the dataset with its k-th nearest neighbor.Here, the steepest increase is used for selecting a distance that can determine the most appropriate value for Eps.Accurate determining of the knee in the sorted distances is very essential to choose Eps.MinPts is selected empirically according to diferent datasets.Based on several experiments, the formulae proposed to fnd the input parameters have good efciency.Te number noise points or outliers after clustering are small in the proposed approach.
An algorithm to determine multiple Eps based on different density levels in the dataset is proposed in [7].Te evaluation are done using various datasets with same and diferent density levels dataset on diferent k and an appropriate k value is found based on experiments.A comparison between automatic generation of Eps for DBSCAN (AGED) and original DBSCAN algorithm is done in terms of accuracy, average silhouette width, Dunn Index and Pearson gamma.First, the algorithm fnds the infuence function (Euclidean distance) for each point to its k-nearest neighbors.Te local density which is the sum of infuences of a point to its KNN for each point is found and average is taken.A min-max normalization is performed on the average and it is put into respective bins having similar average local density.If a bin contains elements less than k, it is considered as an outlier.Otherwise, the average value of each bin having similar local densities is found and a reverse min-max normalization is applied.Te value in each bin gives all the possible Eps values.MinPts is chosen using trial and error method.Te algorithm works well on varied and nonvaried density datasets to calculate multiple Eps values.
Esmaelnejad et al. proposed a new algorithm for determining a suitable Eps for DBSCAN clustering method.In [8], Eps is replaced with a new parameter named ρ (Noise ratio of the data set).Te evaluation is done on three datasets obtained from a dataset named Chameleon.Tis approach does not minimize the number of parameters, but it is easier than the method to set the Eps since in certain cases, the programmer priorly knows the data set's noise ratio.Tis method introduces a heuristic chart to fnd Eps.Te x axis is the number of nodes and y axis is a function given for the dataset and plotted for MinPts 6. Te number of noise points is obtained by multiplying ρ and size of data set.MinDist for each point is calculated based on a proposed formula.If there are points p and q such that MinDist(p) ≥ MinDist(q) where p is noise, but q is not, those points lying on the right of the selected x axis value (x) are the noise points.Eps is obtained by using (x/√2).Te introduction of the parameter ρ is advantageous because (1) in some applications, the users will know the noise ratio of the dataset in advance or it can be easily computed.( 2) ρ is a relative (and not absolute) measure which is dependent on the distance function.Tough noise ratio is easily identifable parameter, epsilon cannot be absolutely omitted because it was replaced with the noise ratio.Tis method primarily depends on the value of MinPts; to fnd an optimum value of MinPts is a challenge.
A data-driven system for outlier fight identifcation was proposed by Jiang et al., focusing on the landing approach of the plane, which has a higher probability of fatalities; frst, three checking points and landing approach output parameters are selected and retrieved from the quick access record (QAR) data in the fight operational quality assurance (FOQA) station in [9].Second, if discrete parameters are involved in the training datasets, density-based spatial clustering of applications with noise (DBSCAN) algorithm is introduced for detection of outliers, and if discrete parameters are not involved in the training datasets, one-class support vector machine (SVM) model is applied.Lastly, aircrafts that deviate from the group's characteristics known as outlier planes are identifed.Outlier planes are fltered across the checking points of diferent heights, ofering essential guidance for landing safety management.Te DBSCAN algorithm is used by choosing diferent Eps values keeping the MinPts constant and fxed at 5 in their approach, thereby identifying the outlier planes in each case aiding in reducing the aviation accidents.
Hao et al. presents a new approach for detecting outlier data based on DBSCAN algorithm, and the support vector data defnition (SVDD) model to increase the accuracy of actual datasets by removing outliers in [10].To begin with, the data were clustered and noise data were removed using the DBSCAN technique.Trough adding the SVDD algorithm, the outliers in each class, the method considerably decreases the clustering error samples of DBSCAN clustering output.Dismissing the DBSCAN-identifed outliers in the frst step and the outlier clusters within a class of DBSCAN core points in the second step will substantially increase clustering accuracy.Te Eps, MinPts, the kernel parameter, and the penalty parameter C are the four parameters that must be chosen in their approach.Tese parameters have a big impact on the experimental outcomes, and the optimal parameters for various datasets difer as well.Tese parameters were manually set and fnding an optimal way for automatic selections of parameters is a challenge.
Dharni and Bnasal proposed an improved DBSCAN algorithm which works by splitting the dataset instead of using it as a whole in [11].Te datasets used in the experiment are taken from the UCI repository.Te efciency of clusters of the proposed method is efective in clustering results and more reliable, according to various experiments using diferent datasets in attribute relation fle format with the aid of the WEKA software.Te following conclusions can be drawn from the experiment fndings and analysis of the model: For large datasets, the suggested technique will efciently analyze the cluster.It splits the dataset instead of running over the entire dataset; therefore, the enhanced algorithm is more scalable than the density-based approach.Te future work would be to propose an advanced DBSCAN algorithm for speeding up the model by using parallel programming as well as fnding the appropriate values for Eps and MinPts for big datasets.
Ranjith et al. introduces novel anomaly detection-DBSCAN (NAD-DBSCAN), an unsupervised clustering method for detecting outliers in trafc video surveillance in [12].Tis method groups the trajectories of moving vehicles of diferent sizes and shapes.If an incident never fts the trained model, the trajectory is considered to be irregular.To fnd the sum of clusters for a data point dynamically, Eps and MinPts are the necessary parameters.Te proposed work uses the trajectories of driving objects to identify anomalies such as violation of rules, accidents, haphazard driving, and other questionable events.Te nonreachable feature of the DBSCAN method is used to track abnormal trajectory directions.On a small trafc dataset, the tests are evaluated, and an accuracy of 68.70 percent is obtained.Te development of trajectory simplifcation methodologies to improve irregular detection accuracy is their future work.
To detect the varying degree of investment capability according to diferent areas, DBSCAN is the clustering algorithm used in this analysis by Nabarian et al. [13].Te fndings revealed that seven aspects of regional investment capability seemed to have a correlation of greater than 50% with investment realization in Indonesian provinces.Provinces were divided into three groups, by choosing Eps as 6.8 and MinPts as 3, for potential investment and 0.85 and 2, respectively, for realization investment.According to the cluster similarities, one approach for growing investment in Indonesia is to raise the province's or region's investment potential and capability.It was concluded that increasing the investment potential has a positive efect on investment realization by analyzing the clustering outcomes obtained by DBSCAN.
A cattle detection method is introduced in the article by Ismail et al. using a pre-trained fast-region based convolutional neural networks and the Cafe architecture [14].Making use of the clustering technique on the performance of cow identifcation is suggested for the classifcation of cows.To compare the clustering approaches to conclude which is a better technique at identifying herds and outliers, researchers used K-means and DBSCAN algorithms.In addition, Euclidean and Manhattan distance were used in the experiment to study the impact of clustering with these two types of distance metrics.DBSCAN outperforms K-means with respect to herd trend and outlier identifcation in live-stock tracking, according to the fndings.Euclidean Journal of Engineering distance produces a greater classifcation of herds in K-means method.Manhattan distance, on the other hand, provides improved herd and outlier identifcation in DBSCAN algorithm.It is believed that the distance metric method determines identifying herds and outliers are different in the two clustering techniques used.Finally, it is concluded that in conjunction with Smart Farming 4.0, DBSCAN clustering is best suited for livestock tracking for local ranchers.
Ghanbarpour and Minaei propose an extension of DBSCAN (EXDBSCAN) in [15].EXDBSCAN is a DBSCANbased approach for covering multidensity datasets.Tis approach just needs the user to input a single parameter.It can not only detect clusters of multiple densities but can also identify the outliers correctly.Te outcomes of a comparison of this system's fnal clusters with those of two other clustering techniques on certain multidensity datasets suggest that their method performs better in such cases.Te approach suggested in the work will identify clusters with various densities and resolve the issue of density-based clustering methods such as DBSCAN and OPTICS when multiple densities clusters are taken because of using specifc Eps for each of the cluster.Tis approach employs a greedy algorithm to fnd a cluster density and then extend the cluster using the discovered parameter.However, calculating a density for every cluster requires time, so their approach takes longer time to run when compared to DBSCAN which is one of the drawbacks of this method.Whereas, the comparison of results of their methodology with two wellknown techniques DBSCAN & Make Density-Based Cluster in identifying clusters and outliers demonstrates the benefts of EXDBSCAN approach over the other two.
Du et al. suggest a novel approach for accurate local outlier identifcation using statistical parameters that combines clustering-based concepts when using large data sets in [16].To begin with, the approach uses the 3 sigma standard to fnd certain density peaks in the dataset.Second, the other data items in the dataset are classifed to the very same cluster just like its higher-density nearest neighbor.Ultimately, they classify local outliers of every category using Chebyshev's inequality and density peak reachability.Te outcomes indicate that the proposed approach is efective and reliable at detecting both global and local outliers.Furthermore, the approach has been shown to be more stable than traditional outlier detection approaches like Local Outlier Factor and DBSCAN.Te approach is not only correlated with a single statistical parameter which could be calculated without a priori domain information, but it is also tolerant to parameter change, ensuring that this method is stable under a variety of conditions.Te experiment is carried out on both synthetic and benchmark datasets.Te future work of the paper is to use the latest local outlier tracking technique to fnd irregular GPS trajectories in taxicab drivers.Additionally, their idea is to integrate the new method into the Hadoop architecture to increase its performance even more.
To address the DBSCAN algorithm's limitation of high time expense, an updated DBSCAN method is introduced by Meng'Ao et al. on the basis of grid cells that enhances DBSCAN's very time-consuming area query mechanism and avoids many unwanted query operations by splitting data space into grid cells in [17].Te efect of the grid cell dividing approach on the algorithm is then studied.By selecting the best dividing process, it will improve the algorithm's performance.Te accuracy and time complexity of the DBSCAN algorithm based on grid cells have been demonstrated experimentally which shows a higher accuracy in less time.Te improved algorithm retains the benefts of the original method while also incorporating grid technologies for reducing the number of distance computations and increasing performance.Te experiments suggest that this modifed algorithm performs better than the original approach.
Tang and Kim propose a new approach called DBSCAN-MP for evaluating the input parameters of DBSCAN in [18].In this method, every cluster can have difering Eps and MinPts.As the network environment evolves over time, they suggest a method for modifying normal behavior by adjusting cluster size or generating new clusters.In contrast to similar clustering techniques, the fndings suggest that the efciency has improved in the proposed approach.It is appropriate for detecting anomalies in network trafc of various sorts.Teir fndings demonstrate that it has a higher detection rate and a lower false positive rate when compared to techniques which make the same dataset assumptions and make use of the same dataset, KDD Cup 1999.
For identifying outliers from real-time monitoring trajectories, Dai et al. propose a trajectory outlier identifcation method using DBSCAN as well as velocity entropy in [19].First, unusual subtrajectories are identifed using an updated DBSCAN method that detects trajectory outliers with local features.Second, a comparison of the velocity entropy of the trajectories is made from a global perspective and the trajectory outliers are identifed.Te principle of trajectory confdence is suggested as a way to assess the efcacy of trajectory outlier identifcation performance, which decreases the rate of false detection in addition to increasing the accuracy.Ultimately, a test is carried, demonstrating that the system outperforms TRACLUS method.Te future work of the paper is to identify trajectory outliers from lowconfdence trajectories and assess the irregularity of the trajectories from diferent perspectives, in order to increase identifcation accuracy and extract more details from the incomplete trajectories.
Te above presented noteworthy works clearly picturizes the problems being faced in fnding the Epsilon parameter and MinPts and how some of the authors have used diferent techniques to tackle it.However, these works do not talk about the time taken to fnd the input parameters which is an important aspect.
Te AQI [20] of New Delhi, Bangalore, Kolkata, and Hyderabad has been calculated using three diferent techniques: support vector regression (SVR), random forest regression (RFR), and CatBoost regression (CR).Random forest regression yields lower root mean square error (RMSE) values in Bangalore (0.5674), Kolkata (0.1403), and Hyderabad (0.3826) and higher accuracy in comparison to Tis study [21] delves into six years of air pollution data from 23 Indian cities, aiming to conduct an in-depth analysis and prediction of air quality.Te dataset has undergone thorough preprocessing, with key features meticulously selected based on correlation analysis.Exploratory data analysis has been employed to unveil hidden patterns within the dataset, pinpointing pollutants directly infuencing the air quality index.Notably, a substantial decline in almost all pollutants is evident in the pandemic year, 2020.Addressing the issue of data imbalance, a resampling technique has been applied.Te study utilizes fve machine learning models for air quality prediction, with the outcomes compared against standard metrics.Notably, the Gaussian Naive Bayes model attains the highest accuracy, while the Support Vector Machine model demonstrates the lowest accuracy.Rigorous evaluation and comparison of these models are conducted using established performance parameters.Among the models employed, the XGBoost model emerges as the most efective, showcasing superior performance and achieving the highest linearity between predicted and actual data.

Design and Methodology
In conclusion, the importance of existing methodologies in avoiding air pollution lies in their capacity to regulate emissions, promote cleaner technologies, encourage sustainable practices, and engage communities.Combining these methodologies with advanced prediction techniques, such as machine learning, can enhance our ability to monitor, understand, and proactively address air pollution challenges.Cluster analysis are among the most powerful unsupervised learning techniques frequently used in data mining and machine learning, aimed at fnding important and prospective information in the dataset.DBSCAN is a typical representative built on a clustering method that could group clusters of any shape and helps in identifying data for noisy samples.Te data points that are dense and close together are grouped into a cluster.Furthermore, DBSCAN has an important beneft, since while clustering it does not need cluster information.In many applications it is commonly used since it is not noise-sensitive and because high-dimensional data overlook the data shape and size.Tese benefts make it an approach that is becoming extremely popular [16,22].Te clustering performance relies on the value of the parameter.DBSCAN technique have been used to determine if an object is core or not by use of two signifcant input parameters, Eps and MinPts.DBSCAN can have varying values of MinPts and Eps in various data sets.

Dataset Description for Air Pollution Prediction
(i) Data sources: Te dataset comprises a combination of sources, including air quality monitoring stations, meteorological data from weather stations, satellite imagery, and land-use information.(ii) Variables: Te key variables in the dataset include hourly measurements of air pollutants such as PM2.5, PM10, nitrogen dioxide (NO 2 ), sulphur dioxide (SO 2 ), ozone (O 3 ), and carbon monoxide (CO).Meteorological parameters include temperature, humidity, wind speed, and atmospheric pressure.
Land-use data include information on urban density and green spaces.(iii) Temporal and spatial resolution: Te data are collected at hourly intervals over a span of two years (2019-2021).Te spatial resolution covers a metropolitan area, with a grid of monitoring stations spaced at approximately 1-kilometer intervals.(iv) Geographical coverage: Te dataset covers the metropolitan area of Delhi city, encompassing both urban and suburban regions, with a focus on areas with signifcant industrial and vehicular activities.Te developed model is compared with baseline models, including a naive persistence model and a linear regression model, demonstrating a signifcant improvement in predictive accuracy.

Limitations and Challenges.
Limitations include potential data gaps in satellite imagery during adverse weather conditions and the inherent difculty in predicting rare, extreme pollution events.Te model may also be sensitive to changes in monitoring station locations.Some important terms and properties of the DBSCAN algorithm are described below [10].
(i) Eps-neighborhood of a point: Te Epsneighborhood for a point m also called neighborhood radius is defned as the neighborhood in a given Eps radius for an object.
(ii) Directly density-reachable: If point m is in the Epsneighborhood of point n, and n is a core point, this means m is directly density-reachable from object n.
(iii) Density-reachable: A point m is density-reachable from a point n with respect to Eps and MinPts if there exists a chain of objects m1, . .., mz, m1 � n, mz � m such that mi + 1 is directly density-reachable to mi.
(iv) Density-connected: A point m is density-connected to a point n with respect to Eps and MinPts if there exists an object o such that both, m, and n are density-reachable from o.In other words, if there exists an object such that both m as well as n are density-reachable, then the object m is said to be density-connected to point n.
Tere are three kinds of points, which are core points, border points, and noise [11].
(i) Core point: A core point in a data set is that point where the number of points is greater than the MinPts in the Eps-neighborhood of the point.Tese would be the points inside the cluster.In other words, if the object's Eps-neighborhood includes at least MinPts number of objects, then the object is referred to be the core point.
(ii) Border point: In the given radius (Eps), a border point contains lesser than the MinPts, yet it is within the core point's neighborhood.In other words, if a point is not a core point and lies in the Eps-neighborhood of a core point, it is called a border point.
(iii) Noise point: Noise is regarded to be those objects which are not included in a cluster.Any point which is neither a core point nor a boundary point is an outlier or a noise point.
According to these defnitions, a density-based cluster is a group of density-connected objects which is maximum with respect to density reachability.
By examining the Eps neighborhood of each object in the dataset, DBSCAN determines the cluster.In case the number of points in the Eps neighborhood is larger than MinPts, a cluster is created, and p would be the core.Tis technique applies iteratively to every p-object still not categorized.Figure 2 shows the three types of objects [15].
In DBSCAN, the heuristic approach for determining Eps and MinPts parameters of the smallest cluster in the dataset is easy yet efcient.Every point mapped to the distance to its k-th nearest neighbor is called the k-distance (k-dist) function.Te graph of such a function provides a few details concerning the dataset's density distribution while sorting the data points in the decreasing order of the k-dist values.Tis graph is known as k-dist plot.In this graph, the frst point could be the threshold in the frst valley and the maximum MinPts-dist value is deemed to be noise and rest points belong to any of the clusters.Te DBSCAN algorithm fowchart is shown in Figure 3.

Journal of Engineering
To fnd the Eps parameter to DBSCAN, a k-dist plot is drawn using the search tree algorithm and KNN algorithm.Te search tree algorithm works in the following way.It constructs a binary tree and at each node data points are split.In the frst layer, the frst axis is used and in the second layer, the second axis is used; i.e. considering the tree is starting at layer 1, at every odd level/layer of the tree, the X axis is used to compare and at every even layer, Y axis is used.Te algorithm is explained in detail with the help of an example.
there are a few data points (5,4), (2,6), (13, 3), (8, 7), (3,1).For simplicity, 2 Dimensional data are considered.As shown in Figure 4, the nodes at each layer are X aligned and Y aligned alternatively.Now suppose a new data point, say (10, 2) comes.In the frst layer, since it must X aligned, the corresponding X values for the existing node and the new data points are compared.If the latter value is greater than the former, the traversal further is to the right of the tree.But if it was the other way round, the traversal would be to the left sub tree.And fnally, the exact position at which the new node must be inserted into the search tree is found.
Te search tree can also be represented in a X-Y graph as shown in Figure 5. Te vertical double headed arrow represents the splitting of X axis and the horizontal arrow represents the splitting of Y axis.Te small balls indicate the data points.
Suppose the nearest neighbor of a target data point, say (9, 4) is to be found.Traversing down the tree happens using the insertion method and it fnds that (8, 7) is a candidate nearest neighbor with a distance r, the line joining the target data point and the candidate nearest neighbor makes some random angle with respect to the horizontal axis as shown in Figure 6.But there could be a possibility of a closer neighbor.Tis is because while traversing down the tree, Y in the frst node and X in the next node and so on were ignored.Tis inconsistency would be overcome while recursing back in the tree.
If the line (with a distance r ′ ) that is directly perpendicular to the section that that was not visited when traversing the tree is shorter than the line with distance r, then there is a possibility that, this section might have a point closer than the previously found candidate nearest neighbor.Figure 7 shows the recursion direction to traverse to the other section which was not visited to fnd the nearest distance to the target data point.
When the tree is recursed back, it checks for the section that was not visited.Let the distance between the unvisited section and the target data point be r ′ .If that distance is smaller than the best-found distance so far (i.e.r ′ < r), then traversing continues and the process is repeated until the condition becomes false (i.e., stops when r ′ < r).But during recursing back, every node will not be visited.If the height of the tree is H, then only 2H nodes will be visited which makes it a logarithmic search with respect to number of nodes in the tree.Since the distance to every node need not be calculated, unlike KNN, the search tree technique takes a lesser time to fnd the distance to its k-nearest neighbors      Journal of Engineering (where k � MinPts).Both search tree and KNN techniques calculate the distance between each point and its k-nearest neighbors and plots the k-distances graph of distances v/s sorted data points.Te optimal value for Eps is the point at which the graph has the highest slope.Te search tree technique works equally well for higher dimension data.In this case too, every node will have two branches, i.e., a binary tree will be constructed, but the number of values stored in a node could be arbitrarily large.Te system architecture is illustrated in Figure 8.
In the frst phase, data are collected and understood.In this work, an air pollution dataset is considered [24].Tere are fve attributes which are considered, PM2.5, AT, PM10, WS and WD.In the data preprocessing phase, the null values are updated and the data are cleaned and scaled.Feature reduction is done using Principal Component Analysis.Te attributes are reduced to two principal components (PC1 and PC2) since it accounts to approximately 75% of the variance.In the next stage, the input parameter Eps is found using the search tree approach and KNN algorithm.Te MinPts is set to 10, which is twice the number of dimensions considered.Next, the DBSCAN algorithm is modeled and fnally the performance evaluation is done.

Results and Discussions
Te following are the software and hardware requirements used for implementing the work: (i) Software: Te program is developed in Python language using Jupyter notebook.Scikit learn is used to import the necessary packages.(ii) Hardware: 16 GB RAM, Intel i5 CPU, 64 bit OS.
First step is to understand the data.Te air pollution data set obtained has 32413 rows and 7 attributes.Te head of the dataset is shown in Table 1.Te attributes' description is as follows: (1) From Date: Gives a timestamp of the start of the hour at which values were collected (2) To Date: Gives a timestamp of the end of the hour till which values were collected (3) PM2.5: Particulate matter 2.5 (measured in micro grams/cubic meter-ug/m 3 ) (4) PM10: Particulate matter 10 (ug/m 3 ) (5) WS: Wind Speed (measured in meter/second) (6) WD: Wind Direction (measured in degree) (7) AT: Air Temperature (measured in degree Celsius) Te data types of the attributes are described in Table 2. Te "From Date" and "To Date" are of type objects.Te other attributes PM2.5, PM10, WS, WD, and AT are of foating point numbers which involves only numbers with decimal points.Float64 corresponds to foat data type in Python.
Te matrix visualization of missing values of the fve attributes afecting air pollution in the dataset is shown in Figure 9. White line indicates the missing values.Most of the missing values are seen in PM10 attribute which has the most number of white lines.No specifc pattern of missing values can be observed to indicate the correlation between missing values, for instance, no prediction can be made that such that if there is a missing values in attribute X, then there would be a missing value in attribute Y.
Te null values can be graphically seen using a bar chart in Figure 10.Te count of missing values can be represented with the bars where Y axis indicates the number of missing values and X axis indicates the attributes.Te number of missing values are PM10 with 2160, PM2.5 with 875, AT with 589, WD with 313, and WS with 137 missing values.
As a next step of analysis, a complete description of variables is derived which provides various other useful statistical information about the data.Table 3 shows the statistical data description which in general shows the count, mean, standard deviation, minimum value, 25% quartile value, 50% quartile value, 75% quartile value, and the maximum value of each variable present in the dataset for each of the attributes.
A visualization of how the attributes values have changed over time is shown in Figure 11.Te values of all the attributes over time, where the From Date on the X axis starts from January 1st 2017 and goes up till the 12th July 2021, the Y axis represents the values of all the attributes.Since the initial data are unscaled, there are diferences in the range of each of the attributes.Te legend shows the colors along with its attributes.
Because the attributes of the dataset will not be on the same scale, the whole dataset must be standardised, i.e., the scope and the range and magnitude of each attribute in the dataset are unique.A raise of 1 point in PM2.5 is not the same as a 1 point raise in PM10 and vice versa.If one attribute has greater fuctuation in its data, it will impact the distance computation signifcantly.With the scale of the attributes, all characteristics are averaged to zero and a standard deviation of one.Te scaled data are shown in Table 4.
Once the data are scaled, the dimensionality reduction is done using principal component analysis.Te proper number of principal components must be determined.PCA analysis is shown in Figure 12.Te number of features that would account to approximately 75% of the variance must be the number of principal components.Te horizontal red line indicates the line drawn corresponding to the 75% variance, the vertical red line indicates the number of features required in the PCA, which corresponds to 1.5 features, so after rounding of, it is decided that 2 principal components would be required.
Te head of the data frame transformed after the creation of the two principal components is shown in Table 5.Once the data set is reduced from fve features to two principal components as decided by the PCA analysis, the data will be transformed in terms of two principal components PC1 and PC2.
To fnd Eps, the k-dist plot is drawn using both search tree algorithm and the KNN algorithm.Te knee point, which is the point at which a sharp increase in the curve in the k-dist graph is observed, is the Eps value to be fed to DBSCAN.

Journal of Engineering
Figure 13 shows the distances vs sorted data points k-dist graph for search tree algorithm.Te time taken to fnd the distances and plot the graph is 161 milliseconds.Te Eps value is found to be 0.05 which is the knee of the graph.Te search tree algorithm takes a lesser time for every data point since for every data point the distance calculation need not be from one point to all other points.
Figure 14 shows the distances vs sorted data points k-dist graph for KNN algorithm.Te time taken to fnd the distances and plot the graph is 208 milliseconds.Te Eps value is found to be 0.05 which is the knee of the graph.Te KNN algorithm takes more time since for every data point the distance calculation will be from one point to all other points.
As we can see from Figure 13, the time taken to fnd the distances and plot the graph to fnd the Eps parameter using the search tree approach is less than the time taken to fnd the epsilon using the broadly used KNN algorithm shown in Figure 14.
Since the DBSCAN algorithm's performance varies widely with respect to the Eps parameter, determining the Eps efciently and in less time is very important.As the    12 Journal of Engineering   Journal of Engineering dataset size increases, the search tree approach to fnd Eps in less time is of great signifcance.
A scatter plot of the data points for the PC1 and PC2 before clustering is shown in Figure 15.Te Eps value was found using the search tree approach and the KNN algorithm, and the MinPts are fed to the DBSCAN algorithm to fnd clusters and outliers and the scatter plot is plotted to obtain the clustering results.
Te DBSCAN clustering result is shown in Figure 16.A scatter plot is drawn.Tree clusters are formed.Cluster 0 represented by blue points, Cluster 1 in green and Cluster 2 in yellow.Te purple points are the outliers.Most of the data points lie in the cluster 0, followed by cluster 1, then cluster 2. Remaining fall under noise points.
Accuracy of a clustering algorithm can be measured using the Silhouette score performance evaluation metric which ranges from −1 to +1, −1 being the worst and +1 being the best score [25].Other performance metrics for clustering algorithm, which are the Davies-Bouldin score, Calinski-Harabasz score     14 Journal of Engineering and entropy lies in the range [0, 1], where higher the entropy, poorer the clustering.Te performance evaluation of the DBSCAN algorithm using 4 metrics is shown in Table 6.According to these values, the clustering done by the model by giving the Eps and MinPts values gives a satisfactory performance with all the performance metric values obtained being fairly good in terms of clustering evaluation.
Performance evaluation for DBSCAN v/s KMeans algorithms is shown in Figure 17.
Te Silhouette score, Davies-Bouldin score, Calinski-Harabasz score of DBSCAN is greater than KMeans whereas entropy of DBSCAN is lesser than KMeans.Tis clearly shows that, DBSCAN has clustered the data better than KMeans clustering algorithm.DBSCAN does not require to be fed number of clusters as a parameter, detects clusters of any shape and size in addition to detecting outliers in comparison to KMeans algorithm.Terefore, based on experimental results of this work, DBSCAN has outperformed KMeans algorithm on this dataset.DBSCAN has a lower computational complexity but may still require tuning of parameters which is done using hybridization.(iv) Robustness:

Comparative Analysis
DBSCAN is robust to outliers and noise due to its density-based nature.
In conclusion, the choice between DBSCAN and other algorithms depends on the specifc goals of air pollution prediction task.If primarily interested in clustering spatial patterns, DBSCAN is useful.

Conclusions
DBSCAN is a clustering algorithm which helps in applications such as clustering the dataset into diferent levels of air pollution.Unlike other clustering algorithms, the ability of DBSCAN to fnd clusters and outliers for data of any shape and size makes it a broadly used clustering technique.Te algorithm requires two input parameters Epsilon and Minimum points which are highly sensitive; in that, marginal changes of these values afects the DBSCAN clustering and performance.Finding out the Eps in less time is highly essential as the dataset size increases.In this work, a search tree technique and KNN algorithm are used to fnd the Eps parameter.It is found that, the time taken to fnd the Eps value using the search tree technique is less than the KNN algorithm.Tis is because unlike KNN, the search technique does not visit every node to fnd the distances to its k-nearest neighbors which makes it a better technique in saving time and improving the efciency.A kdist graph is plotted and the knee point of the k-dist plot is considered as Eps.Te MinPts are taken to be twice the number of dimensions.Tese input parameters are fed to DBSCAN and the clustering results are obtained.Te outcomes of this work are early warning systems, optimized environmental policies, public health interventions, data-driven urban planning, improved monitoring networks.Te future scope is integration with IoT and sensor networks, multi-sensor fusion, health impact assessment, global collaboration.

Figure 1 :
Figure 1: Te percentage of air pollutants.

Figure 5 :
Figure 5: X-Y graph of data points with splits.

XFigure 6 :
Figure 6: X-Y graph with a target data point.

Figure 4 :
Figure 4: Insertion of a data point into the tree.

Figure 9 :
Figure 9: Matrix visualization of missing values.

Figure 11 :
Figure 11: Graph of attributes over time.

Figure 13 :
Figure 13: K-dist plot for search tree algorithm.
(i) Nature of task: If the goal is to identify clusters of pollution sources or regions with similar pollution patterns, DBSCAN is useful.(ii) Data characteristics: If the data are noisy, and pollution sources are spatially clustered, DBSCAN might provide valuable insights.(iii) Computational complexity:

Table 2 :
Data type of attributes.

Table 1 :
Head of the dataset.

Table 3 :
Statistical information of the dataset.

Table 5 :
Head of principle components.