Evaluation of a Depth-Based Multivariate 𝑘 -Nearest Neighbor Resampling Method with Stormwater Quality Data

A nonparametric simulation model ( 𝑘 -nearest neighbor resampling, KNNR) for water quality analysis involving geographic information is suggested to overcome the drawbacks of parametric models. Geographic information is, however, not appropriately handled in the KNNR nonparametric model. In the current study, we introduce a novel statistical notion, called a “depth function,” in the classical KNNR model to appropriately manipulate geographic information in simulating stormwater quality. An application is presented for a case study of the total suspended solids throughout the entire United States. The stormwater total suspended solids concentration data indicated that the proposed model significantly improves the simulation performance compared with the existing KNNR model.


Introduction
Human activities in urban areas create a large number of pollutants.These pollutants are carried by stormwater into inland water bodies such as streams, rivers, and lakes, endangering the local ecosystems.During the last few decades, governments and communities have developed strategies to reduce urban stormwater pollution.To meet these objectives, a number of approaches for water quality analysis and modeling such as the environmental probability plot, the box-whisker plot, and the - * model method have been developed; see [1][2][3][4][5].For instance, in 1983, the United States Environmental Protection Agency (US EPA) established the national pollutant discharge elimination system (NPDES), imposing water quality requirements for urban storm sewer systems to secure the environment around water bodies.However, stormwater quality data are inherently difficult to collect and analyze due to their uncertain nature in both the time and space domains; see [6,7].Furthermore, the modeling of stormwater quality generally involves the difficult task of organizing and processing large amounts of spatially referenced data; see [2,8].It is thus essential for modelers and decision-makers to take into consideration the uncertainty in the data.
Monte Carlo simulation (MCS) has frequently been employed in the literature to determine the uncertainty in stormwater pollutant concentrations; see [5,[9][10][11].The general MCS procedure is to fit a probability density function (pdf), for example, the log-normal distribution, to the observed data and then generate "samples" from the fitted pdf model.This traditional approach has a number of drawbacks, such as the limited number of feasible pdfs for stormwater quality data, the large effects of outliers, the limited choice of distributions (other than the normal distribution) for more than one variable, and the bias induced by the normality assumption, especially for the multivariate case; see [4].Furthermore, the limited stormwater quality records hinder the application of the traditional MCS approach, especially in stormwater management.

Mathematical Problems in Engineering
Towler et al. [4] adapted the -nearest neighbor resampling (KNNR) method (see [12][13][14]) to simulate influent concentration scenarios using the information collection rule (ICR) database of the US EPA.The KNNR method applied by Towler et al. [4] for wastewater quality simulation takes into account extensive spatial data to overcome the common limits of temporal concentration datasets.In this method, geographical information (GI) is included as variables along with the concentration variable.The results of Towler et al. [4] indicated that the KNNR model is a good alternative to the traditional parametric MCS approach for regulatory, treatment, and risk assessments regarding concentrations.
However, the way GI is handled as a general variable in Towler et al. [4] might lead to the underperformance of the KNNR model.The primary reason for this is that, as the modeling dimension increases from the insertion of the GI, the model becomes more intricate and subtle.Consequently, the model loses its focus on the water quality concentration variable.Second, the variability of the GI is quite different from the variability of the concentration variables because of the differences in their characteristics.The GI changes only spatially, whereas the pollutant concentration variable varies both temporally and spatially.Third, adding other GI variables such as the altitude and the area of the watershed is not advised in KNNR due to dimensionality problems.
To alleviate this problem, we propose a novel resampling approach based on depth functions (see [15,16]), which adapts a different GI from the target stormwater quality variable in the KNNR simulation model.The proposed algorithm involves the combination of KNNR and a depth function and is denoted as "depth-neighbor resampling" (DNR).It is tested with a stormwater quality dataset in this study.
The study is organized as follows.The background of the KNNR model of Towler et al. [4] and depth functions are presented.The overall model procedure is described in the following methodology section.In the application section, the DNR model and the existing KNNR model are applied to the stormwater quality model, and their results are compared.Finally, the conclusions and final remarks are presented.

Methodology
The KNNR is a simple analogous model for fitting the conditional distribution and then generating simulations from it in a data-driven manner.The KNNR model for water quality simulation was proposed by Towler et al. [4].The generation of an ensemble of the variable of interest () (e.g., pollutant concentration for a given month), conditioned on a feature vector y = [(1), (2), . . ., ()] of  explanatory variables, is based on the evaluation of a conditional distribution, that is, ( | y).Then, the distance between the current feature vector and the feature vectors constructed from observed data is measured, and one of the -nearest neighbors is selected.Finally, the corresponding pollutant concentration value of the selected neighbor is assigned as the simulation value.Towler et al. [4] employed three explanatory variables (i.e.,  = 3): pollutant concentration, latitude, and longitude.Note that the GI is included as explanatory variables through the latitude and longitude.
However, some drawbacks are expected, as discussed in the introduction section, when handling the GI in the form of explanatory variables.A special solution to handle the GI is proposed in the present study by employing the statistical notion of a "depth function." Starting from the half-space depth proposed by Tukey [15], a number of depth functions have been formulated in the literature; see [17][18][19][20][21][22].Although depth functions have been widely used in statistics and econometrics (see [20,23,24]), they have just recently begun to be used in the environmental and hydrological fields; see [16,25].
A depth function is a statistical notion for providing an outward ordering of points in a multivariate framework.The key properties of the depth function are (a) affine invariance, (b) maximality at the center, (c) monotonicity relative to the deepest point, and (d) vanishing at infinity; see [19].
For a given cumulative distribution function  on   ( ≥ 1), the corresponding depth function is any bounded and nonnegative function, denoted as (z; ), which provides an -based center-outward ordering of a point z in   to satisfy the properties mentioned above.A number of depth functions can be described, among which the following two are commonly used.
(a) The half-space depth (see [15]) is defined with respect to a probability, Pr, associated with a distribution  on   as   (z; ) = inf {Pr () :  a closed halfspace that contains z} (b) The Mahalanobis depth (MHD) (see [26]) is defined on the basis of the Mahalanobis distance  2 Ψ (z, ) = (z − )  Ψ −1 (z − ) between two points, z and , with respect to a positive definite matrix Ψ.The Mahalanobis depth is given by where  and Ψ are, respectively, any location and covariance measures corresponding to .
Employing the above depth functions, the basic idea of the proposed model in the current study is to convert the GI (location and longitude) into a real-valued weight vector.This process involves reducing the dimension of the GI variables.In doing so, the information is appropriately handled through the distance measurement in the KNNR model without any increase in the dimension.In other words, the elements of the GI (latitude and longitude) are mapped using a depth function.The MHD (2) is employed in the current study because it is easy to evaluate and flexible to adapt to the current situation and converted to the weight vector for the distance measurement of the KNNR model.
The overall procedure of the depth-based KNNR model denoted as "depth-neighbor resampling" (DNR) as a surrogate of ( | y) for the stormwater quality simulation () is as follows.
(1) The user-specified feature vector is defined as y  = [  ], which is known in advance for the current time .Note that the element of the feature vector is now only with the stormwater quality variable (i.e.,  = 1) unlike the one in Towler et al. [4] as y  = [  , LAT  , LON  ]  (i.e.,  = 3), where LAT and LON represent the latitude and longitude at the location where   is measured.In the current proposed model, the GI (latitude and longitude) is taken into account through the depth function shown below in step (3).
(2) The observed feature vectors of all pollutant concentration available sources are constructed as where   ( = 1, . . ., ) represents the pollutant concentration, which is the variable of interest, and  denotes all the available records of the database employed.The ICR database of the USEPA for Y DB was employed by Towler et al. [4] (see the introduction section).The employed database (Y DB ) for the current study is described in the next section.
(3) The depth function is evaluated by   (g  ; ) in (2), where  = g  , g  = [LAT  , LON  ]  at location, and Ψ is the covariance matrix of g  .Note that g  is the GI where the measurement y  was taken and g  are the GI for the measurements in Y DB .
(5) The distances between the user-specified feature vector y  = [  ] and the observed feature vectors are computed as (6) The distances   are arranged in ascending order, and the first  values are selected.Next, one of these  values is randomly assigned with the selection probability given by (1/)/ ∑  =1 1/, where  = 1, . . ., .This probability gives a higher chance of selection to the nearest neighbor and a lower chance to the farthest neighbors.Suppose the corresponding index of the selected value is assigned as  * .The number of nearest neighbors () is estimated using the heuristic approach (i.e.,  = √) with its theoretical justification; see [12,27].(7) The simulation of the stormwater quality variable can be performed by selecting the stormwater quality variable of the corresponding index as [  * ]; that is,  =   * .5) and ( 6)) with different parameter sets.(8) The steps above (all steps 1-7 except step 2) are repeated to generate the desired number of data points (  ).
To obtain   in step 4 and ( 4), a weight function (⋅) that is positive and monotonically increasing is employed.In the current study, two common weight functions are tested: the one proposed by Lin and Chen [22] and the simple linear weight function (see [16]).The weight function of Lin and Chen [22] is given by with coefficients , , and , where  defines the support of the weight function and  represents the slope of decay to zero.If  = 1, then it is equivalent to the weight function of Zuo et al. [28].The simple linear weight function mentioned in Chebana and Ouarda [16] is expressed as where  1 and  2 are the coefficients with 0 ≤  1 ≤  2 ≤ 1.
A trial-and-error approach was applied in the current study for the parameter estimation of the weight functions.The above two weight functions  LC and  CO are illustrated in Figure 1 using strategically selected coefficients to show their roles.Figure 1 [16] and other studies.Furthermore, it is evident that  1 and  2 are the lower and upper limits in  CO (see the dotted line with circles in Figure 1) beyond which the weights 0 and 1, respectively, are assigned.

Application
The stormwater quality datasets were obtained from the international stormwater best management practice (BMP) database (http://www.bmpdatabase.org/),which has been assembled since 1996 by the American society of civil engineers and the USEPA, as shown in Figure 2. The database was established to foster a better understanding of factors influencing urban stormwater quality; see [3,5].The stormwater quality data as the event mean concentration of total suspended solids (TSS) was retrieved with  ≈ 800 (see ( 4)), as shown in Table 1, because it contains a relatively large number of records (approximately 800).The latitude and longitude were also retrieved as the GI.
As mentioned in the introduction, the KNNR is a simple model based on the conditional distribution ( | y).In Towler et al. [4], the annual average of wastewater influent concentration, longitude, and latitude were used for the conditional variables y  to simulate the corresponding monthly wastewater influent concentration variables.The stormwater quality data used in the present study, however, is rarely available for a continuous full year.Therefore, we use the immediate preceding monthly TSS value as the conditional variable y  instead of the annual average as in Towler et al. [4].In other words, the conditional variable y  to simulate the stormwater quality for a certain month , denoted as   , is the stormwater quality of the preceding month (i.e., y  =  −1 ).Subsequently, Y DB in (3) consists of all the stormwater TSS data in month  − 1.In the current study, the TSS value of the current month is simulated from the proposed DNR model by treating the GI with the depth function.
Among other coefficient sets,  LC with [ = 2,  = 3, and  = 0.8] and  CO with [ 1 = 0.1 and  2 = 0.9] performed comparably well in the application of the current study.Therefore, the results using these coefficient sets for each weight function are presented.The models and parameter sets used in this application are: (1) the KNNR model with the depth function (  in (2)) and the weight function  LC in ( 5) with [ = 2,  = 3, and  = 0.8]: DNR LC ; (2) the KNNR model with the depth function (  ) and the weight function  CO in ( 6) with [ 1 = 0.1 and  2 = 0.9]: DNR CO .
Each TSS value is simulated 500 times with the three models above.Their performances are evaluated through the mean absolute log error (MALE) and the root mean square log error (RMSLE), given, respectively, as: where   is the number of simulated times (here,   = 500 is used) and    is the data generated as a surrogate of the observed data  at the th time.Note that (i) a low value of MALE or RMSLE indicates a better reproduction of the characteristics of the observed data; (ii) RMSLE is more sensitive to outliers than MALE due to the squared formulation; and (iii) the statistics in ( 7) and ( 8) are the logscale version of the mean absolute error (MAE) and the root mean square error (RMSE), respectively, because stormwater quality data are commonly analyzed with log-scaling; see [1,4].
In Figure 3, parts of the generated sequences (20 stations among  ≈ 800, as shown in Table 1 and Figure 2) are illustrated using boxplots along with the corresponding observed TSS data.The typical KNNR model often generates values that are much lower or higher than those corresponding to the observed data.The generated TSS quantities from the proposed KNNR models with depth function (DNR CO and DNR LC ) show better agreement with the observed TSS values  than the ones from the typical KNNR model.The remaining generated values (data not shown) have a behavior similar to that described above.
To check the agreement between the observed and generated data, using all the TSS data from 118 stations, as shown in Figure 4, the MALE and RMSLE in (7) and (8) are employed.These statistics are presented in Figures 5-6 and Table 2.The MALE statistics of the TSS-generated data at each month are illustrated in Figure 5.The DNR CO shows consistently better performance than KNNR during the summer and fall months.The DNR LC presents the best performance compared with the other models, except in July when DNR CO is slightly better.The overall superiority of DNR LC with the average MALE metric over all the simulated values is briefly presented in the second row of Table 1 (the smallest MALE value among the three different models is illustrated).The overall simulation performance of the three models is illustrated through RMSLE, as shown in Figure 6.It shows that the RMSLEs of KNNR and DNR CO are similar, whereas the RMSLE of DNR LC is significantly lower than the former two.The average over all the simulated values of the RMSLE (summarized in the third row of Table 1) supports the results of Figure 6, indicating that the DNR LC model has the lowest RMSLE, whereas KNNR and DNR CO lead to similar performances.This implies that the DNR LC is superior to the other models.
An extensive sensitivity analysis of the coefficients for the weight function  LC in (5) was performed with different parameter sets of  and  while fixing  = 2, as favored  in Chebana and Ouarda [16].This analysis is performed to further investigate the role of the coefficients of the weight function  LC .The analysis of the coefficients of  CO ( 1 and  2 ) is omitted because their role is obvious.The average RMSLE of each parameter set ( and ) from 500 simulations is shown in Figure 7.Note that  is attributed to the shape of the weight function and  is the location parameter beyond which the weight is assigned as the value 1.0 (see Figure 1).Figure 7 illustrates that (1) the RMSLE increases as  increases for all values of the parameter  and (2) the RMSLE decreases as  increases for high values of , whereas the decrease of RMSLE is minimal for low values of .The results indicate that  has a high impact on the performance of the model, whereas  has little impact.Therefore, it is concluded that a low value of the parameter  and a high value of the parameter  are preferred for the currently applied TSS data.

Conclusion
The uncertainty in stormwater quality data has hindered the accurate analysis and physical modeling of water quality.Fitting a parametric pdf model to stormwater quality data is sometimes unfavorable primarily due to the lack of data.As an alternative, the nonparametric KNNR model was recently applied for simulating the ensemble of pollutant concentration data.In the current study, we adapted a statistical approach involving "depth functions" to improve the simulation performance of the KNNR for stormwater quality data.Different weight functions are incorporated with the depth function.The results illustrate that the integration of the depth function in the KNNR model, with an appropriate choice of the weight function and its coefficients, leads to an improved performance in the simulation of stormwater quality data.The weight function  LC is superior to  CO in the simulation performance.
Automatic models for the identification of the optimal weight function and the estimation of its coefficients still remain to be developed.One feasible option to find the optimal weight function and its coefficients is to build an objective function to optimize (such as minimizing MALE and RMSLE) and then using an optimization technique such as a genetic algorithm (see [29]) or an adaptive metropolis algorithm (see [30]) to solve the optimization problem.

4 MathematicalFigure 2 :
Figure 2: Locations of the retrieved stations of stormwater quality.

Figure 3 :
Figure 3: Observed value (circle) and simulated data (boxplot) of log(TSS) for (a) KNNR, (b) DNR CO , and (c) DNR LC in 20 stations as shown in Figure 2. Boxes display the interquartile range (IQR) and whiskers extending to 1.5×IQR.The horizontal lines inside the boxes depict the median of the data.Data beyond the whiskers (1.5 × IQR) are indicated by a plus marker (+).Note that a circle placed inside a box indicates a good agreement between the observed and simulated data.

Figure 4 :
Figure 4: Locations of all the 118 stations of stormwater quality.

Figure 5 :Figure 6 :
Figure 5: Average of MALE for each month.Note that the lower value of MALE indicates better reproduction of the characteristics of all the observed TSS data shown in Figure 4.

Figure 7 :
Figure 7: Overall RMSLE with the different parameter sets [ and ] for the weight function  LC (5) after setting  = 2.

Table 1 :
Selected 20 stations list employed in Figures2 and 3.

Table 2 :
Overall MALE and RMSLE of the three selected nonparametric simulation models.