An Improved Estimation for Heterogeneous Datasets with Lower Detection Limits regarding Environmental Health

Analysis of environmental data with lower detection limits (LDL) using mixture models has recently gained importance. However, only a particular type of mixture models under classical estimation methods have been used in the literature. We have proposed the Bayesian analysis for the said data using mixture models. In addition, an optimal mixture distribution to model such data has been explored. The sensitivity of the proposed estimators with respect to LDL, model parameters, hyperparameters, mixing weights, loss functions, sample size, and Bayesian estimation methods has also been proposed. The optimal number of components for the mixture has also been explored. As a practical example, we analyzed two environmental datasets involving LDL. We also compared the proposed estimators with existing estimators, based on different goodness of fit criteria. The results under the proposed estimators were more convincing as compared to those using existing estimators.


Introduction
The environmental studies often encounter the exposure measurements falling below the LDL. These nondetectable observations are considered left censored observations [1,2]. Hughes [3] discussed that the difficulty in modeling the environmental concentration datasets arises when some of the measurements are below the LDL. As the proportion of censored observations may not be trivial, failure to adjust the censoring in the analysis can produce seriously biased results with inflated variances. The most convenient method to adjust the censoring is to replace the censored observation by the detection limit. However, statistical properties of such methods are obscured. As an improvement, Paxton et al. [4] proposed the iterative imputation technique to settle the censoring issue, but this method did not consider the correlated structure of the data and parametric estimates.
Some authors have proposed the standard statistical distributions to model the left censored datasets. For example, Mitra and Kundu [5] used generalized exponential distribution, Bhaumik et al. [6] employed normal distribution, Leith et al. [7] and Jin et al. [8] used log-normal distribution, and Asgharzadeh et al. [9] considered Weibull model to model the left censored data from different situations. However, varying modeling capabilities of these models to model the left censored data has been reported by Vizard et al. [10]. Further, Moulton and Halsey [11] raised several concerns over using a standard statistical model to deal with such data. They argued that these datasets may be highly skewed to make most of the standard models inappropriate for analysis. Further, there may exist an additional subpopulation of the observations falling below the detection limit making the data heterogeneous or multimodal. Following these arguments, Moulton and Halsey [11] suggested the use of gamma mixture model (in their case) to model the left censored HIV RNA dataset. Taylor et al. [12] also emphasized that in case of larger proportion of nondetectable observations, a suitable model for analysis should be explored. They proposed log-normal mixture distribution to model the datasets with LDL. Some other authors including Moulton et al. [13] and Chen et al. [14] have suggested the use of mixture models for analyzing left censored data, especially when the proportion of censoring and skewness is high in the data. The use of mixture models also allows additional variability in the distributional shape as compared to some standard models. A recent article by Feroze and Aslam [15] considered the analysis of left censored mixture of the Weibull model.
Furthermore, most of the authors employing mixture models for analyzing left censored datasets have proposed maximum likely estimation (MLE) to estimate the model parameters. But, as the mixing cause missing data, MLE is often not suitable in estimating mixture models [16]. Following this argument, Hughes [3] proposed EM algorithm assuming normal mixture model to adjust the censoring issue in MLE. But the EM algorithm can also fail when the censoring rate is high [17]. These issues in the estimation of mixture models can be handled using a Bayesian approach. With informative priors, Bayesian estimation (BE) can produce significant gains in efficiencies of the estimates as compared to classical methods [18,19]. The Bayesian estimators (BEs) can be biased but often have smaller mean squared errors (as compared to MLEs) and include additional prior information [20]. These methods are also applicable when the parameter estimates are correlated [21]. However, the Bayesian inference has not yet replaced the classical methods to model the heterogeneous environmental concentration datasets with LDL [22].
The generalized exponential (GE) distribution is very important distribution to model censored datasets. Recently, Gupta and Kundu [23] and Mitra and Kundu [5] identified that this model can be an appealing alternative to many standard models, such as gamma, Weibull, lognormal, loggamma, exponential, and Rayleigh, to analyze the censored datasets. The mixture of GE models for analyzing the censored datasets has been more recently introduced. Teng and Zhang [24] proposed the estimation of model parameters from two-component mixture of GE models (2CMGEM) via EM algorithm. Ali et al. [25] discussed the application of 2CMGEM in modeling right censored data. Mahmoud et al. [26] considered 2CMGEM for obtaining predictions from the censored data. Wang et al. [27] discussed the applicability of 2CMGEM in medical sciences. Similarly, Kazmi and Aslam [28] advocated the employment of GE mixture model to fit various right censored datasets. More recent contributions regarding the analysis of 2CMGEM can be seen from the following works. Aslam and Feroze [29] considered the analysis of 2CMGEM under progressive censoring. Kluppelberg and Seifert [30] reported some important results from the conditional distributions of 2CMGEM. Yang et al. [31] discussed important inferences about 2CMGEM. Various properties of 2CMGEM were investigated. Hence, 2CMGEM is a very relevant model to analyze the censored dataset with mixture or heterogeneous behavior.
We have explored the advantages of Bayesian methods in modeling the heterogeneous environmental concentration data with LDL. First, we have employed the Bayesian mix-ture model approach using simulated datasets. As discussed by Momoli et al. [22], these datasets represent the real data from environmental studies and provide convenience by avoiding issues with empirical comparisons. Further, to investigate the advantages of the proposed methodology in real world, two real datasets regarding environmental concentrations with LDL have been used for the analysis. The 2CMGEM has been used as candidate model to analyze the mixture behavior of these datasets. The comparison of the proposed model with other mixture models, available in the literature to analyze the datasets with LDL, has been made. The comparison among different estimation methods has been discussed. Some advantages of using the proposed methodologies have been observed.
The remaining paper has been organized as follows. Section 2 contains the estimation of model parameters using different estimation methods. In Section 3, the simulated datasets are considered for comparison of proposed estimators. In Section 4, two real examples have been used to discuss the applicability and suitability of the proposed estimators. The findings of the study have been reported in Section 5.

Material and Methods
This section contains the analytical estimation of model parameters under expectation-maximization (EM) algorithm, Lindley's approximation (LA), and Markov chain Monte Carlo (MCMC) method. In case of BE, the squared error loss function (SELF) and entropy loss function (ELF) along with an informative prior have been assumed for the estimation.
The cumulative distribution function (CDF) for the 2CMGEM is Suppose, we generate a sample of size 'n' from 2CMGEM. Further, suppose n 1 = nπ 1 and n 2 = nð1 − π 1 Þ observations are generated from first and second components of the mixture, respectively. Let X k+1 , ⋯, X n are the largest n − k observations about which we have complete information and remaining smallest 'k' observations fall below the predetermined LDL. Since we have incomplete information regarding smallest 'k' observations, they as assumed as censored. Hence, x 1ðk 1 +1Þ , ⋯, x 1n 1 and x 2ðk 2 +1Þ , 2 Computational and Mathematical Methods in Medicine ⋯, x 2n 2 are the observations with complete information from first and second component of the mixture, respectively. So, k 1 and k 2 are the starting observations censored from each component, respectively. Further, m 1 = n 1 − k 1 − 1 and m 2 = n 2 − k 2 − 1 are observed items (with complete information) from the first and second components of mixture, respectively, where x k = min ðx 1ðk 1 +1Þ , x 2ðk 2 +1Þ Þ, n = n 1 + n 2 , k = k 1 + k 2 , and n − k = m 1 + m 2 . The likelihood function for such heterogeneous dataset x = fðx 1ðk 1 +1Þ , ⋯, x 1n 1 Þ, ðx 2ðk 2 +1Þ , ⋯, x 2n 2 Þg with LDL can be written as where Ω = ðλ 1 , θ 1 , λ 2 , θ 2 , π 1 Þ. Putting the corresponding entries, we have the following. The likelihood function for the heterogeneous dataset with LDL using 2CMGEM is 2.2. Expectation-Maximization (EM) Algorithm. In case of missing observations, the EM algorithm can be used obtain the MLEs for the model parameters. We have used the EM algorithm similar to that proposed by Ruhi et al. [32]. LetΘ = ðλ 1 , λ 2 , θ 1 , θ 2 Þ be the model parameters and π 1 the mixing parameter for the 2CMGEM f r ðxÞ where r = 1, 2.
Let us denote the observed random sample from 2CMGEM by x o = ðx 1 , ⋯, x n Þ′. Further, we assume the missing data vectors as x c = ðy 1 ′ , ⋯, y n ′Þ′, where y j is a 2-dimensional indicator vector containing zero-one observation and y jr is one if the x j is from r th component of the mixture and zero elsewhere ðj = 1, 2, ⋯, n ; r = 1, 2Þ. The complete dataset can be defined as x = ðx o ′ , x c ′Þ. The iterations in any EM algorithm contain two steps. The conditional expectation of the complete data log-likelihood for the model parameters is computed in the first step called E-step. The E-step at ðz + 1Þth iteration is where R r ðx j jΘ ðzÞ r Þ is a reliability function for the r th component of the mixture. As (5) is linear iny jr , the E-step is implemented by taking the conditional expectation of Y jr given the observed data x o , where y jr denotes the observations of Y jr . Now, Ε Ω ðzÞ ðY jr jx o Þ = y ðzÞ jr . Using Bayes theorem, the posterior probabilities y ðzÞ jr are presented as Next, the M-step is carried out. In this step, (5) is maximized with respect parameters of the model to estimate Ω ðz+1Þ . Since expressions containing π r and Θ r are not related, they can be maximized independently. We introduce the Lagrange multiplier Δ, with condition ∑ 2 r=1 π r = 1:, to obtain the expression for π r . The differentiation of (5) with respect to π r is placed equal to zero as The sum of (7) over r and with∑ z k=1 y ðzÞ jr = 1, and Δ = −n gives The MLES for Θ r = ðλ r , θ r Þ, r = 1, 2 are obtained using iterative procedures. Both E-step and M-step are iterated for achieving the desired accuracy.
The steps to estimate the parameters of 2CMGEM using EM algorithm are as follows:.

Computational and Mathematical Methods in Medicine
where Under the assumption of independence, the joint prior density from (9)-(11), for the parametric set Ω = ðλ 1 , λ 2 , θ 1 , θ 2 , π 1 Þ can be given as The general form of posterior distribution for the model parameters, combining (4) and (11), is From (13), the posterior distribution for the parameters of 2CMGEM is We have used SELF and ELF for derivation of BEs.
Since the closed form expressions for BEs under the said loss functions are not available, the approximate BE methods have been used to compute the numerical results.

Lindley's Approximation (LA).
In this section, the LA has been proposed for the approximate BE for parameters of 2CMGEM. For sufficiently large sample size, Lindley [33] proposed that a ratio of integrals have the form where WðΩÞ is a function of λ 1 , λ 2 , θ 1 , θ 2 and π 1 , HðΩÞ is the logarithmic of joint prior for the parametric set Ω, and KðΩjxÞ is the log-likelihood function, can be evaluated as where the MLE of Ω is denoted by b Ω, and ω ij is the ði, jÞ th element of the inverse of the matrix f L ij g. Now, consider the log-likelihood function from (4).
Differentiating (18) with respect to models parameters and equating the results to zero, we have 4 Computational and Mathematical Methods in Medicine ∂l ∂l Here, the numerical solutions of (19) to (21) Since, the higher-order derivative form (18) are straightforward, they have not been reported. The inverse of matrix fL ij gbased on second order derivatives is where i, j = 1, 2, 3, 4, 5: ð22Þ Now, the BEs for parameters of 2CMGEM using SELF are The BEs for the model parameters using LA under ELF are 2.3.2. MCMC Method. This sub-subsection discusses the BE for parameters λ 1 , λ 2 , θ 1 , θ 2 , and π 1 by employing MCMC technique. To implement MCMC procedure, we reformatted (9) as where (25), the posterior densities can be written as where Betað:Þ and Gammað:Þ represent bBeta and gamma distributions, respectively.
The samples from the posterior density can be obtained using the following algorithm.
Step-2:replicate Step-2 m times to obtain ðπ The BEs under SELF using MCMC are

Computational and Mathematical Methods in Medicine
The BEs under ELF using MCMC are

Results and Discussions
In this section, we have generated the simulated datasets from 2CMGEM using different sample sizes, different censoring rates, different true parametric values, different hyperparameters and different mixing weights. As all these factors can have the impact on the estimation, we have considered different situations for each of these factors. As discussed by Momoli et al. [22], these datasets represent the real data from environmental studies and provide convenience by avoiding issues with empirical comparisons. Since censoring is major interest in this study, we have focused on the comparison of complete datasets with censored ones. Different estimation methods and two loss functions have been used in order to explore the best combination of estimation method and loss function in our case. The data from the 2CMGEM with LDL have been generated using the following steps. The corresponding results are given in Tables 1-5 and in Figures 1 and 2 given in the followings. The estimates under simulated samples were replicated 10,000 times, and average of the estimates has been reported. In addition, in case of the MCMC method, the estimates were replicated 11,000 times. The starting 1000 replications were not included in the estimation to remove the impact of choice of starting values for the estimation. The Mathematica and R software has been used for numerical computations. Figure 1 shows the impact of change in sample size and censoring rates on the performance of the estimates. Panel (a1) reports the average of relative estimates (AREs) for parameter (λ 1 ) using different sample sizes and under various estimation methods. Panel (a2) presents the amounts of mean square errors (MSEs) for the estimates of parameter (λ 1 ). Similarly, AREs and MSEs for the parameter (θ 1 ) can be seen in Panels (a1) and (a2), respectively. These figures elucidate that the estimates are converging to the true parametric values with increase in sample size. It is also evident from these panels (a1)-(a4) that amounts of MSEs tend to decrease with increase in the sample size. The results also advocated that the performance of the estimates using the MCMC method assuming ELF is superior as compared to their competitors. The results for the estimates of other model parameters also exhibited the similar trend, please see Table 1.
The detailed analysis of the impacts imposed by the increased censoring rate (on the estimates) is fundamental in environmental studies. For that purpose, the simulated data have been used to compare the results under different censoring rates with the case when no censoring occurs. It will provide an idea about how well our proposed estimators are capable of tackling the censoring scenarios. The right panel (c1 and c2) in Figure 1 shows these results. Panel (c1) elucidates the impact of censoring rates when sample size is 20, whereas panel (c2) presents the impacts of censoring rates when sample size is 200. The results in these panels (c1) and (c2) have been reported under MCMC method using ELF, as performance of these estimators was observed to be better than their counterparts. From these panels (c1 and c2), it can be assessed that when the sample is small (n = 20), the increased censoring rates tend to slightly inflate the MSEs for the estimates of the model parameters. However, for the sufficiently large sample size (n = 200), the MSEs for the estimates under complete data and under 30% censored data are very much comparative. Hence, the proposed estimators are efficient enough to deal with left censored environmental datasets.
The sensitivity of the proposed estimators with respect different values of hyperparameters, different mixing weights, and different true parametric values has been investigated in Figure 2. All the reported results are based on the MCMC method assuming ELF due to their supreme efficiency as compared to their competitors. The results have been compared on the basis of values of scaled MSEs (SMEs). The SMEs have been obtained by the ratio of MSEs to the corresponding true parametric values. The comparison of the estimates for complete data and 20% censored data has been considered. The left panel (a1 and a2) of the figure investigates the sensitivity of the estimates with respect to change in the hyperparameters. We have used three sets of the hyperparameters resulting in three different priors. The prior number one (P1) is the main prior which has been 6 Computational and Mathematical Methods in Medicine     assumed for the overall analysis. This prior has been defined on the basis of prior means approach. Using prior means approach the hyperparameters in the prior are chosen in such a manner that the mean of resulting prior is approximately equal to the corresponding true parametric value. The other two priors (P2 and P3) have been specified so that the respective prior means have been changed (increased and decreased) by 20% from that of P1. In panels (a1) and (a2), the x-axis represents the combination of prior and censoring rate. For example, P1 (0%) means the results under prior one (P1) with no censoring and P1 (20%) denotes results under prior one (P1) with 20% censored samples. Panel (a1) elucidates that in case of complete data, the change in prior parameters has a very little affect on the performance of the estimates. On the other hand, for 20% censored samples, the change in prior parameters has resulted in slightly inflated SMSEs. However, when sample size is increased to 200, the estimates under all three priors are quite similar and censoring has very little impact on the efficiency of the estimates. The corresponding numerical results can be seen in Tables 2 and 3.
Keeping the values for the model parameters fixed, the impact of changing the mixing parameter has been presented in panels (c1) and (c2), respectively. The x-axis line denotes Pi1: π 1 = 0:25 with no censoring; Pi2: π 1 = 0:45 with no censoring; Pi3: π 1 = 0:75 with no censoring; Pi4: π 1 = 0:25 with 20% censoring; Pi5: π 1 = 0:45 with 20% censoring; and Pi6: π 1 = 0:75 with 20% censoring. For n = 20, the increase in value of the mixing parameter imposes a positive impact on the estimation of the model parameters from the first component of the mixture, without seriously affecting the estimation for the other component of the mixture. Interestingly, for n = 200, the choice of larger true values for the mixing parameter still produce the improved estimation for the first component of the mixture.

Real Examples
We have used two real environmental datasets with LDL for illustrating the applicability of the methods proposed. The first dataset contains ammonium (NH-4) concentration (mg/L) in precipitation observed at Olympic National Park, Hoh Ranger Station (WA14). This dataset has been reported by Aboueissa [34] and was collected during January, 2009 to December, 2011 on a weekly basis. We named it as dataset-1. The dataset contains high proportion (46 out of 102) of observations which are below LDL. The observations of the dataset are <0.006, <0.006, 0.006, 0.016, <0.006, 0.015,  0.023, 0.034, 0.022, 0.007, 0.021, 0.012, <0.006, 0.021,

12
Computational and Mathematical Methods in Medicine methods have also been compared with the classical methods (MLE and EM algorithm) already used for estimating such datasets. The optimal number of components for the proposed mixture model have also been explored using different goodness-of-fit criteria and test statistics. The graphs showing goodness-of-fit for different mixture models have been presented in Figure 3. From these graphs, it can easily be assessed that 2CMGEM is more representative as compared to other models for both datasets. In case of dataset-1, 2CMGEM seems to be the best model followed by 2CMNM and 2CMGM, while 2CMLNM and 2CMLGM did not provide good fits. On the other hand for dataset-2, again 2CMGEM provides the best fits followed by 2CMLNM and 2CMGM. Conversely, 2CMLGM and 2CMNM did not provided reasonable.
The results have been given in Table 7 and Table 10, respectively. The LRT assumes model with smaller number of components under null hypothesis. Hence, in the comparison of GEM and 2CMGEM, the GEM has been assumed under null hypothesis. While, for comparison on 2CMGEM with high-order mixtures, the model under null hypothesis is 2CMGEM. From the results, it can be seen that 2CMGEM is the best model for modeling each dataset.
The graphical comparison of fits under different estimation methods has been presented in Figure 4, while the numerical comparison has been given in Tables 8  and 11. Figure 4 reveals that the estimates under the MCMC method using ELF exhibited better discription data. However, the slight departure for some points can be due to the particular sample. The departure for the estimates under traditionally used MLE and EM methods are pronounced. The numerical results in Tables 8 and 11 further justified these findings. Hence, the proposed estimators are significantly better than the traditionally used estimators.

Conclusion
The study is aimed at exploring the improved estimation for the heterogeneous environmental datasets with LDL. The earlier studies considered classical analysis (MLEs and EM algorithm) of such datasets with some specific mixture models. We have proposed Bayesian analysis for modeling such data. The employment of 2CMGEM has also been proposed, and the optimality of the proposed model has been discussed. Our results revealed significant gains in efficiencies for the proposed estimators. In addition, the proposed estimators are quite insensitive with respect to LDL, true parametric values, hyperparametric values, and mixing weights. Hence, the proposed estimators are quite convincing alternative to those existing in the literature.
The current methodology can be extended to incorporate the variance analysis, invariant analysis, and covariance analysis. The future work can also be for the cases where the data is right or doubly censored.