Predicting Personal Exposure to PM 2.5 Using Different Determinants and Machine Learning Algorithms in Two Megacities, China

,


Introduction
Accurate assessment of personal exposures to fine particulate matter (PM 2.5 ) is essential to study its health effects and provide risk assessments.Direct measurement of personal exposure to PM 2.5 via wearable monitors is currently regarded the most accurate exposure assessment method [1,2].However, the collection of personal exposure data is too logistically complicated and expensive for most budgetconstrained large-scale population.Instead, the outdoor concentrations from nearby fixed-site monitors are used as a proxy for exposure in many epidemiological studies [3][4][5].This approximation method leads to exposure mis-classification as people usually spend greater than 80% of their time indoors [6,7], and indoor air quality can vary substantially from outdoor environments.This variation is often driven by building ventilation rates and proximity to indoor sources of pollution such as cooking, heating, cleaning activities, tobacco smoking, and other domestic combustion sources [8,9].
To overcome this significant limitation, investigators have tried to develop personal exposure models accounting for potential influential factors.Personal exposure surveys have shown that measured PM 2.5 concentrations can be correlated with influencing factors using statistical models that can subsequently be applied to estimate personal exposures of new subjects [10].The statistical algorithm used in model development is one of the crucial factors influencing the overall predictive power of the model.Multiple linear regression (MLR) has been the most commonly used method for model development because of its lower computational cost and ease of interpretability of the results [11,12].However, MLR models also have disadvantages such as the inability to capture complex and nonlinear interactions.Increased computing power has enabled the development of advanced machine learning algorithms to overcome some of the shortcomings of MLR models.To date, there have been hundreds of machine learning algorithms described in the literature, such as tree-based algorithms, artificial neural network (ANN) algorithms, kernel-based algorithms, and Bayesian method [13].Recently, machine learning algorithms have been used to accurately predict the concentrations of atmospheric pollutants, and the performance of these algorithms was generally better than the MLR method [14][15][16][17][18][19].However, to the best of our knowledge, the application of machine learning algorithms to estimate personal exposure is still in the early stages [11,[20][21][22][23].The application of this approach in urban areas with a higher burden of ambient PM 2.5 pollution remains understudied [20].
Significant predictors of personal PM 2.5 exposures have been reported to be outdoor and indoor environmental concentrations, meteorological factors, personal and household characteristics, and human activities such as cooking, heating, smoking, and air conditioner and air purifier use [12,[23][24][25][26][27].However, the relative importance of these predictors varied across investigations of different population groups, regions (rural vs. urban), and atmospheric air pollution conditions.In addition to selection of modeling algorithms, feature selection is another key process that can significantly influence model prediction performance.Exclusion of the effective determinants of personal exposure will reduce predictive accuracy, while inclusion of redundant and irrelevant variables may lead to overfitting and decrease the generalizability of the model [28][29][30].In addition, removing noisy features will decrease the effort associated with collecting information for these variables when the model is applied.Several methods of feature selection are available for MLR algorithms, such as best subset selection and backward and forward stepwise selection.Statisticians have also developed feature selection methods suitable for machine learning algorithms, such as recursive feature elimination (RFE), genetic algorithms, and simulated annealing [31,32].However, these methods have not been used to develop models for estimating personal PM 2.5 exposures [11,[20][21][22][23].
The elderly is one of the most susceptible groups to air pollution exposure, due to generally weaker immune systems, or undiagnosed respiratory or cardiovascular health conditions [33][34][35].However, most exposure studies conducted with elderly participants have been carried out in developed countries with relatively low ambient pollution levels.Unfortunately, the results of these studies cannot be directly extrapolated to elderly populations that suffer from exposures to high levels of PM 2.5 pollution in Chinese cities.To better characterize the exposure characteristics of this population, we conducted a repeated measurement study of outdoor-indoor-personal exposure in Beijing (BJ) and Nanjing (NJ) during 2015 and 2016.Our previous analyses showed that measured personal exposure concentrations were significantly lower than concentrations measured outdoors, confirming that using nearby outdoor PM 2.5 measurements as a direct proxy for personal exposure would inaccurately represent true exposures [12].Therefore, a validated personal exposure prediction model should be developed, tested, and used to further investigate exposure-health effect relationships in at-risk populations.The primary aims of this analysis include the following two aspects: (1) to explore whether the use of machine learning algorithms can improve the accuracy of exposure prediction models and (2) to identify the key variables needed for accurate PM 2.5 prediction of elderly exposures in urban areas with high background pollution levels.

Study Design and Subjects.
A detailed description of this PM 2.5 exposure longitudinal panel study of the elderly has been reported previously [12].Briefly, this study was conducted in urban districts of BJ and NJ during both the heating season (HS; Nov.-Mar.) and the nonheating season (NHS; Jun.-Sep.) in 2015-2016.BJ is located in the northern region of China, while NJ is in the southern region, leading to distinct climate types (BJ: temperature monsoon climate, NJ: subtropical monsoon climate).These climate differences result in the use of different heating methods in winter (BJ: centralized heating, NJ: no centralized heating) and behavioral patterns, including window opening behavior and air conditioning usage, all of which may influence personal exposure.Outdoor-indoor-personal PM 2.5 levels were measured simultaneously for five consecutive days in each season.The sampling periods covered both weekdays and weekends as the participants generally exhibited distinct activity patterns during these days [36,37].Previous studies have also used this sampling strategy of monitoring exposure for 3-7 consecutive days [38][39][40][41][42]. Household characteristics and personal activity factors affecting exposure levels were also collected during this time period.In each city, thirty-three healthy, nonsmoking retired adults were recruited through leaflets placed in residential communities.In BJ, 31 and 30 participants were monitored during the HS and the NHS, respectively, with 85% (28/33) of the participants completing the monitoring in both seasons.Similarly, 31 participants in NJ were monitored during each season, with 88% (29/33) taking part in both seasons.The study was approved by the Human Investigation Committee of National Institute of Environmental Health, China CDC, and all participants signed informed consent.Teflon sample filters were equilibrated in a chamber (Binder, Germany) with constant environmental conditions (25 ± 1 °C, 50 ± 5% RH) for a minimum of 24 hours (CN HJ 656-2013) and then weighed using a microbalance with 1 μg precision (XP6, Mettler Toledo International Inc., Switzerland) before and after sampling.Each filter (25 mm, 3.0 μ m porosity polytetrafluoroethylene with support ring, Pall Corporation, Mexico) was sampled for five days, and the five-day integrated PM 2.5 mass concentration was calculated by dividing the PM 2.5 mass collected on the filter (μg) by the corresponding air sample volume (m 3 ).These filter concentrations were then used to post-correct and calibrate the corresponding real-time concentrations for each individual sample using the following equation.

Measurement of PM
where C is the corrected real-time PM 2.5 concentration, C 0 is the raw real-time concentration from the nephelometer, C gav is the five-day weighted mass concentrations measured by the gravimetric method, and C nep is the concurrent fiveday mean concentration calculated using the raw real-time nephelometer data.The 24 h time-weighted PM 2.5 concentrations were calculated using these calibrated real-time data.
2.3.Ambient Air Quality and Meteorological Data.Ambient PM 2.5 data were retrieved from the China National Environmental Monitoring Center Network, which provides hourly PM 2.5 concentrations from local air quality monitoring stations (AQMS).The straight-line distance between participant's address and local AQMS was calculated.Data from the closest AQMS to each participant's address was used to produce 24 h time-weighted PM 2.5 concentrations corresponding to the sampling periods for personal exposure.In addition, meteorological data (temperature, relative humidity, atmospheric pressure, and wind speed) was also obtained from government-run monitoring sites in BJ and NJ.
2.4.Questionnaire and Time-Activity Diary (TAD).Prior to deployment of the sampling equipment, a standardized questionnaire was used to gather subjects' demographics (e.g., gender, age, and household income), home description (e.g., floors, room volume, building age, number of inhabitants, pet ownership, and primary cooking fuel), and lifestyle (e.g., window opening, cooking and cleaning frequency, and air conditioner and air purifier use), which potentially affect personal PM 2.5 exposures.The participants were also instructed to complete a daily TAD during sampling periods.Time-location information, as well as certain activities of pollutant-generating (i.e., cooking, cleaning, and environmental tobacco smoke (ETS) exposure), was recorded on the standardized time-based diaries.A global position system (GPS) data logger (model BT-Q1000XT, Qstarz International, Taiwan, China) was carried by each participant to collect timestamped data on position (latitude, longitude) every 10 s.The recorded GPS track was displayed in Google Maps to verify the trips manually recorded in the TADs.When any inconsistencies between TAD recordings and GPS data were identified, the individual participants were contacted immediately for information confirmation.If the inconsistencies could not be clarified with the participant, the more objective GPS data were used for microenvironment identification.Finally, potential predictors of exposure levels and patterns were extracted from the manually inspected pooled GPS-TAD data.
2.5.Quality Assurance and Quality Control.The nephelometer baseline and nominal flow rate of MicroPEMs were calibrated before sampling and measured again at the conclusion of sampling.Filters were weighed in duplicate, and the values were averaged to obtain the final weight.The duplicate weights are needed to be within 4.0 μg of each other; otherwise, the filter was reweighed.Field blanks were collected at a rate of 10% of the samples.The method detection limit (MDL) for gravimetric method was estimated as three times the standard deviation (SD) of the field blanks divided by the nominal sample volume, and all the masses of samples greatly exceeded the MDL (4.3 μg/m 3 ).Field duplicate samples were collected for 6% of the samples.The difference between the time-weighted average PM 2.5 concentrations of duplicate samples was within 10% or 5 μ g/m 3 in all cases.During HS, some real-time personal exposure data was lost due to an unknown source of instrument failure likely due to large temperature swings and the potential for condensation within the MicroPEM.This was more frequently an issue with BJ, which has colder outdoor temperatures during HS (BJ: -8.5 °C to -7.7 °C, NJ: 4.0 °C to 10.4 °C).Additionally, some samples were stopped early on request from the participant and scheduling considerations.Therefore, the calculated daily exposure from the calibrated real-time measurements was considered valid only if the 3 Indoor Air sample contained more than 22 h of valid data within a 24 h period.In total, 89% (271/305) and 96% (297/310) of the daily data were included in this analysis for BJ and NJ, respectively.
2.6.Statistical Analysis.Five state-of-the-art machine learning algorithms were tested to identify the most effective algorithm at predicting personal PM 2.5 exposure.The selected algorithms included commonly used algorithms with different underlying principles that have been shown to have good predictive ability for estimating outdoor or indoor air quality [10,13,14].These algorithms included ANN with a single hidden layer, random forest (RF), support vector machine with Gaussian kernels (SVM), extreme gradient boosting (XGBoost), and gradient boosting machine (GBM).The MLR algorithm served as a reference method for comparison of the results.To meet the normality requirements of MLR, all 24 h PM 2.5 concentration data were natural logtransformed.Grid search optimization was used to tune the hyperparameters for each of the machine learning algorithms.To this end, we defined a wide range of variance for each of the hyperparameters (Table S1).The model performance for each combination of hyperparameters was evaluated using a cross-validation (CV) method, and the one with the best performance was selected for the final model.
All candidate predictors are listed in Table S2 and were divided into three categories according to the data source and difficulty of information acquisition: routine monitoring (including ambient concentrations and meteorological factors), basic questionnaire (including personal and household characteristics), and TAD (including timelocation information and certain activities).Dummy coding, using the dummyVars function, was applied to handle the categorical variables as the machine learning algorithms are unable to process these variables.A series of prediction models were developed with different sets of potential predictors, beginning with those that are easiest to collect (routine monitoring) and followed by increasingly complex data (basic questionnaire and TAD).The improvement of model performance following the inclusion of additional more complex information was assessed by comparing between models.
The RFE method was applied for feature selection from each set of candidate predictors for the MLR and machine learning-based models.The RFE method is a search algorithm that treats the predictors as the inputs and uses model performance as the output to be optimized.Initially, the algorithm fits the model to all predictors.Each predictor is ranked using its importance to the model.Let S be a sequence of ordered numbers which are candidate values for the number of predictors to retain S 1 > S 2 , ⋯ .At each iteration of feature selection, the S i top ranked predictors are retained, the model is refit, and the performance is assessed.The value of S i with the best performance is determined, and the top S i predictors are used to fit the final model [31].The method was implemented by function RFE using the "caret" package in R software (version 3.5.1).
To better understand the relative influence of each predictor on model performance, variable importance (VI) scores and variable importance plots (VIPs) were constructed based on individual conditional expectation (ICE) curves [43][44][45].This method identifies VI as the flatness of ICE curves in which the flatter curves represent the lower relative VI for the predictor of interest [44].This analysis was performed by R software (version 3.5.1)with "vip" package.
A nested CV strategy was employed to evaluate the performance and generalization errors associated with the prediction models.This method overcomes the bias in performance evaluation caused by information leakage when the same data are used to tune model hyperparameters and evaluate model performance in non-nested CV [29].The nested CV strategy contains an inner loop CV nested in an outer CV.The inner loop is responsible for hyperparameter tuning as mentioned above, while the outer loop is for error estimation [46].For our analysis, 10% of samples were used for validation in the outer loop (10-fold CV), and 20% of samples were used for validation in the inner loop (5-fold CV).Measurements from the same participant were forced into the same group in each sampling procedure, and thus, artificial increases in the fitting degree related to repeated measurements of the same participant were eliminated.The coefficient of determination (R 2 ), root mean square error (RMSE), and mean absolute error (MAE) between the measured and model predicted values were calculated and used for model comparison.

Results
3.1.Personal Characteristics.The median participant age was 62 and 59 in BJ and NJ, respectively.All participants were nonsmokers, but exposure to ETS was recorded for 12.9% (35/271) and 27.3% (81/297) person-days in BJ and NJ, respectively.All subjects lived in apartment, and natural ventilation was the only ventilation mode.Window opening was more prevalent in NJ than BJ due to differences in climate.Air purifiers were not frequently used and accounted for less than 3% (BJ: 8/271, NJ: 6/297) of monitoring person-days in both cities.Air conditioner usage time accounted for 23.2% (63/271) and 16.5% (49/297) in BJ and NJ, respectively.
3.2.PM 2.5 Concentrations.Table 1 shows the summary statistics of ambient, outdoor, indoor, and personal PM 2.5 concentrations by city.Though large variations existed within each city, high levels of PM 2.5 pollution were observed in both cities.Overall, 95% of person-day measurements exceeded the World Health Organization (WHO) guideline 4 Indoor Air of 15 μg/m 3 (BJ: 90%, NJ: 100%).Regional differences in PM 2.5 exposures were found.The personal PM 2.5 concentrations in NJ were statistically significantly higher than BJ (p < 0 001), which was consistent with the indoor and outdoor PM 2.5 measurements.Figure 1 illustrates the relationships among personal, indoor, outdoor, and ambient measurements.The residential outdoor PM 2.5 concentrations measured by MicroPEM were highly correlated with the ambient levels of the nearest AQMS, with the Spearman correlation coefficient of 0.94 and 0.96 in BJ and NJ, respectively.The personal PM 2.5 exposures were most related to indoor PM 2.5 , followed by outdoor and ambient measurements.

Model Performance with Different Predictors and
Algorithms.Table 2 shows the nested CV results for the prediction models based on different algorithms and candidate predictors.Overall, the prediction models performed better for the data collected in BJ than in NJ.Model 1 (including only ambient PM 2.5 and meteorological factors), based on either traditional MLR or machine learning algorithms, performed well with the CV R 2 ranging from 0.82 to 0.88 in BJ and from 0.76 to 0.80 in NJ.Model performance, including different candidate predictors, was then compared.However, the addition of variables from basic questionnaire (model 2) and TAD data (model 3) did not improve the model performance for all algorithms and in some instances slightly diminished model accuracy, possibly due to overfitting caused by redundant variables.For example, model 1 which is based on an RF algorithm, has a higher CV R 2 (0 88 ± 0 10) and lower RMSE (16 3 ± 4 5 μg/m 3 ) and MAE (12 0 ± 2 4 μg/m 3 ) than the corresponding model 3 (R 2 : 0 85 ± 0 12, RMSE: 16 3 ± 5 5 μg/m 3 , MAE: 11 6 ± 2 6 μg/m 3 ) in BJ.The upper is the Spearman correlation coefficient, the lower is the scatter plot, and the diagonal is the density distribution plot.* * * p < 0 001.

Indoor Air
Compared with a traditional MLR algorithm, the machine learning-based models performed similarly or slightly better as indicated by a higher R 2 and lower RMSE and MAE.These results also demonstrated that RF and SVM were the most effective algorithms tested.As shown in Table 2, the CV R 2 of RF model increased by 7% (from 0 82 ± 0 13 to 0 88 ± 0 10), while RMSE decreased by 18% (from 19 8 ± 5 4 to 16 3 ± 4 5) compared to the traditional MLR approach in BJ.In addition, the lower SD of model performance metrics suggested that the performance of the RF and SVM algorithms was more stable.S4 illustrate the relative variable importance in predicting personal PM 2.5 exposure based on different algorithms (model 3).Across all algorithms and cities, the ambient PM 2.5 was consistently the most import predictor and its contribution was much larger than any other factors.However, the other variables included in final models were quite different between cities and algorithms.For example, outdoor relative humidity (RH) was the only variable included in all models in BJ, while it was less important in NJ, where exposure to ETS played a more important role than other variables except ambient PM 2.5 .

Discussion
MLR models were used for reference purposes during our development of machine learning algorithms for the prediction of personal PM 2.5 exposures.The nested CV results indicate that our MLR models yielded accurate 24 h exposure estimates.This MLR approach has been used extensively for PM 2.5 exposure prediction in previous studies, but the majority of these studies have been carried out in urban areas of developed countries with low air pollution levels, such as North America and Europe [47][48][49].Recently, more research studies have been carried out in rural areas of developing countries (e.g., Kenya, India, Lao PDR, and China) [11,21,23,27,50,51].The predictive ability of the models included in these studies varied greatly with CV R 2 values ranging from 0.09 to 0.76.Compared with the studies mentioned above, our MLR model displayed stronger prediction ability as indicated by the higher nested CV R 2 values (BJ: 0.82, NJ: 0.78).This result was mainly due to the following two reasons.First, the personal exposure levels of our subjects covered a much broader range (BJ: 4.2-285.0μg/m 3 , NJ: 16.4-218.9μg/m 3 ) than that studied in the developed countries.Second, ambient PM 2.5 was the dominant exposure source for our subjects, which has been accurately monitored and included in our MLR models.Contrary to our study, strong indoor sources (e.g., solid fuel combustion, cooking fumes, and ETS) and local outdoor source (e.g., vehicle emissions) also contributed a considerable proportion of exposure for participants in studies conducted in urban areas of developed countries [47][48][49]52] or rural areas of developing countries [11,21,23], and the influence of these sources on personal exposure was difficult to accurately estimate.
A primary aim of this analysis was to explore whether the utility of machine learning algorithms could improve the accuracy of PM 2.5 exposure prediction compared to MLR methods.Our analysis found that all of the five machine learning algorithms we tested could provide accurate prediction with an R 2 ranging from 0.76 to 0.88 (model 1).The RF and SVM algorithms generally performed better than our MLR models with the same candidate explanatory variables, especially in BJ.To our knowledge, only a few studies have applied machine learning algorithms to predict personal PM 2.5 exposure [11,[20][21][22][23].Among these studies, RF was the most commonly used algorithm.For example, in the  6 Indoor Air Relationships of Indoor, Outdoor, and Personal Air (RIOPA) study, MLR and RF were used to predict chemical elements in 48 h personal PM 2.5 samples.Consistent with our findings, RF analysis performed better than MLR for most elements [22].In rural Lao PDR, the mean 48 h PM 2.5 exposure concentrations for female cooks were estimated using machine learning models.These models produced an observed vs. CV predicted R 2 between 0.26 and 0.31, and the best candidate learner was RF, followed by cForest [21].This, along with our findings, suggests that RF is a promising technology for personal exposure estimation for its ability to uncover and harness complex variable interrelationships to produce more accurate predictions [21].However, inconsistent results were reported in a study conducted in rural area of Kenya.In this study, all five tested five machine learning algorithms (including RF, XGBoost, SVM, Rpart, and Glmnet) performed worse than MLR.The poorer machine learning model performance in this study may be partly explained by the relatively small sample size (~50) and failure to adopt appropriate variable selection methods [23].Unlike the analysis presented here, a variable selection method specific to machine learning algorithms was not adopted in the Kenya study, but the same variables as MLR model were included, potentially limiting the predictive ability of the machine learning algorithms.Therefore, a suitable variable selection method is essential to improve the predictive power of the models based on machine learning algorithms.In a recent study conducted in Tianjin, a heavily polluted city in northern China, a total of 117 older adults over 60 years of age were recruited and their PM 2.5 exposures measured.Four modeling techniques, including time-integrated activity modeling, Monte Carlo simulation, ANN modeling, and combined use of principal component analysis (PCA) and ANN model, were used to evaluate their ability to predict PM 2.5 exposures in this study setting.The authors found that the combined use of PCA and ANN produced the most accurate results, yielding an R 2 of 0.99 and RMSE lower than 15 μg/m 3 , while the traditional time-weighted activity modeling showed the lowest correlation with measured values with R 2 of less than 0.6.The high accuracy of the model used in this study may be very likely attributed to the inclusion of measured indoor PM 2.5 levels as predictors [20].However, the indoor PM 2.5 measures were not used in our study, since only ambient measures can be accessed easily.In addition, contrary to the results in the Tianjin study, the prediction accuracy of our ANN model was slightly lower than MLR and the preprocess  7 Indoor Air method of PCA did not improve the model fit of ANN or any other machine learning based model.
Our comparison among models developed with different candidate predictors showed that the inclusion of variables from the basic questionnaire, and even the participant's TAD, could not improve prediction accuracy.The variable importance evaluation results also confirmed the rationality of this result.Our result may be of great practical significance as it shows that we can obtain the same prediction model performance for the elderly without the added burden needed to gather those data.However, extrapolating the current results to other age groups requires caution.In our study, the majority of participants were over 60 years old, and almost all of their time was spent at home (~90%), with only a small percentage spent during transportation (~3%) or in public places (~3%).It is noteworthy that their timeactivity patterns significantly differ from other subgroups, such as office workers and school-age children.Thus, factors associated with time-activity patterns, such as commuting status and exposure to indoor pollution sources in public places, might assume greater significance.A study by Rojas-Bracho et al. found that personal PM 2.5 exposures increased by 2.5 μg/m 3 for each hour spent in a motor vehicle [48].Our PM 2.5 real-time concentration data indicates that personal exposure levels are higher than environmental background levels during cycling or walking, with a personal/outdoor ratio of approximately 1.1 [53].Moreover, our findings highlight that individuals frequenting restaurants were exposed to elevated levels of PM 2.5 , as evidenced by considerably higher ratios of personal to outdoor PM 2.5 (BJ: 1.48, NJ: 1.37) [53].This is consistent with previous studies conducted in Seoul [54,55].Taken together, it is important to consider that differences in time-activity patterns may significantly influence personal exposure models for populations other than the elderly.
In previous studies, exposure to ETS was found to be another important factor affecting overall PM 2.5 exposure [47,48].However, the ETS contribution to the prediction model is not evident in this analysis.It was reported that exposure to ETS for 1 h would increase the 24 h mean concentration of PM 2.5 exposure by about 4 μg/m 3 [47,48,56].In our study, only 3.6% and 6.7% of participants in BJ and NJ were exposed to ETS for more than 1 h a day, which means its impact on PM 2.5 exposure levels was far less than ambient air and may be masked by the variation of ambient PM 2.5 .Cooking behavior can lead to a sharp increase in indoor PM 2.5 level in a short period of time, which is also another important contributor of PM 2.5 exposure especially in rural areas in previous studies [21,23,48].Chang et al. reported that cooking for 1 h increased 24 h personal exposures to PM 2.5 by about 4 μg/m 3 [47,48,57].However, it should be noted that the magnitude of impact cooking can have on overall exposure is also strongly affected by the type of cooking, fuel type, who is cooking (participant or other), ventilation status, and building structures [57].This suggests that a simple variable such as cooking duration could not accurately characterize its contribution to exposure.The TAD results from our study show that the median (P 25 , P 75 ) daily cooking duration in subject's homes was 1.5 (1.0, 1.9) h and 1.5 (0.9, 2.1) h in BJ and NJ, respectively.Unfortunately, our questionnaires only included a cooking question related to fuel type.Natural gas was the dominant cooking fuel in both BJ and NJ.This uncertainty reduces the prediction ability of the family cooking time variable on individual exposure levels.Lack of detailed information on cooking behavior and high levels of background PM 2.5 pollution have reduced the role of cooking behavior in predicting personal exposure in our study, and future studies should attempt to collect more detailed information on cooking activities and patterns to better understand the potentially important relationship between household cooking and residential exposures.
Window opening was regarded as a predictor related to an increase in indoor and personal concentrations in previous reports [58][59][60], since window opening has a strong influence on air exchange rate, as well as increasing penetration by permitting ambient air to enter the indoor environment.However, we did not find that the inclusion of relevant variables of window opening behavior (window opening time and window opening width) had a significant impact on the accuracy of our models.A potential reason for this can be attributed to meteorological factors (e.g., temperature and wind speed), which can indirectly capture the opening windows status to a certain extent.In fact, our data indicated that more than 50% of the total variation of window opening time can be predicted by variables of temperature, humidity, and wind speed in BJ.
To our knowledge, this is the first study to develop prediction models for personal PM 2.5 exposure using multiple machine learning approaches in urban locations with high levels of ambient PM 2.5 pollution.This study was conducted in two Chinese megacities with uniform study design and measurement methods, and the consistent results between cities indicate that our findings are robust.However, we also note that the models in BJ and NJ did not include the same predictors, which suggests the need to develop city-specific assessment models.There were several limitations of this study.First, our study was only conducted with retired adults residing in urban areas, and as such, caution should be applied when extrapolating our results to other age groups with different time-activity patterns and people living in rural areas who are exposed to different PM 2.5 sources.Second, the sample size is relatively small, which is not conducive to developing machine learning models, especially for neural network models with complex structures.However, even with a relatively small number of training samples, the RF and SVM algorithms show advantages over the traditional MLR algorithm.Therefore, the machine learning approach shows promise for predicting personal air pollution exposures.

Conclusions
Our nested CV results showed that the models containing only predictors from routine air quality and meteorological monitoring data can accurately predict the personal PM 2.5 exposures of the elderly adults residing in urban areas with elevated levels of air pollution.The addition of individual 8 Indoor Air and household characteristics as well as time-activity information had a limited effect of predictive ability.The comparison statistics between MLR and machine learning models for the same data set indicated that the latter algorithms have advantages over the classic MLR method even at limited training sample sizes.Our results suggest that the machine learning approach could be a promising technology for predicting personal air pollution concentrations.

Figure 1 :
Figure 1: correlation matrices of daily ambient, outdoor, indoor, and personal PM 2.5 measurements in BJ (Beijing) and NJ (Nanjing).The upper is the Spearman correlation coefficient, the lower is the scatter plot, and the diagonal is the density distribution plot.* * * p < 0 001.

Figure 2 :
Figure 2: Bar plots of relative variable importance for personal PM 2.5 exposure prediction based on calculating flatness of partial dependence plot curves in BJ (Beijing) and NJ (Nanjing).MLR: multiple linear regression; RF: random forest; SVM: support vector machine; GBM: gradient boosting machine; XGBoost: extreme gradient boosting; ANN: artificial neural network.

Table 2 :
Nested CV results of prediction models with different algorithms and predictors.