Machine Learning-Based CO2 Prediction for Office Room: A Pilot Study

Architecture and Planning Department, CSIR-Central Building Research Institute, Roorkee 247667, India Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India Building Energy Efficiency Division, CSIR-Central Building Research Institute, Roorkee 247667, India Structural Engineering Department, CSIR-Central Building Research Institute, Roorkee 247667, India College of Computer Science and Information Technology, University of Anbar, Ramadi 31001, Iraq Department of Hydro and Renewable Energy, Indian Institute of Technology, Roorkee 247667, India Department of Applied Data Science, Noroff University College, Kristiansand, Norway Department of Computer Engineering, Sungkyul University, Anyang 14097, Republic of Korea

Air Acts, the Air Quality Act of 1967, and the amendments in 1977 and 1990, all of which focused on pollution control from outdoor sources [9]. The major pollution in urban areas and big cities is primarily due to transportation. Massive investment to develop transport networks and infrastructure along with a growing economy is inevitable in developing countries. With the rapid growth of travel demand along with better finance and investment options, vehicular air pollution is emerging and dominating other pollution sources in cities. Transportation is the prime sector contributing to the air pollution in urban agglomerations followed by the industry and the agriculture sector. This growth leads to more emissions of pollutants in the air. This polluted air has mild to severe adverse effects on all living beings depending upon the duration of exposure, the concentration of the pollutants in the air, and the health status of the living one. While most people know the effects of air pollution, less are aware that the quality of their indoor air may be worse than that of their outside air. Around the world, an increasing proportion of the population works in office buildings. Existing research on office building IAQ has concentrated on particular concerns such as photocopier and printer emissions as well as some other indoor sources. Workplaces and offices are generally situated near busy roads and marketplaces for economic reasons. The outer air pollutant enters the building and enhances the concentration of indoor air pollutants (IAPs). IAPs have recently been acknowledged as having an equivalent impact on human health as outside air pollution [4]. People spend most of their waking time inside different types of buildings and full sleeping time in residential buildings majorly. IAPs can be classified into 3 categories [10] (i) gases, (ii) biological contaminants, and (iii) particulate matter. Adequate outdoor fresh air is necessary to ensure excellent IAQ. If outdoor air quality is not good, then it is difficult to maintain good IAQ in naturally ventilated buildings. Throughout the globe, investigations on volatile organic compounds (VOCs), aldehydes, ammonia, particulate matter (PM), and other pollutants in office buildings were conducted. BASE, IAQ-AUDIT, HOPE, AIRMEX, and OFFICAIR are some of the major milestone projects in the development of the existing knowledge on IAQ in the office-built environments [11]. Some of these studies also included energy efficiency, occupant performance, and satisfaction as additional important parameters along with IAQ. Apart from these, several studies have been undertaken to forecast the ventilation performance of buildings and occupant's perceptions [12][13][14]. Low ventilation rates in houses are linked to asthma and allergic symptoms [15]. Inadequate ventilation will result in sick building syndrome (SBS), and excessive outdoor air will result in increased energy demand for buildings to maintain thermal comfort indoors. If the nearby roads are busy, then it is not easy to prevent degradation of IAQ when using natural ventilation to ventilate the stale air out having a high concentration of unwanted gases like CO 2 and other resuspended particulate matters due to worker's activity. Outer CO 2 concentration and conditions affect inner concentrations of CO 2 along with inner sources like humans and other indoor anthropogenic activities. It is also not feasible to open win-dows and doors in office buildings situated in or near any noisy area as this results in hampering the concentration and performance of the worker (due to reduced acoustic comfort) inside the building. Increased IAPs lead towards health issues in building occupants. SBS is a phenomenon in which inhabitants of a structure may have a feeling of discomfort along with a variety of health symptoms that cannot be ascribed to a single cause or sickness and which generally improve once they left that particular building and space [16]. Apart from SBS, some occupants are affected by building-related illness (BRI) which affects for a longer duration than SBS. According to a report in 2010 by the WHO [17], IAPs are the main reason behind 2.7% of all the diseases globally. Additionally, a report in 2018 presented by the WHO [18] revealed that 3.8 million people die every year due to diseases that can be attributed to poor indoor environments. The usage of motorised vehicles such as heavy-duty and light-duty vehicles contributes to ambient air pollution caused by traffic activities. Carbon compounds, hydrocarbons, nitrogen oxides, sulphur oxides, particulate matter (PM) with a diameter less than 2.5 μm (PM 2.5 ) as well as diameter less than 10 μm (PM 10 ), and ultrafine particles (UFP) are only a few of the pollutants released by these vehicles [19]. After infiltration or by natural ventilation and wind, these pollutants enter inside the built environment and enhance IAP concentration. Additionally, pollution from indoor sources also nudges the poor IAQ conditions. Worldwide studies have been done to address ventilation strategies to reduce CO 2 and other pollutants using artificial intelligence (AI). Fuzzy logic (FL), artificial intelligence (AI), and genetic algorithms (GA) are commonly used in intelligent control modelling for enhancing IAQ. Vanus et al. [20] predicted CO 2 levels inside smart homes considering relative humidity and temperature as input using the decision tree regression method. Pantazaras et al. [21] look at the possibility of developing predictive models that are suited to certain regions in order to forecast future CO 2 concentrations in their study. Kallio et al. [22] predicted CO 2 in the office environment. Their study investigated the suitability of four ML approaches for simulating the future CO 2 concentration in the indoor office environment: multilayer perceptron, random forest, decision tree, and ridge regression. The authors explore that the decision tree model was equally accurate as of the computationally more difficult random forest model. Khazaei et al. [23] used ML to predict the concentration of CO 2 in indoor offices. On the basis of the mean-square-error approach, the authors determined that the most accurate model was the four-steps-ahead prediction model, which had an average difference of less than 17 ppm from the actual CO 2 content in the room. Skön et al. [24] modelled CO 2 concentration in apartment buildings using artificial neural networks. They considered temperature and relative humidity as input parameters. Taheri and Razban [25] developed a dynamic indoor CO 2 model to predict CO 2 levels. The data set includes temperature, relative humidity, dew point, and CO 2 . The authors compared six learning algorithms including multilayer perceptron (MLP), logistic regression (LR), gradient boosting (GB), random forest (RF), AdaBoost, and support vector machine 2 Wireless Communications and Mobile Computing (SVM). The MLP surpasses other algorithms in terms of accuracy and can accurately forecast CO 2 behavior. Mohammadshirazi et al. [26] tested four different ML methods, rolling average, random forest, gradient boosting, and long short-term memory for the prediction of indoor concentration levels of carbon dioxide, total volatile organic compound, formaldehyde, PM 10 , PM 2.5 , PM 1 , ozone, and nitrogen dioxide. The study concluded that the best approach for forecasting indoor pollutants was consistently long short-term memory, while the optimum combinations of input factors varied depending on the pollutant of interest. The predicted results show that the LSTM training and validation MSEs from interpolating data varied from 0.001 to 0.007 and 0.001 to 0.003. ML models were used by Lillstrang et al. [27] to forecast the quality of indoor air in smart campuses. Predicting energy loads and inferring space occupancy status are critical activities that increase building energy efficiency and user comfort. The findings can be used to assess and improve the quality of sensor-based indoor data used in machine learning models, to determine whether a data set is representative enough to build a model that is robust under changing building conditions, and to determine the appropriate number of sensors per space when constructing an indoor wireless sensor network. The proposed work uses artificial neural networks (ANN) and other machine learning methods to forecast CO 2 level inside an office building. As the severity of CO 2 concentration affects the human health. Every machine leaning method has some pros and cons. The performance of individual machine learning models depends on the type and number of data sets. Various literatures are available on predicting the CO 2 level inside buildings; however, an accurate mathematical model to determine the quantity of CO 2 emission is difficult, as the input parameters are complex in nature. Therefore, monitoring and prediction of CO 2 concentration inside buildings are an important aspect. The objective of this study is to address the research gaps identified from the selective literature review using the ANN and other ML methods to predict CO 2 concentration inside the office building. Also, the performance comparison of different machine learning models used for predicting the CO 2 level inside the building has been presented.
The main contribution in this study is listed below: (i) Identified the critical parameters used as input for prediction the CO 2 concentration (ii) The identified critical parameters used as input for predicting the CO 2 concentration are number of occupants, area per person, outdoor temperature, outer wind speed, relative humidity, and air quality index (iii) Collected real-time CO 2 concentration data from the office building (iv) Modelled six machine learning algorithms, namely, artificial neural network (ANN), support vector machine (SVM), decision tree (DT), Gaussian process regression (GPR), linear regression (LR), and ensemble learning (EL), and four optimized machine learning algorithms GPR, EL, DT, and SVM for predicting the CO 2 concentration inside the office building The work in the research article is divided into five sections: Section 2 provides the details of data generation, data normalization, and performance indices which are used to evaluate the prediction of ML models. Section 3 describes all the machine learning (ML) approaches. The results and discussion part are presented in Section 4, and the conclusion of this study is presented in Section 5.

Materials and Methods
To predict the CO 2 concentration inside the office room due to internal emission, exterior transportation movement and industry emissions are studied in this article. The total number of 169 data sets was used to construct the prediction models which include the input variables such as temperature, relative humidity, air quality index, wind speed, occupancy, area per person, and one output variable, that is, carbon dioxide. The concentration of carbon dioxide inside the room mainly affected the occupancy inside the room as humans themselves are the emitting source. The concentration of CO 2 inside any building also depends upon the exterior environment surrounding the building. Buildings near industrial areas and busy roads are mostly seen affected by the pollutants.

Data Generation.
The data used in this work was generated in the lab of CSIR. The area of the office room was approximately 24 m 2 . The office is situated on the ground floor and contains one window on the north-faced wall having a width and height of 2.5 m and 1.5 m, respectively. The office room contains two doors; the dimensions of door 1 and door 2 are 1:2 m × 2:9 m and 0:9 m × 2:0 m, respectively. The 3D diagram of the office room with two doors and one window with the arrangement of furniture is shown in Figure 1. The collected seven parameters are indoor carbon dioxide (CO 2 ), number of occupants (O), area per person (A), outdoor temperature (T o ), outer wind speed (W S ), relative humidity (RH), and air quality index (AQI). The data were collected six times a day. The timing of the data collection is shown in Figure 2, where the watch represents the office hours 9 AM to 6 PM. One reading was recorded every day after one hour later than office hours at 7 PM. The maximum observed CO 2 level inside the office room was 572 ppm, while the minimum value observed was 445 ppm. The other statistical analysis of the collected data such as minimum value, maximum value, mean, standard deviation, kurtosis, and skewness for input and output database is shown in Table 1. Figure 3 shows the distribution of all data in terms of contributing parameters on CO 2 .

Evaluation Criteria.
For evaluating the accuracy of ML models, four commonly used performance indices such as correlation coefficient (R), mean absolute error (MAE), root mean square error (RMSE), mean square error (MSE), mean absolute percentage error (MAPE), Nash-Sutcliffe (NS) efficiency index, and a20-indices were used. Equations (1)- (7) 3 Wireless Communications and Mobile Computing    Wireless Communications and Mobile Computing [28] represent the performance indices considered in this study.
where T is the number of samples in the data set, A s is the measured values, P s is the predicted values, and P s is the mean of the predicted values. m20 is the number of samples with value rates measured/predicted values (range between 0.8 and 1.2).

Normalization of Selected Data.
Data normalization was performed to decrease undesirable feature scaling effects and increase computational stability. In this work, data was normalized in the range of 0 and 1 using equation (8) [29].
where y I , is the measured value (given value of CO 2 ) of the I th input (I = 1, 2, 3, 4, 5, 6) in the i th database (i = 1, 2, ⋯:, 169). y I,max and y I,min are the maximum and minimum values in the I th input, respectively.

Machine Learning Algorithms
In the literature, there are various studies available that measured the concentration of CO 2 inside different types of  In addition to that, optimized algorithms such as GPR, SVM, EL, and DT are also used. Each model's predictions were analysed and compared to achieve the best accurate model.

Artificial Neural Network (ANN).
In the late 19 th and early 20 th centuries, the groundwork for the area of ANN was done. This mostly comprised of psychology, neurophysiology, and physics multidisciplinary work. This early study focused on general learning, vision, conditioning, and other ideas, rather than particular mathematical models of neuron activity. The field of neural networks has been revitalized as a result of these new advances. Many studies have been published in the previous two decades, and many different forms of ANNs have been studied. ANN models were first employed in the ecological field in the early 1990s, but they became increasingly popular in the late 1990s. Around 10 10 neurons, or computing components, make up the human brain, which interacts via a connecting network. ANNs are parallel distributed computer networks that share certain fundamental features with biological neural systems. Neurons (X = ½x 1 ; x 2 ; ⋯ ; x n ) receive a variety of signals as input. Every input is given a relative weight (W = ½w 1 ; w 2 ; ⋯ ; w n ), which influences its impact. The strength of the input signal is determined by weights, which are adjustable coefficients inside the network. The summation block, which approximately corresponds to the actual cell body, generates the neuron output signal (NET), which algebraically sums all of the weighted inputs. The basic structure of ANN is presented in Figure 4.
Several types of ANNs have been produced over the last 10-15 years; however, two primary groups may be distinguished based on how the learning process is carried out: "In 'supervised learning,' a 'teacher' 'tells' the ANN how well it performs or what the right behaviour would have been throughout the learning phase." The ANN independently examines the features of the data set and learns to reflect these properties in its output in "unsupervised" learning. The relevant information regarding ANN is mentioned in the literature [30].

Support Vector Machine (SVM)
. SVM is a moderately new concept in the field of environmental science. In comparison to other disciplines, researchers employing remote sensing in environmental and ecological applications adopted SVMs initially, possibly due to the prompt growth of data-intensive technologies and the accompanying gap in the development of analytical tools. The application of SVMs in the environmental research domain has increased in recent years. SVMs are used in the detection of pollution, mapping of contaminated areas and disease distributions, and air quality estimates. Indeed, whenever there is highdimensional data and a related lack of understanding about the underlying distribution, SVMs offer tremendous potential to resolve the ensuing data processing issues. The graphical representation of SVM regression is shown in Figure 5. Support vector classification is based on a specific form of statistical learning machine, with Vapnik's supporting theory. Except for the assumption that the data are identically distributed and independent, support vector classification makes no assumptions about the underlying population's distribution. Furthermore, rather than  7 Wireless Communications and Mobile Computing estimating error via asymptotic convergence to normalcy, SVMs use theorems limiting the real risk in terms of the empirical risk. As a result, even with tiny sample sizes, reliable estimates of the prediction error may be obtained without making any distributional assumptions. The ideal machine strikes a compromise between consistency in the training set and future data set generalization. Furthermore, SVMs allow us to avoid the degraded computing efficiency that is common in high-dimensional problems. Support vector classification is a suitable choice for the typically noisy, high-dimensional, and chaotic data encountered in environmental research because of these key features. More details of SVM can be found in [32].

Decision Tree (DT).
One approach to demonstrate the links between samples in classification is to display them visually in the form of a "phylogenic tree." The tree-like structure represents the feature space in a hierarchical manner and helps to create links between the data: The characteristics identify the leaves of the tree, and all branches join together at the base, just as they do in a real tree. A DT, unlike a genuine tree, is generally portrayed as growing from top to bottom. Beginning at the root and working through the branches to a leaf that identifies the class, the membership of unknown data can be determined.
A DT is also a technique of encoding a series of choices produced by applying a set of classification rules in a       Wireless Communications and Mobile Computing sequential order to distinguish data. This method of classification has the benefit that, at least for small sets of rules, a graphical explanation of the set of rules is typically simple to comprehend. The discovery of such principles by methodical examination of the behavior of a set of known instances induces or creates a decision tree. More details of DT can be found in [33].

Gaussian Process Regression (GPR)
. GPR is the nonparametric, Bayesian method to regression and is used frequently in the field of ML. GPR offers numerous advantages, comprising the capacity to deal with tiny data sets along with providing uncertainty assessments on predictions. "Gaussian process regression is nonparametric (i.e., not constrained by a functional form), rather than computing the probability distribution of parameters of a single function, GPR computes the probability distribution of all admissible functions that fit the data. However, in order to calculate the posterior using the training data and compute the predicted posterior distribution on our points of interest, a prior specification (on the function space) is required." More information related to GPR can be found in [34]. Various studies on GPR are available in the literature for the forecasting of various parameters. Gaussian process regression (GPR) models have been widely used in ML applications because of their representation flexibility and inherent uncertainty measures over predictions.

Linear Regression (LR).
In statistics and ML, LR is one of the most well-known and well-understood techniques. Many variables (or measures) are gathered for each individual or unit investigated in many scientific researches. Regression analysis is a statistical technique for predicting the value of one (or more) variables from a set of others. Basic or simple LR, multiple LR, and multivariate LR are the three forms of linear regression. The basic linear regression model is a model that is linear in these parameters. This model, often known as a straight-line model, is fitted using the least-squares method. When we want to model the link between one answer variable and more than one regression variable, we utilize multiple regression analysis. When we have more than one response variable and want to model the link between these variables and a collection of regression variables, we use multivariate multiple regression analysis. For more information, the reader should refer to a statistics textbook and [35].
3.6. Ensemble Learning (EL). EL is the process of intentionally generating and combining numerous models, such as classifiers or experts, to tackle a specific computational intelligence issue. Ensemble learning is generally used to increase the performance of a model (classification, prediction,

10
Wireless Communications and Mobile Computing function approximation, etc.) or to minimize the chance of an unintentional selection of a poor one. Other uses of ensemble learning include providing a confidence level to the model's conclusion, nonstationary learning, data fusion, selecting optimal features, error correction, and incremental learning. The best-fitted model in the EL algorithm was boosted tree as shown in Figure 6 [36].

Results and Discussion
4.1. Implementation of Machine Learning Algorithms to Predict CO 2 . Based on the training process of ML algorithms, the data were split into two ratios. To avoid the overfitting phenomenon, the distribution ratio of the two sets is adjusted to 7 : 3, 70% (130 samples) of the data used in the training process and the other 30% (56 samples) of the data utilized as testing data as shown in Figure 7. To validate the results of ML algorithms, the 5-fold cross-validation method is used. In 5-fold cross-validation, the data is further divided into 5 subsets. Then, each subset would be chosen in order for the validation process, the remaining 4 subsets being utilized for training inside the training stage.

4.2.
Results of Machine Learning Models. The concentration of CO 2 inside the office room was predicted using various ML algorithms. The predicted values were compared with the actual results and estimated the errors based on performance indices.

ANN Model.
A single hidden layer was investigated, with the number of neurons increased from 5 to 20, and the optimal model was discovered by trial and error. The performance parameters of the ANN model are presented in Table 2. As observed from the table, ANN attained almost 84.7% (R 2 = 0:84758) accuracy for the whole data set. In terms of MAE (8.24) and RMSE (10.56), the optimal feedforward ANN structure with six inputs delivered the best training outcome. Figure 8(f) shows a comparison of the experimental and predicted values of the ANN model.

GPR Model.
As we all know, the S (width of rbf) and ε (Gaussian noise) are two crucial factors that can be found by a trial-and-error procedure. S = 0:40 and ε = 0:07 are the final optimum values that are considered to design the optimum GPR model. In the training stage, the GPR model predicts the concentration of CO 2 practically perfectly, whereas, in the testing stage, there is a small difference. Figure 8( Figure 8(j). The  In the decision tree model, the fine tree model shows the perfected fitted model than a medium and coarse tree. The minimum leaf size = 5 and maximum surrogates per node = 10 were considered as the best-suited hyperparameters. The prediction accuracy of the DT model is 88.42% with RMSE and MAE values being 9.40 and 7.14, respectively. The comparison between predicted and experimental results is shown in Figure 8(e). The comparison between R 2 , RMSE, and MAE indicators of different algorithms is presented in Figure 8. The optimized GPR model R 2 value is the highest among all models, the R 2 of optimized GPR is 1.24% greater than GPR, 4.84% greater than optimized EL, 6.20% more than optimized DT, 10.16% more than DT, 13.30% more than ANN, 17.89% more than EL, 17.94% more than LR, 18.34% more than optimized SVM, and 18.89% more than SVM. Similarly, both the performance indices RMSE and MAE of the optimized GPR model are the lowest among all the opted techniques and can be seen in the pictorial representation in Figure 8

Wireless Communications and Mobile Computing
The best-predicted model is optimized GPR with the R -value of 0.98874, R 2 of 0.977607, MSE of 17.64568, and MAE of 3.350982 and having a standard deviation of 25.5432 as tabulated in Table 2. The worst prediction was from the SVM model with the R-value of 0.89044, R 2 of 0.792883, MSE of 153.0833, RMSE of 2.37268, and MAE of 9.900226 and having a standard deviation of 24.1408. The intermediate models analysed were GPR, optimized ensemble, optimized DT, DT, ANN, EL, LR, and optimized SVM with corresponding R-value of 0.98259, 0.96447, 0.95758, 0.93714, 92064, 0.89592, 0.89566, and 0.89349. The comparison between experimental and predicted optimized GPR, optimized EL, optimized DT, and optimized SVM is shown in Figures 8(a), 8(c), 8(d), and 8(i), respectively. The plotted graphs between measured CO 2 values and predicted output of indoor CO 2 by all the above-mentioned methods are presented below in Figure 8. The a20-index of all the ML models was having a value of 1. The performance of different ML models is shown in Figure 9.
In the optimized GPR model, the 96% data lies in the range of 10 ppm as shown in Figure 10(a). In the GPR model, the 94.67% data lies in the range of 10 ppm as shown in Figure 10(b). Similarly, in the optimized EL, optimized DT, and DT models, the data lies in the range of 10 ppm which is 82.84%, 79.29%, and 73.37%, respectively, as shown in Figures 10(c)-10(e). In the ANN model, the 70.79% data lies in the range of 10 ppm as shown in Figure 10(f). In EL, LR, optimized SVM, and SVM models, the data lies in the range of 10 ppm which is greater than 60% as presented in Figure 10.

Conclusion
In this study, the concentration of CO 2 inside an office room is evaluated using ANN, GPR, DT, EL, SVM, and LR algorithms along with optimized GPR, EL, DT, and SVM. A total of 169 real-time data sets were collected and used for predicting the CO 2 level, containing temperature, wind speed, air quality index, relative humidity, occupancy, and area per person as input parameters. To obtain the accurate result, all the data was scaled and normalized between 0 and 1, and 70% of the data is used for training and 30% for testing with 5-fold cross-validation process to validate the results. It has been found that the optimized GPR is quite accurate having R, RMSE, MAE, NS, and a20-index values of 0.98874, 4.20068 ppm, 3.35098 ppm, 0.9817, and 1, respectively. This proposed prediction model is only valid for similar input data having similar statistical properties. The proposed study can help the researchers and professionals to predict the CO 2 concentration inside the office building and its effect on individual health. In future work, efficient machine learning models with large data sets can be used to predict the concentrations of various parameters like PM 2.5 , PM 10 , NO x , SO x , and CO 2 .