Privacy-Preserving Sensing and Two-Stage Building Occupancy Prediction using Random Forest Learning

—Sensing and predicting occupancy in buildings is an important task that can lead to signiﬁcant improvements in both energy efﬁciency and occupant comfort. Rich data streams are now available that allow for machine learning based algorithm implementation of direct and indirect occupancy estimation. We evaluate ensemble models, namely random forests, on data collected from an 8x8 PIR matrix thermopile sensor with the dual goal of predicting individual cell temperature values and subsequently detecting the occupancy status. Evaluation of the method is based on a real case study deployed in an IT Hub in Bucharest, for which we have collected over three weeks of ground data, analyzed and used it in order to predict occupancy in a room. Results show a 2–4% mean absolute percentage error for the temperature prediction and > 99% accuracy for a three-class model to detect human presence. The resulting outputs can be used by predictive building control models to optimize the commands to various subsystems. By separating the speciﬁc deployment from the system architecture and data structure, the application can be easily translated to other usage proﬁles and built environment entities. As compared to vision based systems, our solution preserves privacy with improved performance when compared to single PIR or indirect estimation.


I. INTRODUCTION
Economic and environmental constraints are placing increased emphasis on intelligent building energy management systems (BEMS) in accordance to new regulations. One of the main functions of such an intelligent system is to become occupant-aware in order to condition internal space in proportion to current and foreseen usage levels. Beyond the practical need of reducing energy consumption, occupant comfort has to be assured as part of quality of service agreements and health considerations. One salient example has been observed in the importance of temperature in cognitive performance [1], especially for children. This can be also extended to insuring proper levels of carbon dioxide through energy conscious ventilation with heat recovery. Knowing the number of occupants in a class, temperature, ventilation and air conditioning could be automatically adjusted to suit the needs of occupants. Privacy and identity have been largely debated subjects in many experiments and solutions, since prediction of occupancy behavior should be achieved without invading privacy and especially without making possible face or body recognition. This makes computer vision based solution using video footage from security cameras improper for use in an occupancy detection system. Modern commercial buildings possess hundreds or even thousands of sensors integrated in a common system called Building Management System (BMS). Existing buildings might not have this infrastructure and even so, challenges arise when deploying different sensing generations and make them to communicate within the same system as well as installation costs that might overpass the cost of the hardware sensor nodes [2]. A wireless energy solution for occupancy sensing might prove feasible in many scenarios. Despite the many opportunities in modern buildings to use the rich data streams provided by modern networked instrumentation, rooms are still conditioned based on maximum number of persons using bipositional control algorithms, so the consumption of the Heating, Ventilation and Air Conditioning (HVAC) system provides room for improvement using an advanced solution for occupancy sensing and prediction.
Although Europe is the third largest energy consumer [3], after China and US, the topic of building energy consumption has gained awareness in Europe more and more due to continuously increasing level of urbanization and industrial developments. In Europe, buildings are responsible for 40% of energy consumption [4], with 38% of them being older than 50 years and inefficient. Therefore, existing buildings have the potential to save energy by renovation and deployment of sensing infrastructure to transform them into smart spaces. The situation is similar in the USA with the potential for energy savings by means of new sensor and control devices integration in building energy management highlighted by various technical reports [5], including the need for occupancy detection and estimation in buildings with multiple thermal zones and variable usage patterns. In this study the adoption of occupancy sensors for energy management is estimated at 50% for large commercial buildings and below 10% for smaller commercial buildings across all categories: renter-occupied, owner-occupied or a combination thereof. Large buildings are considered to have a usable area of over 50000 sqft.
Facing the context of energy poverty as described in [3], with Romania on the top of the list for the level of energy poverty, we consider a stringent need to improve the way we administrate the energy maintaining thermal comfort. In this context we place our research as a meaningful demonstration of how to incorporate non-intrusive sensing to estimate forecasted occupancy, with a less exploited scenario as location: a hub for IT activities in an old building. The research proposes to address this topic for a case deployed in a lab where children perform robotic and IT activities, more precisely, on a door case, to estimate the room occupancy by finding the total number of events detected on the door level and then divide it. We do not aim to control an entire building, but to present promising results with high accuracy to predict occupancy in rooms used by students. These results could be obtained in other buildings located in different places, by using very simple and low cost hardware. We encourage the application of the algorithm to other domains where timeseries data is collected.
The contributions of this paper are argued to be the following: • We designed and deployed an infrared monitoring system in an IT Hub from Bucharest, with the aim of learning from historical data and predict the temperature of it, to exploit and transform it into usable occupancy metrics; • We have evaluated the performance of the infrared sensing grid, used in our previous deployments, and from our best knowledge the drawbacks from the hardware is not presented in other studies; • Discussion of scenario implemented using the system in a laboratory where young students are taking classes of programming and robotics. We discuss how our methods and solution could be exploited for their benefits, especially for spaces dedicated to cognitive activities; • Occupancy prediction using machine learning algorithm in a two-stage pipeline: Random Forest algorithm, for temperature values forecast, and Random Forest classification for presence counting. The rest of the paper is organized as following: in section II, we provide a comprehensive summary of relevant work using Random Forest techniques. We dedicate section III to explaining the setup and the objectives. We treat the topic of data analysis in Section IV touching the phases of collection, structure and storage, cleaning and pattern discovery. The following, Section IV, illustrates the Random Forest model with the insights for occupancy prediction. The paper concludes with remarks sketching ongoing directions for continuing the research.

II. STATE OF THE ART
Recent studies show that schedules that include occupancy patterns in buildings could reduce the reheat energy consumption up to 38%, keeping the indoor thermal comfort [6].
The literature presents the deployment of ambient sensors to estimate occupancy in commercial and residential buildings, often cases when thermal infrared sensors are combined with other sensing devices or mobile phones. Video cameras as sensing infrastructure for managing occupancy in such situations are considered privacy breaching devices given technological advancements, data protection regulations such as GDPR and machine learning algorithm performance increase. A taxonomy on this topic, emphasizing very frequent used sensing platforms and methods for detecting human presence and counting it is presented in Table 1.
From a review of the models of occupancy detection, considering the deployment period, space and reported accuracy, some key points were identified: many contributions highlighted classification using Random Forests (RF) to achieve high accuracy for occupancy detection when it was used, comparing with other algorithms [13] and multiple parameters from different type of sensors do not necessarily play a crucial role for a better accuracy. In Table 1 we present only some of the most relevant works in the domain, selecting them by the influence on the community and relevance of experiments, as well as novelty and recent publication.
We did investigations or the usage of the Random Forest algorithm in related works such as in [19], where the Random Forest algorithm is used for predicting the parking lot occupancy. The study treats data from complex systems from business analytics perspective. The data came from sensors of the parking lot from the most sustainable building in the world, which is in Amsterdam. Data from approximatively 1.5 years is considered with half an hour distance between samples. From several data prediction instruments, the authors chose the Random Forest model which returned the best results for prediction with 0.5h in advance, having an error of 2.3 cars. Although a very rigorous implementation of prediction has been performed, the data reported poor quality as the authors employed some data imputation and approximation of occupancy was used, so a distance from the ground truth interfered.
For more general time series regression and classification tasks, the authors of [20], apply Random Forests for real time price forecasting of energy in New York electricity market. Although they tested Random Forest, artificial neural networks (ANN) and classical autoregressive moving average (ARMA) model, proving that the Random Forest has the highest accuracy, so the smallest value for MAPE. The use case is isolated from potential important factors on price evolution such as real time climatic and economic data. Including these factors too, the authors could check if these are important in the forecast. Random Forest has been proved to give the best results for classification in terms of efficiency and accuracy, for occupancy detection [10]. On the other hand, the drawback of running time aspect of the algorithm is not a concern in our application and type of situation, because we do not have a large number of features.
Data-driven building models are described in [21] which can be suitable to incorporate occupancy models as constraints to  [18] the optimization problem. A significant body of experimental data is provided by [22] allowing off-line training of quality occupancy models. Estimation of occupancy is extensively evaluated in [18] based on direct and indirect measurements modelled through Bayesian networks. Beyond direct presence detectors, occupancy is inferred using CO2 concentration, acoustic levels, power and water consumption. In [17] a more capable PIR sensor array is used which provides 24x72 temperature resolution i.e. 768 data points. This enables further analysis beyond basic occupancy detection towards activity recognition which can also be used to quantify subjective perceptions of thermal comfort. The current contribution builds upon previously published results concerning: lab-scale experiments using the Panasonic Grid-Eye sensor for occupancy detection [23], testing of various machine learning algorithms for simulated data [24], [25] and infrastructure for data processing pipeline in occupancy sensing and prediction [26]. The progress is supported by improved experimental evaluation in a realistic scenario of daily usage profiles.

III. INFRASTRUCTURE AND EXPERIMENTAL DEPLOYMENT
The experimental system, composed of a Panasonic Grid-Eye development kit and an associated Raspberry Pi wireless gateway, has been deployed for three weeks in an IT hub where young students are taking classes of programming and robotics. We found this scenario very appealing since we have previously deployed our equipment in the university laboratory [23] where adults are using the spaces, but this one is from another perspective since the young students have different behavior: they are faster when enter in the room, they are walking in groups of two often, and they have a much smaller height than adults, this meaning a larger distance to the sensing grid places on the top of the doorcase.
Data is recorded with a frequency of 1 Hertz, in frames of 64 values of temperature in degrees Celsius, corresponding to the 64 cells of the sensing grid. Knowing all values from a frame, we could identify warm bodies passing through the door by identifying blobs over a static background temperature. This lead to finding the time when the room is used. The room we have monitored is in an old building in Bucharest, apartments building, without a building management system (BMS) to enhance the scheduling. We were interested in predicting occupancy, considering that the class is running with the same number of students almost every time. The algorithm considers the last 2 dates for each timestamp and is continuously learning each time when it is running. This assumption is made since the room is small and the students are numerous; so good ventilation and proper temperature would be an important condition for small children in the act of learning.
We have run the experiment for between 15/05/2018 to 6/06/2018, logging data in text files, comma separated values, with timestamp, which then were transferred to a base-station -a Raspberry PI model 3 B, via Bluetooth wireless communication, and stored in a local database. The text file log is easy manipulated and imported in any type of database or can be convert to other formats as well such as JSON or XML for automatic processing libraries. The raw and processed datasets are available from the authors and they will published in a dedicated online repository.
The Grid-Eye evaluation kit (AMG 8834 EIK) which we have used [27], it is illustrated in Figure 1. A comprehensive diagram of the physical deployment and associated working flow is illustrated in Figure 2. In the right side of this figure, is a conceptual view of the physical deployment. On the doorcase top part, the sensing grid is placed, and it senses the temperature at one frame per second, on an angle of 60 degrees. Every frame contains 64 temperature values which define a background and potential higher values, clustered, which are assigned to a human person detection, in case these satisfy the conditions to be classified as an occupant in the building. This data is transmitted via Bluetooth to the basestation to which we could connect via Wi-Fi to the backend IT system and integration with the control equipment.  We have used the board in the standalone mode, without integrating it with Arduino. The infrared image data is sent through the external interface I2C to the onboard microcontroller and then sent to the Raspberry PI via Bluetooth module, PAN1740, short range. The infrared sensors are packaged in 8 mm x 11.6 mm x 4.3 mm SMD can type. The Grid Eye evalua-tion kit is made to communicate also with the smartphone. The temperature measurement range of the infrared array sensor is between -20 and +100 degrees Celsius, with good accuracy and up to 10 fps rate. Our sampling takes places at 1fps.

IV. DATA PRE-PROCESSING AND ANALYSIS
The logical flow to go through data processing for finding forecasted values is: data ingestion, outliers/anomalies identification, data preparation for machine learning model, model training, prediction phase, prediction metrics, interpretation of results in a visual manner. The main steps of the data pipeline are graphically presented in Figure 3. One example for the anomaly detection, in the first phase we have noticed that there was a spike in the last week of data collection which could have been caused at the Grid Eye sensor level. To deal with this spike as shown in Figure 4 -average value per frame for the 64 temperature values recorded, we simply removed the anomalous value since it was an isolated case. If there would have been numerous such abnormal values, then an average value could have been an option to replace the wrong values. Temperature values were in the same range, and so we did not need to perform data scaling nor season cyclicity. However, after a very fine data value analysis on each grid cell granularity, we found that one of the 64 sensors of the grid failed on reporting the correct temperature several times. We classified this as a hardware issue, because this situation we identified only on the same sensor each time. The number of wrong values (0 degrees Celsius) is considerably small (less than 50 times) and we have replaced it with the average value for the frame when that particular 0 was recorded. Sometimes performing an average, could hide different issues on data, as we had on our data set for one sensor. For the case presented here for one class of the students, we had 38 cases of value 0 in the first week, and 29 cases in second and in the third week also, out of 4446 records.
For the Random Forest algorithm predicting the next temperature values, we have chosen to input 3 measurements: the actual value for each cell indicating the temperature in°C, the corresponding value for each cell of the sensing grid from the same time, but one week before, and the corresponding value for each cell of the sensing grid for the same time, but from 2 weeks before. Training set consisted in 75% of the total 4446 values, representing the time for one class, approximately 1.2 hours. The purpose is to predict the temperature for each cell of the grid, for the last week, based on the values from the previous 2 weeks.
To evaluate the forecast error, we use the mean absolute percentage error very common for time series, expressed as: with A k are the actual measured values and F k are the predictions. We have assured that there were no zeros values in our data set, to use this evaluation.

V. MODEL DEVELOPMENT AND EXPERIMENTAL RESULTS
Analyzing the literature and previously experimenting with other algorithms such as linear regression and Markov model chains, we found that Random Forest model promises fruitful results. Random Forest is an algorithm used for both regression and classification tasks. Data is randomly selected from the training sets to train multiple decision trees which thus forming a 'forest'. Decision trees split rules are built by using an attribute selection indicator. In our case, we used Gini index for criterion to evaluate splits in dataset. The aim is to have a split with a low value of this index, where p is the probability for each class.
The model of Random Forest is related to the one of k-NN and it is based on the bagging (model averaging) approach for random samples to avoid overfitting and reduce variance. Let X be the training set, and Y be the set of responses with X=x1,. . . ,xn and respectively Y=y1,. . . ,yn. The values for unseen samples x' are predicted by averaging the predictions from all the individual regression trees on x'. So, for b=1,. . . B, it samples with replacement Xb and Yb and trains the regression tree tb on Xb and Yb. The prediction is expressed as:t Due to the fact that there are several trees participating on prediction on which vote is done for the predicted values e.g. shown in Figure 5, Random Forest is considered a robust and highly accurate method. The challenge here is to find an optimal number of trees such that they could ensure good results and handle the time-consuming process due to the vote process. We chose n = 1000 trees to participate on the voting process, after trying with different options. The accuracy is influenced only on the second decimal by the number of trees, but the time to perform the algorithm is proportional increasing with the number of trees. Grid search or random search methods can improve the robustness of the approach with regard to hyper-parameter tuning.

Fig. 5. Random Forest Prediction
The first trained model is tasked to for predicting individual temperature values in the 8x8 thermal sensor matrix. In Figure  6, a sample of the predicted values achieved for the first sensor of the grid is illustrated for approximately 16 minutes. Similar behavior was observed for each sensor. Applying Random Forest for data corresponding to each cell of the grid, we have predicted the values with an average accuracy of 97.46%. The performance for each cell is represented in Figure 5:  Figure 7, we could observe that the accuracy value of the forecast plays in the range of 97.1 and 98.1%. The accuracy is calculated as the difference between 100 and the MAPE value defined in equation (1). The highest value we have obtained for the first cell in the grid. Data coming from this first sensor is more accurate than data coming from the cell on the last row of the grid, being exposed to a further view at the event happening, so also some external perturbation could have been interfered. Having a forecast for the temperature from the sensor grid, we could identify the number of persons which cross the horizon view of the grid finding the number of occupants. For a visual representation, we present in Figure  8 how a detected person looks like in the Grid Eye sensor imaging. The side color bar shows the value of temperature in Celsius degrees. Starting with the image from Figure ??, we implemented the second model based on Random Forests to find the number of students which crossed through the door in the considered time for a length of a class. Dividing by 2, according to entrance and exit actions, we could estimate the occupancy degree. This occupancy degree could then be used in real time by the owners or facility personnel to be incorporated in HVAC schedules. The classification process focuses on three occupancy detection classes: 0 -no person detected, 1 -one person detected and 2 -two persons detected in the frame. This corresponds to the physical space limitation for persons passing through the doorcase while neglecting edge cases of more than 3 persons at the same time in the frame.
Our approach to detect the human presence using Random Forest algorithm is described by the algorithm in Algorithm 1: We have considered as dataset for this phase the set with processed features obtained from the raw temperature values: active pixels, number of blobs and the size of the largest blob, as in [8]. After performing another step to find the importance of each feature in the classification process, Figure 9, we only used the number of active pixels and the number of blobs. In fact, we have tested the algorithm considering different number of trees in the forest, and for each test, the importance has a different weight, but the highest ones have been achieved by the first and third features. There is a relation of inverse proportionality between the number of trees and the importance of the first feature, number of active pixels, as is presented in Table 2. Based on this analysis, we have used the feature extraction step described in algorithm 2, to prepare the raw data for the Random Forest algorithm presented in Table 2. The value of 25 degrees Celsius, used for defining the active cell background was found after analyzing the dataset collected during our three weeks use case. Considering the air conditioning, the night temperature values as well as the heating during the occupancy time over the day, this value of 25 degrees Celsius was appreciated to be reasonable for defining an event as human presence temperature. As an alternative, moving average background subtraction can be implemented for more robust performance in varying conditions. Having the input feature dataset containing the number of active pixels and the number of blobs, the ground truth labelling for the human presence count is performed manually.
For the human detection phase, we have chosen the Gini index as in the prediction phase, and for a better understanding of the principle of how this algorithm works, we illustrate the graph for a single tree on a small dataset of 29 samplesnumber of observations in the root node in Figure 10. In this visualization, we kept the 3 feature vectors, and the output of it is a class: 0, 1, and 2, for no presence detected, one person and respectively two persons. Gini impurity for one node of 0 value is perfect because there is no chance for a randomly selected sample to be incorrectly labeled. The row with 'value' represents the number of samples in each class. Tested on our medium length period dataset for one class of IT with the children, our algorithm has used manually labeled records, which led to >99% accuracy, due to the single data type source, but also to the simple classification type problem. An extended experiment should be deployed including several rooms for a longer period. We state that our solution is very practical due to the small cost of the hardware around 90 Euros, which if wisely used could return a promising profit in terms of energy saved. Even more, if a PIR sensor would be added, to activate the system only when a movement happened, the precision of event detection will be more reliable, reducing the number of spurious detections. Switching to a system that uses only the Grid Eye sensor, not a kit board from Panasonic as it is presented in this paper, the costs will be cut at half, but a more demanding embedded system design will be needed to integrate with a board to power the sensing grid using batteries; in addition, another testing period will be necessary.
A closer model to the ground truth could be built by enriching the data collection process with the incorporation of other sources. For instance, for an IT class, by monitoring the power up time of the systems, we could obtain information about the number of users could lead to a degree of occupancy as in [28]. Occupancy information can be integrated into a model predictive higher level system for HVAC control [29].
VI. CONCLUSIONS This paper exposed a system tested in a space where occupants are elementary school students, with the aim to predict occupancy in a space with possibility to increase comfort and efficiently manage the energy. The study has been conducted in Bucharest in early summer, which offered promising results. We have presented the lessons learnt and findings regarding the hardware, data analysis and algorithm tuning. So, for a three weeks period, we have cleaned data collected from an infrared sensing matrix and applied Random Forest method for temperature timeseries forecast, but also for occupancy counting, obtaining interesting results in terms of accuracy. We discussed also data preparation steps, so that the prediction and classification techniques could be transferred to other situations and applied for some different datasets. The importance of this paper is emphasized in the context of finding approaches and frameworks to reduce energy consumption in old buildings as these ones have showed a poor energy efficiency due to lack of sensing infrastructure, age and construction materials.

ACKNOWLEDGMENT
This work is based on the experiment deployed in the IT Hub, Bucharest. We would like to express our gratitude to George Murga, Mya and Sorin Tarmure for making this possible.