Wearable Sensor-Based Location-Specific Occupancy Detection in Smart Environments

Occupancy detection helps enable various emerging smart environment applications ranging from opportunistic HVAC (heating, ventilation, and air-conditioning) control, effective meeting management, healthy social gathering, and public event planning and organization. Ubiquitous availability of smartphones and wearable sensors with the users for almost 24 hours helps revitalize a multitude of novel applications. -e inbuilt microphone sensor in smartphones plays as an inevitable enabler to help detect the number of people conversing with each other in an event or gathering. A large number of other sensors such as accelerometer and gyroscope help count the number of people based on other signals such as locomotive motion. In this work, we propose multimodal data fusion and deep learning approach relying on the smartphone’s microphone and accelerometer sensors to estimate occupancy. We first demonstrate a novel speaker estimation algorithm for people counting and extend the proposed model using deep nets for handling large-scale fluid scenarios with unlabeled acoustic signals. We augment our occupancy detection model with a magnetometer-dependent fingerprinting-based localization scheme to assimilate the volume of locationspecific gathering. We also propose crowdsourcing techniques to annotate the semantic location of the occupant. We evaluate our approach in different contexts: conversational, silence, and mixed scenarios in the presence of 10 people. Our experimental results on real-life data traces in natural settings show that our cross-modal approach can achieve approximately 0.53 error count distance for occupancy detection accuracy on average.


Introduction
Localized commercial (university, office, mall, cineplex, restaurant, etc.) and residential (apartment, home, etc.) building occupancy detection and estimation at room/zone-level granularity in real time can provide meaningful insights into many smart environment applications, such as green building, social gathering, and event management.Smartphone-based participatory and citizen sensing applications have adhered to the promise of building such applications by utilizing various context-sensing sensors on board.Different sensors can be exploited individually or in tandem to build a variety of such novel applications to satisfy the myriad requirements of differing smart environment applications.For example, potential benefit from microphone sensor-based application is the assessment of social interaction and active engagement among a group of people by leveraging their conversational contents [1] and speaker identification and characterization of social settings [2][3][4].To enumerate the number of people in a conversational episode, such as during a social gathering, interactive lecture session, or in a restaurant or shopping mall environment, various speaker-counting paradigms have been explored [5][6][7][8].Most of the recent studies which focus on conversational data features to extract high-level occupancy information assume that all of the users need to take turns at some point.While this specific scenario is feasible, it is not ideal.To tackle this ideal situation, researchers have proposed using arrays of microphone sensors, video cameras, or motion sensors for identifying microscopic occupancy information in real time [9][10][11][12] which is obtrusive in nature.We envision to move one step further by considering a more natural environment where people may spontaneously participate or abstain from any conversational phenomenon.We posit to augment the smartphone-based locomotive sensing model in the absence of any conversational episode along with acoustic sensing-based audio inference model to precisely capture the characteristic of a natural environment and accurately estimate the occupancy count.To further pinpoint the occupancy, we integrate the smartphone's magnetometer sensor-based location-sensing model.In pursuit of these goals, we design a model which opportunistically exploits both the audio and motion data, respectively, from the smartphone's microphone and accelerometer sensor to infer the number of people present in a gathering and their semantic location information as supplemented by the magnetometer sensor on the smartphone.We also introduce a crowdsourcing model to reduce the effort for obtaining semantic location information at scale.
In particular, we propose a zero-hassle ambient and infrastructure-less mobile sensing (a.k.a.smartphone) based approach by exploiting only the smartphone's sensors to provide significantly greater visibility on real-time occupancy and its semantic location [13,14].e key challenge in this case is to effectively estimate the number of people in a crowded and noncrowded environment either in the presence of any conversational data or not.Such a hybrid sensing approach could potentially furnish more finegrained occupancy profiling to better serve many participatory sensing applications while saving smartphones' battery power by advocating a distributed sensing strategy.Main contributions of this paper are summarized as follows: (i) We propose an online acoustic sensing-based linear time adaptive people-counting algorithm based on real-life conversational data which promotes a unified strategy of considering both overlapped and nonoverlapped conversational data in a natural environment.We propose to select opportunistically the minimal number of microphone sensors which can substantially reduce the energy consumption of smartphones.Our proposed people-counting algorithm can dynamically select the length of the audio segment compared to the other existing work [6].(ii) We also propose an offline data-driven peoplecounting algorithm which uses the deep neural network-based clustering approach.We optimize the deep network by learning the feature space and cluster membership jointly.We allocate the cluster dynamically to determine the number of people present in a conversation.Our proposed model dynamically provides beneficial frames to the occupancy-counting module.We perform extensive evaluation in the presence of 10 domestic users to validate our model performance.(iii) Although the acoustic sensing-based approach holds great promises in inferring the number of occupants, it fails in the absence of any conversational data.erefore, we propose the augment motion sensingbased counting strategy with our acoustic sensingbased people-counting algorithm which works on extreme modality of either of the data sources, be it acoustic or locomotive.(iv) We design a magnetometer sensor-based localization technique at zone/room-level granularity to infer the location of a conversing group.We propose a novel crowdsourcing model to map the magnetic signature of different locations and collect a large number of annotated location information to tag the occupancy with its semantic location information [13].

Related Work
We review the most relevant literatures on the occupancy inference problem in the context of conversational sensing, localization, and speaker estimation which are smartphone-based.
2.1.Speaker Sensing.Occupancy estimation is an important enabler of various applications such as HVAC (heating, ventilation, and air-conditioning) controlling [12,[15][16][17] and social interaction [18].For example, Nikdel et al. [15] quantified the energy consumption using building occupancy information.Aftab et al. [12] predicted occupancy from video sensing using object-tracking techniques from a scene and controlling the HVAC system in real time.In addition, various speaker-sensing algorithms have been proposed in the recent past using acoustic sensing [19,20].Valle [19] proposed a hybrid occupancy estimation model by combining the Gaussian mixture model (GMM) and hidden Markov model (HMM).A large number of previous works have used the smartphones' microphone to opportunistically analyze audio for context characterization.For example, SpeakerSense [4] performs speaker identification and SoundSense [21] classifies sounds from macro-to microcontexts.ey have often in common employing the supervised speaker-learning techniques.In contrast, our model's occupancy-counting process is entirely unsupervised.Our proposed model anonymously estimates the number of people from the smartphones' acoustic cum locomotive sensing model where we have employed unsupervised learning techniques to cluster different forms of acoustic signatures.For example, Ofoegbu et al. [22] have built a model from mean and covariance matrices of the linear predictive cepstral coefficient (LPCC) of voice segments in conversations and used the Mahalanobis distance to determine whether two models belong to the same or different speakers.Iyer et al. [23] have performed speaker clustering using distance of the feature vectors extracted from different speakers and finally applied the modified k-means algorithm with distance metric data.However, their experiments for occupant estimation were on telephonic conversational data, where multiple participants were present, and voices were frequently overlapped and intertwined with the noisy environment.Sell et al. [20] predicted the number of occupants present using acoustic signals by employing the agglomerative hierarchical clustering (AHC) algorithm.Our proposed model performs speaker counting without any predefined environmental setup and collects data from natural conversation.Our proposed speakercounting algorithm is close to [24] and [6] where smartphonebased speaker counting has been proposed in a controlled scenario where all the participants spoke actively.Xu et al. [6] used a fixed-length audio segment (3 sec) where each segment corresponds to an individual, but we performed this audio segmentation dynamically to increase the accuracy of 2 Mobile Information Systems occupancy inference.Xu et al. [6] also classified a few segments as undetermined, but our system never discards segments as undetermined which is achieved only through employing dynamic segmentation.erefore, our proposed audio-based occupancy inference model tackles a richer problem, where none of the speakers are discarded for handling the computational challenges.Crowd++ [6] was proposed to combine pitch with MFCC to compute the number of people with an average error distance of 1.5 speakers.On the other hand, our MFCC-based proposed model improved the average error distance by a factor of two (0.76 speakers) [13,14].However, the major disadvantage of the MFCC-based acoustic approach is that the MFCC discards a lot of information present in the speech sound.erefore, we need to develop a robust system which could potentially capture discriminative speaker information and establish correlation between features.Recently, deep learning methods become the state of the art in many acoustic applications such as object recognition [25] and speaker recognition [26].In particular, deep learning algorithm helps design robust audio-related acoustic signal modeling like speech recognition [27] and phone recognition [28] with better accuracy.
e major reason behind the popularity of the deep model is the capability of learning features from a large data set automatically and easing the rely on handcrafted features.Deep models are capable of learning robust features from both labeled and unlabeled data.Among the different deep learning methods, the deep neural network (DNN) has been recently shown to be effective in speech recognition applications [29,30].Milner and Hain [31] concatenated features from all the audio channels and helped train the DNN (deep neural network) model with this mixed feature to predict the number of speakers using audio signals.Mohamed [32] used the DNN to recognize phone from the acoustic data and showed that DNN performance is superior than that of the Gaussian mixture model (GMM).However, the features were learned from the labeled information provided during the training phase.Unsupervised feature learning has also been investigated in [33] where the convolutional deep belief network (CDBN) has been applied to audio data for gender detection, phone recognition, and speech identification applications.Deep learning-based clustering techniques have also been employed in [34,35], where convolutional neural network-(CNN-) generated features were fed into the hierarchical clustering methods to cluster nonoverlapped utterances.Xie [35] proposed offline DNN-based clustering techniques and used the k-means clustering algorithm to initialize the cluster centroids.However, our DNN-based method automatically helps cluster the speakers and train the models in an unsupervised fashion.Our model dynamically determines the total number of clusters present in the given input audio stream and employs the offline DNNbased clustering strategy to help achieve an average error count distance of ≈0.53 which is 30% higher than that of our MFCC-based iterative speaker-counting approach.

Indoor Localization.
Wang et al. [36] proposed an unsupervised indoor localization approach exploiting environmental identifiable artifacts and specific signatures on single or multiple sensing dimensions using smartphones' different sensor readings (mainly from accelerometer, compass, gyroscope, and WiFi APs).Track [37] deployed reusable beacons around the place of an event and utilized the location of the beacons in conjunction with the smartphone contact list and applied crowdsourcing techniques to infer users' location.Chung et al. [38] measured geomagnetic field in a way which is spatially varying but temporally stable, using an array of e-compasses to infer location.However, they used a bunch of sensors or sensor arrays for location detection, whereas our model only used the smartphones' magnetometer sensor to infer semantic location information of a gathering at zone/room-level granularity.Subbu et al. [39] used magnetic fingerprints with dynamic time-warping algorithm to predict location information with 92% accuracy.Our model used the standard random forest algorithm and achieved 98% accuracy to detect high-level semantic location information of any gathering.IndoorAtlas location technology [40] utilized anomalies of ambient magnetic fields for indoor positioning.
is platform provides the functionality for participatory sensing where the crowd can contribute by war-driving magnetic signatures of an unexplored location.

Overall System Architecture
We envision developing a minimally invasive cost-free robust mobile system for counting the number of people present at any time in any environment and enlighten their semantic location information.Our model boosts these capabilities by employing smartphones' magnetometer, microphone, and accelerometer sensors.Our system, as shown in Figure 1, comprises two subsystems, one deployed on the smartphone and the other in the server.Using only acoustic sensing, it is not always possible to predict the correct number of the occupants present in a specific location as some people get involved in a conversation, while others remain silent.For example, in a classroom scenario, while professor lectures, some of the students participate, but the majority of the students remain silent.Sensed data are stored in a data sink (sink) for posterior analysis in the mobile part of our proposed architecture consisting of an accelerometer and a magnetometer.In our model, we propose to utilize microphone sensor-based acoustic sensing in conjunction with accelerometer sensor-based locomotive sensing for occupancy detection.For this joint collaborative sensing, acoustic sensed data are being fed to the filter to collect acoustic fingerprint (AFP), consisting of contentbased audio.e AFPs being collected from all smartphones are sent to the "estimate proximity" module residing on the server which helps distinguish the audio signals in vicinity and approximate the inclusion of a group of smartphones to form a single clique.Finally, the "optimum node" module elects the clique leader (the most informative smartphone) to record the audio data and notifies the condition of deactivation to the other smartphones from capturing the duplicate audio signal.It also helps in sorting the smartphone list based on their audio signal strength which is Mobile Information Systems eventually utilized by the locomotive "signature collection" module to opportunistically check on and trigger the accelerometer sensor on the smartphones [41].e server-side architecture consists of two main logical subcomponents: (i) occupancy context model and (ii) location context model.
ese models together form the inference engine of our proposed semantic location-sensitive occupancy detection system.

Occupancy Context Model.
It has two submodules: acoustic context model and locomotive context model.

Acoustic Context Model (ACM).
Our acoustic context model has two independent inference modules: (i) iterative speaker count (I-SC) and (ii) deep neural network-(DNN-) enabled speaker count (DNN-SC).We employ these modules for inferring occupancy.
(i) Iterative speaker count (I-SC): this module serves as the core processor for occupancy counting.It takes the raw audio signal as input and generates the MFCC as features and then measures the similarities between the audio frames and segments.Based on these similarity measures, it decides whether those speech segments are generated from distinct or the same speaker.It keeps track of all the segments and their identities with respect to a speci c person and nally helps count the total number of existing speakers during a conversational episode.

Design Methodology
In this section, we describe the details of our model design framework.We present an acoustic augmented locomotive sensing model for counting the number of people present in a conversing, nonconversing natural environment.We posit a magnetometer sensor-based ngerprinting methodology to semantically localize the gathering.
4.1.Occupancy Estimation Using Acoustic Signature.We compute the total number of speakers present in a conversation using two methods: (i) iterative speaker counting (I-SC) and (ii) deep neural network-(DNN-) enabled speaker counting (DNN-SC).We discuss the details of our approach below.

Iterative Speaker Count (I-SC).
is module has three submodules: (i) preprocessing, (ii) feature extraction, and (iii) occupancy estimation.Figure 2 shows the stacked pipeline of our iterative speaker count module.
(1) Preprocessing: this module is the most trivial phase for acoustic signal processing. is module helps to perform the ltering and select the audio segment length dynamically.It nally helps remove all the noises and silences and produce smooth conversational data which are later passed to the feature extraction module.(2) Feature extraction: this is the main basis for extracting all types of features which is utilized in the speaker estimation module.is module takes conversational samples and processes them through a series of data cleaning and feature extraction steps.It helps making frames from samples to calculate various features like MFCC, pitch, and so on.ese features are later used by the speaker estimation module.(3) Speaker estimation: in this section, we describe our iterative occupancy estimation using our proposed acoustic sensing model.We look into the speci c cases where all the occupants have been conversing.We rst attempted to calculate the number of speakers engaged and consider three di erent phases to compute the number of personnel present.First, we propose to create dynamic segments from the raw audio data and assume that each segment belongs to an individual person.We attempt to detect every speaker change point in the entire audio signal spectrum and assign one segment to one person to increase the counting performance of our occupancy detection algorithm.A speaker change point depicts the stopping point of one speaker and the starting point of another speaker.Speaker change point detection algorithms have been investigated extensively [42][43][44]; however, it is a complex process to detect the speaker change point in conversational speech because utterance lengths can be extremely short, speaker changes may occur frequently, overlaps between the speakers may happen, and surrounding environment can be noisy.We create segments from the raw audio dynamically.Details of our preprocessing are discussed in Section 5.4.1.
Assume that the entire audio signal has N segments S 1 , S 2 , . . ., S N and consider a segment which contains m frames, and each segment consists of frames F 1 , F 2 , . . ., F m .We calculated the MFCC for each frame where each segment has corresponding MFCC feature vectors as M 1 , M 2 , . . ., M m .We also computed the pitch for each segment to apprehend gender in the conversational data.Segment pitches are represented as P 1 , P 2 , . . ., P m , where the average pitch for male falls between 100 and 146 Hz, whereas the female pitch is within 188 to 221 Hz, as demonstrated in [45].Segments which fall within the male frequency are marked as male and similarly for female.ese two sets are then passed to our proposed people-counting heuristic algorithm.Before passing these male and female segments for checking similarity measures, we calculated intracosine angle of each segment to sort out both male and female segments.Next, we have checked the similarity among intersegments whether it falls within our prede ned threshold, θ th , or not.If these segments have been similar, then we have merged them to make a new segment and continued to check for the next segment with this newly created segment.If these segments have been dissimilar, then we have moved forward and picked another segment to check the similarity with the next one.e pseudocode of our proposed people-counting heuristic has been shown in Algorithm 1. Mobile Information Systems (ii) feature extraction, (iii) gender detection, and (iv) DNN model.In the preprocessing module, raw audio signals are segmented dynamically based on the con dence score of segments where each segment contains one speaker's voice information.Detailed segmentation is discussed in Section 5.4.1.Frames are generated from these raw audio segments, and selected frames are admitted into the feature extraction module which helps extract mel-frequency cepstral coe cients (MFCCs), zero crossing rate (ZCR), and spectral ux (SF).e DNN establishes nonlinear correlation among these features.In the gender detection module, male and female segments are di erentiated using pitch information calculated with the help of the YIN [46] algorithm.Gender-speci c audio features are passed into the deep neural network to count the total number of speakers present in the audio conversation data.Figure 3 shows our DNN-SC architecture.Next, we discuss the details of our frame selection algorithm and DNN-based speakercounting algorithm.

Frame Selection Algorithm.
Frames are created from the raw audio segments which may have important human voice information, silence, white noise, and so on.Since we are interested only in the voice information, therefore, we need to discard unwanted frames to improve the performance of our people-counting algorithm.ese unwanted (unvoiced, silence, etc.) frames can occur at any time due to the di erent positions of the phone or contexts of the environment.Silence or unvoiced frames have low energy levels.Energy is obtained by calculating root mean square (RMS) values of the frames.Spectral entropy is also a good indicator of unwanted audio frames.White noise or silence has a at spectrum and has high entropy, whereas low entropy represents human voice information.Entropy of the frame is obtained by calculating the normalized fast Fourier transform (FFT) spectrum of the frame.We represent the spectral entropy mathematically as follows: (1) Our frame selection algorithm selects frames based on the RMS and entropy values calculated above.Voiced frames have high RMS values when the recorded audio sample sound is high.However, the sound of the audio samples may be low due to the microphone or phone's position.In this case, we use entropy to admit or discard a frame.Since we are only interested in voiced frames, we focus on increasing Input: set of segments, S S 1 , S 2 , . . ., S N , total number of segments N Output: ▷ Sort MFCC set and keep sorted MFCC set into the same set M PS { } ▷ Initialize Person Set which contains similar person in sets PS j for i from 1 : N do for j from (i + 1) 6 Mobile Information Systems the true positive.erefore, we use two thresholds, RMS threshold (rth) and entropy threshold (hth) to admit or discard a frame.ese thresholds are determined based on the empirical analysis of the given acoustic signals.We admit a frame when either the RMS value crosses the threshold (rth) or the entropy h f value is lower than the threshold (hth).ese threshold values depend on the phone, microphone, and context.We empirically determine this threshold from the collected audio data.e complete procedure is summarized in Algorithm 2.

DNN-Based Speaker
Counting.We construct four layers of the deep neural network (DNN) [29] to cluster the entire audio signal.Figure 4 shows the building block of our DNN network.DNN is a feed-forward artificial neural network that has one or more hidden layers between the input and the output layers.Restricted Boltzmann machine (RBM) is the basic building block of a DNN, and each RBM is stacked one after another to form the network.An RBM is a type of Markov random field (MRF) and has one visible layer and one output layer.Each layer is composed of binary stochastic units.All units from the visible layer are connected to the hidden layer units, but there are no visible-visible or hidden-hidden unit connections.Each hidden unit's output depends on all of the visible units and the corresponding connection weights and a bias factor.e probability distribution function is defined using these weights and biases of the units and the joint distribution of the visible (v) and hidden (h) state vectors. is is defined as an energy function that is represented as follows: where N v and N h represent the number of visible and hidden units, respectively, θ � (w, b, a) is the model parameter, w is the weight, and a and b are the biases of the visible and hidden units, respectively.Each RBM helps construct hidden units from the given visible units and reconstruct the visible units from the constructed hidden units.e visible vector probability is defined as follows: e conditional distributions of the visible and hidden layers are defined as follows: Our first layer, RBMs' visible units, is constructed using Gaussian visible units [47] that use real-valued features which are extracted from the audio signal.e remaining RBM layers employ rectified linear unit (ReLU) activation functions to produce the binary output.is DNN model helps determine the total number of speakers present in a conversation in two phases: (i) preclustering and (ii) speaker counting, where the former is responsible for preclustering the audio segments and the latter designates the appropriate number of speakers or clusters present in the provided audio data.
is module combines consecutive segments that are from the same speaker.We train the DNN network in a greedy layerwise basis before uniting the smaller segments into a larger one.e unlabeled audio data are leveraged to train the model using the contrastive divergence (CD) algorithm [48] which calculates the gradient and updates the model weights as follows: where v i h jdata is the expectation of the training data and v i h j1 is the expectation calculated from the distribution of samples using the Gibbs sampling method [47].Raw audio features-MFCC, zero crossing rate (ZCR), and spectral flux-are placed side by side to form a feature vector from a raw audio segment.ese feature vectors are used to train each RBM one after another in a greedy layerwise fashion to find the correlation between features and distinct vocal tract characteristics of the speaker.Once pretraining is completed, each raw audio segment produces a binary feature vector which is then used to form clusters using the forward clustering method.Assuming that the binary feature vector set for the audio segment is where N is the number of segments present in the audio signal, the raw audio segment set is Mobile Information Systems represented as S 1 , S 2 , . . ., S N .In the forward clustering method, we pick the rst segment's feature vector f S 1 and calculate the cosine distance against f S 2 .If the distance is smaller than the similarity threshold δ s , we merge these two raw segments into a new raw audio segment S 1 and compute the cluster centroid C 1 by taking an average of these two segments' feature vectors.We then calculate raw features from the merged segment again and formulate the binary feature vector with the help of the pretrained network.Next, we compare this newly computed feature vector f S 1 with f S 3 .If these two are similar, we then repeat merging and updating the centroid.Otherwise, we begin comparing f S 3 with f S 4 and form a new centroid.In this forward pass, we merge consecutive similar speaker audio segments to form a bigger segment since the smaller voice segments have a high likelihood of dissipating from the same speaker.
e longer segments help increase the likelihood of distinguishing the di erent speakers and thus help increase the clustering performance.After this forward clustering, we merge the segment set S C 1 , S C 2 , . . ., S C k and their corresponding cluster centroid set C 1 , C 2 , . . ., C k of the inferred speakers, where k represents the number of newly inferred speakers.In this preclustering technique, we have both longer (combined) and shorter segments as S 1 , S 2 , . . ., S k+m , where k is the number of smaller nonmerged segments.We compute the pitch for each of these longer segments, where each centroid is associated with pitch information.We assume that the pitch set is P S 1 , P S 2 , . . ., P S k which helps determine the gender of the speaker.Figure 5 shows the schematic diagram of our preclustering method.
(2) Speaker Counting.We count the number of people from the preclustered segments in this step.We employ the DNN for this purpose, but it seeks ground truth label information to compute the gradient of the network parameters.Since we have no label information, we postulate the previously computed centroid set as initial labels for this network.We start with segment S i and pass this through the network to generate the feature vector Z i z 1 , z 2 , . . ., z li , where l is the total number of output units.e output of each unit, l, is calculated as follows: where x j is the output of the ith unit from the previous layer.
We then compute the cosine distance against all the centroids which have similar gender information as with the current segment.If the cosine similarity distance (δ) is less than the empirically calculated threshold, δ s , we then compute the new mean centroid C i across all the similar centroids.is process is repeated for each of the segments.If any segment has the cosine similarity distance D(C i , S j ), greater than the empirically determined intraspeaker distance threshold, δ s , and less than the interspeaker distance threshold, δ d , we discard that segment.While a segment's cosine similarity distance is higher than the threshold, δ s , we assign that feature vector as a new centroid in the network.Since these intra-and interspeaker cosine distance thresholds depend on the microphone sensitivity, we determine it from our collected samples such that it reduces the total number of false positives.We validate our model by setting δ s 16 and δ d 31.To optimize the network parameters, θ (w, b), we de ne our network objective function, J(θ), based on the cosine similarity measure as follows: where 1 • δ ≤ δ s is the indicator function where it infers one when the cosine distance δ < δ s or δ > δ s , otherwise infers zero.We jointly optimize the network parameter θ and cluster centroids using the stochastic gradient descent (SGD) algorithm.e gradient of the objective function, J, with respect to each unit (z l ) of Z i is calculated as follows: where Similarly, we calculate the gradient of the objective function, J, with respect to each component (c l ) of the centroid, C j .e derivative is represented as follows: where ese gradients are passed down to our DNN network and used the standard backpropagation algorithm to optimize networks weights, w, and bias, b.Once the training of our DNN is complete, the total number of clusters represent the total number of speakers present in a conversation.

Occupancy Estimation Using Accelerometer Signature.
In this section, we discuss our locomotive sensing model in the absence of any conversational data or in a mixed environment where a group of people may talk and others listen silently.If a smartphone is stationary for a signi cant amount of time, the on-board accelerometer sensor produces a steady-state signature which has no variation or spikes in terms of signal amplitude, whereas if there is a movement, it generates a spike or corresponds to a steady-state signal alteration.To detect these abrupt changes in locomotive signal amplitude, we propose to use the change point detection-based technique [49].
Change point detection helps to nd the abrupt variation in the movement data stream.Our motivation in this work is to use the change point to nd the stray movements by nding abrupt changes in the accelerometer signals.ese changes help inferring binary people counting (whether people are present or not).We investigated the o ine Bayesian change point [49] detection-based algorithm for inferring the occupant's presence in O(n 2 ).Let the observed accelerometer data sequence be x 1:N x 1 , x 2 , x 3 , . . ., x N , where N denotes the number of data points over time T. We partition this data sequence into nonoverlapping regions based on run length [50].e length of each partition or time since the last change point occurred is de ned as "run length".If there are m partitions, then the partition data set is denoted as ρ 1 , ρ 2 , ρ 3 , . . ., ρ m .We also denote x t i :t j as the contiguous set of observations between times t i and t j inclusively.If the length of the current run at time m is denoted by r m , then it can be de ned as follows: .
Change points occur at discrete time points.e conditional probability that a change point occurs at time t k after the last change point at time where π(t m ) is the prior probability of a change point at time t m and depends on the probability distribution of the observed data sequence and the preceding change point.Change point detection algorithm computes predictive distribution π(x n+1 |x n ) on a given run length r m taking the integration over the posterior distribution π(r n |x 1:n ) which is computed using the following equation: It also nds out the joint distribution over the run length and the observed data as follows: Mobile Information Systems where π(x n |r n−1 , x 1:n ) is the segment log-likelihood which depends on the data x (r) n and π(r n |r n−1 ) is the change point probability which can be calculated as follows: where hazard function H f (η) is calculated using H f (η) g(η)/ inf j η g(j).We employ this change point technique in our locomotive sensing model for designing the binary occupancy detection algorithm.It has been built on the basis of the following threefold methodology.First, we calculate a priori probability of two successive change points at a distance d (run length).We use the Gaussian-based loglikelihood model [50] to compute log-likelihood of the data in a sequence [s, d], where no change point has been detected.Second, we calculate log-likelihood for the entire signal S[t, n], log-likelihood of the data sequence S s [t, s] where no change point has been occurred between t and s and π[i, t], and the log-likelihood that the ith change point occurs at time step t.Finally, we calculate the probability of a change point at time step t by summing up the log-likelihoods for that sequence.Figure 6 presents the change points and their probabilities which are being detected successfully in our proposed locomotive sensing model using the smartphone's accelerometer sensor.We lter those change points based on empirically determined threshold probability (δth) and infer the presence of the occupants based on the admitted change point sequence.We also count the number of change points in the data sequence which indicates the movement score that represents how frequent a person moves.e overall algorithm has been summarized in Algorithm 3, and we named it as the locomotive speaker-counting (LSC) algorithm.

Location Estimation.
In this scenario, our goal was to explore the possibility of inferring the location at the zone/room level in di erent commercial and residential buildings by only using the smartphones' magnetometer sensor signals.Intuitively, this is possible as di erent rooms have magnetic patterns that are distinct based on their unique structures and furniture layouts. is opens up the possibility that a sophisticated machine learning technique may learn to discriminate magnetic signatures belonging to di erent rooms.In our experiment, we collected the magnetic signature of di erent rooms, o ce spaces, and lobby areas in an academic building using the smartphones' magnetometer sensor.In a room, all furniture and metallic objects generally remain xed in positions and rarely are moved from one place to another. is gives us an intuition that each room has its own magnetic ngerprints which can be utilized to detect that speci c room or semantic location.
We notice that the magnetic sensor is sensitive to magnetic uctuations in indoors specially near pillars and metallic objects.Figure 7 represents this behavior where peaks occur near pillars, elevators, and so on because pillars and elevators emit high magnetic elds.Magnetic elds produced by pillars are di erent for each oor because of their varying intensity levels.ese density characteristics guide in localization because each oor is independent of the structure and height with other levels, from which it is also probable to infer oor-level location.From these empirical observations, we conclude that each room has its unique magnetic ngerprint.We analyze di erent rooms' data at the university's Information Technology and Engineering (ITE) building for three months.Figure 8 represents this analysis which depicts each room-speci c magnetic ngerprint helping to create a coarse localization model for pinpointing the semantic location of gatherings at the zone/room level.
We also note that this magnetic signal differs not only for different indoor environments but also for the phone's placement.is distraction has been optimized in two different ways: (i) calibrating magnetic signals and (ii) calculating absolute magnitude.
During our experimentation, we observe that magnitude represents different fingerprints for a separate indoor environment.Figure 9 describes how normalized magnitude of different rooms varies upon the total number of samples.Performing this experimentation over several rooms helps establish the fact that each room represents a different magnitude which may form their own fingerprint.We consider magnitude of the magnetometer because for different persons with distinct movement, it does not deviate much other than little variations.Figure 10 represents these characteristics where the magnetic signature has been collected from two different people in the same room, and both signals delineate the same shape and almost the same magnitude.
From this empirical study, we conclude that, by only using the magnetic signature, it is difficult to estimate finegrained indoor location in different indoor environments; for this reason, we also consider the mean, standard deviation, and variance of different axes.Based on those feature vectors, we generate two sets of data: training and testing using the cross-validation process.We use the training set to learn indoor characteristics by using different machine learning models and later use the testing set to predict location.To estimate fine-grained semantic location, we use SVM, J48, and random forest classifiers.

Crowdsourcing Magnetic Model.
We propose to use collaborative sensing or crowdsourcing to ease our ground truth data collection and location-mapping process.We have divided the area of interest inside the ITE building as a grid of squared cells (details are provided in Section 5.2).We collected data from most frequently visited grids without any major obstruction.While crowdsourcing the unique characteristics of grid location, it was difficult to choose the right representation of data as analogous magnetic signatures of different grids in different locations were prevalent.As a result, it was deemed necessary to display a potential set of locations from which the crowd would finalize the association of a semantic label with a particular observed magnetic signature pattern.Considering this, we provide the floor information for a specific signature pattern, such that our crowdsourcing model will enable the crowd to choose the appropriate semantic location or room from that specific floor.Nevertheless, the search space remains large as the possibilities of multiple rooms with similar magnetic footprints in a floor are quite abundance.We propose a simple grid-mapping crowdsourcing model which reduces the search space by mapping the magnetic signature pattern of point of occupancy with the existing patterns and sorts the rooms according to the similarity measurement.Our model takes the Manhattan distance and the squared deviation of magnetic magnitude as input parameters for the mapped grids and searches the repository of existing signature pattern database.
Consider a set of cell values found from a test pattern X � x 1 , x 2 , x 3 , . . ., x n .First, we take x 1 from X and try to map this value with the cell values of existing patterns.We do not assume to have any prior idea regarding the organization of the cells in the test pattern.For mapping signature values, we consider the deviation of ± 2 which has been determined empirically according to our experiments.

Mobile Information Systems
We add patterns which match the similarity value of a cell to our candidate set C and initialize a n × n distance matrix M (i) and a n × 1 deviation matrix D (i) for each candidate c i .
M (i) records the Manhattan distances between the mapped cells in a candidate pattern C i , and D (i) stores the squared deviation between the mapped cell values.If we nd similarities in multiple cell values in a single room signature pattern, we consider them as an individual candidate.We take the next test pattern, x 2 , in the next iteration and do the similar operation like x 1 , but this time we consider only the candidates in C. In this iteration, if the deviation and distance matrices of a candidate c j do not get updated, then we discard them from the candidate set and reduce the search space.We recursively perform the same mapping for the remaining grid values and compute the nal matching candidate set C F with their corresponding distance and deviation matrices.
At this stage, it is still possible to have a large number of candidates in C F .To tighten the search space, next we compute the error measurements for each candidate E(c i ) and sort the candidates with respect to this value assuming that, in an ideal conversational episode, the participants remain in close proximity.We calculate E(C i ) based on the following: where k 1, l 1, 1 ≤ a ≤ n, and b 1.
After calculating the error measurements for each candidate, we sort C F and choose the rst 10 candidates from C F .We plot the magnetic signature pattern of these candidates and the test pattern.e crowd now have to choose the signature pattern in which they nd the test pattern.In our experiments, there were some cases where we observed the empty candidate set.In these cases, we selected the last iteration's candidate set which was not empty.We also asked the crowd that if they found match with multiple candidates, then they have to choose the earliest signature pattern.

System Implementation and Evaluation Results
We now discuss the detailed implementation and evaluation of our model framework.

Tools and Resources.
We used Google Nexus-5 with built-in microphone and three-axis accelerometer sensor for our experiments.Our entire system comprises two parts: (i) sensing and (ii) classi cation and clustering; the rst one was implemented on Nexus-5 and the latter on the server.Application software was written in Java which utilizes the Android application programming interface (API) to sense microphone and accelerometer signals.Classi cation and clustering algorithms and our occupancy-counting algorithm have been implemented on the server side using Python.
We consider the Python-based deep learning platform Tensor ow [51] to implement our deep neural network-(DNN-) based clustering algorithm.Features are fed into the DNN in batch with a length of 32.Our DNN comprises 4 layers which represent two hidden layers with 1024 units each, one input layer of 22 units and one output layer of 512 output units.In the pretraining phase, each layer was trained for 100 epochs, and in the ne-tuning phase, each layer was trained for 1000 epochs.e internal architecture of our DNN network is shown in Table 1.

Data Collection.
Magnetic sensor signals are sensed through our Android application and stored temporarily on mobile storage.We rst collected magnetic data for the training set and subsequently for the testing set.We divided the room space into small regions, each containing an area of 0.5 × 0.5 m 2 and was named as the cell.us, each room forms the grid containing cells.We collected data from each cell for 5 minutes both clockwise and counterclockwise direction to form the training set.We also maintain xed height (approximately 4 feet from the oor) when collecting our ferromagnetic ngerprint because it also depends on the height.
e partial 3rd oor map along with the sample magnetic data collection path is shown in Figure 11.It shows the sample data collection path of room number 305, where green line shows how the grid forms and red line shows the data collecting path in both directions along the grid.We use a sampling rate of 5 Hz for magnetometer sensor data.We implemented the acoustic sensing and collected conversational data from di erent places at di erent times in natural settings.Conversational data have been collected and properly anonymized during the spontaneous lab conversation among the students (without making the occupants aware of it), lab meeting, and general discussions in the lobby/corridor in the presence of a variety of surrounding noise levels.e demographic for our conversational data collection was 1-10 persons (with 5 females and 5 males) in the age group of 18-50 years.e acoustic data were collected at a monosampling rate of 16 kHz at 16 bit pulse-code modulation (PCM).

Privacy.
One of the major concerns of smartphonebased acoustic signal processing is privacy.
is concern becomes more serious when the smartphone records the conversation data.Our counting algorithm determines the number of speakers in this environment in an anonymized manner.We used text le as cover in which our recorded audio is embedded.A secret key is induced for the embedding and extraction process which is known by both the sender and the recipient.A steganographic function takes cover le as an argument and then embeds audio le and key to produce stego as output which is sent to our server.A reverse steganographic function on our server side takes stego le and key as parameters and produces audio le as output.
ere are di erent steganographic methods (i.e., LSB coding, parity coding, and phase coding), but we used the simplest method, the least signi cant bit algorithm, which replaces the least signi cant bits of some bytes in the cover le to hide a sequence of bytes containing hidden data.To generate the stego le, the algorithm rst converts each character of the cover le into bit stream followed by converting the audio le into bit streams and nally replacing the LSB bit of the cover le with the bit of the audio in the secret information.We also ensured that the size of the le was not changed during this encoding and it was suitable for any type of audio le formats.

Preprocessing.
In this section, we discuss the details of our preprocessing module.

Acoustic Data Preprocessing.
We process the raw audio streams to remove noise and prepare the audio data for the feature extraction module.is module is responsible for segmenting the raw audio signals to extract appropriate frames.ese frames contain event information (i.e., voice, noise, and silence) that accounts for further processing.
(1) Dynamic Segmentation.We create segments from the entire audio signal dynamically assuming that each raw audio segment contains single speaker information.We calculate the con dence score for the entire audio segment Mobile Information Systems which represents the probability of nding the pitch within a segment.We then start nding the con dence score from a small segment (32 ms) and increase the step size in the successive iterations and repeat this up to an audio segment of size 10 seconds.We calculated the variance of this con dence score, and based on a lower variance associated with a speci c segment, we selected that segment length as one unit of conversation.
If a segment has over 90% con dence, we considered it.As there are many audio segments with di erent segment lengths, we have chosen a segment length corresponding to a single person unit associated with a higher con dence score and greater number of audio segments with a lower segment length.Figure 12 shows various con dence scores for different segment lengths.We selected 2.72 sec as the segment length instead of 3.36 sec when both have a con dence score of 1, but the rst segment length admitted a greater number of segments than the latter one.We have calculated this condence score using the YIN [46] algorithm by using nonoverlapping frames and skipped the best local estimate step.
is helps to determine on real time the unit audio segment which solely depends on the recorded audio.
As human voice ranges approximately from 300 Hz to 4000 Hz, we lter each of the segments based on that frequency range using the band pass lter.After ltering the raw audio, we have applied the Hamming window to reduce the spectral leakage while creating audio segments.
(2) Framing.We create frames from the ltered audio segments using a xed-width sliding window.Each frame has a length of 32 ms and 50% overlap.ese frames are able to capture the person's subtle vocal characteristics present in the sounds.5.5.Feature Extraction.We discuss di erent features relevant to our acoustic, locomotive sensing, and localization technique in this section.

Magnetic Features.
For location detection, we used only the magnetometer sensor.e smartphones' magnetic sensor provides three axis values: x-, y-, and z-axis.From these values, we calculated magnitude using m x 2 + y 2 + z 2 .We considered only the resultant magnitude to mitigate variations of the readings resulting from smartphones' di erent axes based on di erent positions.We also calculated the mean, variance, and standard deviation of each reading and combined those features to generate the feature vectors.

Acoustic Features.
We generated four basic features which are used in the speaker identi cation-MFCC, pitch, zero crossing rate (ZCR), and spectral ux.Each feature has been described in detail in the following: (i) MFCC is one of the most signi cant features which is used for acoustic processing.We followed the following steps to process it: (1) take the Fourier transform of (a windowed excerpt of) a signal, (2) map the powers of the spectrum obtained above onto the mel scale using triangular overlapping windows, (3) take the logs of the powers at each of the mel frequencies, and (4) nally, take the discrete cosine transform of the list of mel log powers.We excluded the rst coe cient of the MFCC and then chose 20 coe cients as feature vectors.e MFCC feature computation schematic diagram is shown in Figure 13.(ii) Pitch is de ned as the lowest frequency of a periodic waveform.It is the discriminative feature between man and woman.e human voice pitch interval falls within the range of 50 Hz to 450 Hz [45].We calculated the pitch of di erent segments using the YIN [46] algorithm.(iii) Zero crossing rate (ZCR) is de ned as the rate at which the signal changes its sign from positive to negative or back [52].Human voice has both voiced and nonvoiced sounds.Nonvoiced and voiced sounds show lower or higher variations of the ZCR, respectively.erefore, the ZCR is an important feature to count the number of speakers.e ZCR is calculated as follows: (iv) Spectral ux (SF) [53] is de ned as the l 2 norm of the spectral amplitude di erence between the current frame, F(t), and the previous frame, F(t − 1), and mathematically represented as follows: Human speech changes from voice to nonvoice rapidly and thus alters its spectral shape frequently.Spectral ux helps measure these spectral shape changes.Usually, speech has a higher SF value.

Locomotive Features.
We considered the magnitude of the accelerometer data as our locomotive feature in order to mitigate calibration.
, where f i is the prediction and y i is the actual value.

Occupancy-Counting Results
. We evaluated our opportunistic occupancy-counting algorithm in four scenarios: (i) no conversation among occupants, (ii) all occupants are conversing in a single clique, (iii) occupants are conversing in multiple cliques, and (iv) mixed conversing and nonconversing occupants.
(i) No conversation among occupants: for the rst scenario, when no occupants are involved in a conversation, we used the accelerometer to count the occupancy.Each accelerometer sensor provides binary occupancy indication based on our change point detection algorithm as discussed in Section 4.3, which computes the total number of people present in the environment.Figure 14 shows the total number of people successfully counted using our locomotive speaker-counting (LSC) algorithm.We note that our locomotive sensing model achieves 80% accuracy (8 out of 10 people) in predicting occupancy when most of the users carry their smartphones with them.(ii) All occupants are conversing in a single clique: our opportunistic sensing system plays a critical role when all occupants have been conversing in a single clique.Our system helps to activate a single microphone for occupancy counting and deactivate all other microphones and accelerometer sensors based on the server's feedback.Figure 15 depicts the e ect of cosine distant similarity measures on our occupancy-counting algorithm (I-SC) as shown in Figure 1.We noticed that similarity distance angle measures (in degree) play a pivotal role in reducing the error count of occupancy inference.In our experiments, with 3 people conversing, we found that 15-degree similarity measure threshold is an appropriate choice for consideration to reduce the error count for our proposed adaptive peoplecounting algorithm.
We also have run experiments in an uncontrolled environment (completely in a natural setting) without imposing any restrictions on smartphones' relative positions and distances from each other or from the server.Figure 16 reports the average error count distance as ≈0.5 with respect to di erent positions of the phone.It is noted that when the smartphone is placed on the table and two persons speak, the error count becomes zero, but when three persons start speaking, error count tends to become slightly higher due to the ambient noise and overlapped conversation.Figure 17 shows occupancy-counting results for DNN-SC on di erent positions of the phone.We notice that the average error count distance for DNN-SC is 0.30 which is 40% less than our I-SC approach as we employ a more selective strategy to select appropriate frames in our frame selection algorithm.
Figure 18 depicts that the error count increases as the single clique leader's distance from other occupants increases.We note that, for a 3-meter distance, the error count becomes close to two which con rms that even for a large Mobile Information Systems internal distance separation among the conversing occupants, our acoustic sensing model performs quite well.
Figure 19 shows the average error count distance with di erent distances of the phone from the speakers.Note that DNN-SC outperforms I-SC in this case.However, DNN-SC reports similar trends as in I-SC with the increasing distance of the phone from the speakers.Figure 20 presents the performance of our peoplecounting algorithm (I-SC) where users speak naturally with overlapped conversations.It is observed that the average error count is 0.1 for 2 people and 1.7 for 10 people when conversing together.us, the overall average error count is 0.76 with the number of users present varying from 2 to 10 establishing that our acoustic-based occupancycounting algorithm performs well even in a crowded environment.Figure 21 presents the performance of our DNN-SC algorithm.We observe that the overall average error count for DNN-SC is 0.5316 with the number of speakers present varying from 2 to 10.Our DNN-SC people-counting algorithm performance improves 30% than our I-SC occupancycounting algorithm.In Figure 20, we notice that our I-SC algorithm performance decreases with the increase of the number of speakers present in a conversation because of the overlapping segments which span across multiple speakers' voice and limited capabilities of MFCC features to di erentiate these speakers.In Figure 21, we observe the similar trends as in our I-SC method, but DNN-SC helps improve performance with the increasing number of speakers because DNN-SC can capture the hidden correlation between features.
(iii) Occupants are conversing in multiple cliques: in our third scenario, where occupants are conversing in multiple cliques (three cliques in our experiment), we deployed three microphones and accelerometer sensors which are chosen based on the proximity measure from the server to infer the occupancy.
Figure 22 shows the intragroup count in the presence of conversational data with distinct clique formation.In our experiments, the rst group has 5 occupants (2 men and 3 women), the second group has 6 occupants (3 men and 3 women), and the last group has 8 occupants (4 men and 4 women).We observe that the mean error count is ≈1 even for our group-based acoustic sensing model which attests the promise of our occupancy detection model in di erent real-life scenarios.(iv) Mixed conversing and nonconversing occupants: in our last scenario, where some people speak and some people remain silent, we propose to utilize our     Mobile Information Systems + LSC), it is 0.5, more than a threefold increase in accuracy for inferring the total number of people.
From Table 2, we observe that our combined (DNN-SC + LSC) model outperforms the combined (I-SC + LSC) model by approximately 34% in total.

Location Estimation Results
. Figure 24 presents the location estimation error of an occupancy gathering using di erent classi ers.e random forest classi ers perform best with an average precision, recall, and F1 score of 0.98.We also validated our location model through di erent test cases where we consider (i) di erent trajectories, (ii) di erent times of a day, and (iii) di erent rooms with a varying number of occupants.
We conducted our experiments following di erent trajectories, like keeping mobile phone on the table, following the same or reverse directions when collecting data, and nally, collecting data randomly for a room.We noted that these di erent movement patterns do not a ect much in the performance of our occupancy-gathering location determination model.Figure 25 shows errors for di erent movement patterns.We nd that the stationary pattern shows better accuracy, while moving in the same direction gives higher error rate.Average errors are close to 0.015, which is quite acceptable with a minor number of false positives or true negatives.
Figure 26 depicts the varying nature of the magnetic signature during the di erent times of a day.We observe that the location estimation of any gatherings is similar during the di erent times of a typical day.It shows that error ranges approximately from 0.015 to 0.03 due to the global variation of weather and other magnetic factors making our model as time invariant.
We also ran experiments for location-sensing model with respect to di erent rooms at di erent oors in ITE building with a di erent set and size of the occupants.From Figure 27, we do observe that the mean absolute error approximately varies in the range of 0.015 to 0.04 which has a negligible e ect on the performance of our locationsensitive occupancy determination model.We observed some discrepancies between di erent subjects' data for room 321 and room 461.After investigating, we found that the discrepancies happened due to unusual magnetic inferences of electronic devices present while collecting data for subject II.To evaluate our crowdsourcing model, we ran a simulation of our magnetic crowdsourcing model in the Vowpal Wabbit (VW) toolkit [54].We implemented our mapping algorithm on the server side and then used the function active_interactor of VW to interact with the users.We showed 10 magnetic  18 Mobile Information Systems signature patterns and 1 test pattern to an user and asked him to choose the magnetic signature pattern in which he/she nds the test pattern.10 participants participated in the crowdsourcing, and in Figure 28, we show the overall accuracy for each participant when given 15 pattern-matching tasks.Average accuracy of gaining correct annotation for these 15 patterns is ≈81% which is adequately high.Our results indicate that the probability for getting noisy labels is very low, and the crowd-annotated data can be chosen as input to the classi er.

Discussion and Future Work
In the current version of our work, we have assumed that people keep their smartphones in the pocket or in the hand which might not be ideal in some cases.In future, our plan is to make our architecture more robust and independent of smartphones' location.
e performance of our counting algorithm does not get a ected by TV or radio sounds as TV or radio follows di erent modulation techniques which make it easier for us to remove those external noises from resultant audio signal systems.We have used source separation where signi cant overlap between human conversation and TV occurs.In the current implementation, location-mapping process is independent of the classi cation process.In future, we plan to develop and integrate a combined mapping and classi cation model.We also plan to investigate ne-grained oor-level location using smartphone barometric sensing.We plan to investigate a more advanced opportunistic sensing model considering microphone, accelerometer, and magnetometer sensor participation not only based on a server-based architecture but also based on an intersmartphone distributed collaborative sensing-based approach.

Conclusion
In this paper, we presented an innovative system to infer the number of people present in a speci c semantic location which opportunistically exploits the accelerometer and microphone sensor of smartphones for people counting.We proposed an acoustic sensing-based unsupervised clustering algorithm by addressing the underpinning challenges evolving from naturalistic overlapped and sequential conversation to infer the occupancy in an environment.We posit a change point detection-based locomotive sensing model to infer the number of people in the absence of any conversational episode.We implement an opportunistic context-aware client-server-based architecture to leverage smartphones' microphone, accelerometer, and magnetometer sensors and combine our acoustic sensing with locomotive and semantic location-sensing model to better predict the location-augmented occupancy information.We have also demonstrated a novel crowdsourcing model for reducing the e ort of collecting location information at the zone/room level at a large scale.Our experimental results hold promises in a variety of natural settings with an average error count distance of 0.76 in the presence of 10 users.We believe that this investigation holds promises and helps to open up many new research directions in this opportunistic multimodal sensing domain.

(
ii) Deep neural network-(DNN-) enabled speaker count (DNN-SC): it accepts raw audio signals and produces features such as MFCC, ZCR, and so on and deploys the deep neural network (DNN) to infer occupancy.

Figure 1 :
Figure 1: Architectural overview of our model.

Figure 5 :
Figure 5: Schematic diagram of the preclustering method that shows how individual contingent speech segments combined to form clusters.Di erent colors represent di erent speaker's audio.Contingent audios from the same speakers are combined into a new bigger segment and formed a new cluster.A newly formed cluster centroid is shown in circle.

Figure 6 :
Figure 6: Magnitude of the accelerometer signal (a) and change points with probabilities of that signal (b) due to random movement patterns of a person.

Figure 16 :
Figure 16: Occupancy count over di erent phone positions for iterative speaker counting (I-SC).

Figure 20 :Figure 21 :Figure 22 :Figure 23 :
Figure 20: Accuracy versus the number of people for I-SC.

Figure 27 :
Figure 27: Location estimation error in di erent rooms with di erent occupancy sizes.

Figure 28 :
Figure 28: Results of our magnetic crowdsourcing model.
Accelerometer Sensor Data, data, Total number of data points � n Output: Binary Speaker Count for (t from 1 Input:

Table 2 :
Comparison of average error count between Crowd++ and our model.