Identification of Partitions in a Homogeneous Activity Group Using Mobile Devices

People in public areas often appear in groups. People with homogeneous coarse-grained activities may be further divided into subgroups depending on more fine-grained behavioral differences. Automatically identifying these subgroups can benefit a variety of applications for group members. In this work, we focus on identifying such subgroups in a homogeneous activity group (i.e., a group of people who perform the same coarse-grained activity at the same time). We present a generic framework using sensors built in commodity mobile devices. Specifically, we propose a two-stage process, sensing modality selection given a coarse-grained activity, followed by multimodal clustering to identify subgroups. We develop one early fusion and one late fusion multimodal clustering algorithm. We evaluate our approaches using multiple datasets; two of them are with the same activity while the other has a different activity.The evaluation results show that the proposed multimodal-based approaches outperform existing work that uses only one single sensing modality and they also work in scenarios when manually selecting one sensing modality fails.


Introduction
People often appear in groups and participate in various activities in public areas.People with homogeneous coarsegrained activities may be further divided into subgroups based on more fine-grained behavioral differences.For instance, in emergency response situations such as fire evacuation, people have the same coarse-grained activity, that is, walking or running towards emergency exits.However, people may be heading for different exits and with different moving speeds, and people who are moving together can be considered as a subgroup.By monitoring these subgroups, the emergency control center can better guide people by directing each subgroup's route.Therefore, partitioning a group with the same coarse-grained activity into subgroups based on specific activity differences is very important.Similarly, tourists walk around in a park and walking is the same coarse-grained activity.Different walking flocks can be distinguished by the mobility patterns of the tourists; that is, people in the same subgroup should have similar direction and speed.A tour guide can easily manage the tourist group based on the walking flocks and send customized message to different subgroups which are heading to different attractions.Another example is people watching a game.Different subsets of the audience cheer for different teams in a game and the subgroups can be distinguished by the specific actions performed by them; that is, people in support of the same team typically perform certain gesture such as waving hands during the same time period when the team is performing well.Fans of the same team can be easily identified and they can be recommended to be friends to share information for future games.Partitioning groups with the same coarse-grained activity into subgroups based on specific activity differences is exactly the focus of this work.
Lots of work have been done in group detection and activity recognition using mobile devices, but the problem at hand has not been fully addressed by existing work as detailed in Section 2. We have been inspired by the divergence-based affiliation detection (DBAD) approach [1] which provides a framework to identify group affiliation given a sensing 2 Mobile Information Systems modality to be used for an activity.Different from the group activity recognition problem which typically first recognizes each user's activity and then analyzes their cooperative or collaborative relationship in a group [2], the group affiliation detection problem is about how to identify which users have similar behavior instead of identifying their specific activities.However, one limitation of DBAD is that only one sensing modality can be used at a time to distinguish multiple subgroups, so it cannot accurately partition the groups when behavioral differences can be observed only through multiple sensing modalities.Another limitation of DBAD is that the sensing modality has to be explicitly provided to the framework, which is not practical in many cases since it is not clear which sensing modality works the best.In this work, we focus on building a generic framework that fuses multimodal sensors to identify subgroups in a homogeneous activity group.In other words, the same coarse-grained activity of all the people is provided to the framework as prior knowledge; the framework will divide these people into subgroups based on multiple sensing modalities automatically determined for the given coarse-grained activity.This is also different from the group detection problem studied by some existing work [3][4][5][6] as detailed in Section 2 which fuses some manually selected sensor features to group comoving people or devices.
Fine-grained partition of groups raises several interesting challenges.
Sensing Modality Selection.Existing work has shown that sensors on the users' mobile devices produce similar signals when the users have the same fine-grained activity [7]; therefore, group affiliation can be detected by monitoring the sensor signals of the mobile devices.However, with multiple sensing modalities available, it is not clear which sensing modalities can best capture users' activity similarity.It is even harder for a generic approach since it needs to detect group affiliation under any activity.We address this issue in Section 3.

Inconsistent Window Size among Multiple Sensing Modalities.
To reduce cost (in particular in terms of energy consumption) of data collection and exchange to measure similarity between users, it is necessary to summarize the sensor data time series into aggregate sensor features.We choose to use probability distribution function (PDF) as the aggregate sensor feature [1].The length of sensor data time series for summarization significantly impacts similarity measurement, so we need to determine the measurement time window for each sensing modality and deal with the different time window sizes when combining the measurements of multiple sensing modalities.We address this issue in both training phase (Section 3.3) and testing phase (Section 4.1).

Multimodal Clustering.
Identifying groups based on the similarity measurements of multiple sensing modalities is nontrivial.Usually, we can apply clustering algorithms on the similarity graph of all users.However, since most sensing modalities are independent of each other, we cannot arbitrarily weigh each sensing modality to combine their similarity measurements into a single value.We address this issue in Sections 4.2 and 4.3.
The main contribution of this paper is that we propose approaches to address these challenges in a generic framework using two phases: phase I is sensing modality selection and phase II is multimodal clustering for group identification.The overall process is presented in Figure 1.We evaluate our approaches using both the dataset provided in DBAD and two datasets we collected.The evaluation results show that our multimodal-based approach outperforms the DBAD approach that uses only one sensing modality by about 10% in group affiliation accuracy.Even though 10% is not a large margin, a distinguishing feature of our approaches is that we can automatically select the right sensing modalities while the best sensing modality has to be explicitly provided to DBAD, which significantly limits its practicality.Further, our approaches work effectively for various activities.

Related Work
Group affiliation detection and group identification have been studied using sensor-equipped mobile devices such as smartphones.There exist several ways to identify groups, for instance, based on interactions [8], proximity [9], mobility [3][4][5][6], and activity [1,7].Most of the existing work relies on mobility for group detection, in which the individuals who have the similar trajectories are considered as in the same group.For example, GruMon [4] determines a group of individuals in a specific location who are traveling together in crowded urban environment.The solution fuses location data of different levels of accuracy using Bluetooth or WiFi with additional data such as semantic labels and smartphone sensor data, and the system shows very promising results based on tests using real-world datasets.In this paper, we focus on the activity-based group detection, in which the individuals who have similar activities are considered in the same group.For example, [7] identifies activity groups based on crowd behavior such as queueing, clogging, and group formation.The solution involves individual activity inference, pairwise activity relatedness, and global behavior inference.Different from the mobility-based group detection, tracking the location data of each individual over time is no longer a requirement.To be more specific, we define a homogeneous activity group as a group of people who perform the same coarse-grained activity at the same time and is one type of activity-based groups (people can have the same coarsegrained activity or different coarse-grained activities).We will use the term "activity" to represent a coarse-grained activity in the rest of the paper.
This work of identifying subgroups in a homogeneous activity group is inspired by DBAD [1].The DBAD approach uses probability density functions (PDF) to model sensor data.Each mobile device computes the disparity to its neighbors by computing Jeffrey's divergence between the local PDF and the neighbors' PDF.The DBAD approach has several limitations.First, only one sensing modality is used at a time and this has to be selected manually.In particular, to identify people walking in different groups,  the magnitude of the accelerometer readings is manually selected to identify groups walking with different speeds, and the azimuth sensing modality obtained from the orientation sensor is manually selected to identify groups with different walking directions.However, using only the azimuth will not work when different groups of people walk in the same direction but with different speeds; using only the magnitude can not differentiate groups with different directions.Therefore, multimodal sensing is necessary to distinguish different groups without prior knowledge of the grouping details.Second, in DBAD experiments, wearable mobile devices are attached to the human body with fixed positions to reduce noise in sensor data collected.This is not practical since people may put their phones in pockets or hold them in hand.It is not clear how DBAD performs when noise is present in the collected data.
In activity recognition, the first stage is often sensing modality selection (i.e., feature construction).There are many existing approaches based on mobile devices [10].In general, either based on some domain knowledge about the physical behavior involved or by making some default assumptions, a fixed set of sensing modalities is manually selected to construct the feature for a specific activity.Further, as discussed in [11], most activity recognition approaches are not generic and they often lead to solutions that are tied to the specific scenarios.Therefore, [11] proposes an algorithm which embeds feature construction into the machine learning process.However, this generic approach only works for the classification and regression problems and cannot be directly applied to the clustering problem we face in this work.

Phase I: Sensing Modality Selection
For different activities, different sets of sensing modalities may represent the most distinguishing features.The sensing modality selection process uses a training set for a given activity.The training set consists of one time series for each sensing modality on each mobile device.Each time series may have different sampling rate and may need to be summarized in different time windows.To select the sensing modalities which can provide accurate group affiliation detection results, we first define scoring function as a metric to find the best window size for a sensing modality and then determine whether the sensing modality is qualified for group affiliation detection.
Notations are listed at the end of the paper.The thresholds depend on the activities and sensing modalities.In this work, we determine the practical values of these thresholds using our datasets for various activities.We will determine the thresholds by activity as detailed in Section 6 in our future work.

Scoring Function.
We use a probability-based approach to predict the group affiliation detection accuracy of a sensing modality   .
By summarizing   on each mobile device over a time window as a PDF, we can compute Jeffrey's divergence [13] (measures the disparity, opposite of similarity) between each device pair.Jeffrey's divergence between two probability distributions PDF  and PDF  is given by Scoring function (  ) (2) is defined as the conditional probability of any pair of devices in the  devices' training set being in the same group when Jeffrey's divergence between them for sensing modality   is no larger than TH  : where  , = 1 indicates that  and  are affiliated with the same group while  , = −1 indicates no group affiliation.As Mobile Information Systems discussed in [1], TH  highly depends on the sensing modality being used and varies for different activities.
Using Bayes' theorem, ( 2) is derived as The PDF of a sensing modality can be computed using Algorithm 1, assuming the distribution function type is known for the sensing modality.For example, most sensing modalities such as 3D acceleration and 3D rotation rate can be modeled as standard Gaussian distribution, and some sensing modalities such as orientation data have circular features and can be modeled as von Mises distribution [14].If standard Gaussian is the distribution function type, the parameters are the mean  and the variance  2 of a vector of numerical values in a time series.If von Mises is the distribution function type, the parameters are the circular mean () and the circular variance () 2 of a vector of angular values in a time series.
The computational cost of Jeffrey's divergence is related to the number of integration steps when calculating the integration in (1), and the integration steps can be determined based on the time series length .Therefore, the time complexity of computing Jeffrey's divergence for a time series with length  is about ().

Sensing Modality Selection.
The sensing modality selection problem is stated as follows.Given  mobile devices or users in the training set, each has a set of time series  (contains one time series of the time stamped data for each sensing modality under a given activity ), and given the scoring function  to predict the group affiliation detection accuracy (i.e., the ratio of group affiliations that can be determined correctly), find the set of sensing modalities as well as the best window sizes which may result in an accuracy higher than decision threshold TH  .Since a probability less than 0.5 means that the group affiliation detection has more chance to be incorrectly detected than correctly detected, TH  should be larger than 0.5.Further, according to different activities, TH  may vary in order to choose the most significant sensing modalities which have highest scores.The determination of TH  and the most significant sensing modalities will be discussed in Section 5.
Algorithm 2 depicts how to select the candidate sensing modalities with their corresponding best window sizes which lead to the detection probability higher than TH  .The time complexity depends on the number of sensing modalities (constant), the number of windows (constant), the number of mobile devices , and Jeffrey's divergence computation complexity (()).Therefore, the overall time complexity of sensing modality selection is ().

Adjusting Window Size.
The sensing modality selection process identifies the best and a few secondary sensing modalities.The window size of each candidate sensing modality is compared against that of the best sensing modality.For any candidate sensing modality, if the new scoring function when using the window size of the best sensing modality is still not smaller than TH  , the window size of this sensing modality will be modified to the same as that for the best sensing modality; otherwise, it keeps the original window size.The rationale behind this trick is to produce the multimodal fusion results mainly based on the best sensing modality and the results are expected to be improved by considering the secondary sensing modalities.The purpose of this window size matching is to reduce the processing of different window sizes during multimodal clustering in phase II.
Algorithm 3 depicts this process of adjusting window size.Similar to Algorithm 2, the time complexity of adjusting window size is ().

Phase II: Group Identification Using Multimodal Clustering
Once we have determined a set of candidate sensing modalities along with their window sizes, the next process is to use the test set to identify subgroups whose members have high similarity in these sensing modalities within a homogeneous activity group.Unlike the precollected training set, the test set can be recorded in real time and the sensor data distributions of all mobile devices can be periodically (i.e., according to the window sizes of the sensing modalities) sent to a central server in an infrastructure-based environment or collected by a sink node via data collection protocols in mobile ad hoc networks.Therefore, the group identification can also be done in real time in addition to using a precollected test set.The multimodal sensor fusion-based group identification problem is actually the multimodal clustering problem, which has commonly been treated using early fusion or late fusion [15].Early fusion combines the sensing modalities in a specific representation before the clustering process, while late fusion first applies the clustering process to each sensing modality separately and then combines the results from each sensing modality.According to the comparison in [16], the advantage of early fusion is that it requires one learning phase only, while the disadvantage is the difficulty to combine multiple sensing modalities in a common representation.Although late fusion avoids this issue, it has other drawbacks such as the expensiveness in learning since every sensing modality requires a separate learning phase and potential loss of correlation in multidimensional space.We believe that early fusion may outperform late fusion in certain scenarios, but not in others.Therefore, we investigate and compare two clustering approaches, probability-based clustering for early fusion and majority voting-based clustering for late fusion.
Before we discuss the two clustering algorithms, we need to explain how to deal with different window sizes among different sensing modalities selected.

Dealing with Inconsistent Window
Size.We use the window size of the best sensing modality for group identification, so the best sensing modality delivers one pairwise group affiliation result in each time window of group identification, and the secondary sensing modalities deliver multiple or no results in such a time window.Figure 2 shows an example with time series of three candidate sensing modalities provided by a mobile device, where  1 is for the best sensing modality  1 and the window size  1 of  1 is used as the group identification time window.The window size of each sensing modality is the same on all mobile devices.Therefore, by collecting the information of all sensing modalities on all mobile devices,  1 delivers one pairwise group affiliation result in each of the  1 windows,  2 (corresponding to  2 ) delivers one or no result, and  3 (corresponding to  3 ) delivers one or multiple results.
To determine pairwise group affiliation between a pair of mobile devices  and , Jeffrey's divergence is compared against threshold TH  : if DJ(PDF  ‖ PDF  ) ≤ TH  , then use the temporary result V = 1 to indicate positive group affiliation; otherwise, use V = −1 to indicate no group affiliation.Moreover, since the sensing modality   may deliver multiple results or no result in the group identification time window  1 , we define the aggregated result delivered by   in each  1 window as    ∈ {1, 0, −1}, indicating whether the sum of V during the window is positive, zero, or negative.This is because positive summation implies that most of the time positive group affiliation is suggested and vice versa.The aggregated result 0 may be caused by no result delivered in this time window or multiple results canceling out each other.In this case, the impact of   on group identification does not need to be considered.Therefore, sensing modality   is taken into account in a group identification time window only when it provides an aggregated result 1 or −1.

Early Fusion: Probability-Based Clustering.
We present an early fusion multimodal clustering approach which combines the pairwise group affiliation results delivered by all sensing modalities in each group identification time window into a single result.A common approach for early fusion is to assign weights to each sensing modality.However, it is difficult to determine the appropriate weights, either manually or using a search procedure.Moreover, we have sensing modalities which deliver the pairwise group affiliation results with different accuracies.Intuitively, the best sensing modality should be given the highest weight in the early fusion process.If we assign a percentage as the weight to each of the sensing modalities and then sum them up, the fusion function has no physical meaning and it is even more confusing than using only the best sensing modality.
On the other hand, as discussed in Section 2, using a single sensing modality without prior knowledge of grouping details is insufficient for many scenarios such as different groups of people walking in the same direction but with different speeds.Therefore, instead of using a single sensing modality or arbitrarily providing weights to different sensing modalities, we use the joint probability of correct pairwise group affiliation detection as a fusion method to combine the pairwise group affiliation results delivered by all the selected sensing modalities.In a group identification time window, given a set of sensing modalities { 1 , . . .,   }, each delivers a pairwise group affiliation result    ∈ {1, −1}, where  ∈ {1, . . ., }.The probability of correct pairwise group affiliation detection (i.e., the fusion function) is calculated as shown in what follows using Bayes' theorem: Further, we assume that each sensing modality can deliver a pairwise group affiliation result independently, so we can rewrite (4) as where the probabilities (   |  , = V) and ( , = V) are computed in the same way as the calculations in Section 3.1 using the training set.These precomputed probability values can be directly applied to the clustering algorithm in which the test set is being used for group identification.
Using the test set, we can compute the pairwise group affiliation probabilities ( , = 1 |   1 , . . .,    ) in each group identification time window.We use a probability threshold TH  to convert the pairwise group affiliation probabilities into a binary matrix V of the fused pairwise Input: test set of time series  1 , . . .,   on  mobile devices under activity ,  selected sensing modalities in each set of time series, probability threshold TH  Output: device groups in each group identification time window (1) Each mobile device uses its local time series to compute the PDFs for each selected sensing modality according to its window size; (2) The server or sink node collects the PDFs from all the  mobile devices once in each group identification time window and run the following process: (3) Initialize group affiliation matrix V; (4) for each device pair (, ) do (5)  ← Ø; (6)  group affiliation results.The value corresponding to the mobile devices  and  in the matrix V is denoted as  , ∈ {1, −1}.If ( , = 1 |   1 , . . .,    ) ≥ TH  , then  , = 1; otherwise  , = −1.TH  may also vary for different activities, and its determination will be discussed in Section 5.
Based on the group affiliation matrix, we can use existing clustering algorithms in one-dimensional space.We apply the density joint clustering algorithm (DJ-Cluster) [17] which is used by existing work of pedestrian flocks detection [3] to cluster the mobile devices into different groups.
The process of the probability-based clustering approach is given in Algorithm 4. Note that a sensing modality   is taken into account in computing the fused pairwise group affiliation result only when it provides the result    ̸ = 0.The time complexity depends on the number of device pairs ( 2 ), the number of selected sensing modalities (constant), computation of    (the complexity is the same as computing Jeffrey's divergence, i.e., ()), and DJ-Cluster algorithm (( 2 )).Therefore, the overall time complexity of the probability-based clustering algorithm is ( 2 ).

Late Fusion: Majority Voting-Based
Clustering.We present a late fusion multimodal clustering approach which combines the clusters generated by each sensing modality in each group identification time window.We first use the DJ-Cluster algorithm to generate the clusters for each sensing modality separately.Similar to Algorithm 4, a sensing modality   is taken into account in the final cluster determination for two mobile devices only when it provides the result    ̸ = 0. We modify the majority voting approach used in [3], where the fusion is calculating the summed weight of the sensing modalities where a pair of mobile devices are clustered into the same group.The two mobile devices are added as a cluster in the majority solution if the summed weight is larger than 50%.If one of the them is already inside a solution cluster, the other one joins the same cluster instead of adding a new cluster.However, in [3], it simply assigns a weight of 50% to the features which may give the best accuracy and then divide the remaining 50% among the other features.It does not search for the best weights assignment or automatic training of these weights.Therefore, the weight assignment is still a problem in this late fusion multimodal clustering approach.Since we already have a sensing modality selection process before the clustering process, as long as the sensing modalities are well selected, all the selected sensing modalities should play important roles in the group identification.Therefore, we apply the same weight on all selected sensing modalities.
Algorithm 5 gives the process of the majority votingbased clustering approach.Similar to Algorithm 4, the time complexity of separate clustering for all the selected sensing modalities is ( 2 ).Further, the time complexity of applying majority voting on all device pairs is ( 2 ).Therefore, the overall time complexity of the majority voting-based clustering algorithm is ( 2 ), which is the same as the probability-based clustering algorithm.

Performance Metrics.
Since the DBAD approach only detects pairwise group affiliation, its evaluation only considers the accuracy of pairwise group affiliation detection results.In contrast, our final results are the identified groups; therefore we use the performance metrics pairwise group affiliation accuracy and group membership similarity to evaluate the intermediate and the final results, respectively.For group identification, since the groups are preconfigured and unchanged during an experiment, we determine the final groups when the grouping results are stable; that is, groups remain for at least five group identification time windows.The group membership similarity is calculated as the average Jaccard similarity [18] between an identified group and the corresponding actual group.The pairwise group affiliation accuracy is calculated as ratio of the correctly determined group relationships over the total number of pairwise group relationships when the final groups are identified.

Datasets.
In performance evaluation, we first use the dataset provided in DBAD [1] where the activity is people walking together.The DBAD dataset contains the sensor data obtained from 10 homogeneous Android devices which are attached to the hip of each person.The experiments are conducted with different group configurations (from 1 to 10 groups), and each experiment lasts 51 minutes.The sampling rate is about 25 Hz for each sensor.To compute the activity similarity for people walking together, we consider the following sensing modalities available in the dataset: acceleration, -acceleration, -acceleration, and magnitude (obtained from the 3D accelerometer); azimuth, pitch, and roll (obtained from the orientation sensor).The magnitude is the square root of the square sum of the 3D accelerations, and the DBAD evaluation uses it instead of the 3D acceleration measurements.There are two limitations of the DBAD dataset as discussed in Section 2: one is that wearable mobile devices are attached to the human body with fixed positions in order to reduce noise in the collected sensor data; the other is that there is only one activity (i.e., people walking together) involved.Therefore, we also collect our own datasets-one for the park scenario and one for the game scenario as discussed in Section 1.
The park scenario has the same activity with the DBAD dataset and uses the same sampling rate, but with less controlled phone positions to allow for more noisy data and with more sensing modalities to allow for consideration of multiple modalities.Since the DBAD dataset only contains accelerometer and orientation sensor, we collect our own dataset with more motion sensors on smartphones for the same activity in which people walk together.It contains the sensor data obtained from 8 heterogeneous smartphones (e.g., Nexus and Samsung Galaxy phones) held in hands by people walking in 3 groups for about 10 minutes.These groups have different walking directions and are slightly different in walking speed.The sensors recorded are 3D accelerometer, 3D gyroscope, and orientation sensor.We consider the following sensing modalities: -acceleration, -acceleration, and -acceleration (obtained from the 3D accelerometer); -rotation, -rotation, and -rotation (obtained from the 3D gyroscope); azimuth, pitch, and roll (obtained from the orientation sensor).
The game scenario has a different activity (i.e., audience wave hands for different teams) from the DBAD dataset and it is used to demonstrate that our approaches are general and can handle different activities.The sampling rate is also the same.This dataset contains the sensor data obtained from 8 heterogeneous smartphones for about 10 minutes.Each group waves their smartphones in different time periods, mimicking the activity that audience cheer for the two competitor teams in a game.The sensors recorded are the same as in the park scenario dataset.
For each dataset, we divide it into two parts-the first half as the training set for sensing modality selection and the second half as the test set for identification of subgroups within a homogeneous activity group.We implement our algorithms in Python and run Algorithms 2 and 3 on the training set and Algorithms 4 and 5 on the test set.

Results
Using the DBAD Dataset.In the training set, we set the minimum and maximum window sizes as 5 seconds and 50 seconds, respectively.The minimum window size is set according to the sampling rate 25 Hz, so we can have more than 100 samples within each window to compute the PDF.The maximum window size cannot be too large (within a minute); otherwise it takes too long to make the grouping decision.Table 1 shows the results for each sensing modality, where the best score is the scoring function with the best window size for that sensing modality and the new score is the recalculated scoring function using the best sensing modality's best window size.As discussed in Section 3.2, the decision threshold TH  should be larger than 0.5.Here we set TH  = 0.55; then the azimuth (window size 5 s), -acceleration (window size 15 s), -acceleration (window size 15 s), -acceleration (window size 15 s), and magnitude (window size 15 s) are selected.Since magnitude is a redundant sensing modality to the 3D acceleration and it yields very similar score as the 3D acceleration, we use the 3D acceleration sensing modalities in Algorithms 4 and 5 instead of magnitude.We next use the test set to evaluate Algorithms 4 and 5.
First, we consider the probability threshold TH  in Algorithm 4. Similar to the decision threshold TH  , it should also be larger than 0.5.Therefore, we vary it from 0.55 to 0.95. Figure 4(a) shows that the group membership similarity is slightly smaller than the pairwise group affiliation accuracy.This is because there exist some critical links in the graphbased clustering algorithms.If a critical link is determined with incorrect group affiliation result, it will significantly impact the group identification results.In general, the pairwise group affiliation accuracy increases when TH  increases.Using the DBAD dataset, TH  = 0.85 leads to both the highest pairwise group affiliation accuracy and the highest group membership similarity.Next, we will compare the results of the probability-based clustering algorithm using TH  = 0.85 with the results of using the DJ-Cluster algorithm on each single sensing modality as well as using the majority voting-based clustering algorithm on all sensing modalities.
Figure 4(b) shows the pairwise group affiliation accuracy and Figure 4(c) shows the group membership similarity.We put the results of different sensing modalities together with the results of different approaches in order to compare not only the approaches but also multimodal against each individual sensing modality.Also note that, since the majority voting-based clustering algorithm outputs the final clusters based on the clusters computed from each sensing modality, it does not output the combined pairwise group affiliation results of all sensing modalities; we only compare the probability-based approach with each single sensing modality for the pairwise group affiliation accuracy.
In Figure 4(b), the 3D acceleration sensing modalities lead to an accuracy around 0.6 while the azimuth related to the orientation sensor leads to an accuracy about 0.76.These results are consistent with the findings in the DBAD approach, where the azimuth delivers the best pairwise group affiliation accuracy.Beyond their findings, our sensing modality selection approach automatically selects the azimuth as the most significant sensing modality.Further, the probability-based approach leads to an accuracy about 0.86, which shows that the multimodal-based approach outperforms the original DBAD approach which uses a single sensing modality.
In Figure 4(c), the comparisons are similar to Figure 4(b).In addition, the probability-based approach outperforms the majority voting-based approach using the DBAD dataset.This is because the sensing modalities other than azimuth do not have high scores, so their contributions in the majority voting-based approach are not significant.However, the majority voting-based approach still provides a higher group membership similarity than using the 3D acceleration or the azimuth separately.

Results
Using the Park Scenario Dataset.We use the same minimum/maximum window sizes as in the DBAD training set.Table 2 shows the results, where the azimuth also leads to the best score as in Table 1.
We also choose the decision threshold TH  = 0.55, so the azimuth (window size 5 s), -acceleration (window size 15 s), and -acceleration (window size 15 s) are the selected sensing modalities.Although -acceleration is not selected here, it does not contribute significant results for DBAD dataset either.Figure 5(a) shows the results of the probabilitybased approach when we vary the probability threshold TH  again verify that the multimodal-based approaches outperform the original DBAD approach that works with a single sensing modality.Further, unlike the controlled experiments with homogeneous phones and fixed phone positions in DBAD, our experiments are less controlled and have more uncertainty in the collected sensor data.Despite all these, the results using our dataset are still promising (e.g., the group membership similarity for the probability-based approach is still above 0.8), indicating that our approaches can inherently deal with sensor data noises.This is because sensing modalities are selected in the presence of data noises.
Moreover, the results using the park scenario dataset are consistent with those using the DBAD dataset because of the same activity involved.This indicates that the same training set for the same activity may be used to test both the datasets if the training set is well collected and the parameters involved in the algorithms are well studied.

Results
Using the Game Scenario Dataset.Table 3 shows the results of sensing modality selection.Different from Tables 1 and 2, the 3D rotations lead to the highest scores.The 3D accelerations may still work, but the azimuth does not make much sense in this activity.This implies that the DBAD approach of manually selecting one single sensing modality will not work in such a scenario.
We can still choose the decision threshold TH  = 0.55, so the -acceleration, -acceleration, -acceleration, -rotation, -rotation, and -rotation are selected.Figure 6(a) shows the results of the probability-based approach.Similar to the findings in both the DBAD test set and the park scenario test set, we can choose TH  = 0.95 for the probability-based approach to compare with using each single sensing modality as well as the majority voting-based approach.
Figure 6(b) shows that the -rotation leads to a higher accuracy than each other sensing modality, and the probability-based approach leads to even higher accuracy than using only -rotation.Figure 6(c) shows a consistent trend as in Figure 6(b).However, different from both Figures 4(c) and 5(c), the majority voting-based approach leads to a slightly higher group membership similarity than the probability-based approach.This is because there are several significant sensing modalities (i.e., -rotation, -rotation, and -rotation) which contribute accurate results in this activity.Unlike the activity that people walk together, only the azimuth makes significant contribution in the final results of the multimodal-based approaches; here all the 3D rotations make significant contributions; therefore the majority voting is more significant.
In summary, the activity significantly impacts the sensing modality selection as well as the group identification results.This verifies our hypothesis in Section 3 that a selection process is needed to automatically select sensing modalities for different activities.In addition, the comparison of the probability-based approach and the majority voting-based approach verifies our hypothesis in Section 4 that early fusion multimodal clustering may outperform late fusion in some activities, but not always.All things considered that all the approaches proposed in this work (i.e., Algorithms 2, 3, 4, and 5) are effective for various activities.

Conclusion
In this paper, we have presented a generic framework to identify subgroups in a homogeneous activity group using sensor-equipped mobile devices.We have first proposed a sensing modality selection approach given a coarse-grained activity.We have then provided an approach to deal with multiple window sizes among all the selected sensing modalities.By setting the group identification window size the same as that of the best sensing modality, we have further developed two multimodal clustering approaches-probabilitybased approach for early fusion and majority voting-based approach for late fusion.Finally, we have evaluated our approaches using a publicly available dataset and also two others collected by ourselves.The evaluation results have shown that our framework of multimodal approaches outperforms the original DBAD approach which works on a single sensing modality, and the framework is effective for various activities.
Several improvements are considered for future work.First, in this framework, activity is considered as an input to the algorithms.Although we have not yet studied the sensing modality selection training per activity, our evaluation results of different datasets but with the same activity tend to be very similar, indicating that using the same training set for an activity and test on different datasets regarding this activity is possible.Second, in this work, we assume that the sensor data distributions of all mobile devices are periodically sent to a central server in an infrastructure-based environment or collected by a sink node via data collection protocols in mobile ad hoc networks.Therefore, the central server or the sink node has the complete information in the network to calculate pairwise similarities and apply clustering algorithms on the group affiliation matrix based on the pairwise similarities.In our future work, we will further consider a pure peer-to-peer environment where neighboring mobile devices exchange their sensor data distributions.Since some pairwise similarities between multihop neighbors may not be computed due to limited hops of data exchange, the clustering algorithms need to be revised accordingly to work with a local partial group affiliation matrix on each mobile device.Last, we will apply Jeffrey's divergence directly to multiple sensing modalities when a practical mathematical method is available.
clustering algorithm or majority voting-based clustering algorithm

Figure 2 :
Figure 2: Example time series with different window sizes.

Table 3 :
Sensing modality selection using game scenario dataset.