Applying Data Mining Techniques to Identify Suitable Activities

Identifying suitable physical activities is crucial for personal health management. However, a big challenge in identifying suitable physical activities is the influencing factors are extremely complex. Therefore, this study aims to propose an approach to facilitate the construction of suitable physical activity models. In the approach, association rule mining and clustering technique are applied to analyze personal activity-physiological information. To demonstrate how the proposed approach can be used for constructing the activity models, an experiment using mobile devices to collect personal activity-physiological information was designed. The revealed models can be used to not only understand personal health conditions but also provide useful information about proper and improper physical activities.


Introduction
Providing high quality healthcare with limited medical resources is a crucial concern for medical institutes [1][2][3].Many medical institutes integrate user records and medical information to provide health management decision support [4,5].Many researches use wireless and sensor network technology for remote health monitoring [6,7].The collected physiological data are submitted to a medical center for medical diagnosis [8,9].Bauer et al. used a multilayer network to collect and analyze physiological data to provide useful information [10].Although the collected data can be evaluated by medical personnel, it still requires significant efforts to provide useful information individually [11].Tremblay et al. proposed the architecture of a health agent that analyzes the collected physiological data to facilitate personal health management [12].Sim et al. proposed an evidencebased approach in which expert judgment is applied to data collected from user action and medical evidence to facilitate decision-making pertaining to personal activities [13].The challenge in the evidence-based approach is that the data used to construct personal models are collected by users, and the characteristics of the same data may be different when the sources (users) differ.To overcome this problem, Li et al.
proposed a personal activity analysis process, which includes an interviewing and investigation stage to collect and verify data, an information integration stage to analyze the data, and a reflection stage to take decisions on personal health management [14].However, the analysis process requires considerable efforts to support the data collection, validation, and analysis.
This study proposes an approach to the construction of activity models; in the approach, a data mining technique is applied to personal physiological data for the construction of personal activity models (PAMs).The proposed approach is based on modeling techniques to measure suitability of activities and construct activity models [15,16].Many modeling techniques can be used to measure the attributes of activities.For example, the regression analysis is a simple technique that can be used to obtain the relationship (linear or exponential) among attributes.The constructed models can be used to predict subsequent values of certain attributes [17].Multivariate analysis is another technique used to analyze the relationship among multiple variables; for example, the analysis of variance (ANOVA) can be used to test the variances of means of variables [18].
Data mining techniques are commonly used to construct prediction models; for example, the classification technique  is used to construct a decision tree that can be used to predict subsequent data.Before the modeling process, the collected data are preprocessed according to the mining technique used.The modeling process attempts to identify the association rules between these items [19,20].Association rule mining is another mining technique that can be used to construct prediction models by using the relational dataset.By appropriately setting the confidence and support, association rule mining can be used to identify many useful rules [21].Clustering is another mining technique that groups collected data into several clusters according to the data characteristics.These clusters can be used to predict subsequent data [22,23].
The main advantage of the proposed approach is that the obtained PAMs can be used not only to identify suitable activities but also to facilitate medical evaluation by medical personnel.The PAM can be constructed using realtime data collected from mobile devices.To demonstrate the performance of the proposed approach, the modeling techniques are applied in an experiment pertaining to health management.The results show that the constructed models can be used to select suitable activities.The remaining part of this paper is ordered as follows: the architecture of the proposed approach is introduced in Section 2, and the results and discussion are provided; finally, suggestions for further studies and conclusions pertaining to the proposed approach are presented.

Methods
This study proposed an approach to construct activity models; in the approach, modeling techniques are applied on the collected physiological data.The architecture of the proposed approach is depicted in Figure 1; personal physiological data and activity data are collected using wearable devices based on the predefined schema.The collected data are preprocessed and analyzed to construct the PAMs.The obtained models provide information that can be used to identify the proper activity or avoid improper activities.The proposed approach mainly involves the data collection (physiological data and activity data) and activity modeling.Physiological data, such as blood pressure, heart rate, and blood glucose, can be treated as retroactive information, and they can be collected automatically using mobile devices.The activity data can be treated as proactive information and represent the activities of the user.In addition to physiological data and activity data, environmental information, such as the environmental temperature and humidity, can also be collected for the modeling process.The collected data are preprocessed according to the predefined schema and are used to construct the PAMs.The obtained models provide information for medical advice and recommendations for medical consultants.The medical recommendations are expressed as rules.The suitable activity identification (SAI) component identifies suitable activity according to the rules of medical advice and the obtained PAMs.The information can be used for selecting suitable activities.Details of the implementation of these components, including data collection, data preprocessing, activity modeling, and SAI, are depicted in the following subsections.

Modeling Techniques.
To facilitate data collection, a set of attributes should be defined to describe the physiological data and activities.The physiological data may be collected by different types of devices at different timestamps.The collected data are integrated and preprocessed according to the analysis engine.Table 1 shows integrated data collected at different timestamps;  = { 1 , . . .,   } denotes the timestamps,  = {  1 , . . .,   ℎ } denotes the attributes of environmental data, and  = {  1 , . . .,    } denotes the attributes of physiological data.The variables   and V  denote environmental data and physiological data that can be collected automatically in a specified time interval by using wearable mobile devices.The environmental data also contain information on the device location (obtained from Global Positioning System), which indicates the location where the user performs activities.
The activity data contain information on the activities performed, while the physiological data contains information on the user's physiological status, which can be treated as an outcome of the performed activities.The performed activities cannot be recorded using a single term.The characteristics of an activity may be different for different persons.For example, the characteristics of the activity "jogging" are different for two different persons, and characteristics such as speed, duration, and direction (uphill or downhill) may affect the outcomes of the activities.For activity modeling, this study used the attributes of movement, such as the speed, direction, and acceleration, to represent the activities performed by Timestamps Activity clusters users.The information can be collected using the sensors of wearable mobile devices, such as the G-sensor.In Table 1,  = {  1 , . . .,    } denotes the attributes of activity, and   denotes the collected activity data.
The data, including the environmental, physiological, and activity data, are collected by wearable devices at different timestamps.An activity can be represented as a sequence of activity data.For example, let  = { 1 , . . .,   } denote the collected activity data and let  = { 1 , . . .,   } denote the set of activities performed by the user, where   = {  , . . .,   } ⊂  denotes an activity contained in .An activity can be represented as a sequence of {  } (for some ).The algorithm used to identify the possible activities is shown as follows.
(1) Assume that the set of clusters   = {  1 , . . .,    } is obtained by applying the clustering technique to the collected activity data .
( The modeling process identifies the relationship between  and the physiological information.Each activity is assigned a tag by the user for easy identification.For example, an activity can be expressed as (tag  ,   ), where tag  denotes the activity tag (such as jogging or climbing) and   denotes the timestamp.The steps used to link the tags to the obtained activity patterns are shown in Figure 2;   denotes the preselected duration and is less than the duration of possible activities.For a tag (tag  ,   ), a sequence of activity clusters   = {   , . . .,    } within the time interval [  ,   +  ] is selected.For the pattern   that is satisfied,   is linked to tag  .A pattern   is said to satisfy an activity cluster   if all elements of   are contained in   in order.
The main purpose of data preprocessing is to apply the cluster technique to the collected data to identify possible activities that can be linked to the physiological data.The identified activities can be used to construct PAMs.

Activity Modeling.
The main purpose of personal activity modeling is to find the relationship between activity and physiological data.The collected physiological data are grouped according to the activity transactions, such as the set of activity clusters within a time interval   .For example, in Figure 3 Activity l j = (tag i , t i ) l j = (tag j , t j ) transaction; {V  , . . ., V 1 } denote the set of physiological data based on the activity transaction and can be treated as representing the personal physiological state in the time interval [  ,   +   ] (the effects of the performed activities).
For personal activity analysis, the last  physiological data {V 1− , . . ., V 1 } are selected to represent the effects of the activity transaction and are denoted by a centroid.The personal activity modeling process can be divided into the following steps.
In the first step, the clustering technique is applied to the physiological data of all activity transactions to obtain which indicate the effects of certain activities.Figure 3 shows the physiological clusters   = { V  |  V  ∈  V } obtained from the collected physiological data.The purpose of this step is to distribute the collected physiological data among several clusters, with all the data contained in a cluster having similar characteristics (attributes).
In the second step, the clustering technique is applied to the environmental data to obtain environment clusters.Since the effects of an activity may be influenced by environmental parameters, such as temperature or humidity, the environment clusters can be used to group the activity data.For an activity transaction, a sequence of environmental data, such as {  , . . .,  1 }, can be selected and represented by a center   .The clustering technique can be applied to these centers { 1 , . . .,   } to obtain environment clusters  = {  , . . .,   }.Therefore, an activity transaction can be represented as tr , where a suitability tag   ∈ {1, 0, −1} can be attached to  V  .The identified transactions of activities are denoted as TR = {tr 1 , . . ., tr  }.
In the third step, the association rule mining technique is applied to the obtained transactions TR to construct PAMs.The PAMs can be expressed as a set of rules, and each rule contains antecedent and subsequent items.The following algorithm is used to generate the antecedent items.
(1) Calculate the support (the number) for each activity cluster    ∈   using the obtained transactions TR and select the items with a support greater than supp  (preselected) to form the first-level large itemset  1 .
(2) Generate the second-level itemset  1 ×  1 , and select the items with a support greater than supp  to form the second-level large itemset  2 .
(3) The process is repeated until no more large itemsets can be formed.The last -level large itemset is denoted as   .
The obtained large itemset   contains the items, such as (   , . . .,    ), that appear in the collected dataset frequently.The PAMs can represent the relationship between activities and physiological clusters.The physiological clusters   denote the subsequent items, and the algorithm used to construct PAMs is as follows.
(1) Calculate the support for each physiological cluster  V  ∈   , and select the items with support greater than supp  to form the large itemset   .
(2) Generate the candidate rules   ×   , and, for each item   ∈   ×   , count the number of appearances in TR.The total number of appearances shown in the antecedent part is denoted as   , and the number of appearances shown in both antecedent and subsequent parts is denoted as   .The confidence of the rule   is calculated as   /  .A rule is selected if its confidence is greater than the preselected threshold confidence.
An obtained rule   can be expressed as {   , . . .,    } → { V  }, indicating that an activity   with the pattern {   , . . .,    } causes physiological effect  V  .The environmental information can be used to indicate the environment where the obtained rules can be applied and where the confidences of these rules with the environment cluster   are recalculated.The confidence of the suitability of the rules can be evaluated using the attached suitability tag, and the confidences of the rules with the suitability tag   ∈  = {1, 0, −1} are recalculated.The rule   with an environment cluster and a suitability tag can be expressed as {   , . . .,    , [  ]} → { V  , [  ]}, and the generated activity rule set is denoted as .

Activity Identification.
The obtained PAMs provide information that can be used to evaluate of suitability of subsequent activities.The evaluation process can be done before or during an activity, and it can be described as follows.First, when applying the obtained rules to select suitable activities, the environmental data collected at the current timestamp are clustered and used to select rules.For example, in Figure 4, let { − , . . .,   } denote the data collected from the current environment and let it be represented as    ∈ ; the rules with the same environmental data are selected such that    =   (the same cluster).  = { 1 , . . .,   } denotes the possible activity patterns.
Second, the obtained rules can be used to evaluate the suitability of the activity.Since the activity data are collected during activities, the collected data can be used to evaluate the suitability of the activity.As shown in Figure 4, let    = {  − , . . .,    } denote the activity cluster of the data collected in the time interval [ − ,   ] when an activity is being performed.Since the activity is being performed by the user, the new data    are collected continuously.The collected data may contain only the first several activity clusters of certain rules (  ∈ , for some ).Therefore, these rules can be selected  ).Let  1 denote the candidate rules that are suitable (for   = 1), and let  −1 denote the candidate rules that are unsuitable.Then,  1 and  −1 can be used to evaluate the suitability of the activity.

Results
To demonstrate the construction of the PAMs using the proposed approach, an experiment was designed.Table 2 shows the set of attributes used to collect data, and data sources column denotes the sources from where data are collected.The environmental attributes consist of the environmental temperature (ETemp) and humidity (Humid).The physiological attributes include the blood pressure (SBP and DBP), heart rate (HB), and body temperature (BTemp).The activity attributes include the -, -, and -axis (Act, Act, and Act) values of the accelerator, and ActTag denotes the activity tag.
The activity data are collected (every second) using a mobile device.An activity tag is assigned by the user before performing the activity; for example, go uphill (U), downhill (D), or on a smooth road (F).The physiological data are collected (every 5 s) using an electric sphygmomanometer, and they are transmitted to a mobile device over a Bluetooth interface.The environmental data, including temperature and humidity, are also collected, and they are transmitted to the mobile device using a Bluetooth interface.The collected data consist of 143 data items collected over five different days.Table 3 shows an example of the collected data based on the attributes shown in Table 2.The collected data are preprocessed and used to construct PAMs.

Suitable Activity Identification.
The activity data, including Act, Act, and Act, are used to cluster the collected data.Table 4 shows the activity clusters obtained from the collected data shown in Table 2; the number of clusters is selected as 5 (denoted by the numbers 0 to 4).ActTag is not used as the attribute for clustering and is intended only for helping the user recognize the activity.
The clustered activities are used to form activity transactions for a transaction length of 6.The items of a new transaction are selected from the previous transaction by shifting the items to the right by one item.For example, the first transaction is selected from item 0 to item 5 (0, 2, 0, 2, 4, 3), while the next transaction is selected from 1 to 6 (2, 0, 2, 4, 3, 2).An activity tag, denoted as TranTag, is attached to each transaction for recognizing the transaction.TranTag can be determined by the number of activity tags appearing in the transaction.The obtained transactions can be used to name the identified activities.In addition to the activity tags, a physiological cluster is associated with each transaction for indicating the effects of the activity transaction.The physiological clusters of the transactions are selected as follows.First, for each transaction, the last three physiological data items are selected and used for calculating a centroid.The centroid can be used to represent the effects of the transaction.Second, the obtained centroids are clustered.Table 5 shows an example of physiological clusters obtained by applying -means (the number of clusters is 5) to the physiological centroids of the transactions.The environment clusters are obtained using the collected environmental data.The centroids of these environmental data are computed and clustered.Examples of obtained environment centroids and clusters are shown in Table 6; the environmental parameters, including temperature and humidity, do not change rapidly and are clustered into two clusters (denoted as 0 and 1).The transaction tags, physiological clusters, and environment clusters are added to the transactions ({   , . . .,    },   ;  V  , [  ]).The suitability tags   are assigned by medical personnel according to the physiological cluster of the transaction.Table 7 shows an example of obtained transactions, and the total number of obtained transactions is 137.

Personal Activity Modeling.
The activity transaction with physiological clusters {   , . . .,    } → { V  } can be used to construct PAMs.Table 8 shows the PAMs obtained using association mining; the minimum support is 0.04 and the minimum confidence is 80%.Activity items denote the antecedent items, while the phy.item represents the subsequent item of the rule.Support indicates the frequency of appearance of the patterns, while confidence denotes the accuracy of the rule.For example, rule 2 indicates that an activity with the pattern {2, 0, 0} may yield physiological cluster 0 with 7% support and 90% confidence.This pattern can further be extended to {2, 0, 0, 0}, which also yields cluster 0 with support 5% and confidence 100%.The environment clusters   of the transaction indicate the situation in which the activities are performed.Different environment clusters may be associated with different effects.Therefore, the environment clusters should be considered while applying the rule on prediction of suitable activity.The confidences of the obtained rules with environment clusters are shown in Table 9; no minimum confidence is used to filter the rules.For example, according to rule 2, an activity with the pattern  1 = {2, 0, 0} may yield physiological cluster 0 with confidence 90%, with all transactions with pattern  1 (100%) contained in environment cluster 0. However, when rule 3 is applied, an activity with the pattern  2 = {2, 0, 1} may yield physiological cluster 2 with confidence 86%, with only 71% of transactions with pattern  2 being contained in environment cluster 0.
For determining the suitability of activities, the physiological clusters are evaluated by medical personnel and suitability tags are attached to the transactions.In this study, the suitability tags are evaluated according to the obtained physiological centroids and clusters (as shown in Table 10).The suitability tag is 0 for physiological clusters 0 and 1, 1 for physiological clusters 2 and 3, and −1 for physiological cluster 4.An example of activity transactions with suitability tags (without minimum confidence) is shown in Table 11, in which the confidences for suitability tags −1, 0, and 1 are shown in columns 7, 8, and 9, respectively.
The confidences for suitability tags indicate the probability of the corresponding rule yielding suitable physiological clusters.For example, the confidence of the suitability tag 0 of rule 2 is 90% which means the probability to cause the physiological cluster 2 is approximately 90%.An activity satisfying a rule with higher suitability can be treated as suitable activities for users.Therefore, rules 1, 3, and 6 shown in Table 11 can be used to identify suitable activities.An activity satisfying the patterns of suitable rules can be identified as suitable activity.
The TranTag associated with each transaction can be used to name the identified activities.Table 12 shows the confidences for different activity tags, including D (downhill), U (uphill), and F (smooth road).The activity tags with high confidences can be used for naming the identified activities, such as F for rules 0 and 5 and D for rule 4. The D and U activities may contain pattern {0, 1, 3}.

Suitable Activity Identification.
The obtained PAMs can be used to identify suitable activities.In this subsection, the application of the obtained PAMs to another test dataset is discussed for demonstrating how the proposed approach can be applied to identify suitable activities.The attributes of the test dataset are displayed in Table 1.The test dataset contains 70 data items collected from different activities, including 28 U activities, 28 D activities, and 14 F activities (the dataset is available at http://mgiga.com.tw/PAM/).The test dataset are divided into 10 transactions, and each transaction (containing 7 data items) denotes an activity.
Each activity can be treated as a sequence of activity items.The performed activity items in the sequence are used to evaluate the activity (which is currently being performed), while the physiological data of the last three activities are used to validate the evaluations (the selection of the physiological data of a transaction is identical to the process presented in Section 4.1).Table 13 shows an example of a test dataset.The suitable activity identification process can be described as follows.
First, the collected activity items are clustered based on the centroids of the activity clusters of the obtained PAMs.Table 14 shows the centroids of the activity clusters obtained in Section 4.2.The norm denotes the module of the centroid of each cluster that indicates the strength of the activities.The rule  6 = {0, 4, 1, 3} shown in Table 11 can be expressed as the dash line shown in Figure 5, while the solid line depicts the rule  8 = {0, 2, 1, 3}.Second, the PAMs (shown in Table 12) are applied to the current activity clusters, and they provide information (suitability) on the current activity.For example, when the first activity item {0.19, −1.04, 13.73} is collected and classified into cluster 3, there is no rule shown in Table 11 which can be selected.The second activity item {0.88, −0.58, 7.35} is then collected and classified as 0. The activity transaction is {3, 0}, and rules 0, 1, 5, and 6 are selected; among these rules, no rule is unsuitable, while rules 1 and 6 are suitable.Therefore, the performing activity may be  a suitable activity, and the information provided for the user is "suitable."Data items 3, 4, and 5 are collected and classified into clusters 1, 0, and 3, respectively.The activity transaction is {3, 0, 1, 0, 3}, and only suitable rule 1 is selected.The activity can be evaluated as "suitable."The last two activity items are classified into clusters 3 and 4, and no further rule is selected for this activity.The test transactions are depicted in Table 15; the third column denotes the prediction results obtained using the collected activity data.The fourth and fifth columns denote the applied rule and suitability confidence, respectively.Transactions 3 and 6 cannot be predicted because no activity pattern can be found in the obtained rules.
To validate the prediction results, certain physiological data items were selected and the centroids were classified into physiological clusters of PAMs; the centroids of these clusters are shown in Table 16.The physiological clusters of the test transactions are shown in Table 17, which also shows the environmental data classified into cluster 0. The suitability tag (column 6) for each transaction was based on the physiological cluster (column 4).The prediction results (column 3) were compared with the suitability tag (column 6), and it was observed that 80% of the activities could be predicted correctly and that all suitable activities could be  identified, such as transactions 1, 4, and 5.The remaining 20% activity transactions, that is, transactions 3 and 6, could not be predicted by using the obtained rules, because no obtained rules could be applied on transactions 3 and 6.However, new transactions can be added into the activity records to update the original models so that the updated models can be used to predict subsequent activity records.The obtained PAMs are used to evaluate the activities performed, and they provide evaluation results to users.The constructed PAMs can be updated using subsequent activity data.Table 18 presents the results of test transactions which were obtained by the proposed approach on the dataset with 400 activity records to construct PAMs, and then the constructed PAMs were used to evaluate 138 activity records.All suitable activities were predicted correctly by PAMs.

Conclusion
This study proposes an analyzing approach for planning suitable physical activities.The approach uses techniques of association rule mining and clustering on a dataset of activity and physiological information to construct physical activity models (PAMs), which can be used to predict suitable physical activities for personal health and leisure management.
A potential limitation of this study is the prediction models are constructed using historic activity data and thus subsequent activities with unknown patterns cannot be predicted.Although collecting large amount of activity data may address this problem, new patterns still possibly occur in subsequent activities.Another limitation of this study is that the obtained physiological clusters require medical personnel to determinate the suitability tag to construct prediction models.Further research should apply the knowledge transfer techniques on obtained rule sets to infer new rules for subsequent activities with new activity patterns.Furthermore, the similarity measure techniques should be applied to measure the similarity of obtained and new-build physiological clusters to reduce the efforts of tagging process.

Figure 1 :
Figure 1: Architecture of the PAM approach.

Figure 2 :
Figure 2: Linking the tags to the obtained activity patterns.

Figure 3 :
Figure 3: Clustering the collected physiological data based on timestamps.

Figure 4 :
Figure 4: Identification of suitable activities and evaluation of the current activity.

Figure 5 :
Figure 5: The visualization of rules with different activity clusters.

Table 1 :
Example of integrating collected data.
2) The elements of  are considered to represent clusters of   ; that is,   = {   |      }. (3) Use   to identify the frequent itemsets  = { 1 , . . .,   }, where   ⊂   , such that the number of occurrences (support) of the patterns is greater than a preselected threshold   .

Table 2 :
The selected attributes of the experiment.   , and the evaluation results can be derived by considering the selected rules.For example, let the rule   = {  ,   } → {  } be selected as a candidate rule.If    ∈   and    ∈    , then any    ∈   ( < ) is also contained in    (   contains the first several activity clusters of the rule

Table 3 :
Example of collected data.

Table 4 :
The activity clusters obtained from collected data.

Table 6 :
The environment clusters.

Table 7 :
The activity transactions.

Table 8 :
Activity models with physiological clusters.

Table 9 :
Confidence of activity models with environment clusters.

Table 10 :
The activity transactions with suitability tags.

Table 11 :
Confidence of activity models with suitability tags.

Table 12 :
The activity models with activity tags.

Table 13 :
Example of collected testing data.

Table 14 :
Centroids of the activity clusters of activity models.

Table 16 :
Centroids of the physiological clusters of activity models.

Table 17 :
The predicting results obtained by using 10 test transactions.

Table 18 :
The predicting results obtained by using 20 test transactions.