Empirical Validation of Objective Functions in Feature Selection Based on Acceleration Motion Segmentation Data

Recent change in evaluation criteria from accuracy alone to trade-off with time delay has inspired multivariate energy-based approaches in motion segmentation using acceleration. The essence of multivariate approaches lies in the construction of highly dimensional energy and requires feature subset selection in machine learning. Due to fast process, filter methods are preferred; however, their poorer estimate is of the main concerns. This paper aims at empirical validation of three objective functions for filter approaches, Fisher discriminant ratio, multiple correlation (MC), and mutual information (MI), through two subsequent experiments.With respect to 63 possible subsets out of 6 variables for accelerationmotion segmentation, three functions in addition to a theoretical measure are compared with two wrappers, k-nearest neighbor and Bayes classifiers in general statistics and strongly relevant variable identification by social network analysis.Then four kinds of new proposedmultivariate energy are compared with a conventional univariate approach in terms of accuracy and time delay. Finally it appears that MC and MI are acceptable enough to match the estimate of two wrappers, and multivariate approaches are justified with our analytic procedures.


Introduction
As one of the human computer interactions, Inertial Measurement Unit (IMU) applications have been prominently increasing in quantity [1].Of the related technological issues, motion segmentation using accelerometers has long been a significant problem [2][3][4][5][6].Motion segmentation implies the discrimination of motion-involved periods and is handled within various domains depending on the detection signal.In the IMU applications, which generally depend on accelerometers, the process can be understood as acceleration end point detection in terms of signal processing.Since linear acceleration and angular rates from IMUs are rarely used without integration, motion segmentation is inevitable because it indicates the initial and final points in the integration or the starting and ending points in the period of interest for processing [4,7,8].
Typical problems in motion segmentation using acceleration have been associated with how accurately both ends can be found; thereby several constraints have been reported.First, measured acceleration is corrupted with the gravitational acceleration which is intractable to separate from acceleration by body motion [2,8,9].Since it is exposed to noise whose source is also body motion, such as unintentional trembles or minute motion, the estimated motion segmentation might consequently include teacher noise.Additionally, measured acceleration prevails in such low frequency bands (0-20 Hz) that spectral information is sparse.As a result, motion segmentation specialized for acceleration is temporally processed mainly [3][4][5][6]9].While calculating the acceleration energy in the time domain, another constraint emerges.Sample-wise linear separation between motion and nonmotion states is formidable without modifying a multivalley structure; plus, time delay produced by modifying the multivalley structure has proportional relation to accuracy [3].
The proportional tendency between accuracy and time delay in conventional approaches provokes a new requirement for rapid response time with the advent of smart devices [9,10].Motion segmentation obsessed with accuracy naturally leads to requiring an appropriate trade-off between accuracy and delay.For accomplishing maximum accuracy with minimum time delay, the employment of multivariate energy appended to hyper decision boundaries has been introduced as a promising alternative [9,11].This approach achieves the time delay reduction by skipping energy smoothing, which is the main cause of the time delay in the previous univariate approach.Instead of an explicit energy smoothing process, a shorter time delay is produced implicitly when multivariate energy vectors are generated.The loss of accuracy resulting from the reduced time delay in this approach is compensated by motion state decision making with a nonlinear hyper decision boundary in highdimensional space.
Consequently, accuracy is dependent on the separability between data distributions of two states represented by multivariate energy in high-dimensional space, and it is required to predict the discriminality of each data distribution represented by variables or their multidimensional combinations for building optimal multivariate energy.Because the performance of classifiers implementing a hyper decision boundary may well have a limit, it is important to find and identify variables that can have discriminant distributions between two states in multivariate space.In addition, it is so fundamental to depend on statistical regularities represented by data in pattern recognition that state separability can be used to show how well data is distributed in high-dimensional space for a given task.

Problem Description
The key cause of the given problem is the multivalley structure that commonly occurs in calculating temporal energy.Figure 1 shows the parsed acceleration signal () from a simple arm motion and its basic energy |()|.There, the red dotted line represents the motion period, where nonzero values stand for motion state.Acceleration from arm motion has a multipeaked structure that is a representative of all human arm motion [3].The energy calculation transforms the multipeaked structure into the multivalley structure at the bottom of Figure 1, which is commonly observed in various energy types [3][4][5][6][9][10][11].In this structure, the multiple valleys prevent a linear threshold from simply discriminating motion and nonmotion states, and this phenomenon explains why energy smoothing is required.As smoothing means to extract the desired signal by removing multiple peaks and valleys in the original signal in terms of signal processing, it represents the process to fill the valleys to make the difference between two states clear in this case.The main difference among algorithms is techniques employed to smooth these valleys: low-pass filtering including moving average, axial information integration, inactivated interval setting, extra signal addition, and so forth [2][3][4][5][6]12].
The performance evaluation of motion segmentation algorithms is generally given on the basis of accuracy; however, time delay in algorithms has recently started to be taken into consideration [5,6,10].A related phenomenon is explained in Figure 2, where energy is calculated by a piecewise moving variance given by Benbasat and Paradiso [3].In this approach, the length of a sliding window is directly proportional to the size of the time delay.The graph (without time delay) at the bottom of Figure 1 is again shown at the top of Figure 2 for comparison, and each smoothed energy variation with time delays of 70 ms and 150 ms, respectively, follows by turn.It is clearly shown that the discrimination between two states gets easier by a simple threshold, as the time delay increases.Theoretically, in this situation, accuracy equates to indicating the exact motion starting and ending points; practically, however, the whole detected motion period is compared with the one given by the target label (red dotted line), which measures their overlap with the number of successfully detected samples with respect to full samples or similarity measurements between two time series [5,6].If accuracy is 100%, the annotated and estimated motion periods must be coincident.When fluctuation in acceleration is occasionally extreme and energy smoothing is disabled to flatten the valleys fully, the motion discontinuity happens in the estimated motion period, and such a phenomenon needs to be considered a detection failure regardless of accuracy.To avoid this, energy smoothing is reinforced, thereby increasing time delays.Time delay in this paper results solely from algorithms excluding computation and communication.It is determined by the past data length stored in short-term memory and the group delay for digital filtering, regardless of hardware enhancements.It is basic in statistical inference to make a decision based on previous data.The capacity to store the previous data for processing current data is called the shortterm memory [13].In signal processing, sliding windows implement this by generating a time delay proportional to a window length for the derivative of the signal with respect to time, moving average/variance, and digital filtering often found in algorithms.Group delay is an integrated measurement of the time delay by frequency band when the signal goes through filters.Filtering produces group delay /, where  and  represent phase shift and radian frequency, respectively [14].Moving average is a special case of low-pass filtering to generate group delay.Moving variance can be interpreted as a case of moving average with additional operations since the moving average is used in its calculation.Given that motion segmentation is generally a part of full interaction, time delay by motion segmentation should be much less than the optimal delay of 150-200 ms reported by event-related brain potential measurements for a computer response to a user action [15,16].
The minimized time delay requirement turns efficient energy smoothing in previous approaches into an estimate of the probability at two states in high-dimensional space by expanding univariate energy to multivariate.Borza [11] and Lim et al. [9,17] introduce motion segmentation based on this idea, but multivariate energy and state decision methods in their approaches differ.While Borza's approach emphasizes axial integration and the difference between only two time sequences given in (2) for generating variables, Lim et al. are interested in various variables and their combinations, including the time series of a certain length without axial integration as shown in Table 1.Consider the following: where ( The interest in various candidates of Lim et al. naturally induces the question of how to choose the best combination, and feature subset selection in machine learning is consequently employed to build multivariate energy in motion Feature subset selection is the process of identifying and eliminating as much irrelevant and redundant information as possible [13,20,21].Diminishing the dimensionality of the data may allow learning algorithms to operate faster and more effectively, and, in most cases, final classification accuracy can be improved and data can be easily interpreted as a representation of the target concept.Filter and wrapper methods, which vary in how to estimate feature subset candidates, are generally accepted.Filter methods are the earliest approaches to feature selection within machine learning.They use additional objective functions based on general characteristics of the data to evaluate the merit of feature subsets, whereas wrapper strategies use a learning algorithm to estimate such merit.As a result, filter methods are generally much faster than wrapper methods and, as such, are more practical for use on high-dimensional data.The rationale for wrapper approaches is that the task-dependent induction algorithms should provide a better estimate of accuracy than a separate measure with inductive bias.Despite the better estimate of wrappers tuned to the specific interaction between an induction algorithm and its training data, they tend to be much slower than filter strategies because feature selection must be accompanied by a model selection process for the induction algorithm used.
In this study, filter strategy is scrutinized with several causes.Our problem is the investigation as to how to choose relevant variables for multivariate energy construction to reduce time delays with superior or equivalent accuracy guaranteed.For the given task, an evaluation of the estimate by a few objective functions is required.Another underlying goal that can be accomplished during this investigation is the justification of a multivariate approach compared with the previous univariate approach.To achieve this, we put more emphasis on the understanding of general characteristics of acceleration data than on a learning algorithm.The comprehension of data distribution is followed by designing a hyper decision boundary that should be so independent that more various applications can be expected; however since  wrapper methods are generally accepted to provide better estimates of feature subsets, the reliability and limitations in discriminality of filter strategies need to be compared with those of wrapper strategies.

Experiment
With respect to handwriting acceleration, univariate energy proposed by Benbasat and Paradiso [3] and multivariate energy by Lim et al. [9,17] are created, and each separability estimate is measured by filter and wrapper processes.For the rigorous comparison, theoretical errors are calculated as reference data based on the conditional probability density function of both motion and nonmotion states.A detailed explanation of experimental conditions will be provided.Figure 3 shows overall experiments.Throughout the experiments, the following questions are pursued: (i) Can filter approaches estimate accurately enough to predict discriminality between motion and nonmotion states?(ii) Can it be justified that multivariate energy guarantees superior time delay and accuracy to univariate energy?
(iii) Can the analysis of the above results offer the understanding of the underlying structure of data distributions?
3.1.Data.A total of 294 handwriting measurements are collected with a 3D pen embedded with three-axis accelerometer MMA 7260Q (Freescale) from 7 subjects (male 4, female 3) thrice when drawing the numbers from 0 to 9 and four kinds of symbols.In data acquisition by microcontroller Atmega8 (Atmel), two least significant bits are discarded to cancel the noise effect for 10-bit quantization and 100 Hz sampling.Samplewise motion state annotations paired with acceleration profiles, that is, target values, are measured by subjects pushing a button to mark when drawing [22].
Collected data has been finally grouped into training (98 pieces, 17189 samples), validation (98 pieces, 16728 samples), and test set (98 pieces, 17489 samples).Since acceleration and its paired target label are considered at a single axis, acceleration profiles at three axes integrate to their samplewise mean.

Energy Generation by Univariate and Multivariate
Approaches.The univariate energy used by Benbasat and Paradiso [3] and multivariate energy by Lim et al. [9,17] have been chosen for the investigation.In the approach by Benbasat and Paradiso, the energy is calculated by piecewise moving variance, which combines energy calculation and smoothing.It is an upgraded version of earlier energy calculation of absolute conversion or squared acceleration and is widely accepted as one of the baseline methods considering that several variations have been created.For multivariate approaches the multivariate energy in Lim et al. [9,17] are mainly used for the experiment.Note that every type of energy is abbreviated as in Table 2 for clarity and simplicity hereafter.For feature subset selection and strongly relevant variable identification, the subsets of LIM1∼LIM63 are employed in the experiment 1, and, after selecting the best subset, it is compared with BENBASAT. in experiment 2 (Figure 3).While LIM1∼LIM6 are subsets including each basic variable, LIM7∼LIM63 are subsets composed of the combinations of basic variables.

Theoretical Measure.
To compare filter and wrapper estimates, we need a theoretical reference.We define the likelihood of each state as a conditional probability density function with two assumptions: each is Gaussian distributed and variables and identically distributed for  ̸ = .The density distribution of each state is given as follows and in Figure 4: where In this condition, error results from the overlap between the two states are given by the following equation and are depicted by the dark region in Figure 4: where a threshold Th  is found by satisfying (Th Let the coefficient of each term in (7) be , , and , respectively, Since V is multivariate, for example, V = {V 1 , V 2 }, (5) is rewritten as Therefore, the approximate error, which is estimated by the likelihood of both states, can be counted to the summation of the number of samples that belong to motion state depicted by the bright grey area and samples that belong to nonmotion state depicted by the dark grey area in Figure 5.The actual boundary is located in the line orthogonal to the connecting line between mean vectors at both states, because the state membership of each sample is determined by Mahalanobis distances from each mean vector of two Gaussian distributions.Accordingly error should be also estimated by this linear boundary, but we simplify it in the way of (9) due to computation convenience, which indicates minimum error with highly probable occurrence only.

Feature Subset Selection: Filter Approaches.
Traditional feature subset selection process includes two steps of subset generation and subset evaluation.Since ∑  =1    subset candidates can be produced with respect to  variable candidates, various greedy search strategies are generally used to reduce computation.In our study, the subset candidates are fixed so that search strategy is not considered.We concentrate only on how to evaluate each subset.
Typical objective functions in filter approaches are based on distance measures, dependence measures, and information measures [18,20].Fisher Discriminant Ratio (FDR) or Fisher criterion is exemplary in distance measures and is defined by the ratio of the between-class scatter S  to the within-class scatter S  , where there is a total of  instances where where A multiple correlation coefficient is a multivariate extension of a traditional correlation coefficient, which is a typical statistical technique to measure linear dependence between variables [18][19][20].In statistics, the multiple correlation coefficient measures how well a given variable can be predicted by a set of other variables using the ratio of the correlation between variable vectors x  and target values y  and the correlation between each variable in ( 15)- (18).As a result, the process of multiple correlation is equivalent to the rationale that a good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other: where where Correlation is capable of measuring linear dependence only.A more powerful method, which measures nonlinear dependence, is the mutual information (  ;   ) in (19) under the condition that  subset candidates V  and  classes   are given [18][19][20]: where () is the entropy function.Intuitively, the mutual information method measures the information that V  and   share: it measures the amount by which the uncertainty in the class (  ), prior uncertainty, is decreased by knowledge of the subset (  | V  ), expected posterior uncertainty.For the high order density estimation from limited data, we apply a mixture of three Gaussian distributions.

Feature Subset Selection: Wrapper Approaches.
Of the numerous classification algorithms available, Bayes Classifiers (BCs) and -nearest neighbor classifiers (KNNs) have been chosen because both of them, proposed relatively early, have small numbers of parameters for performance optimization, and their competence has been generally accepted enough to regard them as one of filters based on recognition rates [13].In addition, since these statistical classification algorithms have their own statistical models quite different from one another, each of their results helps us to comprehend the characteristics of high-dimensional data distribution.Note that their parameters are tuned with the validation set after training, and the error rates are finally counted in the test set.
A BC is a statistically parametric classifier based on applying Bayes' theorem, such as the naïve Bayes classification given in the reference data section.Due to the assumption of strong independence between feature variables, its performance can be improved by removing redundant features.The identical conditions and data distributions given in Section 3.3 are applied except for the consideration of data dimension using mean vector m  given in (13) and covariance matrix Σ  by the inside summation term in (11).Given  classes with  dimensional data, each Gaussian multivariate density   (x) is given in (20), and its second-order discrimination function  (x) is given by taking the natural logarithm of each side of ( 20) and simplifying it for classification in (21): A KNN is a representative of nonparametric methods and is a type of instance-based learning used in classification and regression [13].In both cases, the input instance is classified by a majority vote of its  closest training samples, neighbors, in the feature space, with the instance being assigned to the most common class among its  nearest neighbors.If  = 1, the instance is simply assigned to the class of the single nearest neighbor.The density function is locally approximated, and all computation is deferred until classification.A KNN is likely to be overfit so that  is chosen with an extra validation set (Figure 6).for the normalization, note that it is meaningless to compare the scores estimated by different strategies in Figure 7.

Result and Analysis
It appears that every filter shows similar estimates to two wrappers with respect to 63 individual subsets by the statistical analysis.Of the filters, MCC is evidently correlated with two wrappers of KNN and BC with the general trend that the closer to six the subset dimension gets, the greater the average estimates are and the smaller their variances get excluding the mean of FDR and the standard deviation of MI.Despite the dissimilar tendency of FDR and MI in Figure 7, the correlations between the methods give another insight with respect to 63 individual subsets in Table 3.The significant correlations between MI and two wrappers imply that MI has just ill-fitting scales but tends to bring about analogous scores.Likewise FDR might be explained to record similar scores to two wrappers with respect to the individual subsets considering the correlation with MCC, which is significantly correlated with two wrappers and TM.
However it turns out that the poorer scores FDR tends to underestimate the higher dimensions the subsets have because the distributions of the variables from LIM1 to LIM6 has much narrower mean differences compared to variances in (11).As the subset dimension consequently increases, this asymmetric proportion gets worse by summing the diagonal components of the covariance matrix.Even though MCC uses a similar covariance matrix to that of FDR in (17), the covariance matrix for MCC is normalized by the product of standard deviations and all of elements are included to calculate the influence of each variable.

Network Analysis.
General feature selection employs various heuristic greedy search strategies to find an optimal subset, but we conduct an exhaustive full search to understand the attributes of each objective function.Instead of omitting this procedure, we analyze the interrelation between six variables of LIM1∼LIM6 with a social network analysis technique based on the same data used for statistical analysis.As a result, this analysis reveals the underlying attributes of each measure.We regard the variables and the subset as keywords and a link, respectively, for network visualization.To begin with, we identify the affirmative influences of each variable on the discriminality estimation.After the subsets are ranked in order of scores, we choose 10% of total subsets with highest scores and split variables from the subsets.Its influence is then counted by a vote because the stronger influence the variable has, the more frequently it appears in the above selection.After the negative influences are identically identified in another selection of 10% of total subsets with lowest scores, we subtract two votes in each variable and normalize their scales into the range between −1 and 1.With two selections, the network influences are analyzed by counting links between variables again and the links with votes below average are finally removed for clarity.Figure 8 shows the network analysis visualization of KNN, MI, FDR, BC, MCC, and TM.
First similarity among them is that every measure does not specify irrelevant variables because all of variables record nonnegative scores except for FDR.It is interpreted that each variable is strongly or weakly relevant given that the scores of two wrappers tend to be proportional to the subset dimension in the above statistical analysis.Another analogy comes from the network connections between variables which the linearity of each measure causes.It is observed that KNN has a resemblance to MI, and BC does to MCC with few differences in the network topology and this similarity is prominent by classifying respective variables into strongly relevant, weakly relevant, and irrelevant variables based on the network analysis visualization (Table 4).Such a topological analogy also explains why MI records higher correlation with KNN than with BC and MCC vice versa in Table 3.In addition, although FDR is poor at estimating subsets with different dimensions, it appears that the significant correlation with MCC is achieved by the fact that FDR and MCC share the common strongly relevant variables of LIM2 and LIM4.Note that the subsets in Table 4 are just transformed from each network topology with links and nodes into the description of subsets and variables.

Experiment 2.
Based on the previous analysis, we validate the possibility that multivariate approaches can be a better alternative to a conventional univariate in experiment 2. The energy of BENBASAT.based on the piecewise variance tends to produce improved accuracy, as the size of sliding window increases.For comparison, we propose four multivariate energy candidates combining LIM48 = {LIM1, LIM3, LIM4, LIM5}, which are identified as the best subset in Table 4 when using a KNN, to the idea of the time series with  + 1 lengths of the previous data.After using LIM3. as a multivariate energy basis because LIM3 are identified as the most strongly relevant variable as a result of a KNN evaluation, it is bound with LIM1, LIM4, and LIM5 one after another in the order of the variable influences in Figure 8(a), only to create the multivariate energy candidates of LIM3.,LIM9 (LIM3.)= {LIM1, LIM3.n},LIM26 (LIM3.)= {LIM1, LIM3.,LIM4}, and LIM48 (LIM3.)= {LIM1, LIM3.,LIM4, LIM5}.Finally as  increases, the changes in accuracy are compared with BENBASAT.n,along with time delay.In this way, we examine the potential of the energy with high dimension and reconfirm the fidelity of network analysis in the experiment 1 simultaneously.
Figure 9 shows the comparison between BENBASAT.and four multivariate approaches in accuracy and time delay.In the result of a KNN, BENBASAT.14records the best accuracy of 94.95% at  = 14 while LIM48 (LIM3.5)shows its best performance of 94.09% at  = 5.If the comparison is only evaluated in the aspect of accuracy, no doubt BENBASAT.14 is the best choice and the proposed multivariate energy is purposeless; however the identical consequence can have the contrasting significance when time delay is taken into account.When  = 14, time delay caused by the size of short-term memory and group delay is counted to 150 ms.Noted that we deal with the delay caused by a pure algorithm alone excluding all delays which result from computation and communication.In spite of satisfying the optimal delay condition of 150-200 ms for a computer response to a user action, it is easily concluded that the performance of 94.95% in accuracy and 150 ms in delay is not accepted to be excellent considering motion segmentation is usually used as one component in the entire interaction system.On the contrary, the changes in the evaluation criteria are led to reevaluate LIM48 (LIM3.5)with 94.09% in accuracy and 60 ms in delay.If one tries to reduce the time delay of BENBASAT.14as less as that of LIM48 (LIM3.5), the risk of accuracy reduction by 4% needs to be taken, and this is one of benefits which LIM48 (LIM3.)possesses because its changes in accuracy is not rapid with respect to the changes in time delays.Even considering the minimum time delay of 10 ms in LIM48 (LIM3.0) at  = 0, its accuracy of 92.17% is remarkably excellent compared to 78.37% in BENBASAT.0.The identical tendency appears in the result of a BC in Figure 9 except for accuracy differences.In addition, the fact that a nonlinear KNN shows better estimates than a linear BC in Figure 9 implies that the estimate is so dependent on the choice of classifiers that a new nonlinear classifier might record better accuracy than a KNN, given that we simply employ it for the comparison with filters only because of its simplicity in modeling before its excellence in accuracy.For the further investigation on the enhancement in accuracy, the cuttingedge nonlinear classifiers will be more likely to be used, and the details are again discussed in the conclusion.Note that the ultimate goal of our study in this paper is not the improvement of motion segmentation performance but the validation of a few objective functions in filter strategies to replace wrapper approaches with larger computations.Before finalizing the comparison, it is worthy of mentioning that the increment in accuracy as a variable is added to the basis subset of LIM3. one after another.This result does not merely mean the dimension increments are led to the improvement in accuracy but implies the improvement in accuracy critically has to do with the selection of proper variables in that Figure 9 shows the dimension increment up to 33 in LIM3. has the limitation of performance improvement.Therefore this result clearly justifies our analysis of experiment 1.

Conclusion
The goal of our study is to validate the reliability of a few objective functions to be used in finding optimal multivariate energy for motion segmentation in accelerometer applications.To achieve this goal, Fisher discriminent ratio, multiple correlation, and mutual information are tested by comparing them with a theoretical measure and two wrappers of KNNs and BCs in two experiments.Its analysis finally enables us to answer to three questions which have arisen during the investigation and this study is concluded giving summarized explanation to those questions instead of the formal conclusion.
(1) Can filter approaches estimate accurately enough to predict discriminality between motion and nonmotion states?
Of three objective functions and one theoretic measure we suggest, it turns out that multiple correlation, mutual information, and theoretic measure are competent enough to replace two wrappers.With respect to 63 subsets found in literatures, all of them excluding Fisher Discriminant Ratio clearly show that they are significantly correlated with the estimates produced by two wrappers.Furthermore the network analysis for the identification of strongly relevant variables clarifies that each function offers similar interpretation with respect to all possible 63 subsets from six variables.Since each distribution of motion and nonmotion states built by six basic variables from acceleration has too narrow mean differences and wide variance, Fisher Discriminant ratio tends to underestimate their separability, as the dimension of subsets increases.In addition, mutual information turns out to show reliable estimates enough to replace the wrappers, but it is so unstable that it varies dramatically from time to time due to the intractability of density estimation, as data dimension increments.This phenomenon comes from the computation complexity of high order density estimation using Gaussian mixture models, and we suggest calculating stable multivariate density estimation in the way to use variable box size over the corresponding variable space like [23,24] instead of expectationmaximization algorithm.
(2) Can it be justified that multivariate energy guarantees superior time delay and accuracy to univariate energy?
In the comparison between one conventional univariate and our multivariate approaches, we justified the superiority of multivariate approach.In our experiment the univariate approach just showed better accuracy than ours by about 0.9%, but the rapid processing in our multivariate approach outperformed the univariate one by 100% more.It is also observed that the risk of the serious loss in accuracy is required to be taken for the reduction in time delay for the univariate approach while the performance of our multivariate approaches lies in stable ranges.
(3) Can the analysis of the above results offer the understanding of the underlying structure of data distributions?
Using four linear and two nonlinear measures to estimate the separability between motion and nonmotion states with acceleration data, we have concluded that data is distributed linearly and separably considering that multiple correlation works successfully in estimating the discriminality.Despite the linearity, since two distributions are located too closely, the messy condition in the excessively overlapped spaces hinder linear BCs from outperforming nonlinear KNNs.The distribution of two states varies from variables.Since acceleration data without absolute conversion consists of two distributions with nearly identical means but different variances while absolutely converted acceleration data is distributed relatively far distant each other, as a result, linear measures tend to identify variables with absolute conversion as strongly relevant ones and nonlinear estimators vice versa.Overall it seems that motion segmentation using acceleration needs to be achieved by classifiers with a nonlinear hyper boundary such as multilayer perceptrons or support vector machines prior to classifiers depending on Mahalanobis distance kernel such as radial basis functions or BCs, and it is because statistically Gaussian modeling is inefficient when data lie on or near a nonlinear manifold in the data space.Modeling data that lie very close to the surface of a sphere only requires a few parameters using an appropriate model, but it requires a very large number of diagonal Gaussians or a fairly large number of fullcovariance Gaussians.

Figure 4 :
Figure 4: Likelihood of each state.

Figure 5 :
Figure 5: Error approximation of multivariate data distribution.

Figure 6 :
Figure 6: Likelihood of each state.

Figure 9 :
Figure 9: Accuracy and time delay comparison between multivariate and univariate approaches.

Table 2 :
Univariate energy and multivariate energy.

Table 3 :
Correlations between each method.

Table 4 :
Variable evaluation by objective functions.