An Analysis of Audio Features to Develop a Human Activity Recognition Model Using Genetic Algorithms , Random Forests , and Neural Networks

This work presents a human activity recognition (HAR) model based on audio features. The use of sound as an information source for HAR models represents a challenge because sound wave analyses generate very large amounts of data. However, feature selection techniques may reduce the amount of data required to represent an audio signal sample. Some of the audio features that were analyzed include Mel-frequency cepstral coefficients (MFCC). Although MFCC are commonly used in voice and instrument recognition, their utility within HARmodels is yet to be confirmed, and this work validates their usefulness. Additionally, statistical features were extracted from the audio samples to generate the proposed HAR model. The size of the information is necessary to conform a HAR model impact directly on the accuracy of the model. This problem also was tackled in the present work; our results indicate that we are capable of recognizing a human activity with an accuracy of 85% using the HAR model proposed. This means that minimum computational costs are needed, thus allowing portable devices to identify human activities using audio as an information source.


Introduction
The capacity to recognize the activity currently being performed by either oneself or someone else is an inherent behavior of an intelligent system, reason why the human activity recognition (HAR) is currently a relevant research topic [1][2][3][4][5].There is a wide range of areas in which HAR can be applied, such as automatic vigilance, elderly care, entertainment, and residential activities support [6][7][8][9][10][11].
HAR is the principal component that allows people to recognize a high level human behavior and therefore to identify routines and social interactions.In consequence, proposals using different techniques and information sources have been previously published.Additionally, efforts have also aimed at fusing different well-known techniques and information sources in order to increase the accuracy of the systems and the coverage ratio.Some of these analyses are presented next.
Moore et al. [12] presented a framework titled ObjectSpaces that uses familiar object-oriented constructs like classes and inheritance to manage object context and allows for the classification of activities.In said work the authors demonstrated that both familiar and previously unseen objects could be recognized using action and context information.However, there were some constraints in this proposal.For example, the association of actions with objects can reduce the activities being recognized, even with a well-done domain of action, since sometimes people perform activities using unusual objects.Additionally, video is a complex signal and needs cameras deployed in the environment, therefore two information sources are required.
An increasing number of portable devices contain multiple sensors, all with capability of recording information from different sources.These devices have been used in several analyses to recognize human activities.The one presented by Lester et al. [13] proposes the gathering of data from different sources using one device and the use of a modified version of the Adaptive Boosting algorithm (AdaBoost) for feature selection.A similar work proposed the use of Hidden Markov Models (HMM) to classify certain activities [14].
Other approaches are based on the methodology and theories of social psychology, collecting audio data that can be tagged with Essential Social Interaction Predicates (ESIP).A relevant example of this technique is proposed by Lester et al. [13], who built a model named Discriminative Conditional Restricted Boltzmann Machine (DCRBM), this model combines a discriminating approach with the capabilities of the Conditional Restricted Boltzmann Machine (CRBM).The model allows the discovery of actionable components from ESIP to train the DCRBM model and use it to generate lowlevel data corresponding to the ESIP with a high degree of accuracy.
Binary silhouettes have also been used to represent the different human activities.Uddin et al. proposed a system based on Generalized Discriminant Analysis on Enhanced Independent Component, obtaining features from binary silhouettes information to be used with Hidden Markov Models for training and recognition [15].Likewise, principal component analysis (PCA) [16][17][18] and independent component analysis (ICA) [19] have also been used for this purpose.
The process of feature extraction is used to represent a signal using several derived values or features intended to be informative and nonredundant.This process could easily result in very large sets of features that describe a signal.Nevertheless, not all extracted features will be useful in discerning between different kinds of signals (i.e., signals representing different activities), reason why feature selection is also needed.In order to determine which set of features could accurately classify different activities, a feature selection analysis that included the use of a genetic algorithm, forward selection and backward elimination steps, and a random forest (RF) or neural network (NN) algorithm was implemented.
Briefly, we attempted to identify a small set of features that could accurately classify eight different activities using only features derived from audio recordings.Additionally, the accuracy of such a model was compared against models with no size restrictions and against models obtained using a different state-of-the-art methodology.

Materials and Methods
Three main tasks were performed to generate the HAR model: audio collection, feature extraction, and feature selection.Feature extraction and selection were performed using R (https://www.r-project.org/).

Data Description.
The dataset is comprised of seven activities and additionally a collection of no activity noises, commonly performed in a residential setup: Namely, brewing coffee, cooking, using the microwave oven, taking a shower, dish washing, hand washing, teeth brushing, and no activity noises.We generated individual sounds to collect information from activities.Table 1 shows the type and a short description.It is worth mentioning that four of these activities have in running water a similar background sound, adding to the complexity of the HAR problem.All recordings have been made available through the AmiDaMi research group page at http://ingsoftware.reduaz.mx/amidami/.
2.1.1.Recording Devices.The devices used to record all the audio clips were chosen given the different specifications from the microphones embedded in these.In Table 2 are shown system on chip (SoC) and operating system from the selected mobile phones, to know the hardware and software involved in internal audio recording and preprocessing process.

Spatial Environments.
In order to cover a wide gamma of sounds, all were recorded in different houses meaning different spatial environments, audio reflections, and background sounds.Additionally different home facilities mean different cookware, home appliances, and running water reflections.Mobile phone laid near where the activity was being performed.An example of which, shown in Figure 1, was determined in order to record the sounds as clearly as possible.

Metadata.
Audio clips with a sample rate between 8,000 Hz and 44,100 Hz and Mono and Stereo recordings were done depending on the device used to record the audio clip.The range of the sample rate assured that most mobile phones were able to be included, allowing for a future expansion of the database.In Table 3 is shown the summary of metadata for each activity performed in this dataset.

Data Preparation.
In this work, all the audio samples have no other preprocessing than trimming the samples in 10-second clips, no other audio processing was performed in order to simplify the implementation in any device.

Feature Extraction.
In order to obtain information that could potentially distinguish which activity was being performed, several features were extracted from the audio clips.As shown in [20,21], 10-second audio clips seem to be adequate for this task.However, some activities lasted longer than ten seconds, yielding longer than needed record samples.Such recordings were trimmed into as many 10second audio clips as possible.To avoid issues between Mono and Stereo recordings, only the information from the left channel of the latter kind of recordings was used.Each 10second clip was then converted into an integer array, where each integer represented the magnitude of the sound wave in that instant of time.Even though all clips had the same duration, the length of the arrays that represented them varied from 80,000 to 441,000 samples depending on the sample rate of the original recording.
Features that statistically describe sound waves have previously been found to be of importance in solving similar problems [22].Thus, the 16 statistical features listed in the following list were extracted from each sample.Mascia et al. showed that Mel-frequency cepstral coefficients (MFCC) [23] can be used to identify acoustic descriptors from the environmental sound [24] and thus were also extracted from the audio samples.To do so, each 10-second audio clip was split into ten 1-second audio clips, from which 12 cepstral coefficients were calculated, resulting in 120 MFCC per sample.In order to avoid the matrix that is generated during the extraction of the MFCC, the vectorization process showcased by Mascia et al. [24] was performed.In order to avoid any outlier-related problems, features were rank-normalized as described in (1), where   is the th value of the feature  and    is the th value of the ranknormalized feature   .As a result, all features range between 0 and 1, with equidistant steps between each element of the array:

Feature Selection.
In the first step of the feature selection process, a genetic algorithm called Galgo [25] reduced the size of the database by determining which features had more chances of being useful.To do so, a set of random fivefeature models were evolved throughout 200 generations, during which the models mutated, reproduced, and recombined, eventually yielding a highly accurate model.Fitness was defined as the accuracy of the model to classify the eight previously defined activities using a nearest-centroid approach and following a 3-fold train-test methodology.This whole process was repeated 300 times, resulting in 300 highly accurate five-feature models.The number of times each feature was found in these models was used to determine a feature rank, which described the potential classification capabilities of each feature.
Based on this rank, forward selection and backward elimination were performed, defining which features were to be used in the next stage of the feature selection procedure.Forward selection is a well-known methodology used to build models at low computational costs.From the list of ranked features, this approach added one feature at a time and evaluated the performance of the models.Once the last feature was added and the model with all features was evaluated, the features from the model that achieved the highest accuracy were kept, and the rest were disregarded.Backward elimination was then performed to avoid redundant information and to further reduce the amount of features to be used.This process consisted in removing one feature at a time and evaluating the performance of the model, starting with the final model of the forward selection procedure and removing first the least frequent feature in the 300 Galgo-generated models.If eliminating a feature did not decrease the accuracy of the model, then such a feature was removed from the final model.This process was repeated until model stability was achieved.Accuracy in both the forward selection and the backward elimination procedures was measured following a 3-fold train-test methodology, using the same folds as the ones used in the genetic algorithm.
The features selected by the backward elimination algorithm were then used to generate two HAR models, one through a random forest (RF) implementation [26] and the other through a neural network (NN) one.RF is a robust machine learning technique that can handle classification problems based on bagging and random feature selection [27,28].Moreover, it allows for the calculation of the error during the model generation without having to split the data into train and test sets.This algorithm uses an out-of-bag (OOB) error, an unbiased estimate of the true prediction error, in which as the forest is built; each tree can be tested on the samples not used in building that tree.Breiman [29] demonstrated that estimating the OOB error has the same result as estimating the error using a test set of the same size as the training set.The NN model was obtained using 70% of the data to train the model and the rest to test it.This algorithm was analyzed because it can be easily implemented in a mobile phone, giving ubiquity to the proposed HAR solution.These models were compared with an RF-and an NN-based model that included all original features, an RFand an NN-based model that included all MFCC features but no statistical features, and a model developed by Kabir et al. [30] was evaluated in three different setups.

Results
Audio collection resulted in a total of 64 recordings and 1,159 10-second audio clips.Table 4 details the number of recordings and audio clips obtained for each activity.Since each audio clip had 16 statistical and 120 MFCC features extracted from it, the final database had a size of 1,159 × 136 elements.
Figure 2 shows how each one of the 300 Galgo-generated models evolved throughout 200 generations, yielding an average accuracy of 0.68.It can also be seen that the models had achieved stability; that is, no more generations were needed.Similarly, Figure 3 shows that the frequency in which Expected random frequency = 12 (11.1)features appeared in the 300 models had stabilized, at least for the 30 most frequent features.This means that even though more models had been generated, the rank of the most frequent features would not have changed.
The forward selection procedure selected the 35 most frequent features, and the backward elimination strategy removed 25 of them.This resulted in only 9 features with potential classification power: the trimmed mean, the standard deviation, the 95th percentile, and 6 MFCC.These features, whose heat map is shown in Figure 4, were used to generate an RF-and an NN-based HAR model.The weights of the NN-based HAR model were adjusted throughout 100 iterations and the RF-based HAR model was adjusted using 5,000 trees.
Table 5 shows the classification accuracy achieved by each model.It can be noted that by adding more features to the RFbased HAR model the accuracy increased.That is, the model with all features had the highest accuracy, followed by the model with all MFCC features and then by the model with 9 features.Conversely, the NN-based HAR model decreased its accuracy when more features were included, having its best performance when only the 9 selected features were used.Nonetheless, both 9-feature models were able to outperform the model proposed by Kabir et al.Confusion matrices that describe how each model classified each sample are shown in Tables 6, 7, 8, 9, 10, and 11.In addition, Table 5 shows that independently from the scenario, models based in RF approach outperform all the others scenarios, including Kabir et al. ones.

Discussion and Conclusions
The focus of this research is to find features that describe efficiently the behavior of audio signal that represents activities performed by humans in order to develop a HAR model using well-known machine learning techniques that can be used in low power consumption and mobile devices, to provide context information.The results, presented in Section 3, allowed us to identify the following aspects to answer questions presented in Section 1: (i) Mel-Frequency Cepstral Coefficients describe better the behavior of the audio signal: we identified that MFCC describe accurately the behavior of an audio signal used to generate a HAR model.Even after the feature selection process, the model was composed of 9 features, 6 of them are MFCC descriptors.
(ii) Statistical features are important describing audio signals: we propose the use of signal time evolution and the first and second statistical features extracted that can be computed at low computational cost.This means that can be used in low computational processors, as new portable embedded systems (i.e., Arduino and Galileo among others) or low-cost smart phones and in high-end smart phones at low battery cost.In results section we can see that even though several feature selection procedures were carried on, three of them survived until the final features set, meaning that can be used to HAR at low computational cost.
(iii) Selected features can describe the behavior of the audio signal losing some fitness than all the features together: the aim of feature selection is to reduce the computational cost and to maximize fitness; nevertheless, the selected features cannot describe the behavior as when using all features.Yet, we consider that reduction of features needed to be extracted from the signal is more important to allow ubiquity in future works, given the lower computational cost and lower battery consumption processing the audio signal.
One of the strongest points presented in this research is the use of random forest for human activity recognition targeting quasi-real time mobile applications.The presented methodology builds a Random forest for said classification; the procedure is consistent and adapts to sparsity; its rate of convergence depends only on the number of strong features and not on how many noise variables are present [31]; the complexity of the classification methodology is ≈ (( log )), where  is the number of features,  is the number of instances, and  is the number of trees.Nevertheless, once the model is trained and refined, the model could be transferred to a mobile application; said application will only be using the random forest structure and values, and it will not need to retrain the model inside the mobile device avoiding said computation complexity.Additionally we found particular behavior of neural networks (NN) when only MFCC are used to describe the audio signal; in this scenario a NN with a single hidden layer cannot adjust weights efficiently leading us to worst missclassification case, this given by the fact that MFCC after percentile rank normalization tends to have low standard deviation meaning a high similarity which tends to overfit the NN; nevertheless, the use of MFCC in combination with selected statistical features produces an efficient model.These arguments allow us to conclude that best results are obtained using a clever selection of statistical and MFCC features.

Future Work
As part of the future work, we propose adding more activities that are commonly performed in residential homes; additionally we propose adding more features to be extracted and a cleaver feature selection to reduce even further the amount of data needed.The proposed future work is as follows: (i) To study other human activities in residential homes (ii) The use of different features with efficient computational cost (iii) The implementation of the predictive models in a mobile application, for a real world deployment (iv) Implement other feature selection techniques to do a comparison and acquire descriptive features Also, compare the obtained models with a second group of models obtained using a clusterization technique to evaluate which can offer better results in a mobile application implementation.Once the model is built and optimized, we plan to implement the methodology in an application for mobile phone.Also, it aims to check whether these applications can  1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000

Figure 1 :
Figure 1: Typical distance between mobile phone and activity.
kurtosis of the probability distribution of the integer array, skewness of the probability distribution of the integer array, mean of the integer array, median of the integer array, standard deviation of the integer array, variance of the integer array, coefficient of variation (CV) of the probability distribution of the integer array inverse CV, 1st, 5th, 25th, 50th, 75th, 95th, and 99th percentile of the probability distribution of the integer array, mean of the integer array after trimming the bottom and top 5% elements.

Figure 3 :
Figure 3: Evolution of the rank for the most frequent features.

Figure 4 :
Figure 4: Heat map of the 9 features with potential classification capabilities.

Table 1 :
Activities general description.

Table 2 :
Selected mobile phones system on chip and operating system.

Table 3 :
Audio clips metadata per activity.

Table 5 :
Evaluation of each model.

Table 6 :
Confusion matrix of the RF-based HAR model with 9 features.

Table 7 :
Confusion matrix of the NN-based HAR model with 9 features.

Table 8 :
Confusion matrix of the RF-based HAR model with all features.

Table 9 :
Confusion matrix of the NN-based HAR model with all features.

Table 10 :
Confusion matrix of the RF-based HAR model with all MFCC features.

Table 11 :
Confusion matrix of the NN-based HAR model with all MFCC features.