Nowadays, sound classification applications are becoming more common in the Wireless Acoustic Sensor Networks (WASN) scope. However, these architectures require special considerations, like looking for a balance between transmitted data and local processing. This article proposes an audio processing and classification scheme, focusing on WASN architectures. This article also analyzes in detail the time efficiency of the different stages involved (from acquisition to classification). This study provides useful information which makes it possible to choose the best tradeoff between processing time and classification result accuracy. This approach has been evaluated on a wide set of anurans songs registered in their own habitat. Among the conclusions of this work, there is an emphasis on the disparity in the classification and feature extraction and construction times for the different studied techniques, all of them notably depending on the overall feature number used.
In the last few years, the number of devices focused on the monitoring and analysis of environmental parameters has grown strongly. However, sometimes the intended purpose is not related to the direct measurement of a parameter and requires the analysis of complex phenomena. An example of this is phenology, which consists of the study of periodic plant and animal life cycle and how some events are related to seasonal and climate variations [
Furthermore, reversing this study, it has been used for the prediction of climate evolution. A proof of this fact can be seen in some studies [
All of these studies are traditionally focused on comparing different techniques of audio processing, audio feature selection, or classification. However, a WASN approach requires contemplating more factors, such as execution times or the amount of transmitted information for each approach, which can seriously condition the applicability of each one.
In this sense, this paper proposes an audio processing and classification scheme, focusing on these kinds of architectures. Additionally, it is also completed with a detailed time analysis of the different processes involved in this proposed scheme (from acquisition to classification stages), providing useful information to choose the best option with the best tradeoff between processing time and classification result accuracy.
Specifically, this paper is organized as follows: Section
The proposed architecture is focused in distributed solution where the audio analysis in the distributed nodes of a WASN is resolved. This network is made up by a mesh structure with dynamic routing (network topology is described later in Section
Audio processing scheme.
The proposed WASN architecture is made up of a set of distributed nodes and a central node called base station (see Figure
WASN architecture.
On the one hand, the base station is traditionally a standard PC, which has a radio adapter for the WASN connection. It acts as a gateway with other network technologies and provides centralized storage and processing capacities.
On the other hand, the distributed nodes are embedded systems, which have a wireless radio that allows them to connect with the other network elements (neighboring nodes and base station). Due to the remote location of the nodes, in a natural environment, they also require an alternative power source (i.e., solar systems), supported by batteries to guarantee their operation in adverse environmental conditions. This fact makes the consumption a critical constraint in these nodes, requiring drastic reductions in computational and radio power consumption. However, low power transceivers, such as ones based on IEEE 802.15.4 [
Specifically, depending on user needs, different tradeoffs between the amount of transmitted information (radio consumption) and execution time (computational cost) can be established. In this sense, each network node must be able to locally characterize and classify sounds, where the lowest classification error is not the only objective. Furthermore, computational requirements of the each algorithm must also be considered for its viability over these kinds of platforms.
Due to this, in the next sections, the proposed scheme is detailed and completed later with a comprehensive analysis of their execution performance in each audio classification stages.
As it was introduced above, the audio features extraction is done frame by frame, obtaining several parameters for each one. Later, based on these first direct features, this information set is completed with second features construction stage, where new additional estimated information is provided. Both analyses are detailed in the two next subsections, while the classification stage is analyzed in the third.
In this work, the feature extraction of a frame has two approaches. On the one hand, the first proposed approach consists of extracting the features defined by MPEG-7 standard. This standard defines a sample rate of 44.1 kHz and recommends a
MPEG-7 features and their origin analysis.
Feature | Symbol | Based on |
---|---|---|
Total power |
|
Spectrogram analysis |
Relevant power |
| |
Power centroid |
| |
Spectral dispersion |
| |
Spectrum flatness |
|
|
|
||
Frequency of the formants (×3) |
|
Linear prediction coding (LPC) analysis |
Bandwidth of the formants (×3) |
|
|
Pitch |
|
|
Harmonic centroid |
| |
Harmonic spectral deviation |
| |
Harmonic spectral spread |
| |
Harmonic spectral variation |
| |
|
||
Harmonicity ratio |
|
Harmonicity analysis |
Upper limit of harmonicity |
|
For more details, MPEG-7 standard [
On the other hand, other alternatives propose an MFCC analysis for the feature extraction. MFCCs are based on the sound cepstral through its homomorphic processing [
In previous section,
This approach represents “a” or left branch in Figure
This approach represents “b” or the center branch in Figure
The last alternative is represented by “c” or right branch in Figure
In this sense, this method provides
HMM structure.
Once the different alternatives of frame featuring have been analyzed, the next step is using these features to identify the class to which they belong (step (4) of all branches in Figure
In the final stage, (5) of Figure
In previous sections, different implementations or alternatives for animal sound analysis have been proposed. However, from an implementation point of view, these algorithms are not trivial and may require a lot of execution time.
In this sense, an exhaustive time analysis of each stage is essential to guarantee the real-time application. Specifically, and according to the previous section, the analysis time can be divided into five stages: audio acquisition, frame feature extraction (direct frame analysis), frame or sequence feature construction (frame set or sequence analysis), feature classification of each frame, and finally the global sound classification. However, for some of them, their processing times are not static. Specifically, as was described in previous sections, an animal sound can be characterized by a set of The features extraction time of each frame grows when the number of these parameters increases. The features construction time of additional information for each frame (or sequence) grows when the number of direct or additional parameters increases. The classification time of each frame (or sequence) increases with its features dependency. As it will be addressed in Section
Considering the three first times in the former list, their sum is an important restriction in real-time audio processing applications, where this total time must always be less than the audio fragment duration. In this sense, this constraint makes an exhaustive comparative time study of all proposed alternatives essential, seeking the best tradeoff between the feature number and the time available.
Moreover, although not directly related to real-time applications, the time needed to obtain the classifiers is also related to the feature space dimension. Due to this, a comparative analysis of this time could also be useful, especially in applications with a dynamic knowledge base in which the training process is repeated periodically.
From all of the above, these times have been studied in the next section in detail. This analysis makes the comparison between the different proposed alternatives possible, identifying the least computationally demanding.
As a testbed of the previously described strategy, 63 sound files provided by the Zoological Sound Library [
Testbed audio details.
Sound class | Sound | Patterns | |||
---|---|---|---|---|---|
Files | Seconds | Files | Seconds | Frames | |
|
23 | 2,576 | 2 | 21 | 1,439 |
|
10 | 415 | 1 | 29 | 248 |
|
30 | 3,062 | 2 | 89 | 375 |
Silence/noise | — | — | — | — | 11,841 |
|
|||||
Total | 63 | 6,053 | 5 | 139 | 13,903 |
Furthermore, a common characteristic to all of these sounds is that they were recorded in a natural habitat with a significant presence of noise (wind, water, rain, traffic, voices, etc.), which poses an additional challenge in the classification process.
Although the whole process was designed to be finally implemented in distributed nodes, this study was implemented over a laboratory prototype, equipped with an Intel® Core™ i7-4770 processor at 3.4 GHz and 8 GB of RAM. All the algorithms have been coded in MATLAB® with an implementation that does not explicitly exploit code parallelism over the different cores. However, the MATLAB by default built-in multithreading computation has been exploited.
The next sections show and discuss processing time results related to the classification of these sounds.
As it was mentioned in Section
Time analysis of the frame feature extraction.
Parameter type | Requirement | Feature | Processing time | |
---|---|---|---|---|
Secondary |
Total | |||
MPEG-7 |
Spectrogram, primary processing time 41.33 |
|
2.48 | 43.80 |
|
20.23 | 61.55 | ||
|
9.42 | 50.75 | ||
|
14.01 | 55.33 | ||
|
52.22 | 93.55 | ||
LPC, primary processing time 1,777.92 |
|
0.00 | 1,777.92 | |
|
0.00 | 1,777.92 | ||
|
0.00 | 1,777.92 | ||
|
5.86 | 1,783.78 | ||
|
8.75 | 1,786.67 | ||
|
1.87 | 1,779.79 | ||
|
2.78 | 1,780.70 | ||
Harmonicity, primary processing time 1,262.02 |
|
0.00 |
1,262.02 | |
|
||||
MFCC |
44.29 | 44.29 |
On the other hand, the MFCC features use a single process, being calculated all at once (see Table
In summary, the extraction time for the full MPEG-7 feature set is 3.2 ms (approx. 1/3 of frame duration). MFCC feature set requires 45
In this sense, a reduction in MPEG-7 feature dimensionality (reduction in the number of features extracted) will improve this time of frame features extraction. However, as discussed above, this time is strongly conditioned by the parameter type (or their primary process needs), obtaining a significant reduction when any of them is not necessary. Conversely, a reduction in MFCC feature dimensionality does not involve any reduction in this time, since all are obtained simultaneously.
Following the techniques described in Section
Time analysis of the feature construction process.
Feature |
Feature |
Number of |
Processing |
Accuracy | Best |
---|---|---|---|---|---|
RegDis | MFCC | 13 | 85.74 | 92.59% | Bayes |
MPEG-7 | 18 | 99.60 | 91.53% | DecTr | |
|
MFCC | 13 | 0.388 | 94.71% | Bayes |
|
MFCC | 13 | 0.652 | 94.71% | Bayes |
SW |
MFCC | 13 | 10.62 | 94.71% | Bayes |
MPEG-7 | 18 | 14.72 | 91.53% | DecTr | |
HMM | MPEG-7 | 18 | 84.39 | 84.13% | — |
ARIMA |
MPEG-7 | 18 | 25,613.0 | 70.37% | Bayes |
However, the calculation times of these parameters present a significant dependence on the number of parameters. Figure
Sliding window behavior for different number of features.
In this figure, it is easy to note that the construction time shows an approximately linear behavior. Moreover, this time also has a lineal dependence on the window size (as it can be clearly seen in Figure
Sliding window behavior for different window size.
In HMM technique, for each sequence, the feature construction consists of converting the original parameter vector (
HMM behavior for different number of features.
Moreover, ARIMA analysis consists of converting the original parameter matrix (
Like other techniques, this time also significantly depends on the number of features. Figure
ARIMA behavior for different number of features.
Once the feature extraction and construction processes are analyzed, the next step must be to analyze the classification procedure based on these features.
In a first stage, only extracted (or nonsequential) features will be considered for classification purposes ((4a) or left branch approach in Figure
Decision tree classification time versus sound duration.
In this sense, Table
Time analysis of the classification stage.
Classifier | Classification time |
Normalized time |
Speed |
Accuracy |
---|---|---|---|---|
MinDis | 15 | 0.15% | 690 | 58.73% |
MaxLik | 1175 | 11.75% | 9 | 86.24% |
DecTr | 7 | 0.07% | 1389 | 91.53% |
kNN | 207 | 2.07% | 48 | 82.01% |
SVM | 27 | 0.27% | 372 | 82.01% |
LogReg | 7 | 0.07% | 1515 | 76.72% |
Neur | 8 | 0.08% | 1333 | 75.66% |
Discr | 8 | 0.08% | 1299 | 77.78% |
Bayes | 7 | 0.07% | 1449 | 80.95% |
Obviously, for real-time audio processing, this relative time must be less than 100% (or in the same words, the relative speed must be greater than 1). As it has been previously shown, all the algorithms fulfill these conditions, however two of them being significantly slower: maximum likelihood and
Relative classification time per frame using
However, the classification time per frame directly depends on the number of features used (or input parameters). In this sense, Figure
Normalized classification time for different number of features.
In general, it is possible to identify an upward trend in the classification time with the number of features for most algorithms. Figure
Linear regression of classification time (results of all classifiers).
Another issue which has to be addressed is the effect of the number of classes on the processing times. It has no influence on feature extraction and feature construction times as these processes precede (and independent of) the definition of classes. However, the number of classes does potentially have influence on classification times. To explore this topic the original dataset has been modified introducing additional classes (anuran species or sounds) and labeling every frame with a uniformly distributed random class (silence/noise is considered as a class). It has to be underlined that they are not real classes and their only purpose is for testing the impact of the number and distribution of classes on processing times. Figure
Normalized classification time for different number of classes and its linear regression (dashed red line).
Following the scheme proposed in this paper, the next step considers the sound sequential information using frame-trend features (branch (4b) of Figure
Figure
Relative classification time per frame using sliding windows.
As in the previous analysis, all studied classifiers fulfill the time constraints to operate in real-time mode, nevertheless the maximum being the likelihood and
Obviously, this time also depends on the number of features, which directly depends on the configuration of the construction method (i.e., window size for SW). Figure
Normalized classification time for different number of features (extended to constructed features).
In this sense, it is easy to note that there exists a global increasing behavior in the classification time depending on the number of feature. This trend is clearly shown in Figure
Linear regression of classification time (results of all classifiers extended to constructed features).
To finish the study of the classification time, the last topic to be addressed is the sound segment (frame or sequence) classification ((4c) or right branch of Figure
Figure
HMM classification time for an audio segment.
Classifier | Classification time | Classification speed | Accuracy |
---|---|---|---|
HMM | 12.56 ms/s |
80 | 84.13% |
HMM classification time for different durations.
HMM classification time for different number of features.
Conversely, ARIMA approach (as it was described above) uses the same classifiers as those applied for frame classification, now usually increasing the size of the feature set. Therefore, the classification time is the same as that has already been analyzed above and is reflected in Figure
Finally, the last step is the classification of the full sound file (process (5) of Figure
After studying temporal requirements of the three proposed stages for the audio fragment classification, the next study will be the time required to obtain each classifier. Obviously, this time is less critical than those previously studied, since this stage is not properly concerned in the real-time classification process. However, this study would be of interest in cases where the knowledge base is dynamic, or it has a periodic or iterative training approach. In addition, it is true that the techniques proposed (based on supervised classification approach) may have significant deviations in the training period (depending on the training data; the number of patterns; or their content). Nevertheless, its results can be taken as a starting point to get some knowledge in comparing classifier generation times.
Following the structure of this paper, the first analysis is focused on the classifiers for nonsequential analysis (only a single frame is considered). Figure
Classifier generation time (for the full MPEG-7 feature set).
At a first glance, these generation times just depend on the number of features. But a deeper insight into the classification process shows that they also depend on the number of patterns and even their values. Therefore, in order to compare how the reduction of the number of features affects over these times, several trainings with different feature set (mixing all of them) have been performed as patterns. Figure
Normalized classifier generation time for different number of features.
Additionally, the number distribution and proportion of classes could have a certain impact on the time required to train a classifier. To explore this issue, as it was previously mentioned, the original dataset has been modified introducing additional classes (anuran species or sounds) with a random distribution and proportion. Figure
Normalized classifier generation time for different number of classes.
Now, let us focus the analysis in the cases where frame sequence information is added, that is, when some features are constructed using regional dispersion,
Classifier generation time (using the full MPEG-7 feature set and SW with a windows size of 10).
Normalized classifier generation time for different number of features (extended to constructed features).
As for nonsequential classifiers, Figure
Finally, the last concern of this analysis will be the generation times for sequential classifiers, that is, HMM and ARIMA models. In the first of these approaches (HMM), Figure
HMM classifier generation time for different number of features.
On the other hand, audio classification using ARIMA models uses the same classifiers previously considered, although increasing the feature set dimension. In this sense, the generation times of these classifiers will show the same results to those analyzed above (Figure
Throughout this paper, an animal voice classification scheme for WASN has been proposed. This scheme proposes different alternatives to achieve this goal, always taking into account the power composition limitations of these kinds of platforms. In this sense, this paper is completed with a detailed comparative time study of each proposed algorithm within the scheme. It has been possible to find a tradeoff between the classification result accuracy and the required processing time.
From this analysis, several conclusions can be highlighted. For example, MPEG-7 feature extraction requires an important relative computational load (around of 30% of the audio fragment time). Conversely, this load falls to 0.5% for MFCC extraction time, considerably reducing the computational load. Additionally, it is easy to note that most feature construction techniques (either adding frame trend or sequential information) require a low processing cost, ranging approximately between the 1% of the frame time for regional dispersion or HMM and the 0.1% for sliding windows. Conversely, ARIMA models significantly exceed this limit where classification times exponentially grow with the number of features. For the first classification stage, it is also easy to note that the classification time depends remarkably on the type of classifier and the number of parameters (as it can be seen in the different comparisons). However, these requirements are also typically low (between 0.1% and 1% of the frame duration). Only in two of them (maximum likelihood and
From an implementation approach, a first result indicates that the proposed prototype for anuran song classification is able to operate in real-time, taking all alternatives less than the audio duration. Thus, some concerns have to be taken into account when this algorithm is deployed in a WASN node (typically with fewer resources). In this sense, these potential node limitations could be easily compensated with the Digital Signal Processing (DSP) resources, commonly available in modern platforms for this purpose (i.e., ARM® Cortex®-M4 processes), which would greatly reduce feature extraction times (one of the most costly phases in the MPEG-7 approach). Additionally, a reduction in the sample rate could also be occasionally possible if it was necessary.
The authors declare that they have no conflicts of interest.
This work has been supported by the