A Novel Multichannel Dilated Convolution Neural Network for Human Activity Recognition

A novel multichannel dilated convolution neural network for improving the accuracy of human activity recognition is proposed. The proposed model utilizes the multichannel convolution structure with multiple kernels of various sizes to extract multiscale features of high-dimensional data of human activity during convolution operation and not to consider the use of the pooling layers that are used in the traditional convolution with dilated convolution. Its advantage is that the dilated convolution can ﬁrst capture intrinsical sequence information by expanding the ﬁeld of convolution kernel without increasing the parameter amount of the model. And then, the multichannel structure can be employed to extract multiscale gait features by forming multiple convolution paths. The open human activity recognition dataset is used to evaluate the eﬀectiveness of our proposed model. The experimental results showed that our model achieves an accuracy of 95.49%, with the time to identify a single sample being approximately 0.34 ms on a low-end machine. These results demonstrate that our model is an eﬃcient real-time HAR model, which can gain the representative features from sensor signals at low computation and is hopeful for the eﬀective tool in practical applications.


Introduction
Human activity recognition (HAR) is a typical multiclassification problem, which acquires and analyzes human activity-related data to identify human activity status [1,2]. It plays an essential role in people's daily life and is widely used in the fields of safety, medical care, smart home, and entertainment. Specific applications include smart home [3,4], gait analysis [5,6], security certification [7,8], health monitoring [9,10], athlete monitoring [11], and gesture recognition [12,13]. ere are two main methods of human activity recognition: vision-based human activity recognition and sensor-based human activity recognition. Although the vision-based recognition method has been extensively studied and can achieve a high recognition rate, this method is limited by the high acquisition cost of the imaging device, and it is a challenge to collect the image data sometimes, so it is hard to meet the needs of the real-life environment. With the development of smartphones and wearable sensor technologies, smart devices with built-in sensors are characterized by low cost, convenient carrying, and good realtime performance. erefore, HAR based on sensor signals has become the focus of research in this field.
HAR based on sensor signals includes two methods: the traditional method and the deep learning method. e traditional method based on sensor signal for HAR needs complex preprocessing of the raw data and relies on manual experience to extract the required time-domain features [14][15][16], frequency-domain features [16][17][18][19], and other features [20,21]. ese hand-craft features are shallow features, which would inevitably lose some implicit key features. Deep learning methods can make up for the shortcomings of traditional methods and can dig out automatically the more recognizable inherent features contained in the data by learning the deep nonlinear network structure.
Deep learning is well known as a revolution in machine learning, especially in the field of computer vision [22,23] and natural language processing [24,25]. In recent years, different deep learning methods have been proposed for human activity recognition based on sensor signals, including autoencoders [26], fully connected deep neural network (DNN) [27,28], recurrent neural network (RNN), convolutional neural networks (CNN), and the hybrid deep learning model. RNN, CNN, and hybrid models are the most widely studied in HAR, and we will introduce them in detail in the second section. RNN, especially Long Short-Term Memory (LSTM), can capture the dynamic time dependence of various motions and helps to explore the pattern features [2]. However, the LSTM takes a longer training time due to numerous parameters that need to be updated during the training process. Compared with RNN, CNN is more able to learn the crucial features contained in recursive patterns [1,29]. However, most CNNs have a single parameter setting in the convolution process, which dramatically limits the flexibility of the model. Besides, a larger convolution kernel can help to capture more information but increases computation cost for CNN. e application of dilated convolution may be an effective solution, which achieves dilating the receptive field of the convolution kernel without increasing the kernel parameter numbers [23].
Real-time HAR is also a research hotspot in the field. Some methods for this have been proposed to implement this problem [30,31]. e shortcoming of these works is that it is difficult to maintain the balance between activity recognition accuracy and running time. All of these challenges have led researchers to develop efficient recognition methods with high recognition accuracy and low computational complexity effectively solving these problems.
Based on current research deficiencies, this paper proposes a novel multichannel dilated convolution neural network (MDCNN). e model can get a larger receptive field to extract global features of long-time series from the raw sensor data by using dilated convolution rather than traditional convolution structure. Moreover, the proposed model uses multichannel block convolution operations with different kernel sizes to obtain combined features of multiscale.
rough experimental comparison, the proposed model can effectively improve recognition accuracy and achieves real-time HAR effectively. e rest of this paper is organized as follows: Section 2 provides related work concerning different deep learning methods for HAR. Section 3 describes the fundamentals of CNN, dilated convolution, and multichannel convolution. e framework and training process of the proposed model are introduced in Section 4. Section 5 conducts a series of experiments with the proposed model and discusses the results, while Section 6 gives the conclusion and presents our future work.

Related Work
In recent years, various deep learning methods have been proposed for sensor-based HAR. RNN can retain memory and learn sequence data to capture the inherent relationships of time-series data. Chen et al. proposed the LSTM-based method that uses three-axis accelerometer data on the lab public datasets (WISDM) to identify human activities with an accuracy of 92.1% [32]. Guan and Plötz developed ensembles of deep LSTM, which combines sets of diverse LSTM learners into classifier collectives [33]. e experimental result on three standard benchmarks (Opportunity, PAMAP2, and Skoda) demonstrates that Ensembles of deep LSTM outperform individual LSTM networks. However, the deep LSTM takes a longer training time due to numerous parameters that need to be updated during the training process. For enhancing faster learning in early training, Zhao et al. proposed an improved LSTM model: Res-Bidir-LSTM, which also guarantees the validity of information transmission through residual connections and bidirectional cells [34]. e result shows that Res-Bidir-LSTM has increased by around 4% under the public domain UCI dataset and the Opportunity dataset in comparison with previous work. Hammerla et al. explored the three types of deep learning models (deep feed-forward networks (DNN), CNN, and RNN) on the three benchmark datasets in labs [35]. e results found that CNN has better performance than other models on prolonged activities like walking and running.
ere are two advantages to CNN: local dependence and scale invariance [1,36]. Local dependence means that the signal at the current time may be related to the signal around this point, and scale invariance refers to the fact that the research object does not change in the amplitude or frequency of the synchronization [37]. Zeng et al. proposed an original CNN model for accelerometer data, in which each axis of acceleration is input to a separate convolution layer and a pooling layer, respectively, to extract features [36]. However, due to the fact that the model only considering the acceleration data and model structure is too simple, it is tough to extract crucial features.
Ronao and Cho constructed three layers of CNN, automatically extract robust features from the raw data to raise the accuracy of HAR, and get the UCI dataset and WISDM dataset [38]. ey further improve the performance of their model by using additional information from the fast Fourier transform (FFT) of the raw data. However, both the convolution and the pooling operations of the method are performed in a single channel; this single parameter setting dramatically limits the flexibility of the parameters, so the network is unable to extract efficient global and local features at multiple scales. Mohammad et al. presented the multiple CNN pipelines with the structure of late fusion and bypassing connections [39].
is model can comfortably accommodate multiple sensors and signal representations such as the time-domain data, FFT information, and spectrogram, achieving higher performance for six publicly available datasets. However, it is computationally expensive compared with earlier methods due to the employment of bypassing connections from all layers [39].
Besides single models, the hybrid deep learning model combines CNN and RNN which are also proposed in a few works. Ordóñez and Roggan proposed a generic deep framework for activity recognition based on convolutional and LSTM recurrent units [40], where CNN acts as a feature extractor and LSTM models the temporal dynamics of the extracted feature maps. However, this complex network framework suffered from low efficiency and can hardly meet real-time requirements in practice applications.
Most of the current research is carried out offline, and some works have realized HAR in real time. Inoue et al. proposed a deep recurrent neural network (DRNN) for HAR with a high recognition rate and a high throughput [30]. However, despite reducing the training time by parallel processing using the GPU, this network was still very large [41]. Cao et al. proposed a Group-based Context-aware human activity recognition (GCHAR) classification method to achieve HAR in real time, which used a hierarchical group-based classification scheme and context awareness to enhance the classification performance [31]. e result shows that training time and testing time are shorter than other comparison algorithms. e classification accuracy is 94.16%, which is slightly lower than the deep learning algorithm. erefore, the core of the current work is to achieve a model with high recognition accuracy and low computational complexity.

Convolutional Neural Network for HAR.
CNN is a multilayered deep network structure consisting of the input layer, the convolutional layer, the pooling layer, the fully connected layer, and the output layer. Among these layers, the alternating convolutional and pooling layers constitute the most prominent structure. Various studies in the field of computer vision have shown that a multilayer CNN structure consisting of a convolutional layer and a pooling layer can extract image features with different levels. At the bottom of the CNN, it generally learns basic features such as local textures and lines of the image. As the network layer deepens, the model learns more and more complex features, and its recognition ability is also raised from identifying the contour of the object enough to identify the entire image.
In CNN, the convolution and pooling operation are performed in sequence: the output of the convolution operation is used as the input of the pooling operation, and then the pooled layer result is used as the input of the next convolution layer and so on and finally sent into the Softmax layer.
Considering that the sensor data belongs to a onedimensional time series, the input of the proposed multichannel dilated convolution network is one-dimensional time-series data, so its convolution kernel adopts a onedimensional structure. e output of each convolutional layer and pooling layer is also corresponding to a onedimensional feature vector, where the accelerometer and gyroscope time-series data inputs are expressed as where N denotes the length of the time window.
In the convolutional layer, CNN uses the convolution kernel to cope with the input data. Each convolutional layer is connected to the data in the local receptive field of the previous layer to extract local features in the local receptive field. Each particular convolution kernel can extract a differential feature. e data obtained after the convolution operation of a convolution kernel is a feature map, so we can obtain multiple feature maps through multiple convolution kernels to extract multiple features. e output of the convolutional layer is where σ is the activation function. e Restricted Linear Unit (ReLU) function [42] is widely used in deep learning to improve the performance of a deep neural network for nonlinear transformation. b j is the bias term for the jth feature map, l is the kernel size, and w j is the weight for the feature map j.
In the pooling layer, CNN aggregates the local features of a particular region to obtain the scale-invariant feature transform. e pooling operation reduces the dimension of processing data and the computational cost while extracting useful information.
e pooling operation used in this paper, max pooling, is characterized by outputting the maximum value among a set of nearby inputs, given by where R is the pooling size and T is the pooling stride. With the stacking of convolutional layers and pooling layers, this sparse connection method can significantly reduce the number of parameters while extracting the deep features of the input data layer by layer. e obtained multichannel feature map information is first converted into a 1-dimensional vector and then input into the Softmax layer. e converted 1-dimensional vector form is p � [p 1 , . . . , p I ], where I is the number of units in the last pooling layer. e number of the Softmax layer neurons is consistent with the number of activity categories. e Softmax layer gets the probability distribution of each type of activity, and the type of activity identified by the model is the activity type corresponding to the highest probability. e process is expressed as where c is the activity class and N c is the total number of activity classes. Forward propagation is performed through the above process, which gives the error values of the network. Batch Normalization (BN) is proposed to improve the performance of CNN [43]. e BN layer can improve the data distribution during training and speed up the training of the model. Also, the BN layer has the characteristics of improving network generalization ability, to avoid the problem of overfitting and gradient disappearing during training [44]. Define the input dataset of a hidden layer of the network as μ 1 , . . . , μ m , m is the number of samples in the batch. First, it should compute the mean value E(μ) and variance D(μ) by

Mathematical Problems in Engineering
en, each dimension is normalized to μ h , whose distribution has the expected value of 0 and the variance of 1: where ε is a positive number close to zero. Finally, a pair of parameters c and β are introduced to reconstruct and transform the data; the output data y of the BN layer is as follows: Parameters α and β are learned along with the original model parameters.
In general, CNN finally derives robust features with the invariant character for translation, rotation, and scale from the raw data. It is as a result of the convolution operations of multiple convolution kernel network structure, which extracts the features contained in the data, and the extracted features are abstracted as the number of network layers increases. Also, due to the characteristics of sparse connections and weight sharing, CNN can reduce the number of parameters in model training and avoid overfitting [37].

Dilated Convolution.
In the traditional CNN, the pooling operation can make the convolution kernel get a larger receptive field, but it is not a strict component of CNN actually [40]. Meanwhile, excessive pooling operations tend to result in a large amount of information loss [23]. Dilated convolution can expand the receptive field without pooling, allowing each convolution output to contain a wide range of information, and has been applied to problems that require longer sequence information dependencies such as speech and text. e inertial sensor signal is a typical time series, so we apply the dilated convolution to the human activity recognition model in this paper. e principle of dilated convolution is to fill a fixed element 0 that will not adjust during the learning process between the original convolution kernels, which achieves the purpose of dilating the receptive field of the convolution kernel without increasing the number of kernel parameters [23]. e dilated convolution operation is a variant of the traditional convolution operation. If we denote r as the dilation factor, the one-dimensional mathematics of the dilated convolution are as follows: where x[i] and z[i] denote the input signal and output signal, respectively; l denotes the size of the convolution kernel; d denotes the dilatation rate. One-dimensional dilated convolution is achieved by inserting "0" between the pixels of the convolution kernel. For a 1 * k convolution kernel, the dilation factor d is k d , and the size of k d can be defined as e convolution kernel transformed by the dilation factor of d � 3 can be expressed as shown in Figure 1.
As can be seen from Figure 1, the 1 * 3 convolution kernel becomes a 1 * 7 dilated convolution kernel after the dilated operation with the dilatation factor d � 3.
e function of the convolution kernel is to identify certain features in the time series of the sensor. When a segment of the time series satisfies the identifiable feature of the convolution kernel, according to (9), the calculated results of the segment activate a larger value z in the new feature map and finally achieve the recognition of the features of the time series. Figure 1 reveals the change in the receptive field of the convolution kernel after the addition of the dilated convolution. Figure 2 shows an example of dilated convolution with a three-layer convolution structure. In the third layer of convolutional layer, the traditional CNN can only capture three inputs before and after the sensor time series. Under the same conditions, dilated CNN can capture seven input data before and after. Also, dilated CNN has no change in the parameter quantity compared with the traditional CNN.
Without reducing the resolution of the feature map through the pooling layer, the dilated convolution can learn more deep essential features, thus effectively avoiding the problem of severe loss of local detail information in the sensor data. Furthermore, the convolution layer uses different dilated factors to get various sizes of convolution kernel receptive fields and then extract activity features of multiscale.

Multichannel Block Convolution Network Structure.
Although traditional CNNs use filters to capture different features of an instance [25], they perform convolution operations in a single channel, which greatly limits the flexibility of parameter settings and cannot extract global and local features on multiple scales effectively. In order to enhance the robustness of the model, CNN can adopt the group convolution, that is, adopt a multichannel structure; each channel uses different convolution kernel sizes, corresponding to extracting features of different scales of the original sensor time series. erefore, it can be seen as a fusion method of multiscale features. Figure 3 is the diagram of multichannel convolution.
In multichannel CNN, the convolution operations are grouped into multiple branches and carried out separately, and then the fully connected layer concatenates the feature maps of the branches on the channel. By using different kernels, the features of large-scale convolution kernel learning have more global characteristics, while small-scale convolution kernels get features that better reflect local characteristics.

Model
Overview. e MDCNN model proposed in this paper is shown in Figure 4. e whole model composes two parts: feature extraction and classification. e feature extraction part is composed of three dilated convolution channels, Flatten layer, and Concat layer, wherein the dilated convolution channels are the core of MDCNN. Firstly, the sensor data is sent to the dilated convolution channels to extract features of different scales, and the three dilated convolution channels are independent of each other. en, the Flatten layer "flattened" the other dimensions except for the time dimension into a onedimensional feature vector and sent it to the Concat layer to concatenate the one-dimensional feature vector of each dilated convolution channel for feature splicing. Finally, in the classification part, the Softmax layer calculates the probability distribution of each type of activity for the feature parameters transmitted from the Concat layer, and the type of activity identified by our model is the activity corresponding to the highest probability in the probability distribution. e dilated convolution channel 1 is composed of three dilated convolution layers. e model firstly extracts the features with the receptive field increasing sequentially by the dilated convolution layer with dilated factors of 2, 3, and 4, respectively. e BN layer is connected to each of the convolution layers before activation in order to increase the rate of network learning and reduce the risk of overfitting. Finally, the previously obtained feature is flattened into the fully connected layer. e structure of the dilated convolution channel 2 and channel 3 is similar to channel 1, and their convolution kernel sizes are 1 * 4 and 1 * 7, respectively.

Model Training of MDCNN.
e multichannel dilated convolution model discards the pooling layer on the basis of the traditional CNN, avoiding reducing resolution of the feature map caused by the pooling operation. e proposed model introduces the dilated convolution kernel to increase the receptive field of the convolution kernel and captures the long sequence information on the sensor time series, and the multichannel structure is able to extract features of multiscale. Mathematical Problems in Engineering e training and optimization of CNN depend on the loss function. e loss function calculates the error between the predicted value and the true value, backpropagates the error from the last layer to each layer of the network through the backpropagation algorithm, and updates the weights. e updated parameters continue to participate in the training, looping back and forth until the loss function value reaches the minimum; that is, the goal of the final training is reached. In this paper, the CNN model training uses a crossentropy loss function, and it is computed by where x m is the training sensor data, q m,k is m-th sample k-th data's predicted label, y m,k is a one-hot vector that represents the label of the k-th data of the m-th sample, M is the total number of samples, and Nc is the total number of label classes. Large weights can cause the weight vector to get stuck in a local minimum easily since gradient descent only makes small changes to the direction of optimization.
is will eventually make it hard to explore the weight space [38]. L 2 regularization is a regularization method that adds an extra term into the cost function that penalizes large weights. For each set of weights, the penalizing term is added to the LOSS function: where E 0 is the loss function without L 2 regularization, λ is the regularization coefficient, and θ is the overall weight of the model. In summary, the standardized data training set is input to MDCNN, and the model parameters are trained to obtain the recognition model.

Experiment Dataset.
We used smartphones dataset (HAR dataset) [45] in the UCI Machine Learning Repository in our experiments. e dataset collected a total of 10,299 sensor data from 30 subjects between the ages of 19 and 48 in lab. e dataset included six modes of action: walking, going upstairs, going downstairs, sitting, standing, and lying down, each subject carrying a smartphone to record sports data. Each subject carries a smartphone to record motion data, and the recorded data is accelerometer data and gyroscope data with a frequency of 50 Hz. e accelerometer data is separated into total acceleration and body acceleration data, and all data are then preprocessed using a noise filter and finally split into 128 × 9 data windows with 50% overlap between each window. e dataset also offers 561 time and frequency-domain features, but we do not use these features in our experiments. Figure 5 is a schematic diagram showing the structure of a 128 × 9 sensor data used in the experiment. e dataset is divided into a training set and a test set in a 7 : 3 ratio for the experiment. Table 1 is a description of the composition of the human activity dataset.

Experiment Result.
e experimental environment of this article is a laptop with the CPU of Intel i5-8250U and RAM of 8 GB. e programming language is Python 3.7, and the framework is Keras with Tensorflow backend. In order to make the experimental process more efficient, the sample data was sent to the model experiment in batches with a batch block size of 32. e model used the Adam update rules to optimize training parameters to minimize losses and set the maximum number of training iterations to 150. e learning rate was set to 0.0015. We trained the model and tested in the test set and finally got the classification confusion matrix of Table 2.
As can be seen from Table 2, the proposed model achieved excellent recognition results that the accuracy is 95.49%, and the precisions of walking and lying down are over 98.5%. It can be found that the proposed model has a slightly lower F1 score in distinguishing between behavior patterns of sitting and standing, mainly because the two behavior patterns are both static states. e waveform of the signal collected by the sensor at rest is so low that the model cannot extract enough information from the sensor data to distinguish between the two types adequately. At the same time, it may be that CNN has some weaknesses in static activities' identification. e next step is to improve the model further to improve the recognition accuracy for static state activities.
We compare the accuracy of the MDCNN to the other algorithms in literature according to experiment results, which are shown in Table 3. Firstly, compared with traditional methods (SVM; GHAR), our model shows a significant improvement; traditional methods rely heavily on hand-craft features. ese hand-craft features from traditional methods are shallow features, which would inevitably lose some implicit key features. Secondly, we conduct experiments to compare neural network models (LSTM, CNN, and DRNN). For the three networks, CNN performs better than RNN or LSTM. CNN has advantages in feature extraction: the convolution kernel extracts abstract highlevel gait features through layers, which have a decisive role in the final classification. Compared with RNN, CNN is more able to learn the crucial features contained in recursive patterns in complex cyclic processes such as gait [35].
It can be seen from Table 3 that the proposed model gets the highest recognition accuracy in addition to CNN in [38] and the multiple CNN [39]. e two CNN models incorporate frequency-domain features. e frequency-domain features seem to provide global information that is difficult to obtain in the CNN automatic feature extraction process. CNN is paying more attention to local features rather than global features. It is difficult to extract global information to a limited extent with the traditional CNN convolution kernel length. After adding the dilated convolution structure to the convolution layer, the actual length of the convolution kernel is increased, and the receptive field of the widened convolution kernel can extract longer context information.
e experimental results prove that our model has improved over the ordinary CNN model that does not rely on manual features. How to extract more global features from our model would be our future work. e identification model also needs to consider the calculation cost. CNN in [38] and the multiple CNN [39] have complex network framework, which incurs expensive computational costs and hardly meets real-time requirements in practice applications. Besides, both of them used the FFT feature, while the multiple CNN additionally used the spectrogram. e additional feature extraction consumes much time, which is also a hassle for real-time calculations. In contrast to them, the proposed model achieves almost similar performance using only raw sensor data without any manual features. MDCNN implements real-time HAR, which is difficult for these two complex models. Also, its training time and testing time are superior to other real-time deep learning models: the training time per epoch is about 6.01 s running on a laptop with the CPU, while DRNN took 116.39 s per epoch in the GPU environment. It takes only 15 minutes to complete the training process in our model, and it is hard to be negligible. However, the training process only needs to be run once in a practical application. e device loaded with a pretrained model can identify measured data in real time. In our experiment, MDCNN completed the identification of all samples within 1 s 323 μs; that is, the time to identify each sample is 0.34 ms. Because the frequency of the sensor's data acquisition is 50 Hz, our model is sufficient to achieve real-time HAR. It is because CNN can perform parallel operations well in the training process. Furthermore, the dilated convolution achieves a more efficient convolution operation under the same computational complexity.
In general, the proposed model achieves real-time HAR with high recognition accuracy and low computational complexity. e model can automatically and efficiently mine the deep and highly recognizable essential features embedded in the data. More importantly, MDCNN expands the receptive field by introducing dilated convolution without increasing parameter, so that the model can mine the timing dependency information in the long sequence to some extent, which makes up for the defects of the traditional in time-series problems.

Total acceleration Z-axis
Body acceleration X-axis

Body acceleration Y-axis
Body acceleration X-axis Body gyrosecope X-axis

Body gyrosecope Y-axis
Body gyrosecope Z-axis Figure 5: Diagram of the timing structure of the human activity identification dataset. model. Firstly, we design an experiment to verify whether the pooling layer is necessary for the proposed model. is experiment was compared by the difference in accuracy between the proposed model and the model with the pooling layer. e pool size of the model with the pooled layer is 2 and 3, respectively. In both sets of experiments, the pooling layer was after the last layer of convolution. e results are shown in Table 4. It can be seen from Table 4 that the proposed model can achieve higher accuracy than the two models with pooling layers. Also, the accuracy of the model with a large pool size is lower than that of the smaller pool size. e result is because the pooling layer reduces the amount of computation while reducing the resolution, which will lose some of the information useful for classification. As the size of the pool increases, the more information is lost, and the accuracy rate also decreases.
Secondly, we designed a comparison experiment with different layers of MDCNN, which verify the validity of the dilated convolution and analyze the influence of the network depth on the activity recognition accuracy. e experiment results are shown in Figure 6.
As can be seen from Figure 6, the recognition accuracy of MDCNN improves steadily with the increase of the number of layers in 1-3 layers. It is because the advantage of CNN is to mine the nonlinear network structure contained in the raw data. If the network is too shallow, it could not make full use of the powerful fitting model ability of CNN. However, the accuracy of MDCNN recognition of the four-layer network structure is lower than that of the three-layer network. is phenomenon indicates that the deep features extracted by the four-layer MDCNN do not contribute much to the recognition effect and may even extract redundant features, which affects the establishment of the human activity recognition model.

Conclusion
is paper proposes an improved multichannel dilated convolution neural network (MDCNN), which not only does not need to extract features manually and reduces the dependence on expert knowledge but also has achieved excellent recognition results in the experiment. At the same time, MDCNN is also a deep learning model that can achieve real-time HAR efficiently. By introducing the structure of dilated convolution and multichannel convolution, MDCNN effectively mines raw sensor data more comprehensively, further extracts more recognizable features, and increases the diversity of feature sets. e experiments also explored the influence of MDCNN structure on recognition accuracy and constructed an ideal human behavior recognition model. It is worth pondering that MDCNN, like other deep learning models, recognizes static activities with lower accuracy than dynamic activities, which requires further improvement. At the same time, the next step will be to apply MDCNN to more complex types of activity recognition.  Table 3: Comparison of accuracy with other models.

Conflicts of Interest
e authors declare that they have no conflicts of interest.