CNN-Based Multiterrain Moving Target Recognition Model for Unattended Ground Sensor Systems

In recent years, many deep learning algorithms based on seismic signals have been proposed to solve the moving target recognition problem in unattended ground sensor systems. Despite the excellent performance of these deep networks, most of them can only be deployed on cloud-based devices and cannot be deployed on low-power hardware devices due to the large network size. Second, since seismic signals are a ﬀ ected by the terrain, employing only seismic signals as reconnaissance means for unattended ground sensors cannot achieve multiterrain-type adaptability. In response, this paper proposes an MFC-TinyNet method facing a multiterrain. The method adds depthwise separable convolutional layers to the network, which e ﬀ ectively reduces the size of the network while keeping the target recognition accuracy constant, and solves the problem that the model is di ﬃ cult to deploy on low-power hardware. It also uses the Mel-frequency spectrum feature extraction method to fuse sound and seismic signals to improve the accuracy of the model ’ s moving target recognition on a multiterrain. Experiments demonstrate that the method can combine the two advantages of the small network model and multiterrain applicability.


Introduction
The use of unattended ground sensor systems to achieve detection of moving targets in intrusion alert areas is an important research direction in the field of military and civilian security protection. The detected targets usually include humans, various types of vehicles, and ultra-lowaltitude vehicles. Effective identification of moving targets is important for border control [1], protection of key facilities [2,3], and reduction of illegal hunting. Among many detection means, seismic sensors are considered to be the most effective means of reconnaissance at present because of their long detection distance, low cost, light weight, and low power consumption. However, there are some shortcomings in using seismic signals as the only means of detection. The same target will exhibit different seismic signal characteristics under different ground geological conditions [4], and this characteristic poses a great challenge for identifying moving vehicles or humans in different ground geological conditions. Compared to seismic signals propagated by the ground as a medium, sound signals propagate in the air with little distortion. In addition, because of the damping effect of grass, the seismic signal triggered by a human walking on grass is very weak but can trigger a larger sound (see Figure 1). Therefore, adding sound signals as supplementary information to seismic signals can effectively improve the accuracy of recognition.
Among the ground target recognition techniques based on seismic or acoustic signals, machine learning is considered as the best solution at present. According to the development process of machine learning, machine learning can be divided into traditional machine learning (TML) and deep learning (DL). TML methods used for moving target recognition based on ground seismic signals include but are not limited to the following: support vector machine (SVM) [5,6], k-nearest neighbour (K-NN) [7], decision tree (DT) [8], naive Bayesian algorithm (NB) [9], artificial neural networks (ANN) [10], boost [11], and Gaussian mixture model (GMM) [12]. Although TML has high computational efficiency, its classification accuracy is highly dependent on the selection of features. And it is difficult for the manually selected features to portray the nonlinear relationship between the seismic signal and the target. DL is a new method emerging in recent years, among which convolutional neural networks are most widely used in image recognition, crack diagnosis [13,14], synthetic aperture radar automatic target recognition [15][16][17], etc. Wang et al. proposed a 24-layer end-to-end convolutional neural network (CNN) model called Vib-CNN [18] to achieve the recognition of three types of moving targets (humans, wheeled vehicles, and tracked vehicles) for classification. Xu et al. proposed a parallel recurrent neural network (PRNN) to detect moving humans [19]. Experiments show that the algorithm achieves 98.4% accuracy for mobile person detection. Jin et al. used LFCC feature extraction and the CNN to validate the publicly available dataset SITEX02 with an accuracy of 92.28% [20]. Bin et al. proposed an edge-oriented intelligence approach named the compressed-aware edge convolutional neural network (CS-ECNN) [21]. The algorithm was successfully implemented for deployment on FPGAs.
Despite the excellent performance of these deep networks, there are still two problems that remain unsolved.
First, these networks have poor generalization performance for moving target recognition over multiple terrains because the seismic waves generated by moving targets are susceptible to ground patterns. Second, with the deepening and widening of neural networks, the computational complexity of deep neural network models increases dramatically and unattended ground sensors have to adopt more costly hardware solutions. In response, this paper proposes a moving target recognition method called MFC-TinyNet that balances recognition accuracy, multiterrain generalization capability, and low complexity. The method consists of two parts, the Mel-frequency spectrum feature extraction and CNN. The Mel-frequency spectrum improves the generalization capability of the model for multiple terrains through multimodal information fusion of sound and seismic signals and achieves sensor recognition of moving human and vehicles on multiple terrains of concrete roads, asphalt roads, dirt roads, and grass. The CNN includes both traditional convolutional layers and depthwise separable convolutional layers. The convolutional neural network with the addition of the depthwise separable convolutional layer can effectively reduce the size of the network while maintaining the same target recognition accuracy, solving the problem that the model size is too large to be deployed on low-power hardware devices.
The rest of this paper is structured as follows. Section 2 will present a new dataset and describe the signal acquisition device for the dataset, the signal acquisition environment, and the way that the dataset is constructed. Section 3 focuses 2 Journal of Sensors on the identification method of this paper. Section 4 shows the experimental results of the proposed method in this paper with other traditional machine learning and deep learning methods on this dataset. Section 5 will conclude this study and provide an outlook on future research.

Multitopography Dataset
The dataset used in this paper is collected by an unattended ground sensor (see Figure 2) developed by our team. The hardware equipment of this sensor and the dataset acquisition environment will be described in this section.

Unattended Ground Sensor
Hardware. The unattended ground sensor consists of three parts: sensor module, processor module, and communication module. In the process of acquiring sound and seismic signals, the unattended ground sensors communicate with the PC via USB connection. The sensor module consists of a MEMS accelerometer (seismic sensor), an audio sensor, and an AD converter. The processor module contains the storage unit, the processing unit, and the microprocessor. The microprocessor is a 32-bit high-performance STM32H743 based on the ARM core. In the moving target identification stage, the identification results will be sent to the command center through wireless communication. The communication module uses Zigbee self-organizing network communication, and the on-board communication chip model is CC2630. The composition of the sensor nodes is shown in Figure 3. The sound sensor type is a MESM microphone, and the seismic sensor type is an MESM accelerometer. Since the frequency of the seismic and sound signals triggered by the movement of vehicles and human is mainly concentrated below 500 Hz [4], the sampling frequency of both sound and vibration signals is 1015 Hz. The signal collected by the seismic sensor is the acceleration signal of vibration, the sensitivity is 900 mV/g, the full scale acceleration is ±3 g, and the maximum noise is 8 μg. The sensitivity of the sound sensor is −38 dB, the maximum sound pressure level is 130 dB, and the signal-to-noise ratio is 65 dB.

Recording Environments.
The terrain in the dataset will include four different road environments: concrete road, tarmac road, dirt road, and grassland (see Figure 4). The types of moving targets collected include humans, small tracked vehicles, and background noise (no targets). The location of the dataset collection is located in Changsha city, Hunan province.
The unattended ground sensor will record the sound and seismic signals as the target passes through the alert zone, which is a circular area centered on the unattended ground sensor. The diameter of the warning area for humans is 10 m, and the diameter of the warning area for small tracked vehicles is 15 m.

Data Preprocessing.
Data preprocessing consists of two steps: DC bias removal and dataset construction. To improve the accuracy of model classification, the DC bias generated by the acquisition instrument needs to be removed from the original signal. First, a fast Fourier transform is applied to the signal to remove the zero frequency signal. Then, the sound and seismic signals after removing the DC component are obtained using the fast Fourier inverse transform. Next is the construction of the dataset, which contains two parts: the training set and the test set. The training set is the data sample used for model fitting, and the test set is used to evaluate the generalization ability of the final model. To ensure a more effective evaluation of the generalization ability of the model, the training set and the test set are composed of sound and seismic signals collected at two different times. Table 1 shows the size of the processed dataset.
The left side of the bracket in the table is the dataset of the training datasets of this kind of data, and the right side of the bracket is the size of the test datasets of this kind of data.

MFC-TinyNet Algorithm Analysis
In this section, a ground target recognition method is presented. In order to overcome the poor generalization performance of different terrains and achieve model lightweight, MFC-TinyNet is proposed by referring to feature extraction methods in the field of speech recognition and image classification methods in computer vision. As seen in Figure 5, MFC-TinyNet takes the original sound and vibration signals as the input of the model, extracts the two-dimensional time-frequency domain feature matrices of the sound and seismic signals of the same time period using the Meier inverse spectrum feature extraction method, and then fuses the two two-dimensional feature matrices to form a threedimensional feature matrix. The feature matrix is then fed into a convolutional neural network. After the computation  The Mel-frequency spectrum is a sound feature extraction method based on the physical feature of the human ear. It is widely used in the fields of speech recognition, sound classifica-tion, music genre classification, and voiceprint recognition. As shown in Figure 6, the extraction of the Mel-frequency spectrum is divided into the following steps: pre-emphasis, shorttime Fourier transform (STFT), Mel filter bank, and logarithm [17]: (1) Pre-emphasis The transmission of the seismic signal and sound signal in the medium is accompanied by the loss of energy. When the seismic and sound signals have a certain scale, the higher the signal frequency, the more serious the energy loss during the propagation process. Therefore, the high-frequency part of the input signals is compensated to offset the loss of the highfrequency part in the propagation process. This signal processing method is called pre-emphasis. The transfer function of the pre-emphasis system is shown in equation (1) as follows: (2) Short-time Fourier transform Windowing and framing the pre-emphasis signals, then, fast Fourier transforms each frame signal to extract its amplitude spectrum. Generally, the window used is the Hamming window. The window length is 20 ms to 40 ms, and the window is moved to about half of the window length.
where f is the frequency expressed in Hertz and m is the corresponding Mel frequency. In practice, we use 50 Mel frequencies.
(4) Take the logarithm Take the logarithm of the data output by the Mel filter bank to obtain the Mel-frequency spectrum feature.
The seismic and sound signals are extracted with Melfrequency spectrum features and then will be input into a CNN to obtain classification results. Based on the performance characteristics of seismic and sound signals on the Melfrequency spectrum, a convolutional neural network named TinyNet is proposed in this paper. The network has two convolutional blocks and two fully connected layers in total, where each convolutional block includes one traditional convolutional layer, a depth-separable convolutional layer [22], and a  The size of the convolution kernel is 1 × 1 × M, and M is the number of channels in the previous layer. Therefore, the output of pointwise convolution will combine depth convolution in a weighted manner and the output image channels are the same as the pointwise convolutional FiLter channels (see Figure 7). Adding the depthwise convolution and pointwise convolutional layer to the convolutional block can significantly reduce the computational effort and network layer parameters with essentially no loss of recognition accuracy. All hidden layers use the ReLu function as the nonlinear activation function, and the output layer uses the Sigmoid activation function. To solve the overfitting problem of the network, dropout functions are added after each convolutional block and fully connected layer. The MFC-TinyNet network structure is shown in Figure 8. The network uses the cross-entropy loss as the objective function. Therefore, the loss function of the classifier is as follows: where N is the number of samples, C is the number of target types, t ij represents the probability that the i sample belongs to category j, and Θ represents a collection of all parameters.

Evaluation Metrics.
In the unattended ground sensor system, the accuracy rate, false alarm rate, and underreporting rate are usually used to evaluate the advantages and disadvantages of the model. Accuracy [23] refers to the proportion of all correctly predicted samples in the total number of samples, and its calculation formula is ACC = ð TP + TNÞ/ðTP + TN + FP + FNÞ. Among them, correctly classifying intrusion events is called true positive (TP), cor-rectly classifying noise samples is called true negative (TN), missing intrusion events are called false negatives (FN), and misjudging noise as an intrusion event is called false positive (FP). False alarm rate refers to the proportion of samples that misjudge noise as intrusion events in the total number of intrusion events. The calculation formula is FAR = FP/ðTP + TNÞ.The underreporting rate refers to the proportion of samples that misjudge the intrusion event as noise in the total number of samples. The calculation formula is UR = FN/ðTP + TNÞ.  7 Journal of Sensors trials are selected for discussion. Figure 9 shows the accuracy curves and loss curves of the train datasets and test datasets during the training process of Tiny-CNN, and Figure 10 is the confusion matrix of the MFC-TinyNet in test datasets. Through statistical calculation, the overall recognition accuracy of these three categories is 98.49%. For the results of a single-target type, the accuracy values of tracked vehicles and noise are very high, both exceeding 98%. The classification accuracy of personnel is 97.66%. The false alarm rate of the model is 0.43%, and the underreporting rate is 1.08%. Experimental results show that the algorithm has good performance on datasets. Among the misjudged samples, several human samples were misjudged as samples of environmental noise. There are also a few samples of tracked vehicles, classified as person and noise. The main reason of misjudgment is that the samples with low signal-to-noise ratio lead to fuzzy recognition.

Multitopography Dataset Experimental Results
(2) Without sound signal Only the seismic signal is used as the input of the MFC-TinyNet model, and the model is trained with 200 epochs. In order to avoid random factors, the median results of 11 repeated trials were selected for discussion. The performance of the model on the test datasets is shown in Figures 11 and  12. Its overall accuracy is 94.93%, the false alarm rate is 3.87%, and the underreporting rate is 1.20%.
Comparison With the Benchmark Methods: To verify the effectiveness of MFC-TinyNet, the classification performance of the state-of-the-art machine learning algorithms proposed by other scholars, in the field of ground-based moving target recognition based on seismic signals, is used as a comparison with the classification performance of MFC-TinyNet. These machine learning algorithms include genetic algorithm-optimized SVM (GA-SVM) [6], Vib-CNN [18], LFCC-CNN [20], and CS-ECNN [21]. Among them, GA-SVM is a traditional machine learning method that uses the genetic algorithm to optimize the penalty parameters and Gaussian kernel parameters of the Gaussian kernel support vector machine. CS-ECNNN and LFCC-CNN are non-end-to-end deep neural networks that use compressed sensing and LFCC as feature extraction methods, respectively. Vib-CNN is an end-to-end deep learning method. Each model will use multitopography datasets as the training set and test set of the model. Each model will undergo 11 experiments, and each experiment will be trained for 200 epochs. The median classification accuracy of the test set of the 11 experiments is discussed as the classification effect of the model, and the computation time of the algorithm on the test dataset is recorded. All methods are run on a desktop computer with an Intel Core i710th CPU and a GeForce RTX 2070S GPU.
The classification results of each model are shown in Table 2. The classification accuracies of GA-SVM and CS-ECNN are 74.09% and 72.09%, respectively, which is much lower than the accuracies of other deep learning algorithms, and the generalization ability for the multiterrain is poor. LFCC-CNN classification accuracy is 84.86%, which basically achieves the effect of identifying moving targets on

Conclusion and Future Work
In this paper, we propose a ground moving target recognition method based on the Mel-frequency spectrum and CNN for distinguishing environmental noise (no target), human intrusion, and tracked vehicle intrusion. And the sound and seismic signals of a set of moving targets are recorded by a self-researched UGS. The accuracy of the method for target classification reached 98.49% after experimental validation. Compared with other deep learning methods, this method has two advantages of low hardware requirements for implementation and applicability to a variety of terrain conditions. However, the proposed method in this paper is only limited to identifying the type of intrusion target and cannot determine the relative position of the intrusion target and the sensor.
Since the dataset collection site in this paper is located in an urban area, there are a large number of humans moving around the periphery of the unattended ground sensor alert area, resulting in a low signal-to-noise ratio in the dataset and a small alert radius for the unattended ground sensors. If the ambient noise and moving target signals are implemented, the alert area will be greatly enhanced. In future work, based on the difference between environmental noise and target signal in the frequency domain, the separation of environmental noise and target signal using deep clustering method will be a promising research direction. In addition, the combination of sensor arrays, generalized mutual correlation time delay estimation techniques, and deep neural networks will likely enable the localization of multiple intrusion targets.

Data Availability
The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request.