A Deep Convolutional Network for Multitype Signal Detection and Classification in Spectrogram

Wideband signal detection is an important problem in wireless communication. With the rapid development of deep learning (DL) technology, some DL-based methods are applied to wireless communication and have shown great potential. In this paper, we present a novel neural network for detecting signals and classifying signal types in wideband spectrograms. Our network utilizes the key point estimation to locate the rough centerline of the signal region and recognize its class. Then, several regressions are carried out to obtain properties, including the local oﬀset and the border oﬀsets of a bounding box, which are further synthesized for a more ﬁne location. Experimental results demonstrate that our method performs more accurate than other DL-based object detection methods previously employed for the same task. In addition, our method runs obviously faster than existing methods, and it abandons the candidate anchors, which make it more favorable for real-time applications.


Introduction
Wireless communication plays an import role in military and civilian life with its flexibility and long-distance transmission capability. e fast development of wireless communication and related technologies makes electromagnetic environment more and more chaotic. Cognitive radio technology that is capable of learning and adapting to environment has attracted a lot of attention [1,2]. Automatic signal detection (SD) and signal classification (SC) are two of important tasks in cognitive radio which have been researched for decades. Since the novel deep learning (DL) technology performs well in image, speech, and natural language processing, it has also been introduced to wireless communication and brought about great improvements. SD in this paper specifically refers to detecting signals in the wideband. It is also the basic task of the spectrum sensing [3] and the blind signal separation [4,5]. For traditional SD methods, the energy detection has once been the most popular technique that can be classified as the thresholdbased algorithms [6][7][8][9][10][11] and the non-threshold-based algorithms [12][13][14][15][16]. e former has lower computational requirements, but it is limited by the signal-to-noise ratio (SNR) with a high false alarm probability. e latter improves accuracy and universality at the cost of computational complexity. Recently, some DL-based methods have been applied in narrowband or wideband environment. In [17,18], the convolutional neural network (CNN) + long short-term memory network (LSTM) and deep belief network (DBN) were adopted, respectively, to detect signals in the narrowband, with the input of raw data and spectral correlation function (SCF). In [19], researchers selected the narrowband containing signals from a wideband spectrogram by energy detection, and then they utilized CNN to classify the wanted Morse signals. ose methods have achieved excellent detection effects, but their main purposes are mostly to detect the existence of signals, without the time or frequency information.
For SC task, the process is generally to first extract signal features and then design a classifier for recognition. e commonly used features for SC can be classified as instantaneous time features [20], statistical features (cumulants [21] and cyclostationarity [22]), transform features (Fourier transform [23] and wavelet transform [24]), other features (constellation shape [25] and zero crossing [26]), and feature learning (raw data [27,28]). e above features have different capabilities in antinoise and complexity, and they are applicable to different signal types. e classifiers mainly include earlier threshold-based algorithms [20,23], traditional machine learning (ML) methods (support vector machine (SVM) [29,30], k-nearest neighbor (KNN) [31], and fuzzy classifier [32]), and DL methods (deep neural network (DNN) [25,33], convolutional neural network (CNN) [24,27], long short-term memory network (LSTM) [34], and CNN + LSTM [28]). e earlier threshold-based algorithms are fast but depend greatly on the feature design and threshold selection, which need profound expert knowledge. e traditional ML methods reduce the model complexity, but they are sensitive to noise and also need elaborate design of features. DL methods achieve the feature learning on the raw or simply processed data, which obtain a good classification performance, under the precondition of rich training data.
Some research on the joint SD and SC has also been carried out. In [35], an algorithm based on first-order cyclostationarity was proposed for the FSK and AM signals. In [36], the key spectral features of narrowband signals were extracted to train the naive Bayes classifier for the modulation classification of signals and the detection of jamming. Researchers in [37] used the recurrent neural networks (RNN) to process the spectrograms of long-term signals. Recently, some researchers used the single shot multibox detector (SSD) network, which is a classical DL-based object detector, to achieve end-to-end SD and SC in a wideband spectrogram [38,39]. SSD can find the specific location of the signals, including the start and end time and frequency, which is a promising and valuable approach for a further study.
Inspired by the SSD, in this paper, we convert the spectrogram-based SD and SC tasks to an object detection task, and we exploit the advantages of the DL in computer vision. e SSD and other commonly used object detection methods make predictions at a center point of object. However, since the aspect ratio of signals is usually large and varies dramatically, a signal region is usually beyond the receptive field of one point, which could lead to incomplete predictions. In addition, lots of candidate anchors of SSD reduce the real-time performance. Targeting the above shortcomings, we construct a deep convolutional network and implement end-to-end training in this paper. Since a signal region in spectrogram is usually a horizontally-long rectangle, we propose to model a signal by its centerline, and the signal type and a bounding box (BBox) are regressed from the features at centerline. Compared to most object detection methods, our network abandons the candidate anchors and uses the centerline instead of just a point to make predictions, which are more efficient and task-oriented (we will explain the defects of traditional object detectors and our improvements in detail in Section 2). Experimental results show that our method has a higher detection and classification accuracy, especially for the extremely long instances. Moreover, the simplicity of our network allows it to run at a very fast speed.
To summarize, the main contributions of our works are as follows: (1) we utilize the idea of DL-based object detection for multitype SD and SC in wideband spectrograms, which is capable of detecting time and frequency location and recognizing signal types; (2) different from applying the commonly used object detectors directly, we target the characteristics of signals in spectrograms and propose an improved convolutional network that uses the centerlines to locate the signals and abandons the candidate anchors, which makes our method more accurate and faster.

Related Work
We want to accomplish the multitype SD and SC in spectrogram by the idea of DL-based object detection. However, we think that the traditional detectors are not suitable for the task solved in this paper. Before us, the researchers in [38,39] have used the SSD that is a commonly used detector to perform the same task, but the results are not satisfactory, especially for the extremely long instances. In this section, we will explain the defects of traditional detectors for the signals in spectrograms, and then we accordingly propose our centerline-based method.

Defects of Traditional DL-Based Object Detectors.
Most DL-based object detectors [40][41][42][43][44][45] achieve good performance when the objects have regular shapes and aspect ratios. Nevertheless, the signals in a spectrogram usually have extremely long shapes, and their time duration and frequency band vary dramatically with different signals, which makes them quite different from the general objects. e traditional detectors tend to get frustrated, and two main reasons are considered: (1) Due to the limited receptive field of the CNNs, some one-point based detectors [42,44,45] that use only one point or a small area to predict box size cannot get complete BBoxes. (2) Many detectors raise multiple candidate anchors at each pixel in advance [40][41][42][43][44], but since the shapes of signals vary greatly, it is difficult to design a group of common anchors to match the signals, and the regression on those anchors is quite timeconsuming. Figure 1(a) is a detection result of SSD [38]. e green box is the ground truth box of a signal, and an incomplete box proposal as the blue box is predicted. SSD only utilizes the red point to make prediction, whose receptive field (the region in red grids) is smaller than the ground truth box. In addition, in Figure 1(b), the candidate anchors used in [38] are drawn as the yellow grids, and we can see that the shape of anchors differs greatly from that of the ground truth box, which causes the signal to have no suitable anchor to match during the training.

Signal as Centerline.
In order to solve the above problems, we propose to model the signal as its centerline. Since a signal region is usually a horizontally-long rectangle, we first find the centerlines to locate each signal, and then the features in centerlines are utilized to predict the BBox sizes and signal types. e receptive field of a centerline can easily cover the entire signal, and we abandon the anchor generation, turning to predict the offsets between the centerline and (up/down) border lines directly, which avoids the shape mismatch of anchors and saves a lot of time. e principle of our method is visualized in Figure 2

Data Generation
e amount and richness of dataset are crucial for the training of deep neural networks. Since the signal transmission modality in radio communication has a clear mathematical expression, we simulate the wideband signals by programs, and the data generating process is introduced in this section.
We select 2FSK, 4FSK, (PSK/QAM), Morse, speech, and resident noise (RN) as the signal types of tests that often appear in wideband. ese signal types are intuitively distinguishable in the spectrograms except the MPSK and MQAM; thus we merge those two types to the (PSK/QAM). If a further classification on MPSK and MQAM is needed, some subsequent methods such as those in [25,38,46,47] can be introduced.

Multitype Signal Model.
e transmitted digital modulation signal can be modelled as s(t) � n a n e j ω n t+ϕ ( where a n represents the transmitted symbols, ω n is the angular frequency, ϕ is the carrier initial phase, g(t) is the shaping filter, and T b is the symbol period. For the MFSK signal, it can be presented as For the MPSK signal, it can be presented as For the MQAM signal, it can be presented as a n � I n + jQ n , For Morse, it can be presented as For the RN, it is referring to an irrelevant signal with long duration, narrow bandwidth, and random energy changes. Here we present it as the signal with a single frequency and random amplitude change: For the speech, we modulate the real-world audios to different frequencies by the amplitude modulation.

Wideband Spectrogram Generation.
In the actual communication environment, the received signals of most systems can be expressed as is takes into account the effects of many factors in the real world. n Lo (t) represents the residual carrier random walk, n Clk (t) represents the time deviation, h(t) is the timevarying channel function, and n Add (t) is the additive noise.
To make the synthetic data valuable enough, we perform simulation comprehensively in a way identical to the real situation. On the one hand, the pulse shaping and bit rate that are suitable for corresponding modulation modes are set up, and the real voice or text is modulated as transmitted data. On the other hand, a robust channel model is employed including the multipath fading, random frequency walk drifting, and additive white Gaussian noise (AWGN). We pass the synthetic wideband signals through the channel model to obtain the final experimental data.
To calculate the spectrograms of the wideband signals, we utilize the short-time Fourier transform (STFT), which is a common time-frequency analysis method. e calculation of STFT is where s(m) is the sampled signal, w(m) is the window function, and P n (ω) is the final time-frequency matrix. Figure 3 presents different types of signals in the wideband. We annotate each signal with a ground truth box that is higher than its bandwidth, and the influence of ground truth box height will be discussed in Section 5.2.

Approach
e network of our approach is mainly composed of CNNs that perform well in the image recognition. e CNNs learn features via nonlinear transformations as a series of nested layers that introduce several kernels to perform convolution over the input. Generally, the kernels are multidimensional arrays that can be updated by some algorithms [48]. In this section, we first give an overview of our network, and then we elaborate two core modules, and finally we present the details of training and inference.

Overview.
e overall architecture of our method is illustrated in Figure 4, which can be divided into two main parts. First, we extract shared feature maps for subsequent tasks by the backbone network. Our backbone network is a ResNet18 with three up-convolutions, where the features of different levels are effectively merged. en, we adopt a shape and type expression module (STEM), which utilizes the shared features to predict the BBoxes and signal types. e STEM constructs a shape expression by learning geometry properties including the centerline, local offset, and border offsets. e details of backbone module and STEM are presented as follows.

Backbone Module.
We use a ResNet18 with three upconvolutions as the backbone module to extract shared features, and its architecture is shown in Figure 5. e input image first passes through multiple forward convolutional stages whose structures are detailed in the dotted box on the left. In each convolutional stage, there are two blocks and each block consists of two convolutional layers and a residual structure to connect the input and output of a block. e residual structure is able to solve the gradient transfer problems during the deep network training. We introduce three transposed convolutions to upsample the output of forward convolutions, and each output of transposed convolutions is added with that of the corresponding convolutional stages. By merging the multiscales feature maps, we can make full use of the learned features at different level. e size of output feature map is (1/4) of the input image size. e batch normalization and ReLU activation function are following the convolutional layers, which are not marked in the figure.

Shape and Type Expression Module.
e STEM is a multichannel convolutional network and can be divided into three branches. In each branch, we utilize two 3 × 3 and 1 × 1 convolutional layers with different channels to regress the signal property maps including the centerline, local offset, and border offsets.

Centerline.
Centerline is a 7-channel (6 signal types + 1 background) map that represents the pixel-wise probabilities of different classes of centerlines. For the generation of ground truth centerline maps, we compute a low-resolution equivalent p � (p/R) for each centerline point p of class c and then splat all p onto a heatmap are the width and height of input image, R is the downsampling scale, C is the number of signal types, and σ is an object sizeadaptive standard deviation. e training objective of centerline is the pixel-wise focal loss: where α and β are hyperparameters of the focal loss and N is the number of centerline points in a spectrogram. Here we chose α � 2 and β � 4 in our all experiments.

Local Offset.
Local offset is a 1-channel map that has valid values within the centerlines. To recover the discretization error caused by the downsampling of the backbone network, we predict a vertical local offset additionally. e local offset will be added to the ordinate of centerline when mapping the shrunken image to the original size. e training objective of local offset is the L1 loss at centerline points:  the height of the predicted BBox at p, and the ground truth height at p is S p . e same as the local offset training loss, the training objective of border offsets is

BBox and Signal Type Generation.
We have got the predicted centerline, local offset, and border offsets at each pixel and need to regress the final BBoxes and signal types. We set a threshold of 0.5 on the heat map to obtain all of the positive centerline connected domains. Each connected domain corresponds to a signal, and the pixel class that appears most frequently in a domain is the predicted signal type. e horizontal minimum x min and maximum x max of a connected domain are the start time and stop time, respectively. We chose the row with the largest cumulative probability of the predicted class in each connected domain as the centerline of this connected domain. e local offset and border offsets are averaged at the centerline points. So we can obtain the coordinates of the lower left and upper right corners of a BBox as follows: As we can see, all of the BBoxes are regressed directly from the centerlines without the need of deredundant processes such as the intersect over union-(IOU-) based nonmaximum suppression (NMS). e architecture of our model is simple and elegant, compared to most traditional two-stage or one-stage object detection models.

Training and Inference Details.
We train the network end-to-end with the following loss function: e loss is a weighted sum of the three property losses. e weights λ 1 , λ 2 , and λ 3 that trade off among the three losses are set to 1.0, 0.5, and 0.5 in our experiments.
To make training more efficient and effective, we exploit the data augmentation methods including the crop, scaling, and Gaussian noise. e details of training and validation datasets are presented in Table 1. An Adam optimizer with a learning rate of 2e− 4 is used to optimize the overall objective. We train the model with a batch size of 50 for 150 epochs

Experiments
To the best of our knowledge, it is a relatively new research to implement the multitype SD and SC in wideband spectrograms directly with BBoxes, so there are few related methods. Zha et al. [38] and Yang et al. [39] have used the SSD for this task and made a comparison with other DLbased object detectors; hence we conduct comparative experiments in the same way. Specifically, we present quantitative performance results and analyze the influence of several unstable factors introduced by the channel conditions or manual processing. Moreover, we compare our SC performance separately with two traditional SC methods on narrowband signals. Experiments are conducted on the dataset generated in Section 3, and the implementation details are introduced in Section 4.4. We mark our centerline-based network as CLN in experiments.

Baselines.
Zha et al. [38] have exploited the SSD to detect and classify signals in spectrograms and compared it with the Fast-RCNN [40], which suggests that the Fast-RCNN has a high accuracy, while the SSD is superior in speed. Here we compare our method with two traditional DL-based object detectors, SSD and Faster-RCNN (we can expect that the Faster-RCNN could do better than the Fast-RCNN in both accuracy and speed): (1) SSD [42]: SSD is a representative of the one-stage object detection methods. e main idea of SSD is to first raise candidate anchors over different aspect ratios and scales at each pixel of extracted feature map, and then it predicts class scores and size adjustments for each anchor. e SSD used in our experiment is the same as that in [38], where the feature extraction CNN is a VGG-16 [49].
(2) Faster-RCNN [41]: Faster-RCNN is a representative of the two-stage object detection methods. Faster-RCNN also raises candidate anchors in advance, and then the region proposal network (RPN) predicts positive scores and adjustments for each anchor to propose regions; finally, class predictions and reregressions of regions are carried out by the Fast-RCNN. e feature extraction CNN of Faster-RCNN in our experiments is VGG-16.

Metrics.
For the end-to-end SD and SC performance evaluation, we compare different methods in terms of the precision, speed, and model size. e precision metric is the mean Average Precision (mAP) [50], which is the mean of different classes' AP. AP summarizes the predicted precision and recall of one signal class at a given IOU threshold, as in equations (15) mAP � num_classes AP i num_classes .
e speed metric is the frames per second (FPS), representing the number of spectrograms that a model can process per second, as in the following equation: e model size metric is the memory usage of model parameters.
For the SSD and Faster-RCNN, we use the same training methods as in our network. All of the models are trained to converge. Figure 6 shows the quantitative comparison on end-to-end SD and SC performance. CLN has the highest mAP at different IOU thresholds, while the Faster-RCNN is the second, and the SSD is a little bit worse. e results demonstrate that our centerline-based method is more suitable for the signals in spectrograms than the one-pointbased methods. Figures 7 and 8 show the speed and model size, respectively, of different methods, for the evaluation of computational complexity. e results show that CLN is significantly faster and smaller than the other two models. Our method abandons the candidate anchors and regresses the BBox properties directly from the centerlines, thus greatly reducing the number of parameters. e simple architecture of our method ensures that it runs at a very fast speed, even compared with the fast SSD method.
To further visualize and analyze the performance, in Figure 9 we randomly show some detection results of the three methods. Figure 9(a) is the result of CLN, which is able to trace out the precise BBoxes containing the whole signals and successfully classify types with high confidence scores. Benefitting from the twice boundary regressions, the Faster-RCNN also has a nice detection and recognition performance in Figure 9(b), but it occasionally confuses two signals that are very close to each other (such as the two speeches in the bottom spectrogram). In Figure 9(c), although the SSD has found the existence of signals, it fails to regress complete BBoxes, especially for the extremely long instances.
We need to emphasize that, for the SSD and Faster-RCNN models, the default aspect ratios (height/width) of their candidate anchors are too large for the signals in spectrograms. We adjust their aspect ratios to [(1/2), (1/4), (1/6), (1/8)] to make the ground truth boxes match more candidate anchors during input encoding. e above process makes the models more task-specific, but the performance of SSD is still not satisfactory. We can expect that there is still room for improvement through further adjustments of anchors, but it could be a cumbersome and patient process compared to our method that does not need anchors.

Sensitivity Analysis.
To evaluate the robustness of methods, we conduct the sensitivity tests on several unstable factors introduced by the channel conditions or manual processing. All of the factors are able to influence the signal presentation in the spectrograms that are the inputs of networks. Figure 10(a) shows the mAP 50 curves of different methods versus SNR. It can be seen that all of the performances drop quickly when the SNR is lower than − 2 dB. e CLN and Faster-RCNN always have better performances than the SSD, and they can hold a mAP 50 greater than 0.5 at a low SNR. In Figures 10(b)-10(d), we draw the classification confusion matrixes of the CLN at different SNR. e results show that there are few classification errors at a high SNR, and more errors happen as the SNR goes down. e confusions are often related to the similarity of the bandwidth or the shape, such as the 2FSK and 4FSK, RN and Morse, and speech and (PSK/QAM).

Rayleigh Fading.
Rayleigh fading mainly describes that signals transmit through multiple paths of different directions to the receiver. We simulate a Rayleigh fading channel to test the robustness of CLN. We assume that the received signals are combination of two path reflections, the gains of two paths are 0 dB and − 10 dB, and the time delay between them is 1e− 7 s. In addition, the maximum Doppler frequency shift (MDFS), introduced by the relative motion between the transmitter and receiver, is set to 0 Hz, 50 Hz, 100 Hz, and 500 Hz. e test results are shown in Figure 11. Although the "MDFS: 0 Hz" has no MDFS, a multiple path fading in test data leads to a slight performance drop compared to the "AWGN." When at a very low SNR, the mAP 50 of different MDFS are very similar, indicating that SNR is the main constraint in this situation. With the increase of SNR, the performances show a difference and a larger MDFS gets a lower mAP 50, especially after − 2 dB SNR. When the SNR exceeds 2 dB, our method can hold an approximate 0.8 mAP at 500 Hz MDFS, which shows a good robustness, and we can also expect a performance improvement by enriching training data.

Frequency Resolution.
e frequency resolution is an important and necessary parameter for spectrogram expression. To test the robustness on frequency resolution, we vary it from 20 Hz to 40 Hz and plot the mAP 50 curves in Figure 12. Generally, all of the methods perform best in 30 Hz, since our training frequency resolution is approximately 30.5 Hz. When the frequency resolution changes, the method's performances do not fluctuate obviously. erefore, the change of resolution in a certain range has limited impact on detection and classification effects. e reason may be that, at the test frequency resolutions, the target signal types can still be intuitively distinguished in spectrograms.

Height of Ground Truth Box.
During the dataset generation in section 3.2, we annotate the signals with BBoxes whose height is larger than the bandwidth of signals. In this setup, the height of ground truth boxes is annotated as the bandwidth of the signals, which leads to the decrease of the aspect ratios of ground truth boxes. To adapt to those adjustments, we also have increased the aspect ratios of anchors for Faster-RCNN and SSD to [(1/2), (1/10), (1/15), (1/20)]. e detection and classification results are presented in Figure 13, and we can see that our method is still able to predict the BBoxes closed to the ground truth, while the performances of the SSD and Faster-RCNN are greatly reduced, especially the missing and redundant predictions.
For our method, it focuses on the centerlines of signals that do not change with the height of ground truth boxes; thus it only needs to predict different border offsets. For the SSD and Faster-RCNN, since the sizes and aspect ratios of BBoxes vary greatly for different signals, it is difficult to design a group of common anchors, resulting in that some    ground truth boxes could not match the candidate anchors during input encoding. In addition, the signals with a narrow bandwidth or long duration may get repeated but not overlapping predictions, which cannot be filtered by the IOU-based NMS. So if you want to use the anchor-based methods to detect the signals in spectrograms, you had better decrease the aspect ratios of anchors and increase those of the ground truth boxes.

Separate SC Performance.
e proposed method is based on the spectrogram, which can classify the types of signals when they are detected. By adjusting the types of training data, our network can classify some modulation modes and even the specific signal types such as Morse and speech. To further evaluate the classification capability of our method, we conduct a series of separate SC experiments in this subsection. We first conduct classification on several common signal types to analyze which types can and cannot be classified by our method. en, targeting the types that can be classified, we compare our method with two classic SC methods to evaluate the performance.
We select the signal types of 2ASK, 4ASK, 2FSK, 4FSK, BPSK, QPSK, and 16QAM, which are all the basic modulation modes used in practical communication, to test the classification effect of our method. We simulate wideband data that randomly contain the above signals using the same approach as in section 3.2, and the confusion matrix is shown in Figure 14. e results show that 2ASK, 4ASK, 2FSK, and 4FSK can be well classified by our method, while BPSK, QPSK, and 16QAM are seriously confused. e results are in line with our expectations, since the shapes of MPSK and MQAM in spectrograms are quite similar, while those of the other signals are distinguishable. In consideration of the above results, we merge the signal types of BPSK, QPSK, and 16QAM to the "(PSK/QAM)." To further evaluate the classification performance on 2ASK, 4ASK, 2FSK, 4FSK, and (PSK/QAM), we compare our method with two classic SC methods: (1) High order cumulant (HOC) + SVM [51]: An MLbased algorithm using the SVM with seven HOC as the feature vectors: C 20 , C 21 , C 40 , C 41 , C 42 , C 60 , and C 63 . (2) IQ waveform + CNN [47]: A DL-based algorithm using a 4-layer network of two CNN layers and two fully connected layers, with the signal IQ waveform as features.
e inputs of methods are baseband signals. Figure 15 shows the average classification accuracy of three methods versus SNR. e results show that CLN is more robust at a low SNR, partly because the spectrogram features can effectively present the characteristics of tested signal types and partly because the DL algorithm could learn deeper and richer features. Considering the deviation of carrier frequency estimation by CLN or Fourier Transform in practice, we evaluate the influence of the frequency offset on classification performance. We calculate the average accuracy at 5 dB SNR versus the frequency offset normalized by the symbol sampling frequency, and the results are shown in Figure 16. It can be seen that the increase of frequency offset could have an obvious effect on method performances, and when at a large offset, all of the methods are no longer suitable. However, our method still shows a stronger robustness.

Parameter Tuning.
We implement parameter tunings on several important hyperparameters of our neural network, including the layers of up-convolutions, the channels of convolutional stages and up-convolutions, and the channels of the first CNN layer in STEM. Specifically, we tune the object parameters while keeping the others fixed and plot the mAP 50 curves versus SNR. Figure 17 shows the performance on different layers of up-convolutions. e layers of up-convolutions determine the size and channels of extracted feature map; for example, one layer up-convolution outputs feature map of (1/16) input size and 64 channels. In backbone module, the upconvolutions merge the high-level features that present more  general information with the low-level features. us, increasing up-convolution layers could consider more detailed information but more discrete noise. e results in Figure 17 show the three layer up-convolutions obtain the best effect. Figure 18 shows the performance on different channels of convolutional stages and up-convolutions. In backbone module, four forward convolutional stages and three upconvolutions achieve the extraction and merge of the different level features. Increasing channels could let the CNN learn features of more dimensions, but it also increases the space and time usage of a model. We set three channel groups for test. Following the principle of ensuring accuracy and keeping the model as small as possible, we prefer the setting of "Channel: [16,32,64,128]." Figure 19 shows the performance on different channels of the first CNN layer in STEM. Following the principle of ensuring accuracy and keeping the model as small as possible, we prefer the setting of "Channel: 32."

Conclusions
In this paper, we exploit the idea of object detection technology and present a deep convolutional network for multitype SD and SC in the wideband spectrograms. We have analyzed the defects of traditional DL-based object detectors for the tasks solved here and proposed a centerlinebased method. Targeting the characteristics of the signals, our method first finds the centerlines of signal regions, and then it regresses to the complete BBoxes and classifies the signal types. In experiments, we have conducted comparisons with other object detection methods in terms of accuracy, speed, and model size, and we have also implemented sensitivity tests on some channel conditions and manual processing. e results indicate that our method has a higher detection mAP with an obvious speed advantage, and it is also more robust in different conditions. In addition, a series of separate classification experiments have shown the good classification capability of our method. As a consequence, the proposed method achieves outstanding SD and SC performances in spectrograms while keeping a realtime capability, which makes it valuable for the engineering applications.
In the future, we will enrich our dataset, including the real-received signals, and we will explore more comprehensive features for the signal detection and classification.
Data Availability e simulation data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.