Supervised Contrastive Learning-Based Modulation Classification of Underwater Acoustic Communication

Modulation parameters are very significant to underwater target recognition. But influenced by the severe and time-space varying channel, most currently proposed intelligent classification networks cannot work well under these large dynamic environments. Based on supervised contrastive learning, an underwater acoustic (UWA) communication modulation classifier named UMCSCL is proposed. Firstly, the UMC-SCL uses a simply convolutional neural networks (CNN) to identify the presence of the UWA signals. Then, the UMC-SCL uses ResNet50 as an encoder and updates the network by supervised contrastive learning loss function, which can effectively use the category information and make the eigenvector distribution of the same category more concentrated. Then, the classifier uses the feature vector output by the encoder to distinguish the final modulation categories. Finally, extensive ocean, pool, and simulation experiments are done to verify the performance of the UMC-SCL. Without any prior information, the average classification accuracy for MPSK and MFSK can reach 98.6% at 0 dB and is increased by 6% compared to the benchmark algorithm under low SNR.


Introduction
With the development of UWA communication technology, more and more ocean applications have installed UWA communication equipments. Through modulation classification can explore the influence of ocean multipath and Doppler effect and more effectively assistant target identification, signal identification, interference identification, and spectrum management.
In general, conventional modulation classification algorithms can be divided into two categories: likelihood-based and feature-based methods [1]. The likelihood-based method requires a large amount of prior information and computation, which makes it unsuitable for harsh noncooperative UWA communication. On the contrary, feature-based method has gradually become the mainstream method due to its low computational complexity and no dependence on prior information.
Feature-based methods consist of two parts: feature extraction and classifier. In [2], multiscale reverse dispersion entropy and grey relational degree features are used to improve the classification performance of ship-radiated noise. In [3][4][5], support vector machine (SVM) is used to distinguish wireless signals. In [6], high order cumulant features are put into SVM based on mixture kernel function to classify the digital signals. Wei et al. [7] use a SVM based on hybrid features, cyclostationary, and information entropy to classify the modulation types, including BPSK, QPSK, 2FSK, 4FSK, and MSK. By this means, the parameter extraction process is complicated, and the capacity is low. Even if more training data is added, the classification performance cannot always be improved [8]. For recent years, deep learning [9][10][11][12][13][14][15][16] has shown excellent performance in image feature extraction, speech recognition, and natural language processing and has been successfully used on acoustic signal sets [17,18]. However, in modulation classification area, it is mainly used in the electromagnetic communication.
In [9], long-short term memory (LSTM) is used to classify the modulation schemes for a distributed wireless spectrum sensing network. Li et al. [10] use the I/Q data to classify signal directly through deep neural networks (DNN). In [11,12], adaption of deep learning to the complex temporal signal domain is studied, and first proposed a CNN-based classifier to solve the problem of excessive parameters in DNN. In [13], AlexNet and GoogLeNet are used to classify the constellation of the signal samples. Huang et al. [14] introduce a novel cascaded CNN that cascade two-block CNN to identify MPSK and MQAM hierarchically. Wang et al. [15] propose a hierarchical CNN scheme to more accurately classify the higher-order QAM signals. Liu et al. [16] combine CNN with long short-term memory (LSTM) architecture into DNN and increase the accuracy rate by 13.5% compared with original CNN. In these classic end-to-end neural networks, cross-entropy loss is the most widely used loss function to achieve the purpose of updating network weights. However, the cross-entropy loss function also lacks robustness to noisy tags [19,20] and may have marginality [21,22], leading to reduce generalization performance. The traditional end-to-end supervised training methods focus on the final classification accuracy rather than the quality of the features extracted from the UWA data. As a result, when the signal-to-noise ratio (SNR) becomes low, the accuracy of traditional methods will drop sharply and cannot work well. In recent years, the renaissance of contrastive learning has led to major advances in self-supervised performance learning [23][24][25]. When there is no available label, the data is augmented through its own cropping and flipping, and the encoder is updated through the self-supervised loss function. Although it can alleviate the disadvantages of traditional networks to a certain extent, it cannot learn from the other samples in the same category. As a result, self-supervised contrastive learning methods are not suitable for UWA data with different SNR.
In this paper, from the perspective of representation learning, we extract features with high discrimination through supervised contrastive learning [26] to support the normal classification tasks in harsh UWA channel and propose a novel classification framework named UMC-SCL. We first distinguish between valid signal and ocean noise through a simply CNN. Then, the supervised contrastive learning module will learn from the valid modulation signal and update the encoder network by supervised contrastive learning loss function. Go through this module, the features of the same category are as close as possible, and the features of different categories are as far away as possible. Therefore, we can achieve the purpose of classification only by using a fully connected layer. Finally, we verify the superiority of the proposed method through extensive ocean, pool, and simulation experiments and use principal component analysis (PCA) to visualize the output features for interpretability. Compared with the known traditional supervised networks, the proposed method greatly improves the classification accuracy under low SNR without any prior information and parameter extraction process.  hðtÞ is the energy normalized impulse response of UWA channel, sðtÞ is the original signal, and nðtÞ is the ocean noise. aðtÞ is related to SNR. Node 1, Node 2, and Node 3 communicate with each other. The listener can intercept their communication signals from the sea water. SL is emitting sound source level, TL is propagation loss, and NL is the background noise level [27].

UWA Data Sources.
In order to make the research result more applicable, we have constructed a complete data set through actual ocean experiments, pool experiments, and simulation experiments that are close to the reality.
2.2.1. Ocean Data. The ocean data are collected in Wuyuan Bay, Xiamen, China. As shown in Figure 2, the sound source T x1 and the receiving hydrophone R x1 are placed in the shallow sea near the footpath, with a depth of 5 m and a communication distance of 60 m.
We send and receive signals at four different times of the day. During the experiment, there are some activities such as yachts, fishing boats, and other activities that introduce a lot of man-made noise. Besides, dozens of plank road bridge piers between the sending and receiving ends make the reflection effect more significant.

Pool Data.
Ocean experiments are costly, and the data acquisition is difficult. In order to increase the richness of the dataset, we further conduct pool experiments. The pool is located in UAC laboratory in Xiamen University. Figure 3(a) is the photo of the pool. The pool has a length of 25 m and a width of 5 m. It is divided into deep water area (depth = 1:5 m) and shallow water area (depth = 1:15 m). Figure 3(b) is the distribution of transmitter and receivers for pool experiment. T x is the sound source, and R x is the hydrophone. The distances between R x1 , R x2 , R x3 , and T x are 3 m, 6 m, and 12 m, respectively, and the depth is 1 m. When T x sends a signal, the sound rays will be attenuated by water and reflected on the pool wall.    Figure 4 shows the sound ray propagation in a shallow sea channel. The sound ray will be reflected by the sea surface and bottom during propagation. Moreover, the speed of sound in seawater changes with temperature, salinity, and water depth, causing sound rays to be refracted. The speed of sound can be described according to the following formula [27].
where T is temperature in, S is salinity in ppm, and Z is the depth of seawater in m.
In UWA communication, the impulse response can be assessed by beam tracing for typical acoustic communication frequencies. The basic path loss of the received signal that traveling through the UWA channel is given by [28].
where A 0 is a scaling constant, l is the traveling distance of sound ray, k is the spreading factor, and α is the absorption coefficient which is closely related to the frequency of sound waves and can be obtained by Thorp's empirical formula as where the units of α and f are dB/km and kHz, respectively. The impulse response of the multipath channel can be expressed as the summary of the transfer function of each path where Γ p , τ p , and l p are, respectively, the cumulative reflection coefficient of the surface and bottom, propagation delay, and the propagation distance of the p-th path. Generally speaking, an ideal surface can be modeled by a reflection coefficient γ s = −1, while the bottom reflection can be modeled by where θ p is the grazing angle associated with the p-th propagation path and ρ and c are the nominal density and the speed of sound in water (ρ = 1000 kg/m 3 and c = 1500 m/s). ρ p and c b (calculated by Equation (1)) are the density and the speed of sound in bottom. The propagation delay of p-th path can be simple calculated as where l 0 is the direct distance from the sender to the receiver. In order to get a tractable, simple channel model, we examine an approximation to the function. Taking p = 0 as the reference path and H 0 ðf Þ as the impulse function corresponding to l 0 , the impulse function of the receiving end can be further expressed as

Supervised Contrastive Learning-Based Modulation Classification
A large number of studies have proved that DNN is superior to SVM. In the field of UWA modulation classification, the application of DNN is still scarce and all use end-to-end supervised methods. However, when the SNR becomes low, the accuracy will drop sharply. In response to this problem, we use supervised contrastive learning to narrow the feature distance between the same category and expand the distance between different categories, so as to improve the classification accuracy of modulation schemes under low SNR.
3.1. Classification System Model. As shown in Figure 5, in Step 1, the signals received by the receiver may be useful signal or useless ocean noise. In Step 2, the input signals are recognized through a simple two convolutional layers and a fully connected layer. Conv11 × 32 means the channel number is 32, and the size of convolutional kernels is 11 × 11. If the input signal is useful signal, it will be transported to supervised contrastive learning module for further classification; if it is ocean noise, it will be discarded. In Step 3, supervised contrastive learning loss function is used to update the backbone network (ResNet50) to extract features from UWA data and then put the features into classifier for classification. By this means, the influence of ocean noise can be effectively eliminated, and the classification accuracy at low SNR will be significantly improved. In the following content, the specific network architecture will be given in details.

Backbone
Network. The backbone network of supervised contrastive learning in this paper is ResNet50. ResNet50 is a residual CNN with 50 layers. It directly skips several layers and introduces the output of a certain layer into the input part of the following data layer, which overcomes the problems of low learning efficiency and ineffective improvement of accuracy due to the deepening of the network. Another two important operations in the network are batch normalization and ReLU. Batch normalization is aimed at converting the input data to an output data distribution with a variance of 1 and a mean of 0 to improve the speed of network optimization. ReLU is a nonlinear activation function. It makes the output of some neurons be 0, so as to improve the sparsity and avoid the overfitting phenomenon of the network.
In traditional supervised end-to-end CNN, as shown in Figure 6, the output of the classifier is used as the only indicator to update the network. The most widely used loss function is the cross-entropy loss function, and the expression is where N is the number of samples and M is the number of label categories. If the true category of sample i is equal to c, then y ic = 1; otherwise, y ic = 0. p ic is the predicted probability that the sample i belongs to the corresponding category.
3.3. Supervised Contrastive Learning. Supervised contrastive learning effectively utilizes the category label information, making the feature points from the same category closer than the points from different categories. Different from self-supervised learning [24], the positive samples are other samples in the same category. As shown in Figure 7, the progress is divided into two training stages. The first stage focuses on the training of the encoder and uses the supervised contrastive learning loss function to update the encoder. The second stage focuses on the training of the classifier using the feature output by the encoder and using the cross-entropy loss function to update. In self-supervision, the function of the two converters is to flip or crop the input picture so that the two newly generated images can be used as the positive samples. Due to the high complexity of UWA data, cropping or flipping the time domain signal will destroy its original characteristics. Since the label information is known, the supervised contrastive learning takes all the samples from the same class in the batch as positive samples and compares them with the negative samples in the rest of the batch. The loss function becomes where where i is the blind UWA data and z i represents the feature generated by the backbone network. z j represents the feature that comes from the same category with data i, and z k represents the feature generated by backbone network that is different from data i. τ is a scalar temperature parameter larger than 0.ỹ i is the category label of i. To update the network parameters under the constraint of the loss function, the  The classifier in the second stage is a simple fully connected layer. It uses the 2048-dimensional standardized feature output by the encoder to classify the modulation schemes. It should be mentioned that the parameters of the encoder are frozen in the second stage. Therefore, whether the encoder can obtain excellent features after training plays a decisive role.
Algorithm 1 describes the update process of the supervised contrastive learning.

Experiments and Results
In this section, the details of the experiments are explained. We also evaluate the modulation classification performance of the proposed method and compare it with the existing methods. In order to analyze the algorithm performance more intuitively, we use PCA to visualize the features to provide the interpretability of the proposed method.  Table 1 shows the parameter setting of different modulation schemes. The ocean noise is actually collected in the Wuyuan Bay sea area. After passing through the ocean channel, the pool channel, and the simulation channel, the data with the characteristics of multipath fading and Doppler frequency shift is obtained. On this basis, Gaussian white noise with different SNR is superimposed on the obtained data through MATLAB. In this paper, the intraband SNR is used to evaluate the performance of the proposed algorithm. It can be calculated as Step 1 Step 3 Step 2

Wireless Communications and Mobile Computing
where F s is the sampling frequency and B s is the bandwidth of the signal.
In Step 2, it is aimed at distinguishing the ocean noise and the useful signals. The train set consists of 2,000 ocean noise samples and 2,000 modulation signal samples with different SNR. The corresponding test set is 800 samples per category. In Step 3, the training set of supervised contrastive learning consists the data with different SNR after noise pollution. Among them, 550 samples of each modulated signal are generated from -9 dB to 9 dB every 2 dB, 250 samples of which are used as the training set and 300 samples are used as the test set. Therefore, the training set contains 15,000 samples with different SNR, and the test set of each SNR contains 1,800 samples.

Experimental Implements.
In the ocean and pool experiments, NI USB-6259 Pinout capture card is used to convert the digital signal to analog signal at the transmitter and convert the analog signal to digital signal at the receiver. JYH500A power amplifier and Type-2692-0S2 charge amplifier are used to amplify the transmitted signal and the received signal, respectively. WBT22-1107 transducer which can convert the analog electrical signal to acoustic signal is used to send and receive signal in the water. Besides, the experiments are performed on computing server equipped with an Intel(R) Core(TM) i7-9700K 3.6GHz CPU, a NVI-DIA GeForce RTX 2060 SUPER GPU, "Pytorch" and "Python" programming language, the CUDA 10.1 and CUNDD software. The optimizer of ResNet50 is "Adam," and the learning rate is 0.05 and decays to 10% of the original learning rate every 30 epochs.

Simulation Results.
The simulation experiment is carried out under the simulation UWA channel. In the noise distinction stage, the distinction between ocean noise and useful signals is obvious, especially in the frequency domain. Even when the SNR is -6 dB, the classification accuracy can still achieve 100%. Therefore, it can be explained that the simple convolutional network of Step 2 can well eliminate the influence of marine noise. In the Step 3, Figure 8 gives the classification accuracy of six modulation schemes. In general, the classification accuracy of six modulation signals increases with the increase of SNR and can achieve an average accuracy of 98.84% at 0 dB. When the SNR decreases to -6 dB, the recognition of 8PSK is the most difficult, and the confusion of modulation categories is mainly concentrated on QPSK and 8PSK.

Actual Ocean and Pool Experiment
Results. Due to the difficulty and high cost of obtaining ocean data, in practical experiments, we mix pool data with the ocean data to increase the richness of training set, so that the trained encoder and classifier can better fit the distribution characteristics of UWA data. The result of Step 2 in practical experiments is the same as mentioned in the previous simulation part. In Step 3, using the feature output by the encoder, the classification accuracy of the single fully connected layer is shown in Figure 9. For MPSK, its information is modulated in phase, so its characteristics in the time domain are not as obvious as MFSK. When the SNR is -6 dB, the average accuracy of MPSK is 79.7%, while MFSK can achieve a high   Classification performance results of six modulation categories at -6 dB and 0 dB are presented using confusion matrix in Figure 10. In each modulation category, 300 tests are implemented. When SNR is -6 dB, BPSK, 2FSK, 4FSK, and 8FSK have achieved high classification accuracy through supervised contrastive learning. However, since QPSK and 8PSK are relatively similar in modulation phase, they are easy to be confused. There are 102 QPSK samples that are mistaken for 8PSK and 70 8PSK samples that are mistaken for QPSK. When the SNR reaches 0 dB, except for the slightly larger classification error of QPSK, the recognition accuracy of other modulation schemes almost reaches 100%.

Accuracy Comparison.
To verify the superiority of the proposed method in this paper, the performance is investigated by making comparisons with four relevant algorithms in recent years; the comparison algorithms are as follows: (1) Algorithm 1 based on ResNet50 using constellation density as feature [29] (2) Algorithm 2 based on AlexNet using 3-channel image as feature [13] (3) Algorithm 3 based on VGGNet using original gray image as feature [30] (4) Algorithm 4 based on SE-Net using the features in time domain, frequency domain, and timefrequency domain [31] Figure 11 presents the average classification accuracy of five algorithms versus SNR. The average accuracy is obtained by averaging the classification performance of six modulation categories. As shown in Figure 11, the following observations can be made.
(1) For all five algorithms, the modulation classification performance improves with an increasing SNR value (2) Given the same SNR, in addition to the proposed algorithm, the other four algorithms will have a sharp decay on the classification accuracy when the SNR becomes low (3) The proposed supervised contrastive learning algorithm has strong adaptability to low SNR UWA modulation signals and outperform all other algorithms. When the SNR is -6 dB, the accuracy of our proposed method is 6% higher than the benchmark algorithm [29] 4.3.4. PCA for Interpretability. PCA can reduce a set of n -dimensional vectors to k-dimension through orthogonal transformation. That is, k unit orthogonal basis is selected, so that the original n-dimensional data is represented by this group of basis. For high-dimensional data, first make the mean of the input vector to 0 and then use the covariance to represent the correlation between vectors a and b. The covariance is calculated as For mn-dimensional vectors fa 1 , a 2 , ⋯a m g, the matrix X is composed of

Wireless Communications and Mobile Computing
The covariance matrix C is It can be seen that the diagonal of the matrix C is the variance of the vectors, and the other elements are the covariances between different vectors. Supposing Y = PX is the vector of the original data X projected to the low-dimensional space, P is the transformation matrix, and D is the covariance matrix of Y, there is the following equation In order to enable the transformed low-dimensional vectors to represent more original information, we hope that they are not correlated with each other; that is, the covariance is equal to 0. Therefore, the matrix D should be a diagonal matrix. According to the relevant knowledge of linear algebra, the matrix P should be the eigenvector matrix of matrix C, and it should be arranged from top to bottom according to the size of the corresponding eigenvalues. Select the matrix P k composed of the first k rows of matrix P, and obtain a matrix Y k with k-dimensional vectors. Taking k = 3,  Wireless Communications and Mobile Computing the high-dimensional feature outputs by the network are presented in a 3-dimensional plane. Figure 12 shows a 3dimensional space cross-sectional view of the feature point distributions extracted by different networks. It is easy to see that the features extracted by the supervised contrastive learning method have a higher degree of discrimination and better classification effect under low SNR. When SNR is -6 dB, the features extracted by ResNet50 [29] are overlapped. In contrast, the features extracted by the proposed method, except that the features of QPSK, 8PSK, and 8FSK, have some overlap; the feature distributions of the other three modulation signals are concentrated and easy to distinguish. What is more, with the increase of the SNR, the feature point distribution boundaries of different modulation schemes become clearer and clearer.

Conclusion
In this paper, we are the first to propose a novel modulation classification scheme based on supervised contrastive learning. Firstly, the useful signals and ocean noise will be distinguished in the first module. Secondly, the encoder ResNet50 in the supervised contrastive learning module will learn the input UWA data under the guidance of the supervised contrastive learning loss function to update the network. By this means, the distance between feature vectors in the same category but with different SNR will be minimized, and the distance between feature vectors of different categories will be expanded as much as possible. Then, the classifier recognizes the modulation scheme according to the feature output by the encoder. Finally, the ocean, pool, and simulation experimental results verify the superiority of the proposed method. Compared with the existing researches, the experimental verification in this paper is more complete. The proposed method eliminates the complex parameter extraction process and does not require any prior information. When the SNR is 0 dB, the average accuracy can achieve 98.6%. Com-pared to the benchmark algorithm, the accuracy at -6 dB is improved by 6%. Moreover, we use PCA to visualize the feature distribution, which can intuitively analyze the superiority of the proposed algorithm.

Data Availability
The data used to support the findings of this study were supplied by Daqing Gao under license and so cannot be made freely available. Requests for access to these data should be made to Daqing Gao (dqgao@stu.xmu.edu.cn).

Conflicts of Interest
The authors declare that they have no conflicts of interest.