The Control Packet Collision Avoidance Algorithm for the Underwater Multichannel MAC Protocols via Time-Frequency Masking

. Establishing high-speed and reliable underwater acoustic networks among multiunmanned underwater vehicles (UUVs) is basic to realize cooperative and intelligent control among different UUVs. Nevertheless, different from terrestrial network, the propagation speed of the underwater acoustic network is 1500m/s, which makes the design of the underwater acoustic network MAC protocols a big challenge. In accordance with multichannel MAC protocols, data packets and control packets are transferred through different channels, which lowers the adverse effect of acoustic network and gradually becomes the popular issues of underwater acoustic networks MAC protocol research. In this paper, we proposed a control packet collision avoidance algorithm utilizing time-frequency masking to deal with the control packets collision in the control channel. This algorithm is based on the scarcity of the noncoherent underwater acoustic communication signals, which regards collision avoiding as separation of the mixtures of communication signals from different nodes. We first measure the W-Disjoint Orthogonality of the MFSK signals and the simulation result demonstrates that there exists time-frequency mask which can separate the source signals from the mixture of the communication signals. Then we present a pairwise hydrophones separation system based on deep networks and the location information of the nodes. Consequently, the time-frequency mask can be estimated.


Introduction
Underwater acoustic networks are the key technology to realize cooperative and intelligent control among multi-UUVs [1,2].However, compared with terrestrial wireless network, the propagation velocity of underwater acoustic networks is only 1500 m/s and the available bandwidth of underwater acoustic networks is very limited.Moreover, the time delay, Doppler extension, and noise interference cannot be avoided either.The adverse factors mentioned above make the key technology of underwater acoustic network communication MAC protocol design big challenges and restrict the improvements of underwater acoustic network.Presently, most researchers classify underwater acoustic network MAC protocols into 3 types: contention-free, contention-based, and hybrid based on the difference of multiuser access mechanism.Figure 1 has illustrated the existing underwater acoustic network MAC protocols and its categorizing [3,4].
Considering the long-time delay of underwater acoustic channel, maintaining the real-time state between adjacent nodes is different.Concise contention-free protocol is firstly used in the underwater acoustic network, which includes frequence division multiple access (FDMA), time division multiple access (TDMA), and code division multiple access (CDMA).FDMA separates available frequency band into different subbands and allocates the subband specifically to the node, which is simple and reliable.However, the low bandwidth availability ratio is a big disadvantage [3][4][5].In order to solve the problem, orthogonal frequency division multiplexing (OFDM) has been introduced to FDMA.The principle of FDMA is choosing proper frequency band to communicate according to different communication distances and therefore improves the bandwidth availability ratio [6,7].To improve channel utilization fundamentally, researchers begin to study the MAC protocol based on TDMA.The ST-MAC protocol, which translates the issue of multiple joints time slot to the problem of vertex-coloring, was proposed in [8].Reference [9] puts forward DSSS protocol, which utilizes transmission delay of underwater acoustic channel and arranges conflict-free transmission concurrently.The STUMP protocol which eases the limitation of time synchronization was present in [10].Different from TDMA, CDMA protocols distinguish different users through pseudonoise, which has high channel utilization and simple algorithm but still cannot avoid its inherent "near-far effect." Compared with the above-mentioned contention-free protocol this allocates channel beforehand and contentionbased protocol which allocates channel based on the need of nodes has higher channel utilization.The essence of contention-based protocol is based on channel reservation.Nodes reserve channel resources through handshaking exchange message before initiating data communication [11,12].The multichannel MAC protocols transmit handshaking exchange message via independent channels, initiates information transmission among multinode pairs concurrently, utilizes network bandwidth and reduces the consumption when network load is heavy [13][14][15], and attracts attention of the researchers recently.The prospective of the multichannel MAC protocols is solving new problems which are faced with multichannel protocols, especially the collision problem in control channel.Zhou et al. from University of Connecticut adopt joint detection of adjacent nodes to tackle triple hidden terminal problems typical in multichannel MAC protocols [16].
In this paper, we proposed a control packet collision avoidance algorithm utilizing time-frequency masking to deal with the control packets collision in the control channel.This algorithm is based on the scarcity of incoherent underwater acoustic communication signals and regards collision avoiding as the separation of the mixtures of communication signals from different nodes.The remaining contents of the paper are organized as follows.Section 2 briefly discusses the W-Disjoint Orthogonality and the scarcity of the MFSK signal.The simulation result demonstrates that there exists time-frequency mask which can separate the source signals from the mixture of the communication signals.Section 3 outlines the proposed separation system, discusses the lowlevel feature we sued, and gives details about the deep network including its structure as well as training method.Section 4 shows the simulation result about the source separation system in different conditions, including different signalnoise ratio and different bandwidth ratio.

W-Disjoint Orthogonality of the MFSK Signals
As is known to all, the MFSK is a classic noncoherent communication modulation scheme that has been considered as a robust modulation to the complex underwater acoustic channel.Because of its lower bandwidth efficiency than the coherent modulation, such as PSK modulation, the MFSK modulation is not considered as a good choice for the physical layer of the underwater acoustic networks.However, the lower bandwidth efficiency means that the MFSK signal is sparse in time and frequency domain.Same as the speech signals mentioned in [17], the MFSK mixtures can be separated into several sources by using time-frequency masking.
The received signal can be seen as MFSK mixture when the control packets collide, and the sparsity of the MFSK signal in time and frequency domain offers the potential for dealing with the collision of the control packets.
In this section, we focus on the W-Disjoint Orthogonality and the sparsity of the MFSK signals, showing that there exist time-frequency mask which can separate the source signals from the mixture of the MFSK signals.We only consider the MFSK modulation as the physical layer of the underwater acoustic networks in this paper.
Same as the model of speech mixture, the model of the MFSK mixture can be written as follows: With the short-time Fourier transform, we obtain the model of the MFSK in time and frequency domain through where  = 1, . . ., ,  = 1, . . .,  are the indexes of time frame and frequency point, respectively.Assuming (, ) is W-Disjoint Orthogonality, at least one of the  nodes signals will be nonzero for a given (, ).To separate node signal from the mixture (, ), we create the time-frequency mask for each node, respectively, and apply these masks to the mixture to obtain the original node signal.For instance, with the defining mask M  for the node , which actually is an indicator function, the MFSK signal from node  will be derived via However, for the MFSK signal, the assumption about the W-Disjoint Orthogonality is not strictly satisfied.When the sparsity of the MFSK signals in time and frequency domain is taken into consideration, the approximate W-Disjoint Orthogonality will be satisfied.In order to measure the W-Disjoint Orthogonality of the T-F mask, the combined performance criteria PSR and SIR, which are proposed by Yilmaz and Rickard in [17], will be used: With a view to the quite small probability of the collision of more than two data packets in control channel, we have produced a series of MFSK mixture only including two node signals and calculated the PSR, SIR, and WDO of the T-F Mask by the use of Monte Carlo method.Thereinto, the T-T Mask for source separation is derived as follows: According to the definition of the W-Disjoint Orthogonality, the corresponding T-F mask becomes closer to the W-Disjoint Orthogonality as the signal being more sparse.The sparsity of MFSK signal is reflected by bandwidth ratio, and the lower the bandwidth ratio of the signal is, the higher its sparsity becomes.
By the conclusion of [17], it could be thought that mixed signal is able to be demixed through a T-F mask when the value of WDO is close to 1.In accordance with the simulation result shown on Figures 2, 3, and 4, we believe that an existing T-F mask could separate sources from the MFSK mixture with high quality when the bandwidth ratio is greater than 0.5.

The Source Separation System Utilizes Deep Networks
In this section we will outline the proposed separation system, discuss the low-level features we used, and give details about the deep networks including its structure and training method.

Observation in Time and Frequency Domain and Low-Level Features Used in Deep
Networks.We assume that there are  hydrophones in the node and the mixture includes several MFSK signals from  nodes.The mixture received by the hydrophone  can be obtained from where ℎ  is the channel impulse response between the hydrophone  and node , and the * denotes convolution.
Then by using the short-time Fourier transform (STFT), we can obtain the mixture signal mapped into the timefrequency domain from where  = 1, . . .,  and  = 1, . . .,  are the time frame and frequency bin indices, respectively, and   (, ) = F(  ),   = F(  ),   = F(  ), F(⋅) denote short-time Fourier transform.We use  | (, ) representing the component of the node  which takes in the mixture received by hydrophone .Thus, As shown in Section 2 MFSK signal is approximate W-Disjoint Orthogonality when bandwidth ratio is less than 0.5.Then, as shown in (10), the mixture   which received by the hydrophone , can be demixed by using the T-F masks M  corresponding node ,  = 1 . . ..Consider the following: There is a nature choice that uses the orientation of the MFSK signals to estimate the corresponding T-F mask of the nodes with the pairwise hydrophone.Thus, the timefrequency mask is related to the location information of the current input signals (the channel impulse response and array manifold).That is, Obviously, for other hydrophone output, the timefrequency mask M corresponding to the same node remains the same.When we obtain the time-frequency M  corresponding to node , we can then separate the single user communication signals of node  through ISTFT operation.The mixture signals received by the multiarray include  single user signals from  nodes with different locations.The probability distribution of a single T-F point belonging to a certain node, such as , can be described by a Gaussian distribution with mean   and variance   .Here,   and variance   can be interpreted as the mean value and variance of direction of arrival (DOA) of the signal coming from the node , respectively.According to the central limit theorems, for all T-F points, the probability distribution of mixture signals including  node signals can be described by the mixture of  Gaussian distribution, namely, Gaussian mixture model (GMM).Therefore, we can describe the mixture pattern of signals through a GMM based on space features.The parameters of the GMM can be obtained on the basis of the existing observation datasets.Thus the probability of each T-F point belonging to node 1, node 2, . .., and node  can be estimated.Then, the source signals can be recovered through ISTFT operation.

Outline of the Source Separation System. As shown in
Figure 5, the inputs to the system are the two channel MFSK mixtures.We perform short-time Fourier transform (STFT) for each channel and obtain the T-F representation of the input signals,   (, ) and   (, ) where  = 1, . . .,  and  = 1, . . .,  are the time frame and frequency bin indices, respectively.The low-level features, that is, mixing vector (MV), interaural level, and phase difference (IPD/ILD) which can be derived from ( 12)-( 13    where | ⋅ | takes the absolute value of its argument and ∠(⋅) finds the phase angle.Next, we group the low-level features into  blocks (only along the frequency bins ).The block  includes  frequency bins (( − 1) + 1, . . ., ), where  = /.We build  deep networks with each corresponding to one block and use them to estimate the direction of arrivals (DOAs) of the sources.The low-level features as the input of the deep networks are composed by the IPD, ILD and MV; that is, ũ(, ) = [z  (, ), (, ), (, )]  .Through unsupervised learning and the sparse autoencoder [18] in deep networks, high-level features (coded positional information of the sources) are extracted and used as inputs for the output layer (i.e., the softmax regression) of the networks.The output of softmax regression is a source occupation probability (i.e., the time-frequency mask) of each block (through the ungroup operation, T-F units in the same block are assigned with the same source occupation probability) of the mixtures.Then the sources can be recovered applying the T-F mask to the mixtures followed by the inverse STFT (ISTFT).The deep networks are pretrained by using a greedy layer-wise training method.

The Deep Networks.
As described in the beginning of this section, we group the low-level features into  blocks and build  individual deep networks which have the same architecture to classify the DOAs of the current input T-F point in each block.The deep network which is used to estimate the T-F mask is composed of two-layer deep autoencoder and one layer softmax classifier.
More specifically, we split the whole space to  ranges with respect to the hydrophones and separate the target and interferers based on different orientation ranges (DOAs with respect to the receiver node) where they are located.We apply the softmax classifier to perform the classification task, and the inputs to the classifier, that is, the high-level features a (2) , are produced by the deep autoencoder.Assuming that the position of the target in the current input T-F point remains unchanged, the deep network estimates the probability (  =  | u (,) ) of the orientation of the current input sample belonging to the orientation index .With the estimated orientation (obtained by selecting the maximum probability index) of each input T-F point, we cluster the T-F points which have the same orientation index to get the probability mask and obtain the T-F mask from the probability mask through the ungroup operation.Note that each T-F point in the same block is assigned the same probability.The number of sources can also be estimated from the probability mask by using a predefined probability threshold, typically chosen as 0.2 in our experiments.

Deep
Autoencoder.An autoencoder is an unsupervised learning algorithm based on backpropagation.It aims to learn an approximation û of the input u.It appears to be learning a trivial identity function; but by using some constraints on the learning process, such as limiting the number of neurons activated, it discloses some interesting structures about the data.Figure 6 shows the architecture of a single layer autoencoder.The difference between classic neural network and autoencoder is the training objective.The objective of the classic neural networks is to minimize the difference between the label of input training data and the output of the network.However, the objective of the autoencoder is to minimize the difference between the input training dataset and the output of the network.As shown in Figure 6, the output of the autoencoders can be defined as û = sigm( Ŵ(2) a (1) + b( 2) ) with a (1) = sigm(W (1) u + b (1) ), where the function sigm(u) = 1/(1 + exp(−u)) is the logistic function, W (1) ∈ R × , b (1) ∈ R  , Ŵ(2) ∈ R × , and b(2) ∈ R  ,  is the number of hidden layer neurons, and  is the number of input layer neurons, which is the same as +1 +1 W (1)   a (1)   b (1)   Ŵ( 2 that of the output layer neurons.W (1) is a matrix containing the weights of connections between the input layer neurons and hidden layer neurons.Similar to W (1) , Ŵ(2) contains the weights of connections between the hidden layer neurons and the output layer neurons.b (1) is a vector of the bias values added to the hidden layer neurons, and b (2) is the vector for the output layer neurons.Θ refers to the parameter set composed of weights W and bias b.The neuron V is "active" when the output  V of this neuron is close to 1, which means that the function sigm(W (1)  V u + b (1)  V ) ≈ 1.For "inactive" neurons, however, the output is close to 0, which means the function sigm(W (1)  V u + b (1)  V ) ≈ 0, where W (1)  V denotes the weights of connections between the hidden layer neuron V and the input layer neurons, which is the Vth row of the matrix W (1) .b (1)  V is the Vth element of the vector b (1) , which is the bias value added to the hidden layer neuron V.The superscript  of b () , W () , and a () , denotes the th layer of the deep network.
With the sparsity constraint, most of the neurons are assumed to be inactive.More specifically,  V = sigm(W (1)  V u + b (1)  V ) denotes the activation value of the hidden layer unit V in the autoencoder.Generalizing this for the unit  in the hidden layer, the average activation ρV of unit V with the input sample u () can be defined as follows: where  is the number of training samples, and u () is the th input training sample.Next, the sparsity constraint ρV =  is enforced, where  is the parameter preset before training, typically small, such as  = 3 × 10 −3 .To achieve the sparsity constraint, we use the penalty term in the cost function of sparse autoencoders as follows: The penalty term is essentially a Kullback-Leibler (KL) divergence.Now the cost function  sparse (W, b) of the sparse autoencoder can be written as follows: where  controls the weight of the penalty term.In our proposed system, the cost function  sparse (W, b) is minimized using the limited memory BFGS (L-BFGS) optimization algorithm, and the single layer sparse autoencoder is trained by using the backpropagation algorithm.
After finishing the training of single layer sparse autoencoder, we discard the output layer neurons, the relative weights Ŵ(2) , and bias b( 2) and only save the input layer neurons W (1) and b (1) .The output of the hidden layer a (1) is used as the input samples of the next single layer sparse autoencoder.Repeating these steps, like stacking the autoencoders, we could build a deep autoencoder from two or more single layer sparse autoencoders.In our proposed system, we use two single layer autoencoders to build a deep autoencoder.The stacking procedures show on the right part of Figure 7.
Lots of studies on deep autoencoders show that, with the deep architecture (more than one hidden layer), deep autoencoder could build up more complex representation from the sample low-level features, capture the underlying regularities of the data, and improve the qualities of recognition.That is why we use deep autoencoder in our proposed system.
There are, however, several difficulties associated with obtaining the optimized weights of deep autoencoders.One challenge is the presence of local optima.In particular, training a neural network using supervised learning involves solving a highly nonconvex optimization problem, that is, finding a set of network parameters (W, b) to minimize the training error ‖u − û‖ 2 .In the deep autoencoder, the optimization problem with bad local optima turns out to be rife, and training with gradient descent no longer works well.Another challenge is the "diffusion of gradients."When using backpropagation to compute the derivatives, the gradients that are propagated backwards (from the output layer to the earlier layers of the network) rapidly diminish in magnitude as the depth of the network increases.As a result, the derivative of the overall cost with respect to the weights in the earlier layers is very small.Thus, when using gradient descent, the weights of the earlier layers change slowly.However, if the initializing parameter is already close to the optimized values, the gradient descent works well.That is the idea of "greedy layer-wise" training, where the layers of the networks are trained one by one, as shown in the left part of Figure 7.
First, we use the backpropagation algorithm to train the first sparse autoencoder (only including one hidden layer), with the data label being the inputs.In our proposed system, the input data is the 4-dimensional feature vector u.As a result of the first-layer training, we get a set of network parameters Θ (1) (i.e., the parameters W (1) and b (1) ) of the first-layer sparse autoencoder, and a new dataset a (1) (i.e., the features I shown in Figure 9) which is the output of the hidden b (2)   +1 +1 a (1)   V−1 a (1)   V−2 a (1)   V b (1) a (1)   (n,m) a (1)   (n,m) a (1)   (n,m) a (2)   (n,m) layer neurons (the activity state of the hidden layer neurons) by using this parameter set.Next, we use a (1) as the inputs to the second sparse autoencoder.After the second autoencoder is trained, we can get the network parameters Θ (2) of the second sparse autoencoder and the new dataset a (1) (i.e., the feature II in Figure 9) for the training of next single layer neural network.We then repeat the above steps until the last layer (i.e., the softmax regression in our proposed system).Finally, we obtain a pretrained deep autoencoder by stacking all the autoencoders and use Θ (1) and Θ (2) as the initialized parameters for this pretrained deep autoencoder.The feature II is the high-level feature and can be used as the training dataset for softmax regression discussed next.

Softmax
Classifier.In our proposed system, the softmax classifier, based on softmax regression, was used to estimate the probabilities of the current input T-F point u (,) belonging to the orientation index , by the deep autoencoder with the extracted high-level features a (2)  (,) as inputs.The softmax regression generalizes the classical logistic regression (for binary classification) to multiclass classification problems.Different from the logistic regression, the data label of the softmax regression is an integer value between 1 and ; here  is the number of data classes.More specifically, in our proposed system, for  classes,  samples dataset was used to train the th deep network: {(a (2)  (,1) ,  (,1) ) , . . ., (a (2)  (,) ,  (,) ) , . . ., (a (2)   (,) ,  (,) )} , ( where  (,) is the label of the th sample a (2)  (,) and will be set to  if a (2)  (,) belongs to class .The architecture of the softmax classifier is shown in Figure 8.

Softmax classifier
Softmax classifier p(g J = J | a (2)  (n,m) ) p(g 1 = 1 | a (2)  (n,m) ) p(g j = j | a (2)  (n,m) ) the single user communication as the baseline.The sampling rate of our simulation is 16 kHz, and the order of the modulation is 16, without any error-correction coding.The bandwidth of the MFSK modulation/demodulation is 640 Hz, and the bandwidth ratio varies from 0.625 to 0.125, corresponding to the bit rate 400 bits/s to 80 bits/s.
Frist, suppose we can obtain the optimal T-F mask, that is, using (6) to calculate T-F mask, and then get the simulation result of BER being changing with SNR as well as bandwidth ratio and make a comparison with baseline.
As shown in Tables 1 and 2, it can be seen that, from the simulation result, the BER similar with baseline can be attained by T-F mask method when bandwidth ratio < 0.417.Furthermore, the BER with T-F mask would be even lower than that of baseline under the same condition when SNR is very low; that is, SNR = −20.The reason behind the above phenomenon is that, it can be seen from ( 6) that some frequency points are adjusted to zero by the T-F mask, which promotes the SNR of signal objectively and obtains lower BER.In the actual system, however, it is a highly challenging task to accurately estimate the T-F mask under such low SNR, which can also be verified in later simulation.
According to Section 3, we estimate the T-F mask by using the orientation information of the nodes, that is, the DOA of the MFSK signal received by nodes.Therefore, we introduce time delay  to simulate pairwise hydrophone receiver in the simulation and divide the space into 37 blocks along the horizontal direction at the same time, in order to correspond from −90 to +90 horizontally with the step size of 5 in a horizontal space.We estimate the T-F mask by source separation system described in Section 3 and compare it to the BER performance of baseline under the condition of different SNR and bandwidth ratio; the simulation result is shown in Table 3.As shown in Table 3, it is observed that the BER performance of the proposed system is much the same as the baseline when SNR > 20 dB and bandwidth ratio < 0.417, which is consistent with the result when using the optimal T-F mask.When SNR < 20 dB, however, the BER performance of the proposed system begins to decline for a big error of T-F mask estimation made by the lower SNR of the signals, which results in the system performance degradation.

Summary
In this paper, we point at the problem of control packets collision avoiding existing widely in multichannel MAC protocol and, on the basis of the sparsity of noncoherent modulation-MFSK in time-frequency domain, separate the sources from the MFSK mixture caused by packets collision through the use of T-F masking method.First, we indicate the sparsity of MFSK signal with bandwidth ratio and demonstrate the relation between bandwidth ratio and PSR, SIR, and WDO by means of the simulation experiment.Then, we establish the source separation system based on deep networks and the model of MFSK communication system, taking single user MFSK communication as the baseline to compare the BER performance of proposed system and baseline under different condition of SNR and bandwidth ratio.The simulation result shows that, first, the optimal T-F masking could obtain the same BER performance as the baseline under lower bandwidth ratio; second, the proposed system could obtain the similar BER performance as the baseline under higher SNR; third, the BER performance of the proposed system declines rapidly under the condition of lower SNR, for lower SNR leads to a greater error in the estimation of T-F mask.In the future work, we will adjust the structure of deep networks in subsequent research work to promote the performance of proposed system under the condition of low SNR and multipath propagation presenting in the underwater channel.
As a future research topic, it also deserves the possibility that the bioinspired computing models and algorithms are used for the underwater multichannel MAC protocols, such as the P systems (inspired from the structure and the functioning of cells) [19,20], and evolutionary computation (motivated by evolution theory of Darwin) [21,22].

Figure 5 :
Figure 5: The architecture of the proposed system using deep neural network based time-frequency masking for blind source separation.

Figure 7 :
Figure 7: The illustration of greedy layer-wise training and stacking.(a) is the procedure of greedy layer-wise training and (b) is the procedure of stacking of sparse autoencoders.

Table 1 :
The BER of the simulation system with the optimum T-F mask.

Table 2 :
The BER of the baseline.

Table 3 :
The BER of the proposed system.