A Fast Imbalanced Binary Classification Approach to NLOS Identification in UWB Positioning

Non-line-of-sight (NLOS) propagation is an important factor affecting the positioning accuracy of ultra-wide band (UWB). In order to mitigate the NLOS ranging error caused by various obstacles in UWB ranging process, some scholars have applied machine learning methods such as support vector machine and support vector data description to the identification NLOS signals for mitigationNLOS error in recent years.Therefore, the identification ofNLOS signals is of great significance inUWBpositioning.The traditional machine learning method is based on the assumption that the number of samples of the line-of-sight (LOS) and NLOS signals are balanced. However, in reality, the number of LOS signals in UWB positioning is much larger than the NLOS signals. So the samples are characterized by class-imbalance. In response to this fact, we applied a fast imbalanced binary classificationmethod based on moments (MIBC) to identify NLOS signals. The method uses the mean and covariance of the two first moments of the LOS signal samples to represent its probability distribution and then uses the probability distribution and all a small amount of NLOS signal samples to establish a model. This method does not depend on the number of LOS signals and is suitable for dealing with the problem of classification of the imbalance between the number of LOS and NLOS signals. Numerical simulations also verify that the method has better performance than LS-SVM and SVDD.


Introduction
UWB technology has the characteristics of extremely narrow pulse and extremely wide bandwidth, which satisfies the requirements of positioning accuracy in indoor complex environment.However, these characteristics are also a double-edged sword, which leads to many technical problems in the realization of technology [1][2][3][4][5].Such as signal reception, multiuser interference, multipath effects, and NLOS propagation.The problem of NLOS propagation especially is crucial to the influence of high-resolution positioning systems [6,7].Firstly, direct signals will be mistakenly detected due to multipath effects caused by NLOS.Secondly, the propagation of UWB signals in the medium will cause additional delay, resulting in TOA estimate values being larger than the real values, that is, positively biased range estimates.It is worth noting that NLOS conditions often occur in many real environments, such as closed rooms, dense urban buildings, and complex environments under the canopy.UWB positioning system is to work in these environments where GPS or cellular network signals cannot reach.Therefore, it is quite necessary to identify the NLOS signals so as to mitigate the NLOS error of the positioning system [8][9][10].
The purpose of identifying NLOS is to detect the existence of NLOS conditions between the transmitter and receiver and ultimately mitigate NLOS errors.Therefore, the traditional identification research on NLOS focuses on analyzing the statistical characteristics of the signals in UWB positioning.There are mainly two types of methods.The first type is to analyze the statistical characteristics of TOA estimate results.Literature [11] identifies NLOS by the variance of TOA estimate values and judges NLOS when the variance exceeds the set threshold.An approach combines the error distribution of RSS estimates results with the error distribution of TOA estimates results for NLOS identification and achieves the identification effect by adding identification information [12].However, the shortcomings of this kind of methods are also very obvious.The statistical information of the obtained results needs to be measured many times, which wastes more time.The randomness of the estimate error is relatively large at the same time.The second type of method is to directly analyze the statistical characteristics of the received signals [8][9][10], that is, the channel characteristics.Typical characteristics include signal energy, pulse rise time, kurtosis, RMS delay spread, and mean excess delay (MED).Under NLOS and LOS conditions, the statistical parameters of the above characteristics are different, so we can determine which type of sight distance conditions the above characteristics belong to by set thresholds.At the same time, it is also possible to combine all characteristics and give corresponding weights to classify the received signals.
What is worth our attention is that literature [9] uses a nonparametric identification method and only relies on UWB waveform measurement to realize the identification of NLOS signals.In the case where both LOS and NLOS exist, the nonparametric support vector machine is used for NLOS identification.This approach to identify NLOS can be realized without statistical characteristics of waveforms.In the literature [10] published later they proposed two nonparametric regression techniques which only estimated the NLOS error based on the waveform of the received signal and the estimated distance.The traditional machine learning technologies obtain high prediction accuracy on the assumption that the numbers of samples in different class are basically the same.The traditional techniques tend to ignore the small class data, because even if all test samples are mistakenly identified as large class, they can obtain a high recognition rate.These small class data are very few but more important than large class data in some cases and the result caused by false recognition is more serious.In the actual environment, most of the signals we get are LOS signals, while the NLOS signals are very few.So our goal is to identify the small amount of NLOS signals which are very important for localization in NLOS environment.This is a typical class-imbalance classification problem.Traditional machine learning methods such as LS-SVM will not effectively solve the problem of class-imbalance classification.The recent literature [8] uses the method of support vector data description (SVDD) to transform the LOS recognition problem into a one-class classification problem.The method establishes a spherically shaped boundary, with center u and radius r, which contains as many LOS signals as possible while keeping as much minimum volume as possible to realize the classification of the NLOS and LOS signals.On the one hand, the computational complexity of the algorithm is still high although this method saves the time to train NLOS signals samples compared to LS-SVM.On the other hand, the method also classifies other interference signals as NLOS signals, which is not appropriate, because only NLOS signals can be used for ranging and positioning in complete NLOS environment.
From the above analysis, we can see that the problem of NLOS identification is actually a classification problem of class-imbalance.Therefore, class-imbalance learning is an effective way to solve NLOS identification in UWB positioning.This paper applies a new approach to identify NLOS signals in UWB positioning system, which is called momentbased imbalanced binary classification (MIBC).Compared with LS-SVM method, we think that it is more practical to adopt the classification method of class-imbalance, and our method can achieve better classification performance.Moreover, compared with SVDD, the MIBC reduces the computational burden further, but it can still maintain good classification performance while avoiding mistaking the interference signals as NLOS signals.To the best of authors' knowledge, this is the first time that NLOS signals identification is considered as a class-imbalance classification problem.We use the datasets used in literature [9] to verify the algorithm and compare it with LS-SVM method in literature [9] and SVDD method used in literature [8].Experimental results show that when the number of LOS and NLOS signals is seriously unbalanced, the performance of MIBC method is better than LS-SVM and SVDD method.
The rest of this article is arranged as follows.Section 2 introduces the statement of the problem.The proposed technique for NLOS identification is described in Section 3. In the 4th part, we carried out simulation verification based on the measured data sets from the indoor positioning competition organized by MIT.Finally, see Section 5 for the conclusion.

Problem Statement
A typical UWB positioning system mainly consists of two types of nodes: one is the anchor node, and their location information is known; the other is the terminal equipment whose location information is unknown.We need to locate it through anchor nodes at known specific locations around it.In order to describe the problem conveniently, we assume that there are M terminal devices with unknown location information and S anchor nodes participating in location and with known location information.We set   = [      ],  = 1, 2, . . . to indicate the location of terminal device  and   = [      ],  = 1, 2, . . . to present the location of anchor node.
Under the above assumptions, we can get the ranging error model between the device and anchor node: Between the device and the anchor node, ‖ • ‖ Indicates the actual distance,   indicates the NLOS error, and V  indicates the measurement error.What we should pay attention to is the number of paths with NLOS errors between the equipment and all surrounding anchor nodes and we use L to express it.Where  ∈ , in particular, when L=0, it indicates when there is no positively biased range estimates caused by NLOS error between the equipment and all anchor nodes,   = ‖  −   ‖ + V  ,  = 1, ....When  = , this indicates that all paths contain the positive biases caused by NLOS errors, The basic idea of this paper has two aspects.First is that the waveform features of the received signal are different under LOS and NLOS conditions.These features are extracted and used to identify whether the ranging error is NLOS error.The typical waveform features in UWB positioning have been given in literature [9]: energy (  ) of the received signal; the maximum amplitude (max) of the received signal; rise time (rise time); mean excess delay (MED); root mean square delay spread (RMS-DS); kurtosis.Secondly, considering that the number of LOS signals in the actual environment is much larger than that of NLOS signals, it is a typical problem of imbalanced binary classification.The imbalanced binary classification method is more in line with the actual situation than the LS-SVM classification of sample equalization.Compared with SVDD, our method can not only can reduce the misrecognition rate of very important NLOS signals and achieve better classification performance, but also has lower complexity.

Imbalanced Binary Classification for NLOS Identification
In this section, we will mathematically demonstrate the correctness and accuracy of our proposed method and describe how it can be used to identify NLOS signals using the characteristics of the received signal waveform.These features will be the input to the classifier using the new method.First, we introduce the six signal waveform features used, and then deduced the mathematical formula of the new classification method in detail.The theory proves the validity and accuracy of the method.

Classification: Mathematical
Framework.In most classification tasks, the numbers of data under each class are basically impossible to be completely equal, but a little difference will not have any impact or problem.However, when the numbers of data under each class are seriously imbalanced, the traditional classification method will greatly fail to meet the classification requirements because of data imbalance [13].Aiming at the NLOS signals identification of UWB positioning system is a typical class-imbalance classification problem, we adopt a new formula to solve the problem and achieve the purpose of NLOS identification.In general, the number of LOS signals is much larger than that of NLOS signals.So, a large number of LOS signals samples need to be processed before the model is built.We use the two first moments of LOS signal samples, that is, the mean and covariance, to represent their probability distribution, and use this probability distribution and a small amount of all NLOS signal samples to build the model.We use (  ) ∈{1,...,} to represent a set of  NLOS of training samples and let  and Σ represent the mean and covariance of the probability distribution of the LOS signal samples.It is worth noting that, in the following mathematical derivation, the covariance matrix Σ is assumed to be positive definite.Our goal is to find a hyperplane (, b).All NLOS signal samples can be correctly classified: At the same time, it can maximize the probability of correctly classifying LOS signal samples with respect to the distribution with mean  and covariance Σ: where  ∼ (, Σ) refers to the class with a probability distribution of mean , covariance Σ.By synthesizing formulas ( 2) and ( 3), our aim is to find an optimal hyperplane that can correctly classify all NLOS signal samples while also correctly classifying the LOS signal samples to the greatest extent possible.From Lemma 1 in literature [14] we can know that the constraint condition expressed by formula (3) has a geometric representation, which can be converted equivalently.
Using the conclusion of Lemma 1, the optimal hyperplane (, b) we are looking for is the optimal solution of the following optimization problem: Considering that the function () is an increasing function on the domain  ∈ [0, 1), the above formula can be written as  The optimization problem represented by (8) belongs to a convex optimization problem, which has an interesting geometric explanation.Figure 1 shows this geometric explanation.
The ellipse with  as its center and covariance matrix Σ as its shape has a unique intersection point   with the convex hull of positive class  * as k increases.The intersection point  * is the projection of  on the convex set.The tangent line that passes through the intersection point  * and is tangent to the ellipse is the hyperplane we are looking for.
Next, we will give the solving process of the hyperplane (, b) corresponding to formula ( 8 From formula (10), we can see that the function of variable k is monotonically decreasing function in domain r, so we only need to satisfy the following formula conditions in order to obtain the maximum value of k: We can see that formula (11) is very similar to the formula of hard-spaced classification support vector machine in essence and also is an extension of the one-class support vector machine formula [15,16].However, there are two important differences between the two formula: first, we minimize the Mahalanobis distance between  of the LOS signal samples and its projection on the convex hull of the NLOS signal samples instead of the  2 -norms of W; Secondly, we separate the NLOS signal samples from the mean of the LOS signal sample distribution, not from the origin of coordinates.Like the one-class support vector machine, the constraint condition that all NLOS signals are correctly classified in formula (11) is not in conformity with reality; therefore, we add a relaxation factor   ≥ 0 to the constraint condition, so that we can obtain the following convex optimization problem: where  ∈ R + is the penalty factor, which is used to balance the complexity of the model and the size of the training samples that are allowed to go wrong.Formula ( 12) is our method to deal with the classification of imbalanced LOS signals and NLOS signals; we call it moment-based imbalanced binary classifier or MIBC.Introduce Lagrange multiplier  ∈ R  , V ∈ R  ,   ≥ 0,V  ∈≥ 0 and define x =   − .The Lagrangian function constructed by ( 12) is In order to obtain the minimum value of the constructed Lagrangian function, the partial derivative of ,   is zero: By substituting the result of the above formula and the constraint condition  ≥   ≥ 0 into formula (13), we can obtain the dual problem of maximizing : The above formula is a quadratic function extremum problem constrained by inequality; according to Karush-Kuhn-Tucker theorem, formula (15) must satisfy the following optimization conditions (KKT conditions): Through formulas ( 15) and ( 16), the SMO algorithm can be used to obtain the optimal solution of the daily coefficient .
Substituting formula ( 14) can obtain the weight vector W of the optimal hyperplane and the support vector   = x + and then obtain the hyperplane threshold  =     .After finding the optimal hyperplane (w,b), the discriminant function is used: If the condition is satisfied, the output is 1, indicating that the test sample signal is a NLOS signal, and if the condition is not satisfied, the output is 0, indicating that the test sample signal is a LOS signal.When LOS signals and NLOS signals are not linearly separable in the original space, we can solve the problem of nonlinear classification by introducing a modified kernel function K(  ,   ) to map the original data to a highdimensional space.The modified Gaussian kernel function is used in this paper: As can be seen from ( 15), the MIBC algorithm first needs to preprocess the LOS signal samples.The calculation amount in calculating the covariance matrix is O(d 3 ), and the calculation amount in finding the optimal solution to the dual problem is O(n 3 ).Therefore, the overall complexity of the algorithm is O (d 3 ) + O (n 3 ).However, the overall complexity of SVDD proposed in literature [8] is O(N 3 ).Since the number of LOS signal samples N is much larger than the number of NLOS signals n, the MIBC method proposed by us has lower complexity than the SVDD algorithm.The identification of NLOS signals by UWB indoor positioning system is that the receiver classifies a large number of received LOS signals and a small amount of NLOS signals.
In essence, this is an imbalanced classification problem, although the number of NLOS signals is very small; it is very important; misreading of NLOS signals will lead to serious ranging errors.MIBC is a typical class-imbalance classification method, which separates LOS signals from NLOS signals by finding the optimal hyperplane.In order to reduce the error rate of identifying NLOS signals, we use the mean and covariance matrix of the LOS signal samples and a small amount of all NLOS signal samples to train the classifier.After training, we can determine whether the test signal is a LOS signal or a NLOS signal.

Identification Performance
In order to verify the effectiveness of the proposed MIBC method, we designed three experimental schemes based on the datasets of the received signal waveform on the platform of MATLAB 2013 for verification.During the simulation, in order to prove the performance of the MIBC method in this paper more intuitively, we compared the LS-SVM method in document [9] with the SVDD method in document [8].At the same time, we verified the performance of LS-SVM and MIBC when the ratio of the two kinds of datasets is different.Finally, it is also verified that when the ratio of the number of training samples for the LOS signal and the NLOS signal is 1: 0.06, the influence of the capacity of the two types of datasets on the classification performance is correspondingly increased according to this ratio.In addition, in order to evaluate the classification performance of the three algorithms, we used k-fold cross-validation.LS-SVM needs k-1 groups of samples from both LOS signals and NLOS signals to train when using k-fold cross-validation.SVDD only needs k-1 groups of samples in the LOS signal data to train, and then the remaining LOS group and NLOS signals with the same capacity as the randomly selected samples are used as verification data.MIBC requires k-1 groups of samples in the NLOS signal datasets and the mean and covariance matrix of k-1 groups of samples in the LOS signal datasets to train.Datasets and their processing: we adopted the datasets used in [8,9] to carry out experiments and compared three methods with their results.The datasets include 1024 sets of measurement data, including 512 sets of LOS signals and 512 sets of NLOS signals.Log function is used to convert the values of the features.At the same time, it should be noted that before using the MIBC method, we must first calculate the mean value and covariance matrix of the LOS signal training samples' features.
Experimental design: the kernel functions involved in the experimental algorithm all adopt Gaussian radial basis function kernels, and the kernel parameter size is 5.Because the recognition accuracy in traditional machine learning cannot be used to evaluate the performance of the algorithm in class-imbalance learning, like literature [8], receiver operating characteristic curve (ROC) and area under roc curve (AUC) values are used as evaluation criteria in this paper.ROC curve reflects the balance between the recognition rate of large class and the false recognition rate of small class.If one ROC curve is at the upper left of the other curve, the performance of the classifier corresponding to the former is better than that of the latter.The AUC value refers to the area under the ROC curve.The higher the AUC value, the better the performance of the corresponding classifier.
Performance: in this experiment, nine kinds of feature combinations were extracted.The datasets adopted by the three methods include 500 LOS signals and 500 NLOS  2. As can be seen from Table 2, under the feature combination of group F, the classification performance of the three methods is the best.Figure 2 shows the roc curves of the three methods using 10-fold crossvalidation under the feature combination of group F. It can be seen that the MIBC method is superior to the LS-SVM method in areas where the NLOS signal acceptance rate is less than 0.07, and vice versa.Compared with SVDD, MIBC has slightly better classification performance.Figure 3 shows that, under the feature combination of set F, LS-SVM and MIBC use 500 LOS samples and 30 NLOS samples and the ratio of the two datasets is 1: 0.06, while, for SVDD, only 500 LOS samples datasets are still utilized.We utilize tenfold cross-validation to verify the ROC curves of the three methods finally.It can be seen that the MIBC method is obviously better than the LS-SVM method and slightly better than the SVDD method in classification performance.Table 3 shows AUC values of LS-SVM and MIBC under different ratios for the number of LOS and NLOS samples.
Set the ratio of LOS and NLOS samples datasets to be 1: 0.06, randomly select 500 groups, 400 groups, 300 groups, and 200 groups of LOS samples, and then randomly select 30 groups, 24 groups, 18 groups, and 12 groups of NLOS samples according to the ratio.Using 10-fold, 8-fold, 6-fold, and 4-fold cross-verification, respectively, the experimental results are shown in Figure 4.Under the condition that the proportion of LOS and NLOS samples remains unchanged, the classification accuracy of MIBC can be improved by increasing the number of NLOS and LOS samples proportionally.a typical class-imbalance classification problem.The traditional machine learning methods are all aimed at balanced datasets, so the traditional classification methods are not suitable for the real scenario.In addition, there are many differences between the waveform features of NLOS and LOS signals.Based on the above two points, we propose a new imbalanced binary classification method for NLOS identification of UWB positioning system, which we call MIBC.We compare the LS-SVM, SVDD and MIBC methods.The experimental results show that the performance of the MIBC method in imbalanced datasets classification is better, and its complexity is lower than that of the previous two methods.In the next step, we will study how to mitigate the error caused by the NLOS signal, because identifying the NLOS is only the first step to mitigate the error caused by it.

Figure 3 :
Figure 3: The ROC curves for the best cases in the LS-SVM, SVDD and MIBC algorithms when the quantity ratio of the two classification data is 1:0.06.

Figure 4 :
Figure 4: Illustration of how the performance of the MIBC detectors varies with the number of the NLOS and LOS samples.Set F is utilized and the ratio of the LOS and NLOS samples datasets is 1: 0.06.
[8][9][10]LOS condition, the rise time of the received signal is shorter, while, under the NLOS condition, the rise time is longer.Table1lists the waveform signal features we used, and the values of the parameters in the formula are the same as those in[8][9][10].
[8][9][10]eSelection.As mentioned in[8][9][10], we have chosen the following signal waveform characteristics: the energy attenuation of NLOS signals is much more serious than that of LOS signals due to the shielding and blocking of obstacles, resulting in significant differences in amplitude and energy of the two kinds of signals.At the same time, the kurtosis also reflects the steepness of the distribution in the datasets.The kurtosis value of the LOS signals is larger but the kurtosis value of the NLOS signals is smaller.The excess delay and root mean square delay reflect the delay characteristics of multipath components; generally speaking, the average excess delay of NLOS signals is larger than LOS signal and RMS is more serious than LOS signals.

Table 3 :
The AUC values of the two methods for the number of training samples of LOS and NLOS in different proportions.
Figure 2: The ROC curves for the best cases in the LS-SVM, SVDD and MIBC algorithms.For LS-SVM, both LOS and NLOS signals are needed for training, and the number of each of them is 500.For SVDD, only 500 LOS signals are needed for training.For MIBC, the mean value and covariance matrix of the features of 500 LOS signals are calculated firstly, and then 500 NLOS signals are needed for training.