Linear Regression Algorithm against Device Diversity for the WLAN Indoor Localization System

provided the original


Introduction
In recent years, people's daily life is becoming more and more convenient owing to the development of "5G", smart cities, and Internet of Things [1,2], and the network connectivity and spectrum efficiency of Internet of Things are greatly improved by "5G" [3]. As a major part of smart cities, location-based services (LBS) are attracting an enormous amount of attention [4,5]. Furthermore, the explosive growth of intelligent mobile devices has promoted the research of LBS technology [6]. The typical LBSs mainly involve human navigation in unfamiliar environment, robot path planning and guidance, health care inside modern hospitals, location-based enhanced sensing, entity and storage tracking, and management. We can see from all these services that the reliable, accurate, and real-time localization technologies are required to locate the users' position at the beginning of LBS [7].
In outdoor environments, the Global Navigation Satellite System (GNSS) is the most prominent positioning technology and provides precise positioning. It has made a great suc-cess and brought great convenience to people's lives [8]. However, due to line-of-sight (LOS) limitations, satellite signal receiving devices cannot be used indoors and in areas where satellites are blocked by tall buildings. Various solutions have been proposed as alternatives to GPS for indoor environments [9]. Since the RADAR system is proposed in [10], the RSS fingerprint-based WLAN localization method has attracted the attention of many researchers due to its costeffectiveness and availability [11,12]. After years of research and evolution, thousands of access points (APs) have already been deployed in indoor environments, such as campuses, hospitals, airports, and shopping malls, which provide great opportunities for the development of indoor location estimation services. The RSS fingerprint-based WLAN localization system can reuse the existing WLAN infrastructures in indoor environments, which significantly reduces the cost.
Typically, The RSS fingerprint-based WLAN indoor localization system contains two phases: the offline phase and the online phase [10]. In the offline phase, a number of reference points (RPs) are set in the indoor area, and researchers collect RSS values from the existing access points (APs) at all RP throughout. A radio map is then constructed using the collected RSS values and its corresponding geographic coordinates. In the online phase, users' locations can be estimated by comparing the RSS value at their current location and those in the radio map [13,14].
A conclusion can be drawn from the structure of the fingerprinting localization systems that RSS values are the foundation of realizing positioning. In most of the existing experimental systems, researchers build the radio map using a mobile device in the offline phase, and the user's location is computed using the same device in the online phase. However, in the actual localization system, this assumption is generally invalid, and the mobile devices used by different users are different in both the offline phase and the online phase. In the offline phase, aiming to reduce the labor and time costs of radio map construction, the crowdsourcing method has been proposed in the indoor localization domain, which brings a variety of distinct mobile devices [12,14,15]. In the online phase, the mobile devices which are used to localize the users may be different from the mobile devices used in the establishment of the radio map. The device diversity problem has been deeply studied in the last decade. The research results show that the main cause of such problem is the difference in hardware performance, which makes the differences between the RSS values collected by different devices may exceed 25 dB [16][17][18]. As a result, because of the adverse impact of the device diversity problem, the positioning accuracy of the crowdsourcing system is greatly reduced.
Although the establishment of a radio map for each device can obtain the highest positioning accuracy, this method is not practicable obviously as a result of numerous devices. In [19,20], the linear regression algorithm (LR) is proposed to eliminate the device diversity for the crowdsourcing WLAN indoor localization system. However, the descriptions and simulations of the LR algorithm in [19,20] are very simple. Therefore, in this paper, we discuss the proposed method in detail and get the complete simulation results. More importantly, the formula of error detection probability of the proposed LR algorithm is derived. The problem of device diversity and the adverse effects caused by this problem are analyzed at the beginning. Then, the LR method is applied to deal with the discussed problem. The device diversity will be diminished greatly with respect to the proposed algorithm. The advantages of this method include the following points: the system has a low computational complexity, does not need any training period, and can be finished automatically without user's intervention.
The main contributions of this paper are as follows.
(1) The linear relationship between the RSS data collected by different devices is proved. We obtain the RSS points by comparing the RSS vectors collected by different devices, and the slope and the intercept of the straight line determined by any two points are calculated. Since all the slopes and intercepts are equal, all the RSS points are on the same line. Therefore, the relationship of RSS data collected by different devices is linear (2) The fast least trimmed squares (FAST-LTS) algorithm is proposed to eliminate the device diversity problem. Since the relationship between RSS values collected by different devices is linear, using the linear regression algorithm, all the RSS values can be mapped into the same signal space. Because the outliers appear in the collected RSS values frequently and seriously affect the performance of the linear least squares (LLS) algorithm, the FAST-LTS algorithm is used in this paper. Simulation results verify the effectiveness of the proposed algorithm, and all the RSS data are mapped into the same signal space We derived the probability of error detection of all fingerprints in the radio map. By deducing the formula, we can obtain that the probability of error detection is greater when the two fingerprints are closer. Hence, these fingerprints in the set of candidate nearest neighbor fingerprints contribute most of the error and need to be dealt with carefully The rest of the paper is organized as follows. The related works are discussed in Section 2. In Section 3, we state that the problem statement on indoor localization and the linear relationship between the RSS data collected by different devices is proved. The linear regression method is proposed to solve the device diversity problem in Section 4. Section 5 analyzes the probability of error detection in the indoor localization system. The simulation and experimental results are presented in Section 6. Finally, Section 7 concludes the paper.

Background and Related Works
To handle the device diversity problem, the establishment of a radio map for each device can obtain the highest positioning accuracy. However, this method is not practicable obviously as a result of numerous devices [21]. Therefore, various solutions have been proposed as the alternations.
The device diversity problem was first discussed in [21]. Haeberlen et al. collected RSS values using different mobile devices at the same time and location in the test area, and they repeated this process at different locations. Then, the linear relationship between different RSS values collected by different mobile devices was inferred and used to eliminate the differences caused by different mobile devices. However, when an unknown device was used to find the current location, how to calculate the linear regression coefficients of the device had not been solved. In addition, although the authors in [21] had found the solution to overcome the device diversity problem, the solution was not applied to any localization systems.
Since collecting labeled RSS values for each mobile devices is a labor-intensive and time-consuming process, a semisupervised method is proposed in [17] to solve the device diversity problem using a small number of labeled RSS values. To solve the device diversity problem, multiple devices are treated as multiple learning tasks in this paper.
A latent feature space and a regression function are learned at the beginning, and then the signal spaces of all 2 Wireless Communications and Mobile Computing devices are mapped to the latent feature space by the regression function. Accordingly, the differences between different devices have been significantly reduced, and the positioning accuracy has been greatly improved.
In [18], Tsui proposed an unsupervised learning algorithm to solve the problem in the WiFi localization system. In this paper, the Pearson product-moment correlation coefficient is used to label the RSS readings roughly collected by an unknown device at the beginning in the online phase. Then, different algorithms, such as regression algorithm and expectation maximization algorithm, are applied to train the transformation function. In [22], another solution using unsupervised learning algorithm is proposed to overcome the device diversity problem. In this paper, the probabilistic model is built to calculate the RSS values, and kernel estimation with a wide kernel width is used to reduce the difference in probability estimates. Although these methods reported some gain in localization accuracy, the unsupervised learning algorithm could not take the desired effect when the set APs detected by different devices are different.
Kjaergaard et al. utilized hyperbolic location fingerprinting (HLF) to solve the device diversity problem [23,24]. In the training phase, the radio map is built by the signal strength ratio between two APs instead of an absolute RSS value from a single AP. Since the signal strength ratio is more stable than the absolute RSS value, the localization accuracy is significantly improved. However, the ratio term of the linear transformation function is not the only factor to be considered. If this offset component is significant in the linear relation or the set of the APs detected by different devices is different, this method is expected to fail.
In [25][26][27], three different of signal strength methods are proposed to reduce the impact of device diversity. In [25], Dong et al. used the different between all possible AP pairs, called DIFF, to build the radio map. In this radio map, the cost of the DIFF method is Oðn 2 Þ and may increase dramatically when the number of the AP increases. In [26], the signal strength difference (SSD) method approach subtracts the RSS value of an anchor AP from the other RSS values in the fingerprint. Therefore, each fingerprint contains only n − 1 RSS differences, and the dimension of SSD method is OðnÞ. As a result, the DIFF method achieves higher localization accuracy than the SSD method. Laoudias proposed the mean differential fingerprint (MDF) method in [27] which uses the mean RSS value to calculate the RSS differences to create the RSS fingerprint. The MDF method maintains the advantages of DIFF and SSD, which can achieve the high positioning accuracy as DIFF, while keeping the computational overhead similar to SSD.
In [28,29], convolutional neural networks (CNN) are used to eliminate the device diversity problem. Cai et al. [28] proposed a device-free indoor localization system based on channel state information (CSI) in IEEE 802.11n through CNN, and the space diversity, time diversity, and frequency diversity of CSI are combined to design the more abundant localization features. In [29], the database is constructed using the magnetic pattern (MP) in the offline phase, and the location is calculated using the CNN algorithm in the online phase to eliminate the device diversity problem.
Although this is a magnetic positioning system, it can give us a lot of inspiration.
In [30][31][32][33], the LR method is used to mine the internal relationship between data and eliminate the differences between data through linear regression. In [30], an automatic device transparent RSS-based indoor localization system has been proposed, and the linear least squares (LLS) algorithm is applied in this system. Combining the offset component and ratio term makes the LLS algorithm a complete algorithm. Moreover, the algorithm complexity of LLS is much lower than the other algorithms discussed above. Li et al. [31] presented a prototype model of a multiple-surveyor-multipleclient system in the crowdsourcing localization system. The linear regression model is applied to calibrate across participating training devices, and a geometric distribution is used to obtain a conditional likelihood that the client observes invisible access points in the training phase. Ye et al. [32] proposed a device calibration algorithm to fuse samples from different devices to obtain grid fingerprints and a two-step online positioning algorithm to localize user's position. In [33], the FAST-LTS algorithm is proposed to deal with large data sets. Due to the FAST-LTS algorithm, the LTS estimator becomes available as a tool for analyzing large data sets and to detect outliers or deviating substructures. This algorithm provides us with a very good idea to deal with the problem of device diversity. Therefore, we propose a FAST-LTS algorithm to eliminate the device diversity problem in [19,20], which achieves good results. In this paper, we further improve this algorithm.

Problem Formulation
For the RSS fingerprint-based localization method, the localization accuracy greatly depends on the mapping relation between the fingerprint and its corresponding coordinates stored in the radio map. In the offline phase, the localization area is divided into a discrete grid with n RPs and m APs, which are deployed in this area. We collect RSS values from the APs at each RP, and a fingerprint radio map is constructed that holds the RSS for m APs and n RPs. The system diagram of the RSS fingerprint-based WLAN indoor localization system is shown in Figure 1. In traditional WLAN positioning methods, hundreds of fingerprints are collected over time for each RP. After RSS preprocessing, a 1 × m vector x i = ðx i1 , x i2 , ⋯, x im Þ is generated at each RP, where x ik is the received signal strength measured by the training device at the i-th RP from the k-th AP. The i-th fingerprint ðx i , c i Þ is the combination of the RSS measurement x i and the coordinate c i = ðc i1 , c i2 Þ of RP S i . All the fingerprints are tabulated into the radio map that can be represented in Figure 1.
In the online phase, the user collects the RSS value y = ðy j1 , y j2 , ⋯, y jm Þ at an unknown position S j by a mobile device, and then the user's position could be estimated by comparing y with the radio map. Usually, if y is similar to the fingerprint x i in the radio map, we reason that user's location S j must be close to S i .
In the actual localization system, the mobile devices used by different users are distinct from each other. Since the 3 Wireless Communications and Mobile Computing WLAN signal receivers with different performance are equipped with the different mobile devices, so the different mobile devices may have different signal sensing capacities and yield different RSS values. To illustrate, we used five different mobile devices to collect RSS values from a single AP at a particular location and plotted the histogram in Figure 2. Due to the large fluctuation of the RSS value in the indoor environment, if only 2-3 RSS is collected on one RP, the accurate RSS distribution cannot be obtained. Therefore, we collected 100 RSS values in each RP. As shown in Figure 2, the RSS values collected by different devices can be quite different even at the same location. This directly results in erroneous location estimations if we use one device's data for training and another device's data for locating.
We define X and Y as the signal space for the radio map X built by the training device and the online RSS values collected by the localization device, respectively. Let the fingerprint x * in the radio map X is the nearest neighbor online RSS value y. Because of the different signal receiving capability of the different mobile devices, the RSS values collected at the close physical locations are obvious different. Hence, one of the key challenges arises: how to process these RSS values collected by different devices to make the x * in closer to y. Mathematically, By learning F, the radio map X build by the training device could be used to localize any other devices.
Next, we will explore the relationship between different RSS values collected by different mobile devices. The signal processing diagram of the mobile device is shown in Figure 3.
We suppose that the transmit power of the AP is P t , and the receiving power of the mobile device is where P is the receiving power of the antenna, P L is the path loss at distance d form AP, G t is the transmit antenna gain, G r is the receiving antenna gain, and α is the power amplifier magnification.
Assume that m represent the RSS vectors collected by two distinct devices A and B at the same location and the same time. In order to obtain the mapping function in Eq. (1), we compare x A and x B and get m points whose coordinates are ðP A i , P B i Þ, i = 1, 2, ⋯, m, and the RSS value collected by devices A and B from the i-th AP are x 11 x nm x 2m x 1m x 12     Wireless Communications and Mobile Computing where the subscript i represents the i-th AP, and the superscripts A and B represent the device A and the device B, respectively. Figure 4 shows three points determined by matrix x A and x B , any two points can determine a straight line, and then we get 3 lines in this figure. Suppose the slopes of the 3 lines are a 12 , a 13 , and a 23 , and the intercept of the 3 lines are b 12 , b 13 , b 23 , if a 12 = a 13 = a 23 and b 12 = b 13 = b 23 , all the 3 points are in the same line; that is, there is a linear relationship between x A and x B . Therefore, if the slopes and intercepts of the lines determined by any two of the n data points are equal, it can be determined that all the n data points are on the same line, which means the relationship between x A and x B is linear. In this paper, using this method, the relationship between x A and x B can be proved.

The linear equation determined by two points ðP
The slope and intercept of the equation are Substituting the values of Eq. (3) and Eq. (4) into Eq. (6) and Eq. (7), then we can get As can be seen from Eq. (8) and Eq. (9), in the ideal case, the slope and intercept of the line determined by two points ðP A i , P B i Þ and ðP A j , P B j Þ are only related to the antenna gains and amplification factor of the device A and device B. Therefore, the m points determined by the vectors x A and x B are on the same line.
where the linear equation coefficients are a = a ij and b = b ij . Assume that the RSS vectors x A and x B are collected at the same location and at different times. In an ideal case, the transmitting power of the AP, the working status of the mobile terminal, and the path loss remain constant; therefore, the linear relationship between x A and x B is same as Eq. (10).
In practice, the instability of the AP transmit power and the complexity of the indoor electromagnetic environment make the RSS values collected by the mobile device unstable at different times. where Substituting the values of Eq. (11), Eq. (12), Eq. (13), and Eq. (14) into Eq. (6), When the indoor environment approaches an ideal environment, which means the difference of AP transmitting power at different times δ Pt → 0, and the difference of path loss at different times δ PL → 0, then

Wireless Communications and Mobile Computing
Similarly, we can get As a result, the linear relationship between x A and x B can be proved by Eq. (16) and Eq. (17).
In an actual indoor environment, the working status of the AP is unstable, and the electromagnetic environment in the room is very complicated. Therefore, the difference δ Pt of the transmission power of the AP and the difference δ PL of the path loss are not equal to 0, and as the result, the linear equation coefficients are where δ a = δ ij /P A i − P A j is the error of slope, and δ b = P A i δ APj − P A j δ APi /P A i − P A j is the error of intercept, δ ij = δ APi − δ APj , δ APi = δ Pti + δ PLi , and δ APj = δ Pt j + δ PLj . Figure 5 shows the slope and intercept of Eq. (18) and Eq. (19). Considering the noise, we can see from Eq. (18), Eq. (19), and Figure 5 that the slope and intercept of the straight line are centered on the ideal slope and intercept and fluctuate within an error range. Therefore, the relationship between RSS data vectors x A and x B is approximately linear.
In Figure 6, the comparison results of RSS values collected by five distinct devices are plotted. In this figure, each point represents the RSS values collected by two distinct devices at the same RP from the same AP. For example, the top left subplot in Figure 6 represents the RSS values collected by Lenovo laptop and Huawei mobile device. Figure 6 verifies the results of Eq. (18), Eq. (19), and Figure 5, and the linear correlation can be drawn by the RSS values collected by Lenovo laptop and other devices.
As a result, we apply the linear regression method (LR) as the mapping function in this paper. The LR model is defined by specifying how the signal space of localization device X is mapped into Y in the signal space of the training device.
Based on the LR method, a unique radio map in the offline phase could be built and improve the localization accuracy in the RSS-based crowdsourcing localization system.

Linear Regression Algorithm against Device Diversity for the Crowdsourcing Fingerprint Indoor Localization System
In this section, a preprocessing procedure is used to stabilize the acquisition of RSS values at the beginning. Then, the RSS values collected by an unknown device are labeled automatically with a rough location estimation using a correlation ratio computed from the Pearson product-moment correlation coefficient. Finally, the linear regression algorithm (LR) is proposed as the mapping function to solve the device diversity problem, and the fast least trimmed squares (FAST-LTS) is applied for the LR method to provide a more robust performance.

Preprocessing of RSS
Values. The first step in our work is to mitigate the RSS fluctuations caused by the complexity of the indoor environment. Typically, when building the radio map, we measure a large number of RSS values at each RP to eliminate the noise. Let RSS li = frss 1 , rss 2 , ⋯, rss p g be the set of RSS values collected at location l from the i-th AP. When the RSS values are collected to build the radio map or estimate the current location, the length of all RSS must be the same. However, for some reasons, some APs cannot work properly. To ensure that the RSS measurements are the same length, we use a value of −110 dBm to fill the missing RSS value, and we denote it as an outlier. These outliers could affect the linear regression process and produce erroneous location estimations, as shown in Figure 7. Figure 7(a) plots the linear regression function of the RSS values for two different devices, where the traditional average is used for estimation. The uncertainty in the RSS samples can be seen clearly in this figure. To achieve higher positioning accuracy, the original RSS measurements should be preprocessed prior to the localization process. Average, mode, and median are the common data preprocessing methods in mathematics. Because the average takes all the RSS values into consideration and uses the data more efficiently, so we average the RSS values to overcome the RSS fluctuations. However, the average is susceptible to the outliers of -110 dBm. The existence of outliers could seriously affect the accuracy of the average and produce erroneous location estimations. Hence, the truncated average is used in our work to stabilize the collected RSS samples: where Ið·Þ is an indicator function. There is also a special case when calculating the truncated average. Consider a situation that only one sample reports valid reading rss i = α dBm when all the other RSS values are -110 dBm, the result of Eq. (20) is x li = α dBm. Therefore, when applying Eq. (20), we first calculate the ratio t of the P j P i (P i2 , P j2 ) (P i1 , P j1 ) (P i3 , P j3 ) Figure 4: Schematic diagram of relation between data points. 6 Wireless Communications and Mobile Computing normal RSS value in the collected vector and set a threshold t th , and then we can get the truncated average

Rough Location Estimation.
After completing the preprocessing of RSS data, we use the linear regression method as the mapping function.
where x i is the i-th fingerprint in the radio map collected by the training device, y is the RSS values measured in the online phase by localization device, and ða i , b i Þ are the coefficients in the mapping function. Based on the mapping function in Eq. (22), the RSS values collected by different devices can be transformed to the same signal space. Accordingly, the device diversity problem can be solved. However, the RSS values collected in the online phase are unlabeled and cannot be processed using the linear regression algorithm. Therefore, the correlation ratio computed from the Pearson product-moment correlation coefficient is proposed to roughly label the RSS values collected in the online phase as [18].
where m is the number of APs, y k and x ik are the RSS values measured from the k-th AP, y = 1/m∑ m k=1 y k is the average of the RSS values from the tracking device, and x i = 1/m∑ m k=1 x ik represents the mean of the RSS values measured by the training device in the i-th fingerprint. The range of the Pearson correlation ratio is ½−1, 1, where 1 indicates the highest    When we get the online RSS values y, the Pearson correlation ratio r between y and all fingerprints x i in X can be computed using Eq. (23). By setting a correlation coefficient threshold r th , we can obtain the set of nearest neighbor fingerprints in the radio map X for y.
The fingerprints in A have the strongest linear correlation with the online RSS values, the online RSS values can be labeled roughly, and the linear regression mapping function can be obtained more accurately.

Linear Regression Algorithm against Device Diversity
Problem. In Eq. (22), the important parameters, a i and b i , need to be computed first. Because the linear least squares (LLS) algorithm is more sensitive to the outliers, we use the fast least trimmed squares (FAST-LTS) algorithm to compute the parameters in Eq. (22).
Assume that the amount of the nearest neighbors in A is c , the FAST-LTS solution for linear regression with intercept is given by where h = int ½ðc + 2Þ/2, dðiÞ = ky − ða i x i + b i 1Þk , and k•k is norm 2 of a vector, dðiÞ 2 are the ordered squared residuals: dð1Þ 2 ≤ dð2Þ 2 ≤ ⋯≤dðiÞ 2 ≤⋯≤dðcÞ 2 . Given the h-subset H old of all nearest neighbors, the C − step is used to compute a i and b i as follows [33]: (1) Compute a old and b old ≔ least squares regression estimator based on H old Repeating C − step with numerous H old , a lot of regression coefficients will be gotten. The approximate solution is the coefficient corresponding to the least ∑ h i=1 dðiÞ 2 . Using the parameters a i and b i , x can be transformed to the signal space of the online data y: where x i ′ ∈ Y. Since both x i ′ and y belong to the same signal space, the KNN algorithm based on the RSS Euclidean distance can be used to estimate the user's location.

Analysis of Probability of Error Detection of the Proposed Algorithm
In this section, we analyzed the probability of error detection in the crowdsourcing indoor localization system. In the offline training phase, the radio map X consists of n fingerprints that are built by the training device D T . Suppose x 1 and x 2 are two fingerprints in the radio map X, y is the RSS value collected by the localization device in the online phase. Assume that the fingerprint x 1 is the nearest neighbor to the online RSS value y. Based on the linear regression model, the relation between the RSS vector y and the fingerprint x 1 can be expressed as where a 0 and b 0 are the real linear regression coefficient between two RSS vectors, and ε is a 1 × m noise vector. We suppose that ε is the Gaussian distribution of Nð0, σ 2 ε Þ, and the variance σ 2 ε is unknown. Using the proposed LR method, the online RSS vector y and the radio map X can be transferred to the same signal space  Wireless Communications and Mobile Computing In this paper, the KNN algorithm is used to estimate the location of the online RSS vector y. First of all, the Euclidean distance between the online point y and the fingerprints in the radio map X should be calculated by In the KNN algorithm, we choose the fingerprints with the smallest Euclidean distance as the nearest neighbor of y. Assume that k = 1, if d 1 > d 2 , then x 1 is the nearest neighbor of y; on the contrary, x 2 is the nearest neighbor of y. We have assumed that x 1 in the radio map X is the nearest neighbor of y, and an error occurs when the calculation results show that d 1 < d 2 . Therefore, we can get the probability of localization error from Eq. (32) as given as where NNð·Þ is the nearest neighbor of the online data y.
If the nearest neighbor of the online data y estimated by the localization system is correct, then, NNðyÞ = x 1 , the linear regression coefficient could be calculated by Eq. (25), Obviously, in Eq. (33), when a = a 0 and b = b 0 , we get the minimizer of Eq. (30), and the linear regression coefficient between y and x 1 is Substituting the values of Eq. (34) into Eq. (33), we can get where χ 2 m is the chi-square distribution with degrees of freedom of m.
When a localization error occurs, then NNðyÞ = x 2 , and the linear regression coefficient between y and x 2 becomes By making use of triangular inequality, From Eq. (37), b 2 = b 0 is the solution when we get the minimizer of Eq. (36). Therefore, Eq. (36) can be equivalent to Because x 1 and x 2 are m × 1 vector, then x T 1 x 2 = x T 2 x 1 ; therefore, Eq. (38) can be written as In Eq. (39), a 0 , x 1 , and x 2 are already known. Let A = a 2 0 kx 1 k 2 , B = a 0 x T 1 x 2 , and C = kx 2 k 2 , and then Eq. (39) can be expressed as From Eq. (40), we can get the solution of a 2 by making the derivative of the quadratic equation equal to 0. Let Δ = A − 2Ba + Ca 2 and ∂Δ/∂a = 0 yields Then, we can get At last, the linear regression coefficient between y and x 2 can be calculated by Wireless Communications and Mobile Computing Substituting the linear regression coefficients into Eq. (31), then we can get Using Eq. (35) and (44), the probability of localization error in Eq. (32) can be computed by For the right side of the inequality in Eq. (45), we set Using Cauchy-Schwartz inequality, Therefore, ξ ≤ 0 is in Eq. (46). For the left side of the inequality in Eq. (45), we set η = s T ε. In this equation, where Φð·Þ is the standard normal distribution function.
Assume that the actual nearest neighbor of the online data y in the radio map X is x 2 , when the localization system wrongly detect the fingerprint x 1 is the nearest neighbor, the probability of error detection is shown in Eq. (51). Generally, there are n fingerprints in the radio map X. Assume that the fingerprint x i in the radio map X is the nearest neighbor of the online data y. Then, an error occurs when any other fingerprint x j in the radio map X is chosen as the nearest neighbor of y. Therefore, the probability of error detection is Each of the standard normal distribution function Φ represents the probability of wrongly detecting the fingerprint x j as the nearest neighbor of the online data y when the actual nearest neighbor is the fingerprint x i . In the radio map X, if 10 Wireless Communications and Mobile Computing the fingerprint is closer to the online data y, it is determined as the nearest neighbor by the localization system with higher probability; so, when an error occurs, the probability of error detection is larger. This is because the fingerprints in the nearest neighbor set A calculated by Eq. (23) have a high probability to be the nearest neighbor; so, these fingerprints in A contribute most of the probability of error detection in Eq. (52) and should be dealt with carefully. Therefore, it is possible to obtain a more accurate nearest neighbor set by using the preprocess of RSS values, so as to realize the linear regression of RSS values more precisely.

Experimental Results and Analysis
The effectiveness of the proposed LR method is studied and analyzed through experiments and simulations in this section. Figure 8 shows the indoor localization experiment system. The localization area is the corridor with 49.4 m in length and 14.1 m in width, which is illustrated with yellow color. In the offline phase, we deployed 27 access points (Linksys WRT54G) with IEEE 802.11b/g mode. As we can see from Figure 9, in the indoor positioning system, the larger the interval between two adjacent reference points in radio map, the lower the final positioning accuracy, but the interval cannot be too small; therefore, the corridor is divided into several grids of 0:5 m × 0:5 m, which means the interval between any two adjacent reference points is 0. To verify the LR method, several RSS values are collected by the test devices and are used to find the candidate fingerprints in the radio map at the beginning. Based on the candidate fingerprints and the measured data, the linear regression coefficients are calculated. Then, the signal space of the radio map and the online RSS values can be mapped to the same signal space, and we can obtain the accurate localization result.

Wireless Communications and Mobile Computing
Next, we take (Lenovo, Huawei) pair as an example to illustrate the effectiveness of the LR algorithm. As a comparison, the linear regression coefficients are calculated by the LTS algorithm and LLS algorithm, and the linear regression functions are shown in Figure 10. In Figure 10, compare with the result of the LTS algorithm, the linear regression function calculated by the LLS algorithm is closer to the outliers, which result in a large error. The LLS algorithm is more susceptible to the outliers, and this is due to the fact that the LLS algorithm deals with all the measured RSS values equally without any special treatment of the outliers. After getting the linear regression functions, the signal space of the radio map can be mapped to the signal space of the online RSS data.
In order to demonstrate the ability of linear regression algorithm more intuitively, the comparison of RSS values before and after using LTS algorithm is illustrated in Figure 11. In Figure 11, the distributions of Huawei device and Lenovo device are -62 dBm to -55 dBm and -51 dBm to -41 dBm. Obviously, the minimum and maximum RSS difference between Huawei device and Lenovo device is 4 dBm and 21 dBm, respectively. If the radio map is built by the Lenovo device and the user's location is estimated by Huawei device, the localization accuracy is considerably low. Using the LTS algorithm, the RSS values collected by Lenovo device are transformed, and the signal distributions of Huawei device and Lenovo device are basically the same. As a result, the localization accuracy can be improved significantly.

Wireless Communications and Mobile Computing
After applying the LTS algorithm, the RSS values in the radio map are transformed, and the KNN algorithm (K = 3) is used to estimate the user's current location. As a comparison, the LLS algorithm and the RSS ratios are also used to solve the device diversity problem. The error between the estimate and the truth locations is expressed by Euclidean distances. The CDF curves of the localization error of all algorithms are displayed in Figure 12. When the devices used to build the radio map and estimate the user's current location are the same, we can obtain the optimal solution, as the red line shown in Figure 12. As we can see from the rest caves in Figure 12, if the devices used in the offline phase and the online phase are different, the localization accuracy of the crowdsourcing localization system is greatly reduced. In this paper, we do our best to eliminate the device diversity problem, so that the localization accuracy can be as close as possible to the optimal result. As we can see from Figure 12, the localization accuracy has been improved by applying different algorithms. It is clear that the proposed LTS algorithm outperforms the other methods, and the localization accuracy is closest to the optimal solution. Notably, the maximum localization error has been reduced from 10 m to 4.5 m, and the average error is reduced from 3.72 m to 2.31 m.
In Eq. (24), the correlation coefficient threshold r th is set to choose the nearest neighbor fingerprints in the radio map for the online RSS data. However, if all the correlation ratios calculated by Eq. (23) are less than r th , then A = ∅. To be guaranteed A ≠ ∅, we choose 10% of the fingerprints with the highest r to form the candidate nearest neighbor set. The CDF curves of the localization error using different size of A are plotted in Figure 13. As shown in Figure 13, since more fingerprints with low probability to be the nearest neighbors are included in A, the localization accuracy decreases as the candidate set size increases.
The probability of error detection in Eq. (52) for all fingerprints in the radio map is shown in Figure 14. In the simulation, since the variance of the noise has no effect on the trend of the probability distribution, we set σ ε = 10. In addition, we assume that all the RSS values collected in the online phase have an ideal nearest neighbors in the radio map; that is NNðyÞ = x i , so we set a i = 1 for all i = 1, 2, ⋯, n. Because P e ði, jÞ has no physical meaning when i = j, we make P e ði, iÞ = 0 when drawing Figure 14. From Figure 14, it can be concluded that the probability of error detection of fingerprints closer to the nearest neighbor is higher than others, which means that these fingerprints are more likely to be detected as the nearest neighbors. Thus, the fingerprints in the nearest neighbor set in Eq. (24) contribute the most errors in Eq. (52) and should be chosen more carefully.

Conclusions
In this paper, the linear regression (LR) method is proposed to overcome the device diversity problem for the RSS fingerprintbased WLAN indoor localization system using crowdsourced data. The intuition behind this technique is that the RSS values between different devices have a linear relationship. The Pearson correlation coefficient is used to label the RSS values with rough location estimation at the beginning, and the regression coefficients are calculated by the LTS algorithm. Based on the LR algorithm, the RSS values collected by distinct devices can be shifted into the same signal space, and the device diversity problem can be solved. We did a theoretical study of the probability of error detection, and the proposed algorithm is validated through it. Furthermore, we tested the proposed method in a typical office environment, and the experimental results demonstrate that the proposed method leads to significant improvements in localization accuracy.

Data Availability
The radio map data used to support the findings of this study were supplied by Liye Zhang under license and so cannot be made freely available. Requests for access to these data should be made to Liye Zhang (zhangliye@sdut.edu.cn).