Dealing with Insufficient Location Fingerprints in Wi-Fi Based Indoor Location Fingerprinting

The development of the Internet ofThings has accelerated research in the indoor location fingerprinting technique, which provides value-added localization services for existingWLAN infrastructureswithout the need for any specialized hardware.Thedeployment of a fingerprinting based localization system requires an extremely large amount of measurements on received signal strength information to generate a location fingerprint database. Nonetheless, this requirement can rarely be satisfied in most indoor environments. In this paper, we target one but common situation when the collected measurements on received signal strength information are insufficient, and show limitations of existing location fingerprinting methods in dealing with inadequate location fingerprints. We also introduce a novel method to reduce noise in measuring the received signal strength based on the maximum likelihood estimation, and compute locations from inadequate location fingerprints by using the stochastic gradient descent algorithm. Our experiment results show that our proposed method can achieve better localization performance even when only a small quantity of RSSmeasurements is available. Especially when the number of observations at each location is small, our proposed method has evident superiority in localization accuracy.


Introduction
With the development of the Internet of Things (IoT) and the popularization of mobile devices such as smart phones, a variety of mobile applications have changed people's lifestyles tremendously.These applications enable users to access a plenty of services at any time in any place and often use their location information in order to provide them with personalized experiences.The global positioning system (GPS) can achieve meterlevel accuracy in outdoor environments.However, GPS works poorly inside buildings due to the signal attenuation caused by roofs, walls, and other objects.During the past decades, a variety of indoor positioning systems (IPS) have been introduced.Since wireless information access is now widely available, many of these approaches tap into wireless signals for estimating locations.
In the last couple of years, the location fingerprinting (LF) technique using existing wireless local area network (WLAN) infrastructure has been suggested for indoor areas.Location fingerprinting estimates the target location by matching online measurements of received signal strength (RSS) with the closest offline features (i.e., the location fingerprints) composed of location coordinates and respective RSS values.It is relatively simple to deploy, compared to the other wireless indoor positioning techniques using Bluetooth beacons [1] or RFID tags [2], which can achieve higher location accuracy.
To deploy a traditional LF-based indoor positioning system, the positioning server should generate the location fingerprints by performing site survey of the RSS information from multiple access points (APs).With these fingerprints, the positioning server is able to localize a mobile device based on its RSS measurements.The site survey is extremely time-consuming and labor-intensive, which raises the cost of initiating an LP-based localization service.Furthermore, the positioning server should periodically reperform this site survey to update the fingerprints so as to control errors in the changeable Wi-Fi environment, which also raises the cost of maintaining the localization service.

Wireless Communications and Mobile Computing
Benefiting from cloud computing and big data techniques, the LF system may also be deployed without an offline site survey.Following [3], much work has been done to enable the collection of fingerprints based on crowdsourced solutions.In these solutions, users are required to continuously upload their RSS measurements to the positioning server as the training data.In the meanwhile, an additional incentive mechanism is required to guarantee the number of volunteering users.
In the situation when the crowdsourced fingerprints are insufficient to deploy an LF system, the offline site survey is still required to refine the fingerprint database.Suppose an indoor environment as an example; there are some locations which have never been occupied by any volunteering users.Thus, no RSS fingerprints of these locations are generated.To ensure the functionality of the location service, the service provider may still need to perform an offline site survey at these locations.
Regardless of whether the RSS measurements are collected via a traditional site survey or a crowdsourced approach, it is a widely existing fact that sufficient RSS measurements cannot be collected (periodically for maintaining fingerprints) in most indoor environments.Ways of collecting RSS measurements are not the focus of this paper.We are interested in the following several issues which may be interesting and useful but, however, rarely studied by existing researches: (i) At a given location, how many RSS measurements are required to generate an accurate online location or offline fingerprint?
(ii) Most importantly, when the collected RSS measurements are insufficient to generate an accurate location fingerprint database, how do we perform localization in this situation?
Although the answers to these questions may vary in different indoor environments, the readers should take into account the instructive significance of deeply analyzing these issues in a certain indoor application scenario.In this paper, we propose a novel localization method which reduces noise in measuring the received signal strength based on the maximum likelihood estimation and estimates locations from inadequate location fingerprints by using the stochastic gradient descent algorithm.We also use an open dataset to evaluate our proposed method by comparing it with the most commonly used location fingerprinting methods and investigate the number of RSS measurements required to deploy an LF system.The results show that our proposed method can achieve better localization accuracy when only a small quantity of RSS measurements is available.
This paper is organized as follows.Section 2 surveys existing location fingerprinting methods.Section 3 introduces two major problems that arise from insufficient RSS measurements in deploying a location fingerprinting system.Section 4 describes our basic idea in solving these problems, and the detailed solution is presented in Section 5. We evaluate our proposed method in Section 6.At last, we conclude this paper in Section 7.

Related Work
To reduce the cost of deploying an indoor localization system, many researches leverage existing Wi-Fi infrastructures and introduce location fingerprinting based on the RSS measurements of the Wi-Fi signals.The deployment of location fingerprinting systems is often divided into two phases: an offline phase, in which a site survey of the RSS from multiple APs is collected, and an online phase, in which a location can be computed based on the currently observed RSS measurements by using a matching algorithm.

Collecting Online and Offline RSS Measurements. At least four key factors can decide the accuracy of an LF technique.
The first is the density of the offline observing locations where RSS measurements are collected to generate fingerprints: a higher accuracy in LF requires higher intensity of the observing locations, which leads to heavier workload in collecting and updating the fingerprints.
The second is the quantity of available information, including the number of RSS observations used to generate fingerprints and the number of dimensions (observed APs) in each RSS observation.Existing approaches use channel state information [4,5] or environmental information such as light [6], sound [7,8], temperature, humidity, magnetic, or pressure data to improve location accuracy.Both of these factors deal with the sufficiency of RSS measurements.

Reducing Noises and Generating
Fingerprints.The third key factor deciding the accuracy of localization is the algorithm used to reduce noises.It can be used in both the offline and the online phases.
The most common way in denoising RSS measurements is to observe multiple times at the same location and average multiple observations so that noises can be reduced.With multiple observations at the same location, one can also make sure that all observable APs are observed.The tricky part is how to deal with situations when some APs are missed from some (but not all) observations.A common but also naive approach is to simply set the RSS to unobserved APs to −100 dBm.Some other approaches assume that only APs far away from the observing location can be missed (we will show that this assumption is wrong) and make a threshold (e.g., −80 dBm) to consider only RSS measurements larger than this threshold.There are also approaches that use a complex algorithm to reduce noises [9][10][11]; however, most of these approaches require a large quantity of RSS observations at the same location.Some approaches use a lightweight machine learning method to generate limited location fingerprints [12], or variations of fingerprints such as RSS differences between every pair of APs [13], or do not need to generate location fingerprints [14]; however, they suffer from relatively low localization accuracy.
By only reducing the measurement errors, it is still difficult to achieve a high localization accuracy.With sufficient RSS measurements, localization accuracy is mainly decided by the fourth factor.

Matching Algorithm.
The fourth key factor deciding the accuracy of localization is the matching algorithm used in the online phase, which outputs the final location by comparing the online RSS observations with the location fingerprints.By now, most LF systems mainly use, but are not limited to, the following types of matching algorithms.

Probabilistic Method.
The probabilistic method treats the matching problem as a classical classification problem.It computes the probabilities that the online observing location belongs to every offline candidate location and finally performs matching from the candidate location based on the probabilities.The result of the localization can be either the candidate location with the highest probability or an averaged value calculated from every candidate location weighted by its corresponding probability.

𝑘-Nearest
Neighbors.Based on the context information collected at the observing location,  nearest neighbors are defined as the  offline candidate locations which have the most similar context information.The locations of the nearest neighbors (KNN) contribute to the result of the localization by direct averaging or weighted averaging in weighted KNN (WKNN).It must be taken into consideration that the context information can be of various kinds (e.g., wireless signal strength, brightness, temperature, and humidity), and the metric quantifying the distance between the vectors of context information should be carefully designed.In the situation where only wireless signal strength is used, the Euclidean distance in the wireless signal strength space is often used as the metric.Locations which have smaller distance with the observing location are the -nearest neighbors, and the distances can be used to compute the weights in WKNN.

Other Machine Learning Methods.
Existing machine learning methods can be used in matching the online location to those offline locations.A neural network can be created in the offline phase, which takes as input the context information collected at the observing online location, takes as object the location of the fingerprint, learns the weight matrix for each dimension of the context information, and finally outputs the localization result.The support vector machine can be used in small sampled, nonlinear, and high dimensional pattern recognition.The matching and localization can be accomplished by treating the location fingerprint information of candidate locations as support vectors and by performing classification and regression analysis on the context information collected at the target observing location.Other machine learning methods may also be used in location fingerprinting.

Localization without Site Survey.
The site survey in the offline phase can be extremely time-consuming and laborintensive.Recently, many researches introduced crowdsourcing based systems [3,[15][16][17][18][19][20] which require the users to continuously observe their RSS measurements and upload the data to the positioning server.These approaches do not require the site survey to be performed, and they do not require the map of the floorplan.However, additional incentive mechanisms are required to attract enough participation, since the one who uploads his observed RSS measurements cannot obtain any benefits like positioning accuracy but will definitely take the privacy risk and the transmission cost.In our previous work [21], we propose a novel indoor navigation mechanism for shopping mall environments, which requires only few shop owners as RSS information contributors.Compared with our previous work, this work improves the method by adjusting it to more general indoor location fingerprinting scenarios and also evaluates our proposed method by comparing it with existing location fingerprinting techniques.Furthermore, we do not focus on ways of collecting RSS measurements.We are only interested in the quantity of the RSS measurements, regardless of whether they are collected via a traditional site survey or a crowdsourced approach.

Problem Definition
There are so many situations in our real life when we are asking or being asked a question like "How can I go to?" or "Where is?"For instance, a consumer may want to find a certain shop in a shopping mall, or a patient may want to find the correct consulting room in a hospital.Nowadays, most indoor environments like the aforementioned shopping malls or hospitals always have WLAN infrastructures; however, localization in these environments is still unavailable.The key reason deals with the cost in building and maintaining the fingerprint database.Existing techniques highly rely on an assumption that sufficient RSS measurements can be collected, either by a site survey, which is extremely time-consuming and labor-intensive, or by a crowdsourced approach, which requires too many collaborative contributors.
Here is an example showing how much time one should spend in collecting "sufficient" RSS measurements.Consider a very tiny shopping mall with a total area of only 5,000 m 2 that includes all the floors.The offline observations are collected every 1 m 2 , and at every observing location, at least 10 observations have to be collected.After each observation, a time interval of, for example, about 3 seconds is spent so as to obtain a next observation.Suppose one spends no time moving from one observing location to another, and the observations can never fail.We can compute that he should spend at least 150,000 seconds (i.e., 41.67 hours) to perform a site survey.If the fingerprint database needs to be updated every day, then we need at least 5 long-term employees, each of whom works for 8 hours a day with no weekend and must not rest during working.Remember that that is only for tiny shopping malls.For large shopping malls, the workload can be incredibly heavy.Perhaps this is the reason why building owners always choose to deploy infrastructures to provide localization, not the "infrastructure-free" location fingerprinting.
So, our problem is, when the RSS measurements collected are not sufficient, how do we perform localization?At least the following two problems should be addressed.
Mean RSS for L 2 ?

Measurement Noises.
One problem arising from insufficient RSS measurements deals with noises in the RSS measurements.RSS values can change greatly in different observations even at the same location to the same AP, as shown in Figure 1.In this example, the standard deviation is 13.66.Without denoising the RSS measurements, no accurate fingerprints can be generated and no accurate localization can be performed.One may think of an intuitive solution by averaging different observations at neighboring locations.This idea is not always correct as illustrated in Figure 2. Suppose three observing locations  1 ,  2 , and  3 are in a line, and  2 lies in between  1 and  3 .The simple but incorrect solution denoises the observation at  2 by weighted-averaging the observations at  1 and  3 , and the weights can be computed from the distances  12 and  23 .However, this denoising method is not always correct (if not always incorrect), since it relies on a totally wrong hypothesis that the RSS to different locations in a 2D or 3D space can be modeled by a linear function.In Figure 2, suppose the AP is located closer to  2 ; we can find that the RSS to this AP observed at  2 should be larger than those observed at  1 and  3 .So, the denoising method will definitely reduce the value of RSS 2 .We use an open dataset to show how frequently an AP can be missed in an arbitrary observation.The dataset is the Mannheim/compass dataset [22] which contains Wi-Fi observations of different locations.Our experiments described in Section 6 are also based on this dataset.For a given RSS observation, the APs can be classified into the following three categories as shown in Figure 3: (i) Observed APs.Those are observed in the record.

Missed APs. Another problem deals with dimensional mismatches between different RSS observations in the signal
Reversely, the unobserved APs are those not observed in the record.
(ii) Unobservable APs.Those cannot be observed at the observing location.An AP is unobservable if no records at this location ever observed this AP.The unobservable APs must be unobserved APs, but unobserved APs may be observable.
(iii) Missed APs.Those are observable but unobserved in this record.
The proportions of the observed APs, the unobservable APs, and the missed APs are shown in Figure 4.One interesting finding is that the probability of missing an AP is not obviously related to the averaged RSS value.According to this finding, it is not reasonable to treat RSS to a missed AP (i.e., an unobserved but observable AP) as −100 dBm, since −100 dBm means the AP is unobservable.

Basic Idea
In the following, we present how our proposed method deals with the missed APs and the measurement noises.Here, "RSS" is not an observed value, but a theoretically computed value by averaging the measurements in which this AP is observed.

Dealing with Missed APs. Consider an indoor environment as illustrated in
as  1 ,  2 ,  3 , and  4 .At each observing location, we perform an RSS observation; and within each observation, a specific AP is missed.We can further suppose that the location of an arbitrary observation is unknown and needs to be localized.In this situation, we find that it is difficult to perform traditional localization, since these observations observe different APs.And if we discard the dimensions of the missed APs to avoid dimensional mismatches, we will find that no dimensions are left in the signal space.As a result, none of these observations can be used for traditional localization techniques.Our idea to solve the missed AP problem is straightforward.The observations with missed APs really cannot be directly used for location fingerprinting; however, the information within the observations is valuable, since it tells the relationship between the relative locations of all APs and observations.Back to our example in Figure 5, AP 3 is missed in the observation at  2 .Our idea is to compute RSS 3 for  2 based on the values of other RSS measurements at other locations.The RSS to all other missed APs can also be computed in a similar way.If we can compute a theoretical RSS value for each of the missed APs in all observations, the dimensional mismatches can be avoided and localization can be performed.The detailed algorithm is presented in Section 5.

Reducing Measurement Noises.
Our idea in denoising the RSS measurements is to some extent related to our solution to the missed AP problem.As shown in Figure 5, after we compute RSS 4 for  1 , RSS 3 for  2 , RSS 2 for  3 , and RSS 1 for  4 , finally we have 4 observations each of which contains RSS measurements to all the 4 APs.Now, the localization can be performed by matching the RSS observation distance in the 4D signal space to the locational distance in the 2D physical space.However, the output of this localization process is far from accurate, since every RSS measurement in every observation is noisy.Without sufficient RSS measurements at the same location, traditional localization techniques cannot reduce the noises effectively.
Again, we make use of all RSS measurements to compute the relationship between the relative locations of all APs and observations.Back to our example in Figure 5, this relationship can be primarily computed based on the primary localization result.With this relationship, RSS 1 for  3 and RSS 1 for  4 can be used to modify RSS 1 for  1 , and every other RSS measurement can also be denoised by carefully computing the weights of the measurements to the same AP, no matter at the same location or other locations.Then, the newly denoised RSS measurements can be used to improve the accuracy of the previous localization and thus will output more accurate locations.Our proposed method iterates denoising the RSS measurements and refining the locations until convergence.The detailed algorithm is presented in Section 5.

Designing Details
We now introduce our proposed localization method in detail.In the Notations, we summarize the main notations introduced throughout this article.
Suppose there are  APs within an area, denoted as the RSS information collected at one location can be described as an -dimensional vector: where each dimension corresponds to the RSS information of an AP.If an AP is not observed, the RSS value is −∞ and is marked as 0.
For situations when the RSS observations are insufficient, we suppose the location fingerprinting algorithm takes as input  offline RSS observations at different locations and one online RSS observation and outputs the online observing location.The RSS information can be collected as a sparse matrix, denoted as We denoise RSS information in this sparse matrix based on the following two assumptions: (i) The value of RSS (in dBm) follows the Gaussian distribution: (ii) The signal propagation path loss varies exponentially with distance: where PL 0 is the path loss at unit distance  0 ,  is the propagation path loss exponent, and   is a Gaussian random variable with 0 mean.Let RSS 0, denote the observed value of RSS of the th AP at  0 = 1 m; we can obtain the relationship between the value of the RSS and the distance by RSS , = −10 lg  , + RSS 0, .
Moreover, the relationship among the location of the th observing location (  ), the location of the th AP (  ), and the distance between   and   can be formulated as The above assumptions are also made in our previous work in [21] and many other approaches.With these assumptions, we can compute the RSS values for the missed APs and fill in the blank items in the sparse matrix of RSS by using the maximum likelihood estimate with probability density function: Under the independent and identical distribution hypothesis on , , and RSS 0 , the maximum probability of RSS is observed as This is equivalent to minimizing We define the estimation error as The process of fitting can be achieved by using the stochastic gradient descent method: where  is the length of step.This fitting process is hoping to compute a large number of unknown data (i.e., the RSS values to the missed APs) from only a little amount of given data.As a result, the convergence of this fitting process is generally describing the random error or noise instead of the underlying relationship between the RSS and the location information.To address this problem, a typical solution is to use regularization, which modifies the objective function as min where  is the weight vector and () is the regularization term.Take 2 regularization as an example; () can be defined as where  is a free parameter, which needs to be adjusted by methods like cross-validation.And in our experiment, we find that, for most APs, RSS 0, = −36 dbm.It is worth noting that it is usually difficult to use the cross-validation method, so an early exit strategy can also be used.

Evaluation
In this section, we evaluate our proposed localization method and compare it with some most commonly used location fingerprinting methods.
6.1.Benchmark.In the following, we detail the benchmark used in our experiments.

Dataset.
We use an open dataset, the Mannheim/compass dataset [22], to perform our experiments.It records traces of signal strength of 802.11APs and contains data in both an offline training phase and an online positioning phase, in an area of about 35 meters in width and 60 meters in length.The offline fingerprinting data contains 14,300 measurement records for 130 locations (110 records each), and the online positioning data contains 5,060 measurement records for 46 locations (110 records each).
6.1.2.Compared Methods.We choose three most commonly used location fingerprinting methods for comparison.All the three methods generate the same location fingerprint database by simply averaging observations at the same location.
(i) The weighted -nearest neighbor (WKNN) method [23] is a deterministic method which computes the estimate location by weighted-averaging the fingerprint locations: where   represents the weight of the fingerprint location   .It can be computed by Here, the Euclidean norm (2-norm) is used.WKNN keeps  biggest weights and sets the others to zero.(ii) The -nearest neighbor (KNN) method [24] is a simplified version of WKNN which sets the  biggest weights to 1/ and others to zero.(iii) The histogram method [25] is a probabilistic method, which computes the probability that an RSS observation   can be observed at the location   by using Bayes' rule: where (  ) is a normalized constant and (  ) and (  |   ) can be computed as follows: where |  | is the volume of   and where V  =   −   and  V  (  ) is the normalized centralized histogram.It is worth noting that the experiment results provide a direct answer to the questions we listed in the Introduction.From Figure 7(a), we can see that KNN and WKNN can output accurate locations (i.e., the mean localization error is about 2 m) when we can observe at least 4 RSS measurements within an area of 2.5 m 2 in size.From Figure 7(b), we can see that 16 RSS measurements are required for a 10 m 2 area.This means that, on average, one should observe at least 1.6 RSS measurements per square meter to achieve accurate localization.This is the answer to the question "How many RSS measurements are required to compute an accurate location?"with our experiment settings.
The results also show that our proposed method may be an answer to the question "How do we locate accurately with insufficient RSS measurements?"We can see that, whatever the values of  and , our propose method achieves smaller mean localization errors and smaller mean squared localization errors, especially when  (the number of observing locations) is relatively large and  (the number of observations at each location) is relatively small.This is reasonable since traditional methods can denoise RSS measurements at the same location, so our proposed method does not have evident superiority with a large  and a small .However, with a small  and a large , our proposed method can (while the compared methods cannot) address the missed AP problem and denoise the RSS measurements at different observing locations.
The performances of KNN and WKNN are nearly the same, and the performance of the histogram method is not as good as other methods.It is always with a large mean localization error and a large mean squared error.Besides, the histogram method can fail to estimate a location when the RSS measurements are not sufficient.The failure rate is as shown in Figure 9.

Conclusion
This paper investigates the problem of localization arising from insufficient RSS measurements, that is, the missed AP problem and the RSS measurement noise problem.Traditional location fingerprinting methods rely on a large quantity of RSS observations at the same location to finally observe all the APs so that no APs can be missed from the location fingerprints and to denoise RSS measurements by averaging RSS observations at the same location.We propose a novel localization method which uses the maximum likelihood estimation and the stochastic gradient descent to estimate locations in case the RSS measurements are insufficient to generate accurate location fingerprints.The results show that our proposed method can achieve better localization accuracy than most commonly used location fingerprinting methods like the KNN, WKNN, and histogram methods.Especially when the number of observations at each location  is relatively small, our proposed method has evident superiority.

Figure 1 :
Figure 1: RSS collected in 100 times to the same AP.

Figure 2 :
Figure 2: Denoising RSS value by averaging different observations from neighboring locations is not a good idea.

Figure 5 .Figure 4 :
Figure4: Proportion of the observed APs, the unobservable APs, and the missed APs.Here, "RSS" is not an observed value, but a theoretically computed value by averaging the measurements in which this AP is observed.

Figure 5 :
Figure 5: Dealing with the missed AP problem.

Figure 6 :
Figure 6: Floorplan of the testing area in the dataset used for evaluation.The red dots show the locations where online RSS observations are collected, and the blue dots show the offline observing locations.Yellow dots show the locations of the APs, which we do not assume to be known in our experiments.

Figure 7 :
Figure 7: Mean localization error to the number of observations per location , for the KNN, WKNN, histogram method, and our proposed method, with different settings on the volume of observing location .

RSS
, : RSS measurement at the th location to the th AP   : Coordinatesoftheth observing location   : Coordinatesoftheth AP's location  , : Distance between   and   : The set of all RSS measurements : The estimation error : Observation of RSS measurements : The number of observing locations : The number of APs : The number of observations per location : The volume of each observing location.

Figure 8 :
Figure 8: Mean squared localization error to the number of observations per location , for the KNN, WKNN, histogram method, and our proposed method, with different settings on the volume of observing location .

Figure 9 :
Figure 9: Failure rate to the number of the observations per location , for the histogram method, with different observing location volumes .
. When the Wi-Fi scan operations are performed frequently, many APs can be missed in the RSS observations.As a result, even RSS observations from very nearby locations may observe different APs.Since the distance between different RSS observations is computed in a high dimensional signal space where each AP is a dimension, the missed APs will cause dimensional mismatches.If the RSS measurements are not sufficient, dimensional mismatches can always occur.The dimensional mismatches can cause localization failures and errors, and we call this problem the missed AP problem. space Rss 2 , Rss 3 ⟩ ⟨Rss 1 , Rss 2 , Rss 4 ⟩ ⟨Rss 1 , Rss 3 , Rss 4 ⟩ ⟨Rss 2 , Rss 3 , Rss 4 ⟩ ⟨Rss 1 , Rss 2 , Rss 3 , Rss 4 ⟩ 6.1.3.Experiment Settings.We use a program to randomly choose the RSS observations based on two parameters.represents the number of observations selected in each observing location (for both offline and online);  represents the size of a cell where one offline observing location is selected (i.e., the volume of the observing location).For example, when  = 8 and  = 36 × 2.25 = 81 m 2 , this means that, for every 81 m 2 area, there should be no more or no less than one offline observing location, and at this location, 8 RSS observations are selected.An online observing location is then randomly selected, and again at this location, 8 RSS observations are selected.Using these online and offline data, the location is estimated by using each of the three comparing methods and also our proposed method.We let  ∈ {1, 2, 4, 8, 16, 32, 64} and  ∈ {1, 4, 9, 16, 25, 36} × 2.25 m 2 (the distance between two nearby blue dots in Figure6is 1.5 m in real world, so 2.25 m 2 is the minimum value for ).The experiment is performed 100 times for each pair of ⟨, ⟩.