An Optimization Method for the Geolocation Databases of Internet Hosts Based on Machine Learning

In order to improve the accuracy and robustness of geolocation (geographic location) databases, a method based on machine learning called GeoCop (Geolocation Cop) is proposed for optimizing the geolocation databases of Internet hosts. In addition to network measurement, which is always used by the existing geolocation methods, our geolocation model for Internet hosts is also derived by both routing policy and machine learning. After optimization with the GeoCop method, the geolocation databases of Internet hosts are less prone to imperfect measurement and irregular routing. In addition to three frequently used geolocation databases (IP138, QQWry, and IPcn), we obtain two other geolocation databases by implementing two well-known geolocation methods (the constraint-based geolocation method and the topology-based geolocation method) for constructing the optimized objects. Finally, we give a comprehensive analysis on the performance of our method. On one hand, we use typical benchmarks to compare the performance of these databases after optimization; on the other hand, we also perform statistical tests to display the improvement of the GeoCop method. As presented in the comparison tables, the GeoCop method not only achieves improved performance in both accuracy and robustness but also enjoys less measurements and calculation overheads.


Introduction
With rapid development of cloud computing and cloud storage, the cloud is becoming a popular medium for storing and computing data.As data is stored on Virtual Machines, which are deployed on a cloud provider's infrastructure, cloud users give up direct control of their data in exchange for faster on-demand resources and shared administrative costs.Typically, they specify the QoS (Quality of Service) requirements of their outsourced data in a SLA (Service Level Agreement), including not only common QoS requirements such as delay but also location restrictions which define the regional access control about the cloud resource [1].The aspects referred to above are all related with the geographic locations of Internet hosts.To fulfil the demand of the SLA, cloud providers should choose the proper servers for reducing the delay before an Internet host gets the required cloud resource or for making a decision about whether this Internet host can get access to the cloud resource according to his geographic location [2].The granularity of geographic location used above is always coarse-grained, such as country, province, or city.
Cloud providers always establish a geolocation database for collecting the mapping relationships between IP addresses of Internet hosts and their geographic locations.The sources of geolocation information in the database are obtained either by implementing existing geolocation methods or directly from the public IP geolocation databases [3].There occur some problems which resulted from the geolocation methods and the database itself for affecting the accuracy of the geolocation database of Internet hosts.The geolocation methods can be divided into two different types depending on their underlying methodologies for collecting geolocation information: registration-based geolocation methods and measurement-based geolocation methods [4].The former set of methods use previously registered data for gaining the information on the geolocation locations of respective IP addresses.In general, these methods provide accurate location information.However, in some cases, their errors are large enough for entire blocks of IP addresses owing to the fact that their precision greatly depends on the resolution and reliability of the previously registered data they utilize [5].The latter ones utilize active delay and topology measurements 2 Mathematical Problems in Engineering to overcome the aforementioned limitations; but because of queuing delays and circuitous routes, the additive noise produces some inherent inaccuracy and unpredictability into the measurements [6][7][8][9].Moreover, owing to the large number of available Internet hosts and the change in IP assignment, it is hard for the geolocation database to maintain and continuously update the geolocation information of Internet hosts.Because of the drawbacks of geolocation methods and the database itself, the accuracy of geolocation databases still remains limited [10].
This research aims to improve the accuracy and robustness of the geolocation databases of Internet hosts and to analyze the properties of routing policy in China from a view point of geographic location.In this paper, we propose a method based on machine learning (GeoCop) for optimizing the geolocation databases of Internet hosts, which combines network measurement, machine learning method, and routing policy for deriving the geolocation model of Internet hosts.The proposed GeoCop method makes the geolocation results less prone to imperfect measurements and irregular routing.To demonstrate the accuracy and the robustness of the GeoCop method, we not only compare the performance of the geolocation databases after optimization but also perform statistical tests to display the improvement of the GeoCop method.It is presented in the evaluation that the proposed method is effective.The rest of this paper is organized as follows.Section 2 summarizes previous work on measurement-based geolocation methods and analyzes the problems of these methods.Section 3 briefly evaluates the existing geolocation databases of Internet hosts.In Section 4, the GeoCop method is described in detail.The experiments and results are presented in Section 5 and Section 6 includes the conclusions of the paper.

Related Work
Measurement-based geolocation methods always leverage a set of geographically distributed landmarks with known geographic locations for geolocating the targets.These landmarks make use of network measurement tool for obtaining various network properties, such as Internet delay and topology information [11,12].Then, we classify the existing measurement-based methods into two types depending on the network properties they used: delaybased geolocation methods and topology-based geolocation methods.

Delay-Based Geolocation Methods.
Most delay-based geolocation methods geolocate the target by exploiting the relationship between Internet delay and geographic distance and are differentiated only by the way they express the distance to delay function and triangulate the geographic location of the target [13].IP2Geo [14] is included among the first for suggesting a delay-based approach to approximate the geographic distance of Internet hosts.Youn et al. [15] presented a statistical geolocation scheme.They apply kernel density estimation to delay measurements among a set of landmarks and estimate the target location by maximizing the likelihood of the distances from the target to the landmarks.Maziku et al. [16] proposed an Enhanced Learning Classifier approach for estimating the geolocation of Internet hosts with increased accuracy.They reduced average error distance in the geolocation of Internet hosts by extracting six features from network measurements.Arif et al. [17] used bivariate kernel density estimation for approximating joint probability distributions of the distance and delay.Eriksson et al. [18] reduced IP geolocation to a machine learning classification problem and used a Naive Bayes framework for increasing geolocation accuracy.
Gueye et al. [19] proposed a more mature approach called CBG (Constraint-based Geolocation), which used several delay constrains for inferring the geographic location of an Internet host by a triangulation-like method.For each landmark, they used distance-to-delay relationships between landmarks for deriving a maximum distance bound for a given delay from this landmark to the target.They drew a circle centered at this landmark based on the distance bound.Then, the intersection of the circles derived from all of the landmarks formed a convex region.It was assumed that the target resided in the convex region and the centroid of this convex region was the target location.

Topology-Based Geolocation Methods.
In the geolocation for the target, the topology-based geolocation methods also leverage the network topology in addition to the relationship between Internet delay and geographic distance.Laki et al. [13] increased geolocation accuracy by decomposing the overall path-wise packet delay to link-wise components.Guo et al. [20] used web mining together with network measurement to geolocate IP address with significantly better accuracy.Tian et al. [21] performed a large-scale topology mapping and geolocation study for China's Internet.They developed a heuristic approach for clustering the interfaces in a hierarchical ISP (Internet Service Provider) and applied it to the hierarchical structure of the major ISPs in China's Internet.Shavitt and Zilberman [22] introduced a novel approach for generating POP (Point of Presence) level map and then combined the geolocation information of all the IP addresses in a POP from the geolocation database for assigning geographic locations to the POP.Biswajit et al. [23] proposed a classification-based method to improve the accuracy of the geolocation for the datacenters in commercial cloud providers.Wong et al. [24] presented a novel geolocation framework called Octant for the geolocation of Internet hosts, which gained its accuracy and precision by taking advantage of both positive and negative constraints.
Katz-Bassett et al. [25] presented a topology-based geolocation method called TBG (Topology-based Geolocation) for estimating the geographic location of arbitrary Internet host.They converted the data of Internet delay along with the information of network topology into a set of constrains for the geolocation.They first obtained the maximum distance bound based on the maximum transmission speed of packets in fiber and then further refined the region using interrouter delays along the path from the landmarks to the target.At last, they got the geographic location of the target through a global optimization that minimized average location errors for the target and the routers.Owing to the utilization of the relationship between Internet delay and geographic distance, both of the delaybased methods and topology-based methods may not really work in the two conditions: (1) the delay from the target to the landmark deviates much from the normal value; (2) one malicious Internet host, known as adversary, tries to disguise his geographic location by tempering with the delay measurements.Then, the geographic location of the Internet host appears to be wrong, which is presented as an instance in Figure 1.

Wrong location
For simplicity, we assume that there are only three landmarks.The delay from each of the landmarks to the target is denoted by  1 ,  2 , and  3 , and the black arcs  1 ,  2 , and  3 are the circles drawn by these landmarks while geolocating the target.The region enclosed by the arcs is the feasible region of the target location, and the geographic location of the centroid is the geolocation result for the target (black dot).When the delays  1 and  2 are increased to  1 +Δ 1 and  2 +Δ 2 , respectively, the circles  1 and  2 change into   1 and   2 that are presented by the red dotted lines.Then the centroid of the enclosed region changes into a wrong location, so the target appears at a wrong location (red dot).Consequently, more accurate and robust geolocation estimation requires further improvement for the existing geolocation methods to offset network measurement errors.

Geolocation Databases
In this section, we briefly evaluate the geolocation databases of Internet hosts, which are currently available for cloud providers.Owing to the limitation in the number of ground truth nodes, we also use cross validation for evaluating the accuracy of these geolocation databases in addition to the normally ground truth nodes-based validation.We consider two kinds of geolocation databases in this study.The first kind is three existing geolocation databases (IP138, QQWry, and IPcn) [21], which are well known in the Chinese where  right is the number of ground truth nodes whose geolocation information in the database is the same as the actual geographic locations, and  is the number of the ground truth nodes.Table 1 presents the accuracy rates for the five databases at the province and city granularities.As observed from Table 1, all the accuracy rates are not high enough, and the accuracy rates in the granularity of city are lower than those in the granularity of province.

Cross Validation.
Considering the limitation of the ground truth nodes, we also use cross validation to complement the evaluation for the accuracy of these geolocation databases.For an Internet host, if the geographic locations from the five databases are the same, it is most likely that the geographic location is correct; else, we have a low level of confidence on the geolocation information.We define the coverage rate   as another criterion for the accuracy of the geolocation databases, which is the fraction of the cases for which different databases have the same geolocation information for the IP addresses in   .The higher the coverage rate is, the more accuracy the databases have: where  is the number of chosen comparison databases,  is the number of the total comparison databases,  repeat is the number of IPs that have the same geographic locations in different comparison databases, and  is the number of all the IP addresses in   .Table 2 presents the coverage rates for different numbers of comparison databases.As observed from the table, all the coverage rates are not high enough, and the coverage rates in the granularity of city are lower than those in the granularity of province.
From the evaluation for the geolocation databases of Internet hosts, we observe that there is still a lot of room for these databases to improve.The goal of this paper is to develop a method for optimizing the geolocation databases of Internet hosts in order to improve the accuracy and robustness of the geolocation for China's Internet hosts.The following section explains the process of the GeoCop method in details.

The GeoCop Method
In this section, we propose the GeoCop method that utilizes machine learning method in the network measurement data for optimizing the geolocation databases of Internet hosts.Section 4.1 describes the process of the collection required for network measurement data.Section 4.2 describes the definitions of two new network measurement metrics.Section 4.3 describes the analysis of the network measurement metrics.Section 4.4 describes the data processing for the network measurement metrics of the router IPs in the network measurement data.Section 4.5 describes the geolocation model for edge routers.Section 4.5 describes the geolocation model for Internet hosts.

Network Measurement Data Collection.
To generate the set of network measurement data, we perform traceroute measurements in China's Internet with a number of Plan-etLab nodes in China.PlanetLab is a scalable and universal network measurement platform, which consists of 1315 nodes at 629 sites around the globe.The process of traceroute measurements is described as follows.
(1) Pick  effective Internet hosts with known geographic locations in the geolocation database which needs to be optimized.As the targets in the traceroute measurements, the  Internet hosts need to be chosen evenly throughout every city in China.In this paper,  = 1, 141, 815, and the targets are uniformly distributed over 34 provinces and 595 cities in china.
(2) Select  effective PlanetLab nodes to be landmarks according to the relationship between the number of Plan-etLab nodes and the increment of routers, which is presented in Figure 2.
The straight line defined as a least square linear fit presents a positive correlation between itself and the observed values, with the absolute value of the ACC (Accuracy Correlation Coefficient) being approximately 0.79.ACC denotes the fitting degree of fitted curve and observed values.It is observed that the increment of IPs Δ roughly follows a linear distribution presented by the equation shown as follows: Δ = − 3051.8295×  + 75136.5048. ( We consider the theoretical value of  when the increment is zero as the number of PlanetLab nodes in use, which is 75136.5048/3051.8295= 25.The distribution of the used PlanetLab nodes is illustrated in Figure 3. (3) Send traceroute requests from each landmark to all of the targets for  = 10 times; this will result in a set of  ×  ×  traceroute measurements.

The Definitions of Network Measurement Metrics.
Listing all the interfaces along the routing path from the landmark to the target, traceroute is used to learn the routing path between two devices in the Internet.According to the statistics in both the dataset of traceroute measurements and the geographic locations of the targets, we introduce two new definitions on network measurement metrics called geographic location degree   and synchronization frequency  loc  IP  .Definition 1.A measurement unit  includes all the interfaces in the traceroute measurements from a landmark to a target except the landmark and the target.Each measurement unit corresponds to a destination location loc  , which is the geographic location of the target: where IP  ℎ denotes one of the measured interfaces on the th hop of the routing path.Definition 2. Geographic location degree   denotes the total number of different geographic locations which are corresponded to all of the measurement units including IP  : where  loc denotes a set of some certain geographic locations.
If there is a measurement unit including IP  and corresponding to loc  at the same time, then the value of  loc  IP  is one; otherwise, it is zero.  is classified into two categories: geographic location degree in the level of province    and geolocation location degree in the level of city    depending on the granularities of geographic location.

Definition 3. Synchronization frequency 𝑡 loc 𝑥
IP  denotes the total number of measurement units, which meet the two conditions: (a) including IP address IP  and (b) the geographic location of the target is loc  : where   denotes the set of  ×  measurement units, constructed in Section 4.1.

Definition 4. 𝑇 𝑆 loc
IP  denotes the synchronization frequency vector of IP  corresponding to a set of geographic locations  loc : where  IP  denotes the synchronization frequency vector of IP  , including all the synchronization frequencies related with IP  ,   loc is a vector with all zeros except the entries whose corresponding geographic locations belong to the set of  loc , which are ones, and  loc denotes a set of geographic locations meeting both the condition  loc  IP  > 0 and the location restrict in the SLA. Figure 4 presents an instance for the construction of  loc IP  .

The Analysis of Network Measurement Metrics.
The geographic locations of the targets in the analysis of network measurement metrics are initialized by the geolocation results of the targets in the existing geolocation database; but on account of the wrong geolocation results in the existing geolocation databases, the actual statistics of network measurement metrics must have certain disparity with the theoretical values.In this section, we analyze the impact of the inaccuracy of geolocation databases on the analysis of network measurement metrics.

The Theoretical Comparison between Theoretical and Practical Values.
Taking the routing policy of China Internet into consideration, we categorize a typical routing path into four sections, which are presented in Figure 5(a): the routers along the subpath from the landmark to the first backbone router (edge routers 1), the routers along the subpath from the first backbone router to the core backbone router (backbone routers 1), the routers along the subpath from the core backbone router to the last backbone router (backbone routers 2), and the routers along the subpath from the last backbone router to the target (edge routers 2).Theoretically, if  ×  is sufficiently large, the corresponding statistics of   and  loc IP  associated with routers on different subpaths of the routing path lead us to draw useful conclusions, as presented in Figure 5(b).In the first two sections, the value of   is the same as the number of all the different geographic locations where the targets are located in China, and the distribution of  loc  IP  is approximately uniform.In the third section, the value of   is the same as the number of the geographic locations which are corresponding to the measurement units including the backbone routers 2, and the distribution of  loc  IP  is approximately uniform.In the last section, the value of   is one, and the values of  loc  IP  are zero except only one geographic location, which is one; but on account of the wrong geolocation results in the existing geolocation databases and the accuracy rates of these databases which are always above 50%, the actual statistics of   and  loc  IP  must have certain disparity with the theoretical values, which is presented in Figure 5(c).
e P2 = (0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0) e P3 = (0, 0, Edge routers 2 Figure 5: The theoretical comparison of two network measurement metrics.In the light gray area, the ratio of the largest  loc  IP  to the total number of measurement units including IP  is in the range of [0∼0.3].As observed in Figure 8, for most of the router IPs in analysis, the distribution of the synchronization frequency  loc  IP  conforms to the power law distribution, and the distribution of the other router IPs conforms to approximate uniform distribution.

The Statistical Analysis of Practical Values
Combined with the theoretical and statistical analysis, we obtain the following conclusions: (a) if the distribution of synchronization frequency  loc  IP  conforms to the power law distribution, then the corresponding router IP belongs to the set of edge routers 2; (b) if the distribution of synchronization frequency  loc  IP  conforms to approximate uniform distribution, then the corresponding router IP belongs to the set of backbone routers 2; (c) if the values of   meet the conditions    = length( province ) and    = length( city ), then the router IP belongs to the set of edge routers 1 and backbone routers 1.In the following sections, after carrying out the data processing of the network measurement metrics, we classify the routers in order to find and geolocate the set of edge routers 2. Then we geolocate the Internet hosts according to the geolocation results of edge routers 2.

The Construction of Synchronization Frequency Matrix.
Filtering the set of backbone routers 1 and edge routers 1 from the IP list obtained from the whole traceroute measurement dataset, we obtain a set of different IPs  IP .According to the synchronization frequency  loc  IP  of each route IP in  IP , we get the synchronization frequency matrix of  IP , denoted as  × : An instance for the construction of synchronization frequency matrix is presented as shown in Figure 9.
Figure 9 presents the routing paths between three landmarks and three targets with a result of 9 measurement units.We get the synchronization frequency  loc  IP  of each IP in the measurement units, which is presented in Table 3.
Then the related synchronization frequency matrix is (IP  ) denotes the probability of measurement unit including router IP  which occurred in the whole measurement unit set   .(loc  | IP  ) denotes the conditional probability of a geographic location for a given IP.According to the aforementioned equation, it follows that With the value of (loc  | IP  ), we restate the synchronization frequency matrix  × as conditional probability matrix: , (3) Dividing the values in every row by the value of the first column, which is denoted as V 1  , we have the matrix   (V 1  ) × : (4) Moving the first column, we get the matrix  ×(−1) : (5) Define the vector models for routers in different granularities of geographic location.The vector model for the edge router IP, the theoretical distribution whose synchronization frequency conforms to the power law distribution, is defined as [0, 0, . . ., 0].And the vector model for the backbone router IP, the theoretical distribution whose synchronization frequency conforms to uniform distribution, is defined as [1, 1, . . ., 1].
(6) Calculate the Weighted Euclidean Distance between every row and the two vector models.The formulation of Weighted Euclidean Distance is We define   as the distance between a row and the vector model for the edge router IP and   as the distance between a row and the vector model for the backbone router IP.Every row corresponds to a router IP in the traceroute measurement dataset.The classification principals for the router IPs are illustrated as follows: (a) if   >   , then the router IP belongs to the set of edge routers; (b) if   >   , then the router IP belongs to the set of backbone routers.Then the router IPs in the traceroute measurement dataset are classified into a set of edge routers   and a set of backbone routers   .

The Geolocation Model for Edge
Router.The geographic location of edge router IP  is determined by the one which maximizes the conditional probability function (loc  | IP  ).The model of geolocation for edge router is established as follows: 4.6.The Geolocation Model for Internet Host.Find the set of router IPs in all the measurement units including IP  , and the set is denoted as   .The geolocation of Internet hosts is classified into two cases depending on whether   and   have common router IPs.The first one is discussed in Case 1, and the second one is discussed in Case 2.

Case 1:
∩   ̸ = 0. Wipe off all the router IPs which are not included in   from   .Construct the location probability matrix  × with the geographic location of every router IP in   .Consider Summing over the values of every row in a given column of the matrix, we have The model of geolocation for Internet host IP  is established as follows: The procedure of geolocation for Internet hosts in Case 1 is presented in Figure 10.

Case 2:
last ∩  route = 0. Cluster the IPs geolocated in Case 1 according to the geographic locations.IPs in the same geographic location are put into the same cluster.And  loc denotes the set of IPs in a certain cluster.
Based on the instantaneous delay measurements, every IP gets a delay vector   = ( 1 ,  2 , . . .,   ).Then calculate the average delay vector for each cluster  loc , which is expressed as follows: where  and | loc | are the numbers of landmarks and IPs in the cluster  loc , respectively.We use cosine similarity to calculate the similarity between the delay vector of IP  and the average vector of  loc : The geographic location of Internet host IP  is the same as the cluster which has the minimum cosine similarity.The geolocation model for Internet host IP  is established as follows: Internet host

Geographic location Location n
Location n N(location 2 ) N ( location n )

Experiments and Results
In this section, we evaluate the performance of the Geo-Cop method from three aspects: accuracy, robustness, and efficiency.Section 5.1 evaluates the GeoCop method on the aspect of accuracy.Section 5.2 evaluates the GeoCop method on the aspect of robustness.Section 5.3 evaluates the GeoCop method on the aspect of efficiency.

Accuracy Evaluation.
In this section, we not only use the same empirical evaluation methods as presented in Section 3 to evaluate the accuracy of these geolocation databases after optimization but also compare the improvement of the geolocation databases statistically in order to ensure the significance of the differences of performance between the geolocation databases with and without the optimization.

Empirical Evaluation.
As observed in Table 4, the accuracy rates of the geolocation databases in Table 4 are higher than those in Table 1.It is clear that databases after optimization are more accurate.In the original databases, the accuracy rates in the granularity of province are much higher than the granularity of city.In the databases after optimization, the accuracy rates in the granularity of province are still higher than the granularity of city, but the differences are very small.As observed in Table 5, the coverage rates are higher than the results in Table 2.It is presented that the GeoCop method improves the accuracy of the geolocation databases as a whole.

Statistical Tests.
To perform the statistical tests, we divide the IP addresses in   into 10 identical portions and calculate the accuracy rate of each portion.Then all data are analyzed using SPSS (Statistical Package for the Social Science) 22.0 statistical software (IBM Corporation, Somers, NY).The significance of the differences of performance between the geolocation databases with and without optimization is tested by paired-samples -test and Wilcoxon test.For all analyses,  < 0.05 is considered significant.Taking the geolocation database IP138 for example, the results of the two statistical tests are presented in Tables 6 and 7, respectively.As shown in Table 6, the value of  = sig.(2-tailed) is 0.000, which is smaller than 0.05, so the results of pairedsamples -test reject the null hypothesis with significant level 5%.The difference of performance between the geolocation databases IP138 with and without the optimization is significant.And the value of average accuracy rate, which is 0.801, is smaller than that after optimization, which is 0.982.It is presented that the accuracy rate after optimization is higher than that before optimization.
As shown in Table 7, the value of  = sig.(2-tailed) is 0.02, which is smaller than 0.05, so the results of the Wilcoxon test reject the null hypothesis with significant level 5%.The difference of performance between the geolocation databases IP138 with and without the optimization is significant.And the mean rank of positive ranks, which is 5.50, is larger than that of negative ranks, which is 0.00.It is presented that the accuracy rate after optimization is higher than that before optimization.
Owing to the limitation of space or the following statistical tests, we just present the important results in Table 8.  and  denote the results of paired-samples -test and Wilcoxon test, respectively.
As shown in Table 8, we can observe that all the values of P are smaller than 0.05, so the results of the statistical tests reject the null hypothesis with significant level 5%.For each geolocation database, the difference of performance between the geolocation databases with and without the optimization is significant.According to the means in the results of paired-samples -tests and the mean ranks in the results of Wilcoxon tests, we can observe that the accuracy rates of the geolocation databases after optimization are higher than those before optimization.It has been proven that the GeoCop method improves the accuracy of the geolocation databases by comparing the improvement of the GeoCop method statistically.

Robustness Evaluation.
In this section, we compare the GeoCop method with two existing geolocation methods (CBG and TBG) to evaluate the performance of robustness from three aspects: the dramatic increment in the delay, the accuracy rate of the geolocation database before optimization, and the landmark distribution.

Aspect 1:
The Dramatic Increment in the Delay.We evaluate the robustness of the GeoCop method in two scenarios.One is about the delay noise introduced as a result of queuing delays and circuitous routes.To simulate the scenario, we add increment ranging from 0 ms to 1 ms to the delays from the landmarks to the ground truth node.Figures 11(a) and 11(b) present the accuracy rates of the geolocation databases depending on the increment of delay.As observed in Figure 11, the accuracy rates of both CBG and TBG before optimization decrease with the increment of delay, but the accuracy rates after optimization remain constant.The other one is about the delay-based misleading behaviors by adversary who tampers with the delay to fake his geographic location.This is realized by making the delay appear larger than the actual one.Let  , be the minimum delay between the landmark   and the target   , and let  , be the minimum delay between the landmark   and the forged target   .The delay we add to each traceroute measurement to   is Δ =  , −  , ( , >  , ), and then  , appears as  , .Figures 11(c) and 11(d) present the success rates for the misleading behaviors depending on how far the adversary attempts to move the target.As presented in Figures 11(c) and 11(d), if the distance of attempted move in the misleading behaviors is small, it is difficult for the measurement-based geolocation methods before optimization to find the adversaries.But, after optimization, whatever the distance is, the adversaries hardly succeed in implementing the misleading behaviors.The plots in Figure 11 indicate the hypothesis that the GeoCop method is robust on the aspect of the dramatic increment in the delay.

Aspect 2:
The Accuracy Rate of Geolocation Database before Optimization.Figures 12(a) and 12(b) present the accuracy rate of ground truth nodes after optimization depending on the accuracy rate of geolocation database before optimization with the granularity of province and city, respectively.As presented in Figure 12(a), whatever the granularity is of province or of city, when the accuracy rate of the geolocation database with the granularity of province before optimization decreases but still remains above 50%, the accuracy rate of ground truth nodes after optimization remains constant, but once it goes below 50%, the accuracy rate of ground truth nodes after optimization decreases sharply.As presented in Figure 12(b), the accuracy rate of ground truth nodes after optimization remains constant when the accuracy rate of geolocation database with the granularity of city before optimization decreases but still remains above 40%.But once it goes below 40%, the accuracy rate of ground truth nodes with the granularity of province after optimization still remains constant, while the accuracy rate with the granularity of city decreases sharply.5.2.3.Aspect 3: Landmark Distribution.Figure 13 plots the accuracy rates of ground truth nodes depending on 10 different kinds of landmark distributions.As presented in Figure 13, the accuracy rates of both CBG and TBG before optimization are different when the landmark distribution changes, but the accuracy rates after optimization remain at a high level.It is observed that measurement-based methods after optimization will not be affected by the distribution of landmarks.9, we can observe that all the values of  are smaller than 0.05, so the results of the statistical tests reject the null hypothesis with significant level 5%.For each geolocation method, the difference of performance between the geolocation databases with and without optimization is significant.According to the means in the results of paired-samples  tests and the mean ranks in the results of Wilcoxon tests, we can observe the following: (1) in scenario 1 of aspect 1 and the aspect 3, the accuracy rates of the geolocation methods after optimization are higher than those before optimization; (2) in scenario 2 of aspect 1, the success rates of adversary after optimization are lower than those before optimization.It has been proven that the GeoCop method improves the robust of the geolocation methods by comparing the improvement of the GeoCop method statistically.

Measurement and Computation
Overheads.Figure 14 plots the cumulative distribution function for the numbers of edge routers and Internet hosts whose geographic locations vary with time.As presented in Figure 14, the change in the geographic locations of edge routers is less than that of Internet hosts.It is observed that, if the geolocation methods adopt the optimization by the GeoCop method, the calculation frequency for the updating of the geolocation results of Internet hosts in the geolocation databases can be decreased.Table 10 presents the measurement and computation overheads of different geolocation methods before and after optimization for one time and  times.It is observed that the measurement overheads of both CBG and TBG after optimization are the same as those before optimization.The computation overheads after optimization are less than those before optimization, especially when the update times increase.

Conclusion
In this paper, a novel method based on machine learning (GeoCop) is proposed for optimizing the geolocation databases of Internet hosts in the cloud environment.The geolocation model for Internet host is derived from network measurement with the supplement of machine learning and routing policy, making the geolocation less prone to imperfect measurements and irregular routings.This work also involves theoretical analysis on the drawbacks of the existing geolocation methods as well as statistical analysis on the accuracy of the existing geolocation databases.In comparison with three frequently used geolocation databases and two well-known measurement-based methods, the performance    of the GeoCop method is validated on the PlanetLab network test bed from three aspects: accuracy, robustness, and efficiency.And we not only use the typical benchmarks to compare the performance of the GeoCop method but also perform statistical tests to display the improvement of the GeoCop method.As presented in the comparison tables, the GeoCop method achieves improved performance in both accuracy and robustness with less measurements and calculation overheads.Future work will be focused on a more robust method for the geolocation of Internet hosts and more diversity behaviors for misleading the geographic location of an Internet host.

Figure 2 :
Figure 2: The relationship between the number of PlanetLab nodes and the increment of routers.

Figure 3 :
Figure 3: The distribution of landmarks (PlanetLab nodes) in use.

Figure 4 :
Figure 4: An instance for the construction of  loc IP  .

Figure 6 :
Figure 6: The distributions of geographic location degree.

Figure 7 :
Figure 7: The logarithm distributions of geographic location degree.

Figure 8 :
Figure 8: The distributions of the synchronization frequency.

Figure 9 :
Figure 9: The routing paths between three landmarks and three targets.

Figure 10 :
Figure 10: The procedure of geolocation for Internet hosts in Case 1.

Figure 11 :
Figure 11: Robustness evaluation on the aspect of delay.

Figure 12 :
Figure 12: Robust evaluation on the aspect of accuracy rate before optimization.

Figure 13 :Figure 14 :
Figure 13: Robust evaluation on the aspect of landmark distribution.

Table 1 :
Accuracy rates for the ground truth nodes.We define  province = {province 1 , . . ., province  } and  city = {city 1 , . . ., city  } as the set of all the geographic locations in the granularities of province and city, respectively, and they are collectively referred to as  loc .

Table 2 :
Coverage rates for different numbers of comparison databases.

)
4.5.The Geolocation Model for theEdge Routers 4.5.1.Finding the Edge Routers.In this section, we classify these router IPs in  IP into two categories: backbone router 2 IPs (hereinafter referred to as backbone routers or backbone router IPs) and edge router 2 IPs (hereinafter referred to as edge routers or edge router IPs) in order to get the set of edge router IPs on the last subpath of the routing path.Let us define  as the minimum threshold of max (loc  | IP  ), which is classified into the minimum threshold in the level of province   and the minimum threshold in the level of province   depending on different granularities of geographic location.Owing to the discrepancies between both the characteristics of routing policy in different countries and the accuracies of different geolocation databases,   and   may not be the same with respect to different countries or different geolocation databases.Consider   is the number of IPs which meet the condition max (loc  | IP  ) <   ;    is the number of IPs which meet the conditions max (loc  | IP  ) <   and    = .The classification of the router IPs in  IP consists of the following steps: (1) Moving the IPs from the conditional probability matrix  × when the conditions max (loc  | IP  ) <   and max (loc  | IP  ) <   are met, we have  IPs left.(2) Sorting the values in every row from the largest to the smallest and extracting  columns in the front of every row, we generate a new matrix  × .Consider where  is the total number of IPs in  IP ;    and   1 are the numbers of IPs when the value of    is 1 and , respectively;    and   1 are the numbers of IPs when the value of    is 1 and , respectively;

Table 4 :
Accuracy rates for the ground truth nodes after optimization.

Table 5 :
Coverage rates for different numbers of comparison databases after optimization.

Table 6 :
The results of paired-samples -test for the geolocation database IP138.

Table 7 :
The results of Wilcoxon test for the geolocation database IP138.
a After < before.b After > before.c a Wilcoxon signed ranks test.b Based on negative ranks.

Table 10 :
The measurement and computation overheads of different geolocation methods.