Predicting Audience Location on the Basis of the 𝑘 -Nearest Neighbor Multilabel Classification

. Understanding audience location information in online social networks is important in designing recommendation systems, improving information dissemination, and so on. In this paper, we focus on predicting the location distribution of audiences on YouTube. And we transform this problem to a multilabel classification problem, while we find there exist three problems when the classical 𝑘 -nearest neighbor based algorithm for multilabel classification (ML-𝑘 NN) is used to predict location distribution. Firstly, the feature weights are not considered in measuring the similarity degree. Secondly, it consumes considerable computing time in finding similar items by traversing all the training set. Thirdly, the goal of ML-𝑘 NN is to find relevant labels for every sample which is different from audience location prediction. To solve these problems, we propose the methods of measuring similarity based on weight, quickly finding similar items, and ranking a specific number of labels. On the basis of these methods and the ML-𝑘 NN, the 𝑘 -nearest neighbor based model for audience location prediction (AL-𝑘 NN) is proposed for predicting audience location. The experiments based on massive YouTube data show that the proposed model can more accurately predict the location of YouTube video audience than the ML-𝑘 NN, MLNB, and Rank-SVM methods.


Introduction
According to sociology, people often show different characteristics because of their different cultural backgrounds, customs, and traditional sociocultural environments.These factors have a direct effect on the behavior of people in choosing personal information.Studies have shown that people with similar cultural backgrounds are likely to pay attention to information with similar contents [1].Therefore, grasping the regional background of the user in an online social network can improve the effectiveness of information dissemination.For example, Guha et al. [2] found that user location significantly affects the advertisement content placed by Google and Facebook.
Currently the investigation on user location prediction in a social network mainly focuses on the user friends and the characteristic of information dissemination.Most studies use Facebook and Twitter as examples.On the basis of large-scale Facebook data, Backstrom et al. [3] analyze the relationship between friends and physical distance and find a negative correlation between them.The results show that a larger distance between two users corresponds to the lower probability of them becoming friends.On the basis of this finding, they propose an algorithm to predict the physical location of user.The experiment shows that the accuracy of predicting location by using this algorithm is significantly higher than that of the IP address method.
Several studies on user location prediction employ Twitter as an example.McGee et al. [4] consider not only the user's friends but also the interaction level between users to predict user location in Twitter.They first analyze the user relationship and physical distance in Twitter and find that users who have many fans tend to have a significant distance between each other, whereas users mentioned mutually are separated by a short distance.They then presented a model based on the decision tree for predicting user location.Rout et al. [5] hypothesize about the number of features of the user network in Twitter and transform the locating problem to a classification problem by using the support vector machine (SVM) classifier.Li et al. [6] construct a probability model for Mathematical Problems in Engineering predicting user family location by using a microblog written by the user and the user's friends.
Instead of predicting location of the individual, we focus on the issue of predicting the location distribution of audience.We take YouTube, for example, to predict audience location, because YouTube is the largest online video sharing social network in the world.YouTube supports 61 languages, is visited by more than 1 billion visitors per month, and accumulates 80% of website traffic from other countries and regions outside the United States [7].YouTube is the most representative and influential online social network for video sharing.Thus, the results obtained from YouTube have practical significance.
Given that obtaining the actual viewers of videos is difficult, the number of video comments is used as the number of audiences to predict the audience geographical position because studies have shown that YouTube video views and comments are highly correlated [8], and comments have been widely used to represent views [9,10].The countries or regions of audiences are used to represent the audience geographical location.Therefore, the question of this study is how to predict the  countries or regions with the largest number of video audiences.The traditional prediction or classification model can only have one predictive value (e.g., the linear regression and decision-tree classification model assigns only a label for a sample).While our goal is to assign a specific number of countries and regions to a video, the geographical position needs to be sorted according to the number of audiences.
To this end, we transform the location distribution prediction of YouTube video audiences to the question of the multilabel classification.When the classical -nearest neighbor based algorithm for multilabel classification (ML-NN) [11] is used to predict audience location, there exist three problems: (1) the difference of features is not considered in the ML-NN when computing the similarity degree; (2) all objects in the training set are required to be traversed when seeking the -nearest neighbors of the sample; however, large sizes of the training set will lead to tremendous computing workloads, and a small size will cause misclassification; (3) the goal of the multilabel classification method is not to solve the problem of audience location prediction: the ML-NN method finds relevant labels for every sample, whereas our goal is to rank a specific number of relevant labels.To solve the above-mentioned problems, this paper provides the method of computing similarity based on weight and presents a method for quickly finding similar videos, and, on the basis of these two methods and the ML-NN, the -nearest neighbor based model for audience location prediction (AL-NN) model is proposed for predicting audience location.Finally, experiments based on massive YouTube data show the performance of the proposed method.

Data Description
In this section, we describe the data collected from YouTube for the analyses and experiments in the paper.In order to learn the characters of audience, we need to know the information of videos, their uploaders, and viewers.Given that obtaining the actual viewers of videos is difficult, the commenters are used to represent viewers.The information is downloaded by YouTube APIs, and the details are shown in Table 1.Specifically, we firstly download the most popular or the latest video uploaded from different countries and regions.By this way, we obtain about 1 million videos IDs.And then the information of the uploaders and commenters of these videos is collected.By using the standard two-bit ISO country and area code in the user's profile, the country or region of the user is determined.Because videos with few commenters may not reflect the popularity of videos in every country, videos with less than 20 comments are excluded.Because the experiments in this paper are time-consuming, we further select videos whose uploaders belong to 10 countries (Table 2) with the largest population of users and also select commenters who belong to them.As a result a total of 144,695 videos are selected for the experiments.

Modeling Preliminaries
In this section we first define the problem of audience location prediction and explain how this problem is transformed into the problem of multilabel classification.The ML-NN multilabel classification method is then introduced.

Audience Location Prediction
Problem.In this study, the country or region of YouTube video audiences is used to represent the audience geographical position.Therefore, predicting audience position means predicting the rank of a number of countries or regions with the largest number of video audiences.
The traditional prediction or classification models cannot apply to our question, because they can only assign one label for a given sample, whereas this study needs to assign a number of ranked labels to a video.To solve this problem, we introduce the multilabel classification.Below, we will present how to transform the problem of audience location prediction into the problem of the multilabel classification.
Different from single label classification methods, such as SVM and decision tree, multilabel classification allows a sample to be classified to more than one class.Our goal is to predict a given number of countries and regions with the largest viewers for each video, that is, to assign a number of top countries and regions to each video according to the audience number.Therefore, multilabel classification can be used to predict audience location; that is, the video is considered as a classified sample and the country or region is considered as a label category.And the goal is to rank the countries and regions of the audiences according to the number of audiences and choose first  countries and regions for each video.The formal description of the problem is presented as follows.
Let  ⊆   be the sample space that is defined over the -dimensional feature space; that is,  is the set of samples (videos).Every sample has  characteristic values, and let  = {1, 2, . . ., } be the finite set of labels (countries of origin of audiences).Let () ⊆  be the first  countries with the largest number of video audiences over  ∈ .Given the train set  = {( 1 ,  1 ), ( 2 ,  2 ), . . ., (  ,   )}, where   = (  ) (  ∈ ,   ⊆ ), the goal of multilabel classification is to construct a classifier that can effectively predict the labeled set for each unknown sample; that is, the classifier can effectively select the first  countries with the largest number of video audiences from the candidate countries.
The method for solving our problems is the ranking classification method.Multilabel classification based on the ranking classification method would construct a binary real function  :  ×  →  in training process according to the train set.All labels are ranked by the value of (  , ) for any sample.

ML-𝑘NN Method.
This paper improves the ML-NN method to predict audience location; therefore we firstly introduce the ML-NN.By implementing multilabel classification by improving the NN algorithm, the ML-NN method confirms the final label set of training samples from the -nearest labels by maximizing a posteriori probability.The ML-NN method is described as follows.
For sample  ∈  and the corresponding label set  ⊆ , if  is the label of , then   () = 1; else   () = 0. Let () be the set of  -closest neighbors, and   () is the number of  neighbors that belong to the th class: For the samples to be classified , the ML-NN method first finds the -closest neighbors.  () is then computed to predict the category of sample  according to ().
The equation above can be transformed by using the Bayesian rule: The prior probability and posterior probability are calculated according to the statistical frequency of the neighbor category in the training set.

The Model of Predicting Audience Location
The ML-NN, which incorporates NN and Bayesian rule to conduct multilabel classification, has been widely used.However, when ML-NN is used in predicting audience location, there exist three problems.To solve these problems, we firstly propose method of similarity measurement based on weight and then present the algorithm of quickly finding similar videos.Finally, we modify select labels method of ML-NN into ranking a specific number of relevant labels and incorporate the similarity measurement and quickly finding similar videos method into the ML-NN method to build AL-NN model for video audience location prediction.

The Method of Similarity Measurement Based on Weight.
For the ML-NN, the effect of finding similar items with the sample directly influences the accuracy of classification.To find similar items, the feature vector distance is used in general to calculate the similarity of two samples, such as the Euclidean distance and the cosine angle between the vectors.These methods consider all features with equal importance and do not consider the weight of each feature.However, the features of YouTube videos are different from the location distribution of audiences.For example, the data analysis results show that the position of a YouTube user can be closely related to the geographical position distribution of audiences but can be completely irrelevant to the user's gender.Therefore, we propose the method of similarity measurement in which the weight of each feature is considered.
The method of calculating similarity measurement based on weight mainly determines the weight of each feature according to the relationship between features and audience location.Specifically, the local similarity of each feature is firstly computed.Supposing that the th feature values of feature vectors over two videos (i.e.,  and V) are   and V  , respectively, if the th feature values   and V  are continuous, we normalize   and V  ; that is, this feature value is divided by the maximum feature value over all videos.The distance of the th feature is calculated as follows: And then the final similarity, similarity(, V), between videos  and V can be calculated according to the distance between corresponding features of two videos: where   is the weight of each feature and ∑  =1   = 1.To determine the weights of each feature, the relationship between the feature and the location of video audiences should be analyzed, that is, to determine which features play key roles in the location distribution of audiences and quantify the relationship.If the audience location distribution of videos with the same value of the feature is similar, this feature is strongly related to the audience position of the video.Accordingly, this feature should have a large proportion in calculating the similarity degree of the video.On the basis of this idea, the specific calculation method of feature weighting is presented as follows.
Given  features, suppose there are   feature values for each feature   (1 ≤  ≤ ).And the video is placed in different sets   (1 ≤  ≤   ) according to the feature value.It should be noted that if the feature value is continuous, the feature value is piecewise processed; that is, all videos with the feature value in the same segment are placed in a set.The video in   then composes the video pair (  , V  ).Suppose that the first  countries with the largest number of audiences for these two videos are (  ,   ); the similarity    for calculating these two sets is as follows: On the basis of the above-mentioned equation, the audience similarity   for calculating all videos in   is as follows: The average similarity of the audience's country of videos with the feature   can be calculated as follows: The weight of the feature   is the proportion of its average similarity weight in all feature similarities: 4.2.The Algorithm for Quickly Finding Similar Videos.Before proposing the search algorithm, a corresponding analysis is first conducted to provide reference for designing efficient algorithm.Many characteristics of online social networks have shown a certain degree of homogeneity.For example, Wu et al. [12] find that Twitter users always pay more attention to the same user categories.Thelwall [13] found that MySpace users show obvious homogeneity in religion, nationality, and age.Therefore, the video audience position on YouTube is assumed to also show a certain degree of homogeneity; that is, for a seed user, videos of its closer neighbor have more similar audience position distribution with the seed user than its further neighbor.If this assumption holds, we can only search videos of close neighbors to find similar videos, instead of searching all the neighbors.
To test this hypothesis,  2 video pairs (  ,   ) from  videos are made and are then placed into different video groups according to the distance between the uploader and the viewer.In the zeroth group, the distance of the uploader of two videos is zero; that is, two videos are uploaded by the same user.In the first group, the distance of the uploader of two videos is one; that is, the uploader of a video is the direct neighbor of another video's uploader.In the second group, the distance of the uploader of two videos is two; that is, the uploader of a video is a two-hop neighbor of another video's uploader.The remaining steps are followed by analogy.The average value of the similarity in videos with the same group is then calculated for each group.The results are shown in Figure 1.The -axis indicates the group number of the videos, and the -axis is the average similarity value.Figure 1 illustrates that the similarity between video audience positions decreases with the increasing distance between uploaders.For example, the average similarity of the videos in group 0 is higher by nearly 50% than group 6.This result supports our proposed hypothesis; that is, a shorter distance between video uploaders corresponds to a higher similarity degree of video audience position.Therefore, instead of traversing all the videos, the videos possessed by the closer uploader are then searched emphatically when finding similar videos.
The analysis shows that a shorter distance between the user and its neighbor leads to a higher similarity between their videos.Therefore, instead of traversing all the videos, the videos uploaded by closer neighbors are searched emphatically when searching the -nearest neighbors of the seed video.Generally, online social networks have the characteristic of a small world.Existing research also shows that the average path in online social networks is about 6 [14].Hence, searching the six-hop friends of the video uploader Input: the seed video and topology of the uploader's neighbors, searching hop number , and threshold  Output: the sorting output of the video set according to the similarity degree (1) for  = 0 to  do (2)   = {ℎ ℎ ℎ ℎ  } (3) for each video j in   do (5)  all =  all ∪ {} (6) if (      all     ≥  ⋅ ) (7) go to line (8) (8) Compute the similarity of each video in  all and seed video (9) Rank videos in  all based on their similarity (10) Return  all Algorithm 1: Identifying the -nearest neighbors.
is as complex as traversing the whole training set, while identifying the -nearest neighbors is uncertain if only the one-hop friends of the uploader are searched.Therefore, the searching hop number  in designing the algorithm is variable, and this parameter should be determined according to the actual situation.At the same time, the threshold  about the number of searching videos is set up; that is, the searching process is stopped when the number of acquired videos exceeds  ⋅ , and the result is determined.The algorithm is described as in Algorithm 1.
In Algorithm 1, the -hop neighbors of the uploader are traversed (line 1).The neighbor of each hop is placed in   (line 2), and the videos uploaded by each hop are placed in   (line 3).The videos achieved according to each hop are then accumulated in  all (lines 4 and 5).If the number of videos exceeds  ⋅ , searching is halted; otherwise, the search is continued in the next hop (lines 6 and 7).The similarity degree between the seed video and other videos is calculated (line 8).Finally, the videos in  all are then ranked and returned (lines 8 and 9).

The Improved Method Based on ML-𝑘NN.
On the basis of the above-mentioned method of similarity measurement and the algorithm of searching similar users, the ML-NN method is improved for proposing audience location prediction based on -nearest neighbor classification (AL-NN).The detailed process is as follows.
(1) Calculation of the prior probabilities (  0 ) and (  1 ) of each label : where (  1 ) denotes the event of the sample containing label  and (  0 ) denotes the event of the sample without label .∀ ∈ ,  is the preset smoothing exponential.  () indicates if label  belongs to the label set of sample , that is, if yes, then   () = 1 or   () = 0.
(2) For the training sample , the video similarity measurement based on weight and the algorithm of quickly finding similar videos are executed to search for its -nearest neighbors in the training set, which are placed in set (). ( , ∀ ∈ , (12) The distance between the capitals of the country of origin of the publisher Cultural background The cultural category of the country of origin of the publisher where   () records the sample number of () that contains label .After sorting   , the specified number of labels is assigned to the test sample.

Feature Selections
To use AL-NN to predict audience location, it needs extract features for the videos.This section mainly describes the selected features, including the publisher and basic video attributes obtained from YouTube APIs and the language, culture, and physical distance extended based on these basic attributes.

Basic Publisher Features.
The basic features related to the publisher are first provided.The user information that can be downloaded by APIs is used as the features for predicting.The information includes gender, age, and registration time.The features are shown in Table 3.

The Extended Publisher Features. In addition to the basic features of video uploaders obtained directly from
YouTube APIs, other relevant uploader features (e.g., culture background, language, and uploader distance) are described in this section.
According to the cultural background and geographical position [15], the selected 10 countries are divided into 3 groups.The first group is composed of European countries, including Spain, France, Great Britain, Germany, Italy, and Poland.The second group is composed of North American countries, including the United States and Canada.The third group is composed of South American countries, including Mexico and Brazil.On the basis of the language complexity of the different countries, the basic principle for determining the user language is that the official language in the country or region is considered the feature value of the user language.If no designated official language is provided, the most widely used languages in the area are considered the feature value of the user language.The distance between two uploaders is the distance between the two capital cities of their countries.The detailed features of the uploader are shown in Table 4.

Video Content Features.
The video content feature mainly describes the relevant information of the video (Table 5).Some features, such as the number of audiences, number of comments, and video rate, can only be obtained after uploading videos, and we predict audience location before videos are published.Hence only three features that can be obtained before videos being published are selected for the video content features.

Experiment of Audience Location Prediction
This section presents the performance evaluation of the algorithm for quickly finding similar videos and the AL-NN method of predicting audience location.The data presented in Section 2 is used for the experiments, and the features in Section 5 are computed and used for the experiments.It should be noted that the performance of AL-NN can reflect the efficiency of the method of similarity measurement based on weight; therefore we do not evaluate the performance of the method of similarity measurement based on weight separately.

Evaluating Indicator.
The common evaluating indicator of the multilabel classification effect mainly includes Hamming Loss, One Error, Coverage, Ranking Loss, and Average Precision [8].Among these evaluating indicators, Hamming Loss is calculated according to the predicting label set, and the other four are calculated by using real functions in the corresponding method.
(1) Hamming Loss: where  denotes the set of training samples,  is the set all labels, ℎ(  ) denotes the predicting label set of test samples   ,   is the actual label set of   , and Δ denotes the symmetric difference of the two sets: This indicator is used to calculate the inconsistency degree between the predicting label and the actual label of a multilabel classifier.A smaller value of this indicator indicates that the multilabel classifier has a better classification effect.
(2) One Error: where  denotes the set of training samples.Thus, This indicator is used to describe the probability of the label with the maximal membership value that is not in the actual label.A smaller value of this indicator also means that the multilabel classifier has a better classification effect.
(3) Coverage: (  ) is defined as follows: This indicator is used to calculate the average of the number of labels that descend from the label with a maximal membership value in the sorting function.The whole labels possessed by the sample will be covered.A smaller value of this indicator indicates that the multilabel classifier has a better classification effect.
(4) Ranking Loss: This indicator is used to calculate the average proportion of the label obtained by predicting the actual label after implementing the multilabel classification algorithm.In contrast to the four aforementioned indicators, a larger value of this indicator indicates that the multilabel classifier has a better classification effect.
From different angles, these five indicators evaluate the performance of the classifier constructed with different multilabel classification algorithms.Achieving the optimal effect for a classifier over these five indicators is difficult because the emphasis is different for each classifier, and the angle concentrated by each indicator is also different.

Performance Evaluation of the Searching Algorithm.
In this section, the performance of the searching algorithm is evaluated by comparing the algorithm proposed in this paper and the algorithm of traversing in all videos from the angle of computing times, running time, and search result accuracy.
The number of the searching hops changes from two to six to evaluate the algorithm performance.Figure 2 shows the compared results of the computing times and running time, where the -axis is the number of the searching hops  and the -axis indicates the ratio of the computing times between our algorithm and the algorithm of traversing.The ratio increases with increasing  because the searching scope expands with increasing .However, the computing times and running time of our algorithm significantly decrease when  ≤ 3. The ratio is only 27% when  = 3.
The effect of searching videos with similar audience location is compared in Figure 3.The -axis is the number of the searching hops , and the -axis indicates the ratio of the number of elements in the set  ∩  to the number of all the similar videos.The set  is the video set obtained by using  the proposed algorithm, and set  is the video set obtained by using traversing algorithm; that is, the -axis denotes the following value: Figure 3 illustrates that three curves almost overlap when selecting different numbers of similar videos, thus indicating that the proposed algorithm is capable of achieving a similar search performance when selecting different number of similar videos.The ratio of three different numbers of videos is relatively low only when  = 2.However, the ratio exceeds 80% when  ≥ 3. Figures 2 and 3 show that when  = 3 our proposed algorithm can significantly reduce computing times and obtain the expected searching performance as the traversing algorithm.Therefore, subsequent experiments are made under the condition  = 3.

Predicting Performance with Different Number of Neighbors.
In this section, the experiments are conducted to evaluate the performance of AL-NN when the number of the selected closest neighbors () varies.The first 5 countries are chosen; that is, each video is assigned to 5 labels.Experiments are conducted when  varies from 1 to 20, and a part of better results is given in Table 6.Less performance difference occurs when the  value varies, and no one value achieves the maximum performance for all indicators.After comprehensive comparison, the overall performance is relatively better when  = 7.Therefore, subsequent experiments are made under  = 7.

Predicting Performance with Different Number of Countries.
In this section, the performance of the classification model over five different indicators is examined when the number of countries that will be assigned to videos varies.For each video, predicting its audience position means selecting the first  countries with the largest number of audiences from the candidate countries.Here we want to observe the performance when  changes from 1 to 8. We evaluate AL-NN by comparing with three common multilabel classification methods rank support vector machine (Rank-SVM) [16], multilabel naive Bayes (MLNB) [17], and ML-NN.To conduct the predictive experiments, the videos are divided into 50% training set and 50% test set in the experiment.
The results of evaluation are shown in Figure 4 where axis is the number of countries which is assigned to videos and -axis indicates the predictive performance.It shows different methods differ over the performance.For example, when Hamming Loss is used, the performance of the AL-NN method is close to the ML-NN method, and the ML-NN method exceeds the Rank-SVM.By contrast, when Coverage is used, the AL-NN method exceeds the ML-NN.

Value of Hamming Loss
Number of selected popular countries  The ML-NN method is close to Rank-SVM.However, with regard to the use of these five indicators, the overall prediction performance of the AL-NN method is superior to the Rank-SVM, ML-NN, and MLNB methods.Therefore the experiment shows AL-NN can achieve better performance in predicting audience location.

Conclusions
On the basis of the ML-NN, the model of predicting audience location is proposed in this paper.The problem of predicting audience location distribution of YouTube video is transformed as a multilabel classification problem.First, in terms of the problem that feature weight is not considered for measuring the similarity degree in ML-NN, the method of measuring the video similarity degree on the basis of weight is introduced.And then a method to calculate feature weight is also presented.In terms of the problem that the ML-NN method takes more time to find similar items, the algorithm of quickly finding similar videos based on friend relationship of video owners is proposed.Finally, based on these two methods, the ML-NN method was improved to solve the problem of audience location prediction.The experiments based on massive YouTube data show that the method introduced in this paper can more accurately predict

Figure 1 :
Figure 1: Relationship between the video and distance of uploaders.

Figure 2 :Figure 3 :
Figure 2: Performance ratio of the proposed algorithm to traversing method.

Figure 4 :
Figure 4: Prediction performance with different numbers of countries.

Table 2 :
Selected countries of the videos.
∈ {0, 1, . . ., } denotes the event that  samples from the -nearest neighbors of the training sample exactly contain label .[] denotes the number of the training samples  that exactly contain the label  from its -nearest neighbors.  [] denotes the number of the training samples  that exactly exclude the label  from its -nearest neighbors.
nearest neighbors are placed into (), and the label membership vector   is then calculated:

Table 3 :
Basic user features.

Table 5 :
Basic video features.

Table 6 :
Performance comparison with different  value.