In the latest years, the rapid progress of urban computing has engendered big issues, which creates both opportunities and challenges. The heterogeneous and big volume of data and the big difference between physical and virtual worlds have resulted in lots of problems in quickly solving practical problems in urban computing. In this paper, we propose a general application framework of ELM for urban computing. We present several real case studies of the framework like smog-related health hazard prediction and optimal retain store placement. Experiments involving urban data in China show the efficiency, accuracy, and flexibility of our proposed framework.
Urban computing is a process of acquisition, integration, and analysis of big and heterogeneous data generated by diverse sources in urban spaces, such as sensors, devices, vehicles, buildings, and humans, to tackle the major issues that cities face (e.g., air pollution, increased energy consumption, and traffic congestion) [
In fact, the urban data acquired in real situations normally are heterogeneous and in big volumes. Thus, according to studies [
Extreme learning machine (ELM) [
To solve challenges in urban computing using ELM, we firstly analyze the types, sources, and structures of urban data. We standardize the form of data from different sources and fuse data across modalities as input of ELM. In fact, the urban data can be obtained from online social media and offline physical sensors. Online social media (twitter, user comments) have opinions about places. These data are normally text (user ratings are numeric) and uncertain. The bad situation is that the amounts of these data are influenced by the populations or type of district to a great extent. For instance, it is quite easy to retrieve amounts of data in a metropolis. However, less-developed regions have small populations and, hence, relatively low social media activity. Meanwhile, offline data reflect the physical condition of this region such as the flow of taxis and buses, traffic congestion index, real estate, POIs, road network, air quality, and meteorological elements. The amounts of these data in different regions are usually similar. Nevertheless, these data are from different sources and are heterogeneous. We divide the city into regions by the road network. We obtain huge amounts of data in each region. We formalize these data and build standard feature vectors including social view from online social media and physical view from physical sensors through location based feature selection. Social view and physical view are treated as different views on an object. We adopt deep autoencoder method [
Furthermore, we propose several case studies of our framework. For different real applications, we can easily add or remove data sources and adjust parameters of the model. We use the social media, meteorological elements, and air quality to make smog-related health hazard prediction. Truly, these data are heterogeneous, while stacked ELM [
The major contributions of this paper are as follows: We propose a general application framework of ELM for urban computing. We propose several case studies for our framework such as smog-related health hazard prediction and optimal retain store placement. We evaluate our framework in real urban datasets and reveal the advantages in precision and learning speed.
The remainder of this paper is organized as follows. Section
Extreme learning machine (ELM) has recently attracted many researchers’ interest due to its very fast learning speed, good generalization ability, and ease of implementation [
With the rapid growth of cities, huge amounts of data can be obtained, while there still exist data sparsity problems as a result of sensor coverage, difference of human activities, and so on. L. Zhang and D. Zhang [
The data from different domains consist of multiple modalities. According to [
However, there is a lack of research which utilized methods to fuse both social media data and physical sensor data for a general application framework for urban computing. This is mainly due to the lack of (1) systematic approaches for collecting, modeling, and analyzing such information and (2) efficient machine learning framework which can combine features from both social and physical views.
The urban data of a city (e.g., human mobility and the number of changes in a POI category) may indicate real conditions and trends in the cities [
For instance, the tweets from social media and meteorological elements may indicate the existence of smog-related health hazards. Chen et al. [
Furthermore, human mobility with POIs may have contributions to the placement of some businesses. Scellato et al. [
According to [
As Figure
General application framework of ELM for urban computing.
We divide a city into disjointed blocks (
Segmented regions.
For each block, we can access the features over its neighbourhood and simply calculate the average value. However, there exist problems if ignoring the location information of the surrounding features. There are two reasons we need to consider. (1) From the distance perspective, if the selected location is far from the target location, it may have few impacts on the target location and vice versa. (2) From the density perspective, if one of the neighbour blocks has a lot of stores while the others have few, maybe this block has greater influence on the target location. So we propose a location based feature selection (LBFS) method to calculate features considering neighbourhood’s impact. Suppose we want to calculate a feature of target block
In the real situation, we use all three kinds of feature selection results. Normally each kind of feature in a grid has three final feature values as a vector. Formally, we have
For each block in cities we obtain the social view and physical view separately. Each view is represented as a feature vector. Social view is composed of social media text, user comment texts, user ratios, and so on according to different application requirements. Physical view is composed of physical sensor values like the flow of taxis and buses, traffic congestion index, real estate, POIs, road network, air quality, meteorological elements, and so on regarding different application requirements. We adopt the deep autoencoder [
Deep autoencoder of social view and physical view.
Practically, we obtain the middle-level feature representation of each block. For those with abundant data, we use stacked ELM [
In fact, smog is a terrible health hazard that affect people’s health according to recent research [
Public Health Index (PHI) is the sum of total relative frequencies of smog-related health hazard phrases in the current tweets. D-PHI is an enhanced Public Health Index that includes consideration of diffusion in social networks.
Firstly, we extract both smog-related health hazard phrases and smog severity phrases. Secondly, we gather raw tweets with time and location tags from Weibo (a twitter-like website in China). Thirdly, we calculate the daily relative frequency
Moreover, we also extract features from air quality, including both air pollution concentrations (
The optimal placement of a retail store has been of prime importance. For example, a new restaurant set up in a street corner may attract lots of customers, but it may close months later if it is located in a few hundred meters down the road. In this paper, the optimal retain store placement problem is a rank problem. We calculate scores for each of the candidate areas and rank them. The top-ranked areas will be the optimal region for placing. We get the label data from Dianping score (
A strong regional economy usually indicates high demand according to recent studies [
Bus transits are slow and cheap and are mainly distributed in areas having a large number of IT and educational establishments. The price of real estate and the traffic congestion index indicate whether the facility planning is balanced. We exploit these features to uncover the implicit preferences for a neighbourhood.
Table
Details of the datasets.
Datasets | Size (M) | Sources |
---|---|---|
Comments | 2,523 |
|
Tweets | 11,023 |
|
Buses | 254 |
|
Traffic | 119 |
|
Real estate | 35 |
|
Air | 534 |
|
POI, business areas | 10 |
|
Road network | 9 |
|
Meteorological | 98 |
|
In total, we have two main metrics for our framework, the precision and efficiency. For specific case studies, for smog-related health hazard prediction, we have the following.
For optimal retain store placement, we have the following.
Later, given the ideal discounted cumulative gain
For smog-related health hazard prediction, the stacked ELM is trained hidden layer ANN with 10 to 40 hidden nodes; the BP is trained ANN with 2 to 3 hidden layers and 8 to 15 nodes in each hidden layer. Two classic SVM regression methods, nu-SVR and epsilon-SVR, are provided by LIBSVM [
The results of smog-related health hazard prediction.
Cities | Stacked ELM | BP | nu-SVR | Epsilon-SVR | Random forest | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Time |
|
|
Time |
|
|
Time |
|
|
Time |
|
|
Time | |
Beijing | 0.65 |
|
|
0.66 | 0.79 | 15 | 0.78 | 0.69 | 9 | 0.56 | 0.53 | 9 | 0.10 | 0.15 | 7 |
Tianjin | 0.92 |
|
|
0.56 | 0.59 | 13 | 0.58 | 0.73 | 10 | 0.66 | 0.73 | 11 | 0.15 | 0.14 | 8 |
Shanghai | 0.53 |
|
|
0.55 | 0.69 | 16 | 0.55 | 0.64 | 11 | 0.53 | 0.53 | 10 | 0.12 | 0.30 | 8 |
Hangzhou | 0.31 |
|
|
0.63 | 0.65 | 10 | 0.54 | 0.55 | 9 | 0.66 | 0.83 | 9 | 0.15 | 0.34 | 7 |
Guangzhou | 0.53 | 0.54 |
|
0.65 | 0.69 | 15 | 0.53 | 0.66 | 11 | 0.48 | 0.53 | 21 | 0.10 | 0.34 | 8 |
Average | 0.75 | 0.73 |
|
0.56 | 0.56 | 15 | 0.59 | 0.65 | 10 | 0.59 | 0.73 | 10 | 0.20 | 0.34 | 7 |
For optimal retain store placement, the models trained with a single city’s data are used as baselines. Our datasets have the data obtained from five cities in China. The blocks in each city have a label
Figure
NDCG, precision, and recall of @
NDCG@
Precision@
Recall@
Figure
The best average NDCG@10 results of optimal retain store placement.
Cities | Starbucks | TrueKungFu | YongheKing |
---|---|---|---|
|
|||
Beijing | 0.743 ( |
0.643 ( |
0.725 ( |
Shanghai | 0.712 ( |
0.689 ( |
0.712 ( |
Hangzhou | 0.576 ( |
0.611 ( |
0.691 ( |
Guangzhou | 0.783 ( |
0.691 ( |
0.721 ( |
Shenzhen | 0.781 ( |
0.711 ( |
0.722 ( |
|
|||
|
|||
Beijing | 0.752 (23) | 0.678 (20) | 0.712 (19) |
Shanghai | 0.725 (25) | 0.667 (20) | 0.783 (21) |
Hangzhou | 0.723 (21) | 0.575 (19) | 0.724 (15) |
Guangzhou | 0.812 (22) | 0.782 (19) | 0.812 (18) |
Shenzhen | 0.724 (22) | 0.784 (17) | 0.712 (12) |
|
|||
|
|||
Beijing | 0.753 (45) | 0.658 (40) | 0.702 (41) |
Shanghai | 0.724 (44) | 0.657 (42) | 0.7283 (44) |
Hangzhou | 0.725 (41) | 0.555 (40) | 0.714 (45) |
Guangzhou | 0.832 (45) | 0.772 (42) | 0.512 (41) |
Shenzhen | 0.722 (47) | 0.754 (42) | 0.702 (45) |
|
|||
|
|||
Beijing |
|
|
|
Shanghai |
|
|
0.783 (1) |
Hangzhou |
|
|
|
Guangzhou | 0.810 (9) |
|
|
Shenzhen | 0.780 (9) | 0.783 (5) |
|
NDCG, precision, and recall of @
NDCG@
Precision@
Recall@
In this paper, we propose a general application framework of ELM for urban computing and list three case studies. Experimental results showed that our approach is applicable and efficient compared with baselines.
In the future, we plan to apply our approach to more applications. Moreover, we would like to study the distribution of our framework so it can handle more massive data.
The authors declare that they have no competing interests.