Car-sharing is becoming an increasingly popular travel mode in China and many companies invest plenty of money on that including vehicle enterprises and Internet companies. But most of them build car-sharing stations by their experience or randomly as long as there is parking space in the early development of their business. This results in many stations with low operational efficiency and causes capital loss. This study aims to use different data source with statistical models and machine learning algorithm to help car-sharing operator to choose the optimal location of new stations and adjust the location of existing stations. We select Chengdu where there are huge amounts of car-sharing travel demand and several large car-sharing operators as the research area and two main operators as the research objects. Chengdu is divided into 58724 squared grids each of which is 0.5km⁎0.5km instead of focusing on the buffers generated by stations. We try to find a model to estimate a potential travel demand value for each small grid with three data sources: order data, population data, and Point of Interest (POI) data. This problem is transformed into a binary form and five different methods, Logistic Regression, Logistic Regression with LASSO, Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis, are implemented. The optimal model, Logistic Regression with LASSO, is chosen to estimate the probability of existence of demand in all grids. With car-sharing order data from different operators, an existing order heat value is also computed for each grid. Then we analyze and classify all the grids into four groups. For different groups of grids, we give different suggestions on the optimal location of stations. This study focuses on a more competitive market and finds the influential factors on order number. Suggestions on the optimal location of stations are given in consideration of competitors. We hope that our research can help operators improve their business and make rational plans.
International Science & Technology Cooperation Program of China2016YFE01022001. Introduction
Car-sharing is a mode that allows members to access a fleet of vehicles for short-term use without actual ownership. Members just need to reserve a vehicle online or by mobile app and then move to the parking lots and drive the car. They usually pay for this after travelling according to the travelling distance or/and time [1, 2]. This kind of new service allows people to avoid buying a car and spending time finding parking lots [3]. Car-sharing is becoming an increasingly important travelling mode in recent 5 years especially with rapid growth of electric vehicles since 2016. The advantages of car-sharing include reducing vehicle ownerships, reducing vehicle kilometres travelled, and reducing greenhouse gas emission [4–6].
So far, there are three main types of car-sharing mode: station-based, free-floating, and peer-to-peer car-sharing [7]. Station-based system requires the operators to hire parking space for stations and the vehicles are parcelled out among these stations. According to the operation mode, there are two types of station-based form: one-way car-sharing and round-trip car-sharing. One-way car-sharing allows the customers to return cars at any designated station wherever the trip started. In contrast, with round-trip car-sharing, cars should be returned to the station where the trip starts. The first form provides more flexibility but will cause the problem of spatial imbalance over the stations. The operators have to consider relocation problem that leads to much cost. Free-floating system operates without station. Instead, the operators define an area where the customers can park cars at any parking lots [7, 8]. Free-floating car-sharing could also be treated as a special case of one-way car-sharing [9].
Our research will mainly focus on one-way station-based car-sharing system for which station siting is a big challenge and need to take quite a lot of factors into consideration. A good choice can bring high efficiency, large profit, low operation cost, and competitive advantages when competing with other operators [10, 11]. However, at the beginning of business, operators usually locate stations by their experience. For example, trading area, universities, or airports always bring great profit if there are car-sharing stations. But different cities have different features and attitudes towards car-sharing. Therefore, experience sometimes may be unreliable and relocations are necessarily introduced when the business is mature. What is more, some operators just choose locations randomly as long as there are parking spaces depending on their huge funds in order to seize the market ahead of other competitors, where relocations are even more necessary.
Researchers have done a lot of work on relocation or site selection problem by using various methods. Analytic hierarchy process (AHP) is the most popular method among multicriteria decision making methods [12]. One study uses AHP to solve the site selection problem for EVCARD which is a car-sharing operator in Shanghai. The researchers consider the potential users, potential travel demand, potential travel purposes, and distances from existing stations totally 15 factors as the decision criteria. But this method is based on candidate stations and expert scoring method that is subjective [11]. Mathematical and statistical models are also applied. Several studies use multilinear model and mixed-integer programming model to find the optimal location of stations [13–15]. Another paper introduces an intensity model that estimates the demand and an imbalance model that describes the difference between pick-up and drop-off to find the optimal location of stations of EVCARD in Shanghai, which involves usage intensity, usage imbalance, transportation information, and built environment. These two models utilize a combination of Elastic Net, the adaptive Least Absolute Shrinkage and LASSO [3].
The previous two studies about site location of EVCARD all focus on the market in Shanghai where there is no competitor. Therefore, it is much easier to make decision on the optimal site location. Our research will focus on a more complicated market: Chengdu where there are more than five car-sharing operators and each has its own advantages. In Chengdu, there are two main car-sharing operation companies that account for a majority of market shares. For privacy, we call them operator F and operator H. They are both station-based car-sharing operators. The differences between them are the number of electric vehicles and stations, vehicle models, and charging mode. We will consider the potential demand heat combined with the existing order heat to give suggestions on site location.
2. Data
We divide Chengdu into 58724 squared grids each of which has an area of 0.5km∗0.5km. All the data we use are allocated to these grids. The data sources consist of 3 parts. The first one is the order data. We collect the order data from these two car-sharing operator mobile apps and compute the average daily order numbers for each car-sharing point. The time period is between March 28, 2018, and April 17, 2018. One order is only considered once at the beginning point. For each grid, we sum up the day-average order numbers of all points according to these two car-sharing operation companies in that grid. There are 1834 grids with car-sharing stations. The summary of these grids is shown in Table 1.
Summary of average daily orders of operator F and operator H in grids.
Operator
Mean
Std.Error
Min
Max
operator F
1.2
2.4
0
37.4
operator H
2.8
4.2
0
49.7
Total
4.0
4.8
0
53.7
The second data source is the Point of Interest (POI) information of Chengdu that is collected from AMAP in July 2018. AMAP is an e-map that is similar to Google map and is widely used in China. However, the coordinate system of AMAP is called “GCJ-02” which is different from “WGS-84” coordinate system used in Google map. So a coordinate transformation work is introduced to ensure that the POI information is based on “WGS-84” coordinate system. The total number of POIs reaches 860196 and they are categorized into fourteen classes shown in the first column of Table 2. In AMAP POI categories, auto service contains various services such as filling station, auto-mobile rental, and charging station. But here we only consider one subcategory: auto-mobile rental since car-rental service will affect the demand of car-sharing service. Additionally, we separate transportation service into five parts: Bus Station, Underground Station, Train Station, Airport, and Parking Lots. Again we allocate these POIs to these grids and the summary of them is shown in Table 2.
Summary of POI information in grids.
Category
Abbreviation
Type
Mean
Std.Error
Max
Min
Auto Service
Car-rental Service
CRS
Numerical
0.0221
0.1836
7
0
Food & Beverages
FB
Numerical
2.4100
12.5445
373
0
Shopping
SH
Numerical
4.9470
26.9275
900
0
Daily Life Service
DS
Numerical
2.4800
11.7907
226
0
Sports & Recreation
SR
Numerical
0.3965
1.9526
78
0
Medical Service
MS
Numerical
0.6307
3.3228
148
0
Accommodation Service
AS
Numerical
0.3428
3.6490
251
0
Tourist Attraction
TA
Numerical
0.0609
0.4368
29
0
Commercial House
CH
Numerical
0.4638
2.4126
63
0
Governmental Organization & Social Group
GS
Numerical
0.4416
2.1701
78
0
Science/Culture & Education Service
SS
Numerical
0.5213
2.7930
117
0
Transportation Service
Bus Station
BS
Numerical
0.2033
0.6535
10
0
Underground Station
US
Numerical
0.0053
0.0727
1
0
Train Station
TS
Numerical
0.0013
0.0359
1
0
Airport
AP
Numerical
0.0003
0.0184
1
0
Parking Lots
PL
Numerical
0.6332
2.9611
81
0
Finance & Insurance Service
FS
Numerical
0.0874
0.5926
27
0
Enterprises
EN
Numerical
1.2750
7.1559
358
0
The last data source used in our research is the population data (POP) called Gridded Population of the World (GPW) from Socioeconomic Data and Applications Center in NASA. Now it is the fourth version that models the population counts and densities on a continuous global raster surface. The population data is collected from the population and housing censuses between 2005 and 2014, which is used to estimate the population for the year 2000, 2005, 2010, 2015, and 2020. Some adjustments of a set of estimates based on national level, historic, and future, population predictions from the United Nation’s World Population Prospects report (2015 Revision) are also introduced to these sets of years. GPW is gridded with an output resolution of 30 arc-seconds which is approximately equal to 1 km at the equator. The value for each grid is not the population number in it but a scale that reflects the level of population. In our research we use the population density in 2015 and adjust the rasters to ours as well as the population density of them [16].
3. Methodology
The true value of the potential demand is unknown since it is an insubstantial concept which is difficult to measure. One possible way is to do a full sample questionnaire but this requires high human and financial resources and the results might be biased since questionnaire contains much subjectivity. Another way is to use usage intensity as a proxy of demand [3] or use the number of bookings while this amount can only reflect the present demand that may be restricted by the number of car-sharing stations and vehicles. Therefore using a specific amount to represent the potential demand will cause deviation. An alternative way that is applied in our research is just to distinguish whether the demand of car-sharing in a grid exists or not. Therefore, the question becomes a classification problem where classification algorithms can be implemented. So far 1834 grids contain at least one car-sharing station. The distribution of order numbers in them is shown in Figure 1(a). Clearly, those with larger order numbers can be treated as high demand. Those with tiny order numbers are reasonably defined as no demand since very small order number reflects an occasionally demand that operators do not need to satisfy with cost of renting parking space. Therefore, we choose the lower 20% and upper 20% grids as the sample set (as shown in Figure 1(b)) and let respond equal “1” for demand and “0” for no demand. What is more, we also would like to know the level of demand in a grid which is the probability that the grid belongs to the class “1”. Classification algorithms such as k-nearest neighbour (KNN) and tree models cannot provide us such probability so we choose the following four methods: Logistic Regression, Naive Bayes, Linear Discriminant Analysis, and Quadratic Discriminant Analysis.
The distribution of average daily order numbers. (a) The histogram of average daily order numbers. (b) The cumulative probability distribution of average daily order numbers.
3.1. Logistic Regression with LASSO
Logistic Regression is used to find the relationship between variables X=(X1,X2,…,Xp) and binary response Y. It models the probability that Y belongs to a particular class. If we use a linear regression model (1)pX=PrY=1∣X=β0+β1X1+⋯+βpXpto represent the probability, we might get a result that is smaller than 0 or larger than 1, which does not make sense for probability. Therefore, in order to get an output between 0 and 1, we use the logistic function (2)pX=eβ0+β1X1+⋯+βpXp1+eβ0+β1X1+⋯+βpXpThen we have (3)pX1-pX=eβ0+β1X1+⋯+βpXpBy taking the logarithm of both sides, we have (4)logpX1-pX=β0+β1X1+⋯+βpXpWe can see that the Logistic Regression model has a log-odds which is linear in X.
The coefficients of Logistic Regression models are usually estimated by maximum likelihood method. The likelihood function for N observations is (5)Lβ=∏i:yi=1pXi∏i′:yi′=0pXi′where (6)β=β0,β1,…,βpXi=Xi1,Xi2,…,XipThe log-likelihood can be written as (7)lβ=∑i=1NyilogpXi;β+1-yilog1-pXi;β=∑i=1NyiβTXi-log1+eβTXiSet its derivatives with respect to β to zero and we can get the maximum likelihood estimators for β. [17]
The LASSO is a shrinkage method that constrains the coefficient estimates, which can significantly reduce the variance of them. A penalized term λ∑j=1pβj is added to the model, where λ≥0 is a tuning parameter.
Therefore, for Logistic Regression with LASSO, we would maximize the penalized log-likelihood function: (8)maxβ∑i=1NyiβTXi-log1+eβTXi-λ∑j=1pβjThe selecting of a good value of λ is critical since it significantly affects the coefficients. Cross-validation method is applied to choose the best λ that produces the smallest cross-validation error [18].
3.2. Naive Bayes
Assume that there are K classes and we wish to classify an observation X into one of them. According to the Bayes’ theorem, (9)PrY=k∣X=x=πkfkx∑l=1Kπlflxwhere PrY=k∣X=x denotes the probability that X belongs to the class k; πk is the prior probability that a random observation is from class k; fk(x)=PrX=x∣Y=k is the probability density function of X which belongs to class k.
For an observation X=(x1,x2,…,xp), Naive Bayes assumes that each feature is independent given a class k. So we have (10)fkx=∏j=1pfkjxjwhere fkj(xj) is the probability density function of the j-th feature given class k. Then the probability of X coming from the class k is (11)PrY=k∣X=x=πkfkx∑l=1Kπlflx=πk∏j=1pfkjxj∑l=1Kπl∏j=1pfljxjSee [18].
3.3. Linear Discriminant Analysis
Linear Discriminant Analysis is based on Bayes’ theorem and assumes that the observation X=(x1,x2,…,xp) is generated from a multivariate Gaussian distribution with a unique class mean vector and a common covariance matrix. We write X~N(μk,Σ) where μk is the mean vector for the class k and Σ is the covariance matrix that is the same for all classes. Then the probability density function (pdf) of X that is from class k is (12)fkx=PrX=x∣Y=k=12πp/2Σ1/2exp-12x-μkTΣ-1x-μkThe parameters of the multivariate Gaussian distributions are unknown and we need to estimate them from training data by(13)π^k=NkN;μ^k=∑gi=kxiNk;Σ^=∑k=1K∑gi=kxi-μ^kxi-μ^kTN-Kwhere Nk is the number of observations in class k and gi is the label of i-th observation [17].
3.4. Quadratic Discriminant Analysis
Quadratic Discriminant Analysis is a bit different from Linear Discriminant Analysis, which assumes that the covariance matrices for each class are different. Then the observation X from class k has the distribution of N(μk,Σk). So the pdf becomes (14)fkx=12πp/2Σk1/2exp-12x-μkTΣk-1x-μkSee [17].
4. Results and Discussion
As mentioned in the Section 3, we take the lower 20% and upper 20% grids as the sample set. The quantiles are 0.79 and 6.33, respectively, which means that the grids with average daily order numbers that are less than 0.79 and larger than 6.33 are chosen as the samples. The sample size is 737 of which 370 samples are of no demand and 367 are of high demand. We randomly choose 500 samples (approximately 65% of total samples) as the training set and the remaining as the test set. Since population density and POI information are in different scale, all variables are normalized by their mean and standard error (shown in Table 3).
Mean and standard error of all predictors.
CRS
FB
SH
DS
SR
MS
AS
TA
CH
GS
Mean
0.3
32.35
57.49
29.31
4.93
6.32
5.83
0.37
5.12
4.14
Std.error
0.67
43.51
90.96
34.67
6.83
8.32
14.78
1.24
7.51
6.77
SS
BS
US
TS
AP
PL
FS
EN
POP
Mean
7.34
1.51
0.072
0.008
0.004
8.89
1.23
16.24
1882
Std.error
11.35
1.4
0.26
0.09
0.06
10.09
2.23
33.81
5621.53
4.1. Logistic Regression
The results of Logistic Regression are shown in Table 4. Only five variables, Food& Beverages, Medical Service, Governmental Organization & Social Group, Bus Station, and Parking Lots, are significant under 5% significance level. The population factor is not significant. Food& Beverages positively affects the probability of existence of demand, while Medical Service and Governmental Organization& Social Group has negative effects. According to our search and investigation, the car-sharing stations in Governmental Organization are always of limited access which means only staffs that work in them can get access to the stations. Once a car is returned at these stations, it always takes a long time to receive a next order. The negative effect of Medical Service could be attributed to the limited parking space and huge parking demand. The stations might be occupied by fuel vehicles. The coefficient of Bus Station is also positive, which is the same as the results in literatures [3, 19]. Parking Lots has a positive impact on the probability of existence of demand, which implies that more space provide more opportunities to build car-sharing stations. However, Chen’s paper [3] concludes that more parking space leads to more private vehicle trips instead of car-sharing service.
The results of Logistic Regression.
Coefficient
Std.error
P value
Intercept
0.2569
0.1211
0.0338
FB
0.9960
0.2404
<0.001
MS
-0.3401
0.1581
0.0315
GS
-0.5480
0.1383
<0.001
BS
0.2814
0.1307
0.0313
PL
1.3981
0.2182
<0.001
4.2. Logistic Regression with LASSO
Logistic Regression with LASSO introduces a penalized term to the model. The results are shown in Table 5. Since LASSO is a shrinkage method that the coefficients of nonsignificant variables will shrink to zero, there is no p-value for this method and p-value also does not make sense for biased regressions such as LASSO. λ equals 0.01973 that minimizes the model deviance. Seven variables, Food & Beverages, Sports & Recreation, Governmental Organization & Social Group, Bus Station, Train Station, Airport, and Parking Lots, give nonzero coefficients. Comparing to the Logistic Regression results, Food& Beverages, Governmental Organization & Social Group, Bus Station, and Parking Lots have the same effect. Medical Service is no longer significant in this model; instead Sports & Recreation positively affects the probability of existence of demand. The reason could be that Sports & Recreation is mostly visited by young people who have high acceptance and usage intensity of car-sharing mode [20]. In this model Train Station and Airport are also significant which could be attributed to their functions of traffic hub that brings high exposure rate and large mobile population. The population factor is again not significant.
The results of Logistic Regression with LASSO.
Intercept
FB
SR
GS
BS
TS
AP
PL
Coefficient
0.1318
0.4652
0.1587
-0.2717
0.1641
0.01721
0.0336
1.0114
4.3. Linear Discriminant Analysis
As mentioned in Section 3.3, Linear Discriminant Analysis (LDA) assumes that the sample xi is from a multivariate Gaussian distribution where the mean vector is unique for different class and the covariance matrix is the same. Table 6 shows the mean vectors for classes: demand = 0 and demand = 1. The covariance matrix is shown in Appendix A. The prior for these two classes are π0=0.502 and π1=0.498.
Mean vectors for classes: demand=0 and demand=1.
CRS
FB
SH
DS
SR
MS
AS
TA
CH
GS
Class 0
-0.2109
-0.4108
-0.2850
-0.3978
-0.3692
-0.2867
-0.2561
-0.1173
-0.3205
-0.0866
Class 1
0.1552
0.3963
0.3203
0.4089
0.3401
0.2754
0.2615
0.0390
0.3431
0.1573
SS
BS
US
TS
AP
PL
FS
EN
POP
Class 0
-0.3451
-0.2901
-0.0778
-0.0905
-0.0014
-0.4602
-0.2755
-0.2301
-0.1000
Class 1
0.3720
0.2864
0.0947
0.0435
0.0622
0.5125
0.3065
0.3012
0.1076
4.4. Quadratic Discriminant Analysis
The mean vector of Quadratic Discriminant Analysis (QDA) is the same of LDA. However, QDA assumes that the covariance matrix is different for these two classes. The results are shown in Appendix B.
4.5. Naive Bayes
Naive Bayes assumes that each predictor is independent and can be represented by a Gaussian distribution within each group. The mean and standard deviation for other variables grouped by class are shown in Table 7. The Train Station is a constant within class: demand = 0 since the standard deviation is zero. The distributions of them are shown in Appendix C.
Mean and standard deviation for the predictors grouped by class.
Class 0
Class 1
Variable
Mean
Std.error
Mean
Std.error
CRS
-0.2109
0.7259
0.1552
1.1169
FB
-0.4108
0.6177
0.3963
1.1262
SH
-0.2850
0.6372
0.3203
1.2895
DS
-0.3978
0.6668
0.4089
1.1068
SR
-0.3692
0.5560
0.3401
1.0806
MS
-0.2867
0.8778
0.2754
1.0337
AS
-0.2561
0.3561
0.2615
1.3175
TA
-0.1173
0.5521
0.0390
0.6665
CH
-0.3205
0.7296
0.3431
1.1695
GS
-0.0866
0.9128
0.1573
1.1696
SS
-0.3451
0.4995
0.3720
1.2910
BS
-0.2901
0.8360
0.2864
1.0089
US
-0.0778
0.8589
0.0947
1.1439
TS
-0.0905
0.0000
0.0435
1.2157
AP
-0.0014
0.9907
0.0622
1.4038
PL
-0.4602
0.6131
0.5125
1.1715
FS
-0.2754
0.6450
0.3065
1.2305
EN
-0.2301
0.5698
0.3012
1.3990
POP
-0.1000
0.7764
0.1076
1.1253
4.6. Comparing These Five Models
To judge the performance of a classifier, AUC value and accuracy rate are always applied. AUC value is the area under receiver operating characteristic (ROC) curve that varies between 0 and 1. The higher this value, the better the discrimination will be. Accuracy rate is computed by the number of correct predictions divided by the total number of predictions. We test these models with the remaining 237 observations. The values of these two measures for the previous five models are shown in Table 8. We can see that all models give an AUC value between 0.8 and 0.9 and accuracy rate between 0.65 and 0.8. QDA model produces the least AUC value and accuracy rate while the pure Logistic Regression or Logistic Regression with LASSO performs best since their AUC value and accuracy rate are the largest. LDA model and Naive Bayes model are a bit worse than Logistic Regression. Therefore, we can choose either the pure Logistic Regression or Logistic Regression with LASSO as the final model. Here we choose the Logistic Regression with LASSO model.
AUC value and accuracy rate for the previous five models.
AUC value
Accuracy Rate
Logistic regression
0.8500
0.7722
Logistic regression with LASSO
0.8545
0.7637
LDA
0.8513
0.7553
QDA
0.8020
0.6835
Naive Bayes
0.8146
0.7215
4.7. Optimizing Location of Car-Sharing Stations
For all of the 58724 squared grids, we normalize them with the mean and standard error in Table 3 and then run the Logistic Regression with LASSO model. The predictive probabilities of existence of demand for these grids against present order numbers are shown in Figure 2. The left figure (a) shows the results for the whole 58724 squared grids. We can see that most grids are with order numbers less than 10 so to be clear we plot these grids in figure (b). We define predictive probability larger than 0.5 as high demand heat and otherwise low demand heat. What is more, it is reasonable to consider average daily order numbers less than 1 as low order heat and otherwise high order heat. Therefore, the grids can be sorted in to 4 groups as shown in Figure 2(b): I: high demand heat and high order heat; II: high demand heat and low order heat; III: low demand heat and low order heat; IV: low demand heat and high order heat. For stations in group III grids, operators are advised to close or remove them after investigation such as checking the time interval between two orders. For stations in group IV, further work is required to check if other influential factors that are omitted here exist since low demand heat and high order heat mutually conflict. We will mainly focus on groups I and II when optimizing the locations of car-sharing stations. Figure 3 shows the grids with high demand heat in Chengdu. The red colour refers to group I and the blue colour refers to group II. It is obvious that these grids concentrate in the city centre and town centre that is also the gathering area of crowd and business.
The predictive probabilities of existence of demand versus the present order numbers. (a) The whole 58725 squared grids. (b) The grids with order numbers less than 10.
Grids with high demand heat in Chengdu: red for grids with high order heat; blue for grids with low order heat.
For the grids in groups I and II, we can give suggestions to the two operators: operator F and operator H on the optimal location of car-sharing stations. Three cases of grids are considered:
Case 1: grids with no operator F stations and at least one operator H stations
Case 2: grids with no operator H stations and at least one operator F stations
Case 3: grids with no operator H stations and no operator F stations
Figures 4–6 display these three cases. In Case 1, we can see that operator H occupies most of the space in city centre and a majority of grids are of high order heat which means the operation effect is quite good for operator H in these grids. Only several grids are of low order heat where operator F can consider building stations. In Case 2, hundreds of grids of high demand heat surrounding the city centre are monopolized by operator F. However in the north-west of Chengdu, most grids are of low order heat, which suggests an opportunity for operator H to build stations in them. Moreover, in Case 3, we find that these two operators have not entered town centres yet where most of these grids are of high demand heat. These two operators are advised to do investigation at these areas and consider building stations.
Case 1: grids with no operator F stations and at least one operator H stations.
Case 2: grids with no operator H stations and at least one operator F stations.
Case 3: grids with no operator H stations and no operator F stations.
5. Conclusion
This research focuses on optimizing the car-sharing stations in Chengdu market. The main methodology is trying to estimate the potential demand combining with the present order numbers. Unlike the previous research that applies multiple linear regression to model the demand, this study transforms the question to a binary problem whether the demand exists or not. Then five classification models are introduced to model this and estimate the probability of existence of demand. Three data sources, average daily order numbers from operator F and operator H, POI information, and population data, are used. From the five models we summarize as follows:
In the Logistic Regression, Food& Beverages, Bus Station, and Parking Lots have positive effect on the probability of existence of demand, while Medical Service and Governmental Organization & Social Group have negative effect.
The Logistic Regression with LASSO model indicates that Food& Beverages, Sports& Recreation, Bus Station, Train Station, Airport, and Parking Lots positively affect the probability of existence of demand and Governmental Organization & Social Group have opposite effect. This model result contains the traffic hubs that could increase the demand of car-sharing service.
Linear Discriminant Analysis (LDA) model and Quadratic Discriminant Analysis (QDA) model compute the prior probability that is 0.502 for class: demand = 0 and 0.498 for class: demand = 1. The mean vector and covariance matrix of multinomial distribution are then estimated to compute the posterior.
In the Naive Bayes model, each variable is assumed to be normally distributed of which the mean and standard deviation are estimated with each class. The standard deviation of Train Station in class:demand = 0 is zero; thus it is treated as a constant.
Comparing the performance of these five models by AUC value and accuracy rate, we find that QDA models give the worst estimation while Logistic Regression and Logistic Regression with LASSO perform best. LDA is a bit worse than these two models and Naive Bayes works slightly better than QDA. Therefore we conclude that linear models work better in our case.
The Logistic Regression with LASSO is chosen as the final model and used to estimate the probability of existence of demand for all grids. The predictive probability larger than 0.5 is treated as high demand heat and otherwise low demand heat. These grids with high demand heat are concentrated in the city centre or town centre. Together with the present order numbers, 4 groups of grids are defined. Different suggestions on optimizing location of car-sharing stations are given for each group. Both operator F and operator H are advised to build stations in the absent grids with high demand heat and close or remove part of stations in the grids with low demand heat and low order heat. Operator H is also advised to build stations in the north-west of Chengdu where high demand exists and operator F has low operation efficiency.
However, there are still some limitations in our research.
First, Chengdu is a competitive market with more than five car-sharing operators of which only two are considered in our research. The suggestions may be reliable without consideration of the other operators.
Second, our research is based on 500m∗500m squared grids which are suitable in city centre but are too small for other areas since the building density is much lower in these areas.
Third, one important factor, cost of building stations, is not considered in our search. The cost always includes renting the parking space, building charging piles, and purchase electric vehicle. The definition of high order heat and low order heat in our research is simply based on the order number equalling one. However, high cost stations require high order numbers to redeem the cost. Therefore, such definition should vary with the cost of building stations.
Fourth, samples are chosen based on the lower 20% and upper 20% grids according to average daily order numbers subjectively. Other approaches can be applied such as lower 30% and upper 30% to include more observations. The definition of 4 groups is also subjective. More objective method such as clustering may be applied.
Fifth, the models we use are all based on strong assumptions. More technical classification algorithm can be applied such as bagging, boosting, random forest, and Gaussian Process.
Sixth, the order numbers only consider the one that is placed and omit the return behaviour. One station may have low pick-up orders while having high drop-off orders which is called usage imbalance. The decision on such station should be made carefully.
Seventh, variable selection work can be done before we train the classification model since not all predictors are related to the response.
Furthermore, it is worth investigating the effect of adopting those suggestions by estimating the order numbers when a certain number of stations are added. One possible way is to first find the features of relationship between start and end grids and compute the intensity. Then for a grid with some new added stations, possible related grids can be found and the order number can also be estimated by intensity. Another possible way is to do transportation simulation where all of the transportation modes, the population, and the traffic operation are considered, which is a huge and challenging work.
Appendix
Covariance matrices for LDA and QDA and distributions of predictors for Naive Bayes.
A. Covariance Matrix for LDA
Covariance matrix for LDA(A.1)CRSFBSHDSSRMSASTACHGSSSBSUSTSAPPLFSENPOPCRSFBSHDSSRMSASTACHGSSSBSUSTSAPPLFSENPOP0.88580.18750.17500.26790.22220.24140.13750.06310.24090.15020.32850.11660.16390.05920.04820.26070.30430.29050.08200.18750.82310.60250.66100.53100.42460.53200.18240.47750.30550.51370.17790.07420.0073-0.06350.46200.33850.26530.06980.17500.60251.03190.64860.40360.35670.38370.15180.47240.31550.45390.18890.07500.0246-0.04710.41150.39210.39180.00580.26790.66100.64860.83330.45710.55200.48810.12940.55010.39670.53190.26600.05770.0087-0.04950.51600.41870.33870.08690.22220.53100.40360.45710.73670.24110.42860.18480.35550.29120.54360.15700.0964-0.0090-0.06410.39690.36570.27820.05030.24140.42460.35670.55200.24110.91890.20590.08190.48810.30900.31640.26120.0909-0.0559-0.07630.43620.30970.06600.14250.13750.53200.38370.48810.42860.20590.92810.19020.51060.37630.51840.0998-0.0049-0.0062-0.04360.45930.27340.25610.05070.06310.18240.15180.12940.18480.08190.19020.37420.18750.17570.21790.06140.05570.0311-0.02710.13730.14780.12860.10540.24090.47750.47240.55010.35550.48810.51060.18750.94830.58320.56630.25460.0833-0.0543-0.07680.66760.49830.41790.18750.15020.30550.31550.39670.29120.30900.37630.17570.58321.09960.51480.23490.1363-0.0450-0.03250.49940.40820.35000.25800.32850.51370.45390.53190.54360.31640.51840.21790.56630.51480.95530.18050.1658-0.0309-0.06260.58650.57590.57270.11920.11660.17790.18890.26600.15700.26120.09980.06140.25460.23490.18050.85780.10960.08390.09160.24720.20370.09690.13860.16390.07420.07500.05770.09640.0909-0.00490.05570.08330.13630.16580.10961.02200.06140.21400.14730.26170.21340.24500.05920.00730.02460.0087-0.0090-0.0559-0.00620.0311-0.0543-0.0450-0.03090.08390.06140.73600.3420-0.0535-0.0374-0.0444-0.02960.0482-0.0635-0.0471-0.0495-0.0641-0.0763-0.0436-0.0271-0.0768-0.0325-0.06260.09160.21400.34201.4740-0.0761-0.0486-0.0562-0.03530.26070.46200.41150.51600.39690.43620.45930.13730.66760.49940.58650.24720.1473-0.0535-0.07610.87220.60550.52740.15360.30430.33850.39210.41870.36570.30970.27340.14780.49830.40820.57590.20370.2617-0.0374-0.04860.60550.96280.57190.14750.29050.26530.39180.33870.27820.06600.25610.12860.41790.35000.57270.09690.2134-0.0444-0.05620.52740.57191.13760.08840.08200.06980.00580.08690.05030.14250.05070.10540.18750.25800.11920.13860.2450-0.0296-0.03530.15360.14750.08840.9332
B. Covariance Matrix for QDA
Covariance matrix of class demand = 0 for QDA(B.1)CRSFBSHDSSRMSASTACHGSSSBSUSTSAPPLFSENPOPCRSFBSHDSSRMSASTACHGSSSBSUSTSAPPLFSENPOP0.52690.10220.10740.14340.07560.17500.02790.02350.13520.10950.08910.05450.18230.0000-0.01490.12380.13720.11060.03250.10220.38150.28640.31320.20240.31040.09340.09730.24020.15860.17350.16900.07430.0000-0.01800.21150.14120.07470.07660.10740.28640.40600.35530.14110.31680.08290.06650.23530.16670.14080.19920.07460.0000-0.01700.18910.15840.09470.03940.14340.31320.35530.44460.16260.38070.09050.04750.27730.22400.16660.23570.09780.0000-0.02450.27030.20860.13550.09680.07560.20240.14110.16260.30910.14590.08370.09170.13180.10630.11410.11510.06060.0000-0.01290.14910.09580.05990.03180.17500.31040.31680.38070.14590.77050.12730.10430.39800.30110.20950.29840.19890.0000-0.02970.32720.21620.07070.20090.02790.09340.08290.09050.08370.12730.12680.05400.16130.09710.06400.03280.04440.0000-0.00870.11890.07260.06120.02500.02350.09730.06650.04750.09170.10430.05400.30480.08510.09650.09230.05290.02550.0000-0.01150.07040.10810.06030.04000.13520.24020.23530.27730.13180.39800.16130.08510.53230.35640.21690.22820.15170.0000-0.01760.33840.22490.16740.17440.10950.15860.16670.22400.10630.30110.09710.09650.35640.83330.20580.20000.11370.0000-0.00520.25770.26680.15070.28420.08910.17350.14080.16660.11410.20950.06400.09230.21690.20580.24950.13310.13020.0000-0.00230.20110.18020.15340.06750.05450.16900.19920.23570.11510.29840.03280.05290.22820.20000.13310.69890.01830.00000.04010.21170.15870.06490.09600.18230.07430.07460.09780.06060.19890.04440.02550.15170.11370.13020.01830.73780.0000-0.01260.15460.19440.14240.13830.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.00000.0000-0.0149-0.0180-0.0170-0.0245-0.0129-0.0297-0.0087-0.0115-0.0176-0.0052-0.00230.0401-0.01260.00000.9814-0.0140-0.0173-0.0138-0.01470.12380.21150.18910.27030.14910.32720.11890.07040.33840.25770.20110.21170.15460.0000-0.01400.37590.22980.17640.15290.13720.14120.15840.20860.09580.21620.07260.10810.22490.26680.18020.15870.19440.0000-0.01730.22980.41610.16890.16550.11060.07470.09470.13550.05990.07070.06120.06030.16740.15070.15340.06490.14240.0000-0.01380.17640.16890.32470.03480.03250.07660.03940.09680.03180.20090.02500.04000.17440.28420.06750.09600.13830.0000-0.01470.15290.16550.03480.6028Covariance matrix of class demand = 1 for QDA(B.2)CRSFBSHDSSRMSASTACHGSSSBSUSTSAPPLFSENPOPCRSFBSHDSSRMSASTACHGSSSBSUSTSAPPLFSENPOP1.24750.27350.24320.39350.37000.30830.24810.10310.34740.19120.56980.17930.14520.11890.11180.39860.47270.47190.13200.27351.26830.92121.01150.86240.53960.97420.26820.71680.45350.85670.18690.07410.0147-0.10930.71460.53740.45740.06290.24320.92121.66280.94430.66830.39690.68710.23780.71150.46550.76960.17860.07550.0494-0.07740.63560.62780.6912-0.02810.39351.01150.94431.22510.75390.72460.88890.21200.82510.57090.90020.29650.01720.0175-0.07480.76370.63060.54340.07700.37000.86240.66830.75391.16760.33700.77630.27870.58110.47760.97650.19930.1324-0.0180-0.11580.64660.63770.49810.06900.30830.53960.39690.72460.33701.06850.28520.05930.57890.31690.42410.2236-0.0179-0.1123-0.12340.54600.40400.06140.08370.24810.97420.68710.88890.77630.28521.73590.32740.86270.65770.97640.1674-0.0545-0.0124-0.07870.80240.47590.45250.07670.10310.26820.23780.21200.27870.05930.32740.44430.29080.25560.34440.07000.08620.0625-0.04290.20480.18770.19740.17130.34740.71680.71150.82510.58110.57890.86270.29081.36770.81180.91850.28130.0143-0.1091-0.13640.99930.77390.67050.20080.19120.45350.46550.57090.47760.31690.65770.25560.81181.36800.82620.27010.1590-0.0903-0.06000.74300.55060.55080.23150.56980.85670.76960.90020.97650.42410.97640.34440.91850.82621.66670.22830.2015-0.0620-0.12340.97500.97480.99530.17140.17930.18690.17860.29650.19930.22360.16740.07000.28130.27010.22831.01800.20160.16840.14340.28300.24900.12910.18170.14520.07410.07550.01720.1324-0.0179-0.05450.08620.01430.15900.20150.20161.30840.12330.44240.14000.32950.28500.35260.11890.01470.04940.0175-0.0180-0.1123-0.01240.0625-0.1091-0.0903-0.06200.16840.12331.47800.6868-0.1075-0.0751-0.0892-0.05950.1118-0.1093-0.0774-0.0748-0.1158-0.1234-0.0787-0.0429-0.1364-0.0600-0.12340.14340.44240.68681.9706-0.1388-0.0801-0.0989-0.05600.39860.71460.63560.76370.64660.54600.80240.20480.99930.74300.97500.28300.1400-0.1075-0.13881.37250.98420.88120.15430.47270.53740.62780.63060.63770.40400.47590.18770.77390.55060.97480.24900.3295-0.0751-0.08010.98421.51400.97820.12930.47190.45740.69120.54340.49810.06140.45250.19740.67050.55080.99530.12910.2850-0.0892-0.09890.88120.97821.95710.14240.13200.0629-0.02810.07700.06900.08370.07670.17130.20080.23150.17140.18170.3526-0.0595-0.05600.15430.12930.14241.2663
C. Distributions of Predictors for Naive Bayes
See Figure 7.
Distributions of predictors for Naive Bayes.
Data Availability
All data included in this study are available upon request by contact with the author Yu Cheng (chengyu@shevdc.org). They will all be based on the grids defined in the research since order data for a specific station is quite sensitive and private.
Disclosure
This research was presented at the conference “International Conference on Smart Mobility and Logistics in Future Cities” and presentation slides which include brief introduction of this research are shared to that conference. This manuscript is also published in the main website of our organization for internal study and communication.
Conflicts of Interest
The authors declared that they have no conflicts of interest to this work.
Acknowledgments
This work is supported by International Science & Technology Cooperation Program of China under Contract no. 2016YFE0102200.
Millard-BallMurrayG.SchureJ. T.Car-sharing: Where and how it succeeds2005Transportation Research Board of the National AcademiesShaheenS.Innovative mobility carsharing outlook: Carsharing market overview, analysis, and trends - summer 2014, vol. 3, no. 1. 2014ChenX.ChengJ.YeJ.JinY.LiX.ZhangF.Locating Station of One-Way Carsharing Based on Spatial Demand Characteristics2018201816549363210.1155/2018/5493632KlinceviciusM. G. Y.MorencyC.TrépanierM.Assessing impact of carsharing on household car ownership in Montreal, Quebec, Canada2014241648552-s2.0-8497569125510.3141/2416-06MartinE.ShaheenS.LidickerJ.Impact of carsharing on household vehicle holdings: Results from North American shared-use vehicle survey2010214315015810.3141/2143-192-s2.0-78651338056SchureJ. T.NapolitanF.HutchinsonR.Cumulative impacts of carsharing and unbundled parking on vehicle ownership and mode choice20122319961042-s2.0-8487410168010.3141/2319-11SchmöllerS.WeiklS.MüllerJ.BogenbergerK.Empirical analysis of free-floating carsharing usage: The munich and berlin case20155634512-s2.0-8492593392110.1016/j.trc.2015.03.008FebbraroD.SaccoN.SaeedniaM.2012WeiklS.BogenbergerK.Relocation strategies and algorithms for free-floating car sharing systems20135410011110.1109/MITS.2013.2267810CiariF.WeisC.BalacM.Evaluating the influence of carsharing stationsΓ location on potential membership: a swiss case study2016503345369LiW.LiY.FanJ.DengH.Siting of carsharing stations based on spatial multi-criteria evaluation: A case study of Shanghai EVCARD2017911522-s2.0-85011272662PohekarS. D.RamachandranM.Application of multi-criteria decision making to sustainable energy planning—a review20048436538110.1016/j.rser.2003.12.0072-s2.0-1542274590CorreiaG. H. D. A.AntunesA. P.Optimization approach to depot location and trip selection in one-way carsharing systems20124812332472-s2.0-8005411215710.1016/j.tre.2011.06.003JorgeD.CorreiaG.BarnhartC.Testing the Validity of the MIP Approach for Locating Carsharing Stations in One-way Systems20125413814810.1016/j.sbspro.2012.09.733KumarV. P.BierlaireM.Optimizing locations for a vehicle sharing systemProceedings of the Swiss Transport Research Conference2012SEDAC. http://sedac.ciesin.columbia.edu/data/collection/gpw-v4JamesG.WittenD.HastieT.TibshiraniR.2013103New York, NY, USASpringer10.1007/978-1-4614-7138-7MR3100153HastieT.TibshiraniR.FriedmanJ. H.FranklinJ.The elements of statistical learning, second edition: Data mining, inference, and prediction20092702125, 210211WagnerS.BrandtT.NeumannD.In free float: Developing Business Analytics support for carsharing providers2016594142-s2.0-8494909939610.1016/j.omega.2015.02.011MartinE.ShaheenS. A.LidickerJ.Carsharings impact on household vehicle holdings: Results from a north american shared-use vehicle survey20104602143150158