Classification of Housing Submarkets considering Human Preference: A Case Study in Shenyang, China

Housing submarket is aected by the quality of public facilities in neighborhood. In this article, a three-step framework is proposed to classify the housing market in Shenyang city, China. ­e saliences of point of interest, POI, are used to quantify the degree of activities using social media data and merged into the model of density-eld hotspot detector to extract high-quality of urban facilities, aiming at exploring the inuences on the housing price.­emultiscale geographically weighted regressionmodel, MGWR, is applied to investigate relations between the housing price and attribute variables geographically and to identify multiscale impacts on the housing price with exible bandwidths. ­e spatial “k”luster analysis by the tree edge removal method, SKATER, is applied to delimit urban housing submarkets based on theMGWR betas, where ten groups of the housing submarkets are recognized. ­e proposed classication method captures multiscale spatial heterogeneity of the housing price and predominant inuencing factors for each housing submarket in Shenyang, China. It also oers suggestions for urban planners to restrain the unbalanced development of the housing market.


Introduction
Housing market is in uenced by socioeconomic diversity and heterogeneous distribution of public facilities. Urban housing price is spatially varied based on unbalanced urban distribution and neighborhood di erences related to human preference [1][2][3]. In this research, human preference is de ned as public sensibilities and perceptions used for evaluations of urban situations. e urban area with high human preference re ects a high level of attention and visit, implying popularity and convenience to public. For instance, school district houses are preferential to parents pursuing superior education for children. e housing price is higher within high-rank school districts. Schools are high-rank hotspots exhibiting high visit frequencies and public recognitions based on human preferences. Limited research studies have addressed the relations between high preference-triggered urban environment and the housing price. In general, the classi cation of the housing submarkets is more focused on the housing characteristics, including location, structure, and neighborhood, without considering human preferences [4,5].
In this article, we propose a three-step framework to classify the housing submarkets considering human preference in Shenyang, China. e framework involves social media information, hedonic model, and cluster analysis. A model of density-eld hotspot detector data is proposed based on the POI and social media, in order to identify the hotspots of di erent types of human activities and evaluate the corresponding in uences on the housing price. Traditional methods based on POI data are insu cient to con rm whether the site is related to distance or popularity [6]. Location-based social media data have been widely used to show social preferences of each spot [7], where people share locations and activities using smart phones or tablets [8]. For instance, people use check-in data to show daily activities posted on Twitter or WeChat. e check-in data from the microblog of Sina Weibo record the number of visitors at speci c sites or POIs in a certain period. e search volumes of "Baidu index" (http://index.Baidu.com/) reveal the search frequencies of specific keywords (spots) in different periods (mostly daily basis) and regions. ese social media provide data related to individual activities and updated viewpoint to examine human spatial preferences in urban space [9,10]. According to the survey report of Chinese home buyers in 2018, more than 85 percent of house buyers are in the 20-49 age group. People in this age group are the largest users using smart electronic devices. erefore, the housing submarkets are analyzed by exploring the human preferences revealed by check-in and Baidu search index data.
Hedonic models have been extensively applied to evaluate the housing value since the 1960s [11,12]. is model investigates the housing price, structure, and neighborhood condition. However, the hedonic model becomes less reasonable to delineate the housing submarkets when considering bundles of price determinants. It is necessary to explain the spatial differentials of the housing value and estimate accurate coefficients in the hedonic housing price model of the submarkets [13,14]. Spatial hedonic models are introduced for biased estimations due to spatial effects, highlighting the influence of variables on the housing price in a prescribed place [15]. As a popular spatial hedonic model, the geographically weighted regression, GWR, is effective to characterize spatial nonstationarity and heterogeneity of the housing price, through varied coefficients with spatial locations [16]. e application of GWR in empirical housing hedonic studies becomes widespread [17][18][19][20][21]. Nevertheless, only limited relations between the housing price and some covariates are spatially varied in these research studies. e influence of the variables may be local or global, introducing bias to the parameter estimation [22]. e GWR model is further improved by considering both the spatial stationarity and nonstationarity, namely, a semi-parametric GWR model, SGWR [23,24]. e SGWR distinguishes local and global variables; however, all of the local variables are still assumed to operate at the same bandwidth or the same spatial scale. erefore, a multiscale geographically weighted regression model, MGWR, is used to estimate fixed and varied coefficients at different spatial scales [25,26]. e MGWR offers a more flexible option to analyze the housing price and has been successfully applied in real estate and urban studies [27,28]. In this paper, the MGWR method is selected to identify globally fixed elements and locally varied factors, characterize the spatial clustering effects of the influential factors, and analyze the corresponding determinants for the submarkets.
e factors affecting the housing price are complex and diverse. e influential factors to each house may be different, but similarities exist between houses in a prescribed area. e spatially homogeneous submarkets and the major contribution factors are detected using the coefficients of the MGWR model. Cluster analysis based on statistical geodata analyses is superior and objective for the classification of the housing submarkets. is method simultaneously considers spatial clusters of the housing structure and neighborhood, allowing examinations of spatial features at the household level [29]. e cluster algorithms for the housing submarket include factor analysis [30], principal component analysis, PCA [31], partitioning algorithms [32], K-means clustering [33], and hybrid methods of clustering algorithms, GWPCA, and DBSCAN [34]. Most algorithms consider no geographical spatial context. Representations of the clustering pattern may be diverse when the multivariate cluster analysis method is used to analyze spatial data. Spatial locations should be considered in the classification process by setting spatial constraints or spatial weights. Since the housing market is related to urban spatial space, the housing submarkets should be classified using spatial clustering methods based on big geodata. In addition, the classification may involve many attribute data with varied ranges, types, and measurement units. e cluster method should allow single and multiple factors to explore the underlying pattern. Spatially constrained multivariate clustering based on the SKATER algorithm is available to cluster and visualize multiple variables considering geographic locations and cluster size constraint.
is article selects spatially constrained multivariate clustering method to detect homogeneity and divide the housing market of Shenyang into spatially contiguous submarkets based on the coefficients of the MGWR. In comparison with K-means and DBSCAN with uneven density clustering distribution, the method uses the spatial "k"luster analysis by the tree edge removal method, SKATER, to recognize natural clusters [35]. e SKATER uses graph partitioning for efficient division of spatial entities with different densities while preserving neighborhood connectivity, balancing the computational costs and sensitivity [36]. is study aims to detect how the high preference-triggered housing facilities affect the housing price and propose a mixed method to classify the housing submarkets based on the coefficients of factors. e planning strategies are suggested according to the driving factors of the housing submarkets in Shenyang.

Study Area.
Shenyang is the central city of Liaoning province, serving as an important industrial center in the northeast of China ( Figure 1). Shenyang is facing the challenges of industrial transition and sustainable development, similar to other heavy industry cities in the world. In 2019, the average housing price is less than 10,000 CNY/ m 2 (Chinese yuan/m 2 ), significantly lower than the housing prices in other large cities in China. Nevertheless, local overheating and polarization exist in the housing market. e core area of Shenyang is selected to analyze key influential factors to the housing prices and classify the housing market, in order to optimize distributions of urban public resources.
A total of 4,671 dwelling units of the housing transaction records, with real estate attributes of HomeLink Real Estate website from January to December in 2017, are collected by the crawler technology. An apartment may be simultaneously transacted in different agencies, resulting in differences in the data structure and repeated records. ese resources are not used in this study. In this manner, we randomly select 706 community samples with the housing prices close to the average of each community as training samples, and 698 samples with the housing prices as the median of each community as testing samples. e check-in data from the social network are limited; therefore, a short period of the trading data is used and the influences of the time are neglected. e spatial pattern of the housing price in the core area of Shenyang is shown in Figure 2. e data are analyzed by the exploratory spatial data analysis of geostatistical analyst in ArcGIS 10.3, showing that the housing price decreases from south to north, and the high price is localized in the central zone along the east-west direction.

Housing Data Processing and Analysis.
Both the global and local Moran's I coefficients are applied to detect and measure spatial effects on the housing price. e global Moran's I value is 0.467, and the z-score is 21.6, with p value <0.000 (Figure 3(a)). e housing price is positively autocorrelated (clusters), showing a moderately high spatial clustering pattern. In extreme conditions, this kind of pattern may be attributed to a random choice. When the housing price in the neighborhood is high, the price in the surrounding areas is also high. e local indicator of spatial association, LISA, is selected to examine the variation of the spatial auto-correlation in the study area ( Figure 3(b)). e high-high (HH) clustering points are concentrated along the Hunhe River in Heping District, the Golden Corridor of Youth Street in Shenhe District, the central commercial region in Tiexi District, and scattered in the new city Hall of Hunnan District. e low-high (LH) clustering points appear near the HH clustering points, and no high-low (HL) clustering points are observed. e high housing price shows positive effects on the spatial spillover. e low-low (LL) clustering points are predominantly observed in the northwestern regions. e housing prices in the related communities are essentially homogeneous.

POIs and Social
123°20'E 123°40'E 124°E 124°20'E u j i a t u a n The core area of Shenyang   Table 1. ree categories of POIs with 5047 records-the commerce centers, elementary and secondary schools, and ree-A hospitals, are eventually derived. In addition, 119,135 points of bus stops and subway stations, and 879 regional data of river and park are collected. Figure 4 shows the framework of this research. Similar houses are selected as training and validation data. Social media data are used to quantify the POI's popularities and measure the performance of di erent urban facilities. e MGWR model is applied to detect local variables with spatial heterogeneity. e coe cients of these variables are meanwhile used in a spatially constrained multivariate clustering analysis to divide the housing market into submarkets. e OLS results of validation data within the submarkets are used to verify the rationality and reliability of the proposed classi cation method.

POI Salience.
e urban areas attracting high levels of attention and visit are hotspots for economic, educational, or  social activity. Urban planners should consider how to effectively identify these hotspots based on human preference, and characterize the attractiveness and assess the quality of these hotspots. e POI salience index is established based on social media data to represent the service ability of public facilities. e POI salience is calculated by public cognition from social network data to adjust the weight of hotspot detection of POIs, as follows: where S POI is the POI salience; A is the public attention calculated by Weibo and Baidu search index data; B is di erent levels of POI; C is the levels of dependence on the distance to downtown; and w a , w b , and w c are the weights of A, B, and C, generally recommended as 0.4, 0.4, and 0.2, respectively [37]. All indicators of A, B, and C are normalized at a scale of 0 to 1 using min-max normalization: where max (x) signi es the maximum value in all observed values of x, min (x) is the minimum value, and z i is the ith normalized data.
where user w is the weight of living regions. In general, local residents are more familiar with the neighborhood, since public cognitive abilities and attentions are constantly improved with experience accumulation. e search frequencies from the local residents are more speci c, where the Baidu search index and check-in record user behaviors. ree sources of user w are derived by the Baidu search index from Shenyang city, other cities in province, and other cities outside the province. e weights are recommended as 0.7, 0.2, and 0.1, respectively; w u and w i are recommended as 0.4 and 0.6, respectively; parameter B is related to the POI type.
e original POIs acquired from the Baidu map and check-in data have 16 categories. Some of the POIs such as convenience stores and residential areas are less prominent, and some of the POIs show similar functions. We choose the POIs with high in uences and popularity to the public, such as hospitals and commercial centers. e 16 categories of the POIs are, therefore, reclassi ed as 8 categories, including transportation, education, sight, medical facility, park, commercial center, entertainment, and enterprise.
Parameter C is ranked by the distance to the city center. In Shenyang, most roads and streets are connected by several major ring roads. ese ring roads almost share the same center located near the Palace Museum. e ring roads are numbered from inner to outer of the city. e center degree of POI is the highest, located at the area of the 1st ring road, and decreases with the distance to the city center.
To avoid overly small or zero weight, the weighting formula of POI follows: According to equations (1)-(4), the POI saliences are calculated and 10 items greater than 0.6 are listed in Table 1.

Weighted Hotspot Detection.
e quantitative characteristics of POIs are the prerequisite to identi cations of the relations between public facilities and housing values. e distances from these POIs are calculated as explanatory variables to analyze the related housing price using MGWR.
e Kernel Density tools in ArcGIS 10.3 are used to analyze the densities of POIs. e classi cation of the kernel density output raster is important for the results. e common method of natural breaks is insu cient to identify the di erences between hotspots, and the traditional kernel density method is hard to quantitatively extract and express the hotspot peaks [38,39]. e existing density-eld detection method cannot use human preferences su ciently; therefore, this article uses a modi ed method of salience-weighted density-eld hotspot detection, SWDF-HD. is model is achieved by using the spatial analysis of kernel density and the local max-value extraction considering human preference. It is more e ective to adjust the weight of the POIs and derive spatial distributions of high visits and qualities. e salience of POIs is initially de ned as the population eld value of the traditional kernel density tool, in order to construct a weighted kernel density surface as the focus statistics of the raster data. e neighborhood statistics method is then used to extract maximum value from the raster data and generate the extreme-value surfaces. e neighborhood statistics is the focal statistics used to calculate local density maxima of all cells in the neighborhood. e raster calculator tool is selected to execute map algebra calculation between the density and the extreme-value surfaces to extract the local maximum value. Finally, spatial structures of local density maxima are derived at di erent levels. Figure 5 shows the entire modeling process.
Not all POIs are required to adjust the saliences and analyze the housing price. e subway or bus station, for instance, requires no adjustment on saliences. e subway or bus station with high visits is mostly due to its location to the resident destination, not its attractiveness. In contrast, parents prefer to purchase school district houses to provide better education, healthcare, or life convenience for children. erefore, the SWHD-HD method is used to extract local high-quality spatial locations of primary and secondary schools, large polyclinic hospitals, and commercial centers. A total of 5047 POIs are acquired with social network media, and the impacts of distance to these locations on the housing price are analyzed. Validation of prediction (50% the random selected data) Figure 4: Classi cation framework of housing submarkets. 6 Mathematical Problems in Engineering

Multiscale Geographically Weighted Regression.
e ordinary least squares linear regression, OLS, is applied to establish the relation between an explained variable (y) and a set of explanatory variables (x), as follows: where y i is the response variable of dependent variable i; β j is jth parameter estimate; x ij is the jth independent variable of the ith observation variable (j 1 to m); and ε i is the normally distributed error. β is calculated by the minimum sum of the squared di erences between the observed and predicted values, as follows: where X is the explanatory variables matrix and Y denotes the vector composed of the n observed values of the explained variable.
GWR provides a local model to capture the spatial heterogeneity. In case there are n observations, i ∈ {1, 2, . . ., n} is the observation number at location (u i , v i ), and the estimation of the predicted dependent variable y for the ith observation is given by where j ∈ {0, 1, 2, . . ., m} denotes the feature number, y i is the explained variable of the ith observation, and β j (u i , v i ) denotes the jth coe cient related to given spatial coordinates (u i , v i ); x ij denotes the ith observed value of the jth feature and ε i signi es the error item. e SGWR allows a subset of constant parameters across the space and a subset of variables, which can be regarded as a speci c case of MGWR, as follows: where y i is the ith observed explanatory variable; a j is the estimation of the jth global parameter of xed independent variable; b l (u i , v i ) denotes the lth local coe cient; and ε i represents the error term. e MGWR model allows the conditional relations between the responses, and each explanatory variable occurs locally or regionally. Both GWR and SGWR assume that all relations vary at the same or the particular spatial scale. e MGWR relaxes implicit assumptions and identi es the relations between dependent variables and independent variables at various scales [40]. e MGWR model is given as follows:  Mathematical Problems in Engineering where bwj of β bwj is the bandwidth used for calibrating the jth conditional relation, (u i , v i ) is the coordinates for each location i, and ε i is the error term. e optimal bandwidth selection of MGWR is determined in a trial-and-error manner. A trial bandwidth is fitted using either GWR or SGWR, associated with corrected Akaike information criteria (AICc). A suitable bandwidth is the one minimizing the AICc, as follows: where n denotes the observation number, σ represents the estimated standard deviation of residuals, and tr(S) denotes the trace of MGWR hat matrix S.

Spatially Constrained Multivariate Clustering
Analysis. e spatially constrained multivariate clustering tool in ArcGIS Pro 2.8 uses a connectivity graph and an algorithm called spatial cluster analysis by tree edge removal, SKATER, to construct natural clusters with space. SKATER is a method using the graph theory, with a pruning of the minimum spanning tree technique. A minimal spanning tree, MST, is initially constructed based on the connectivity graph representing the neighborhood relations among the geographic features. e classification method allows one or more variables as analysis fields to visualize and create clusters. As an unsupervised method, there are a number of optional parameters that are used to calculate an integer optimal number of groups and control cluster size constraints. We take the set of coefficients of MGWR as analysis attributes and the housing price as the cluster size constraining parameter to cluster submarkets. e MST reflects both spatial structure and associated field values of the features. A series of areal entities are related to a set of quantitative attributes {A 1 , . . ., A n }. e areal entities have an attribute vector x � (a 1 , . . ., a n ), where the attribute of A 1 has the possible value of a 1 . A connected graph G � (V, L) consists of an empty set V of vertices and a set L of pairs of vertices termed edges. If s i and s j are adjacent, an edge is then constructed to connect vertices v i and v j . e cost d (i, j) is connected with the edge of v i and v j vertices and calculated based on the attribute vectors x i and x j of areal entity i and j. A general choice is the Euclidean square distance, as follows: e MST is built through Prim's algorithm [41]. e algorithm constructs a graph G * , including a group of trees T � (T 1 , . . ., T n ). All these trees are linked, but no edge or vertices are shared. G * includes one tree as the MST in the first iteration. In each iteration, one edge is pruned until the number of specified clusters is obtained. e selected edge minimizes the dissimilarity in the resultant clusters, avoiding (if possible) singletons (clusters with only one feature) measured by a minimum sum of the intracluster square deviations, as follows: where Π indicates a division of entities into k trees; Q (Π) denotes a parameter related to the Π partition quality, which needs to be minimized; and SSD i summarizes the square deviations in region I, as follows: where n k denotes the spatial entity number in tree k; x ij denotes the jth attribute of the entity i; m is the attribute number; and x is the mean magnitude of the jth attribute. e Calinski-Harabasz pseudo-F-statistic is a common method to measure the clustering effectiveness, requiring an appropriate number of clusters, behaving as a ratio of the between-cluster variance to the within-cluster variance, as follows: where R 2 � SST − SSE/SST and SST reflect between-cluster differences and SSE reflects intracluster differences; n denotes the entity number; and n c is the cluster number.

Predictor Variables and Global
Modeling. In the application of spatial hedonic price model, a linear function is applied to directly reflect the influences of different factors. e dependent variable is the housing price in prescribed communities. To avoid multicollinearity, twelve factors are selected to explain the response variables of the housing price, including basic condition, traffic, environment, and living condition of the community. e meaning and classification of the twelve variables are shown in Table 2.
For living conditions, the distance to the nearest facilities is used by the method of SWDF-HD. e numbers of POIs related to healthcare and education are less than the total numbers of the commercial facilities, such as shopping mall, restaurants, and supermarkets. erefore, the kernel bandwidth for each three POI types are different in the estimation of the kernel density. e average edge of the streets of Shenyang is 0.224 km, and the multihotspots of the three types are calculated with the bandwidth from 0.1 km to 1 km. For commercial centers, the hotspots in the first level are over-detected below 0.2 km and under-detected above 0.6 km. Consequently, an average of 0.4 km is used as the optimal band width from 0.3 km to 0.6 km. For medical and educational centers, the optimal bandwidth is 0.7 km. Similarly, the education hotspots and commercial facilities are 182 and 498, respectively. Figure 6 shows the preliminary Mathematical Problems in Engineering data of the POIs and the hotspot maps of the hierarchical scale distribution for commercial, medical, and educational service based on the SWDF-HD model, respectively.
All data are processed using data cleaning and collinear processing methods before estimation, so that abnormal values are eliminated. A total of 706 e ective samples are collected. Table 2 also lists the parameter estimates and inference results using the ordinary least squares method, OLS. In order to compare and distinguish the dominant factors to the housing price, the response and predictor variables are standardized so that they are concentrated at zero when the same variation range is used. e property management fee is the most signi cant factor with a coe cient of 0.604, and the second one with 0.231 is the shortest distance to educational hotspots. In contrast, the shortest distance to bus stations is less important. Residents prefer high-quality of living and education, and the bus system is not the primary transportation in Shenyang. e signi cance of the model variables is examined by a robust test and marked by star. All   and p < 0.001) and Jarque-Bera statistic (1824.814 and p < 0.001) are statistically significant, indicating nonstationary in the OLS model.

Local Modeling with MGWR.
Twelve variables are involved in the GWR and MGWR models with the Python package MGWR [42]. Table 3 shows performance evaluation of five models of OLS, GWR, and MGWR with fixed and adaptive spatial kernel using 12 variables, in terms of Adj. R 2 , AIC-corrected (AICc) values, residual sum of squares, and Moran's I test for residual spatial auto-correlation. In contrast to the global model of OLS, both the GWR and MGWR models show significant improvements for Adj.R 2 , residual sum of squares, and AICc. e MGWR shows better performance than GWR, irrelevant to the selected kernel function. e best performance is the MGWR model using an adaptive kernel function. It has an AICc, R 2 , and Adj. R 2 of 1035.531, 0.862, and 0.820, respectively. e residual square sum is 102.809, Moran's index for the model residuals is −0.003, and the p value is 0.726. e MGWR model is, therefore, selected with the adaptive kernel function for estimation after comparing the results of these models. Table 4 lists the MGWR covariate-specific bandwidths, GWR bandwidths, and bandwidth confidence intervals, adjusted critical t values, and Monte Carlo test for spatial variability with 95% confidence intervals. e bandwidth in the MGWR is from 43 to 705. In contrast, the optimal GWR bandwidth is 105, implying that the conditional relations corresponding to the housing price are spatially different. ese scales are categorized into three groups at global, regional, and local scales. e global variables (larger than 600) include housing age, floor area ratio, green space ratio, and the shortest distance to bus stops and subway stations, demonstrating consistent effects and no significant spatial heterogeneity across the study region. Local variables (bandwidth less than 50) are 6% of the total number of the housing sampling points. ree local variables include area and the shortest distances to the park and river. e study region covers a total area of 1,254 km 2 , and 6% of the area is 74.7 km 2 . e remaining parameters are also at regional scales, and spatial distributions of these parameters are relatively similar, exhibiting spatial nonstationarity at the regional scales. Table 5 summarizes the statistics of the parameter estimations using the MGWR method. e housing area, property management fee, green space ratio, and the shortest distance to commercial centers impose positive influences on the housing price, and the other parameters show negative influences. e housing price is higher along with larger size, higher property management fee, more green space and closer distance to the bus and subway stations, and high-rank hospitals and schools. In contrast, the housing price is lower in old or dense-populated communities. Furthermore, the housing price decreases when the house is near commercial centers, largely due to extra noises in these areas. One of the most significant variables is the property management fee, showing that high management quality is the most concerned factor for residents, followed by environment and education. e shortest distance to bus stops has the lowest impact on the housing price. Although the bus  transportation is normally cheaper and widespread, disadvantages such as overload and traffic congestion make this kind of transportation less attractive. With the improvement of the transportation system, residents prefer to live far away from the noises of the commercial centers and the street traffic in the study region. e MGWR results are visualized to analyze spatial differences of the individual parameter estimates. e negative parameter estimates are expressed in blue and positive values in red. e insignificant estimates less than the absolute value of critical t-value at a significance level of 95% are set to zero and displayed in grey. ree significant global factors are the housing age, floor area ratio, and shortest distance to subway stations, as shown in Figures 7(a)-7(c). e same color denotes similar spatial impacts on the housing price. e cold blue colors corresponding to the house age, floor area ratio, and the shortest distance to subway stations are negative impacts on the housing price. According to Figure 7(a), the coefficients of the area are slightly changed, and the maximum value is in the center and the minimum value is in the west. Figure 7(b) shows that the floor area ratio in south is dark blue, implying larger population density and building density. Figure 7(c) shows that the traffic effects in the north are less important than the effects in the central and southern areas. Figures 7(d)-7(f ) show three local estimated coefficients, which are spatially diversified, correlating to the house area and the shortest distances to park and river. According to Figure 7(d), the area shows both positive (red colors) and negative effects (blue colors) on the housing price. e local area estimates show significant positive effects in two zones. e first zone is located in the central business district, namely, Tiexi square. e second zone is the Heping District near Hunhe River. e houses with larger areas are more popular in these zones due to better public facilities and good view. In contrast, small houses are more popular for temporary residences near the Taiyuan Street marked blue. As the most popular commercial pedestrian street, the Taiyuan Street includes a large number of stores and shopping malls. A large number of visitors make this region less suitable for long-term living, due to noises and lighting pollution. e blue colors of Figures 7(e) and 7(f ) show negative relations between the housing price and the shortest distance to park and river. Residents in Shenyang tend to live in good ecological environments. Figure 7(e) shows significant local parameter estimates corresponding to the shortest distance to parks. e housing price is higher near the parks of Changbaidao and Shenshuiwan located along the Hunhe River. e zone showing a similar tendency is located in the Taiyuan commercial zone near Zhongshan Park. ese areas show positive demand for urban green space. Figure 7(f ) shows the significant effects of the shortest distance to river. Blue colors imply that a shorter distance to the river corresponds to a higher house price in Nanhu Street near the South canal and Wulihe Street near the Hunhe River. Figure 8 shows the regional parameter estimates for the variables. Figure 8(a) show the coefficient distribution of the property management fee. Red colors indicate that higher property management fee is in general related to higher housing price, where better community management is served. e central part in Shenyang is more sensitive to the property management fee. Most families prefer high-quality property management. Figure 8(b) shows the estimated coefficients for the nearest distance to large commercial centers. is is opposite to the property management fee. e central sections along the Hunhe River show dark red, where a longer distance to commercial centers corresponds to a higher housing price. Residents in this zone prefer high-quality and quiet living environments. Figure 8(c) shows the coefficients of the nearest distance to school. Most communities near Tiexi square are in dark blue colors. e higher housing price is related to the shorter distance. Similar trends are observed in the Taiyuan commercial zone, Changbai and Shenshuiwan Streets of the Heping District. Based on Figure 6, three well-know, eight very popular, and nine popular educational hotspots are observed in these regions. erefore, the shorter distance to these hotspots equals higher prices. Figure 8(d) shows different distributions of the shortest distance to hospital. In the red zone, the hospital is a positive factor, where the housing price is higher when it is far away from the hospital. According to Figure 6, the living environment possibly becomes crowded and less attractive, since high-rank hospitals are frequently visited by many patients each day. On the contrary, the blue zone around Changbai Street has the opposite trend, where the hospital becomes a negative factor to the housing price. A higher housing price is expected when the house is closer to hospitals. Highquality medical resources in this zone are positive demands, due to the lack of high-rank hospitals. Figure 8(d) also highlights the coefficients of the medical resources in the northern area are larger than those in the southern area. High-rank medical resources are concentrated in central and northern areas, and the medical resources are insufficient in the southern area. Figure 9 shows the coefficient distribution for the intercept of the housing price with the MGWR model. e average housing price in each region is observed while all covariates are constants, regarded as the localized effects for the housing price. e positive e ects are restricted in the central business district of Tiexi square, attributed to convenient public transportation and excellent educational institutions and medical facilities. Lower housing prices are localized in old residence areas of Nanta street and northern suburb of Shenyang.

Spatially Clustering via Spatial Coe cients.
e spatially constrained multivariate clustering tool with the unsupervised machine learning method, namely, SKATER, is applied to conduct clustering analysis. e results of MGWR are modeled in 1 km * 1 km grid cells. e factors sharing similar characteristics based on seven signi cant coe cients of local and regional variables with smaller bandwidths of MGWR (the bandwidth is less than 200 are included) are selected as analysis elds. e housing price is set as the cluster size constraining parameter. e derivation of the optimal clustering number is the rst step for unsupervised clustering algorithms. Figure 10 shows a plot between the pseudo-F-statistic vs number of clusters from 2 to 30. e method is used to derive the key point, after which the pseudo-F-statistic starts increasing, suggesting that 10 is the optimal choice for running cluster algorithm.
A global spatial pattern of ten submarkets in Shenyang demonstrates that the housing price in the core area is higher than the price in suburban areas, and the housing price in south submarkets is higher than the price in north (Figure 11(a)). Submarket 1 has a relatively high average price, covering both sides of the Hunhe River. Submarkets 2 and 3 cover the central areas of the Tiexi district and partial Heping District, including Tiexi square and Taiyuan commercial center. ese areas include popular shopping centers, such as Star Mall, IKEA, and Taiyuan pedestrian street. Submarket 4 involves a cluster of nancial services from Wulihe Street to Beiling Street termed Golden Corridor, including the Palace Museum, Beiling park, and Zhongjie pedestrian street. Submarket 5 includes partial Hunnan District and partial Shenhe District. e planning policy a ects the classi cation of the housing submarkets. Submarket 6 is located in south Shenyang, and submarkets 7 and 8 are located in the southwest Shenyang, regarding Tiexi and Yuhong districts. Submarkets 9 and 10 cover a larger area in the north of the city. Figure 11(b) shows relations between MGWR coecients of seven local variables. e most signi cant factors for Submarket 1 include three variables of the shortest distances to the park, hospital, and commercial center. e average housing price becomes higher when the distance to public parks and hospitals is short, and the price is higher when the houses are far away from the commercial center.  Residences in these areas, mostly in good nancial conditions, prefer quiet and relaxing residential area with good views, educational resources, and infrastructures. e natural surroundings become the most important factor for the housing values. For houses related to submarket 2, the distance to school is clearly more important than others, where the house is more expensive when the distance to schools is shorter.
e educational resource has irreplaceable e ects in this region, which is opposite to submarket 9. In addition, the housing size is an important factor. Residents in submarket 2 prefer to purchase larger school district apartments with su cient space for three or four family members. e shortest distance to the hospitals is the most sensitive factor for submarkets 3 and 7. e price of the houses close to the hotspot's hospitals decreases.
e First A liated Hospital of China Medical University, as one of the top hospitals in China, is located in this area. With a large ow of patients from di erent cities, short-term rental is very popular in the surrounding area of the hospital; therefore, safety issues may arise. In submarket 4, the river plays a signi cant role in in uencing the price, where the south canal runs through this region. e housing price is higher when the houses are closer to the commercial center in submarket 5. e in uencing levels of each factor are equal, and no factors show obvious signi cant in uences in submarkets 6 and 8. For submarket 7, the property management fee is a major factor in this region, where residents prefer to purchase houses with a high-quality property management level. In contrast, the property management level is less-sensitive factor in submarket 10. is may be attributed to the lack of highquality property management and inactive real estate development.

Validation.
e OLS regression of each submarket is conducted to validate the model and to observe whether the regression results of most submarkets are optimized and R 2 is higher. e OLS regression Adj. R 2 and R 2 of all samples are 0.34 and 0.32, respectively. Table 6 lists the OLS regression of each submarket. Most of the regression results are improved signi cantly except for submarket 2, possibly due to the unbalanced and over-concentrated samples of this submarket. e classi ed submarkets mark the spatial heterogeneity of the housing market e ectively, meanwhile showing that basic condition, tra c, environment, and living condition of the communities a ect the housing submarkets. e absolute values of coe cients greater than 0.20 are signi cant and highlighted in Table 6. e coecients of the global variables such as the distance to bus and subway stations have similar in uences on all the housing submarkets. In contrast, local variables, such as housing area, property management fee, and the distance to parks, show large di erences. For submarkets 1 to 4 in the urban areas, the housing size, the distance to the subway, park, and the popular facilities all in uence the housing market characteristics. Meanwhile, high-popularity living conditions of the communities are more signi cant than the basic condition variables. For submarkets 5 to 10 in the suburban zones, high-quality educational and medical resources are still important factors, although the in uences of living conditions obviously reduce. e allocation of public resources strongly a ects the housing values and the distribution of the submarkets.
According to the above analysis, unbalanced structure of the housing market exists in Shenyang. is unreasonable situation needs adjustment, according to the characteristics of each submarket. e submarkets in the urban center show excessive concentration of public facilities, such as commercial centers in submarkets 2 and 3, and the medical resources in submarket 2 and the educational resources in submarkets 1 and 4. Strategies of urban renewal may be adopted, including rehabilitation for existing old buildings and infrastructures, and decentralization for high-quality structures. High-tech and low-carbon industrial clusters may be constructed in the north of suburban areas due to lower costs. Urban green ecological corridor and transport facilities may be planned in the north of the city, in order to increase commuter amount, infrastructure, and land price. Several aspects should be considered before any speci c plan is performed, such as urban development policy, nancial resource allocation, and land acquisition.

Discussion
In this article, we use geo-tagged social media data to describe the human preference of urban facilities. We extract geographic patterns and regularities of commercial, medical, and educational facilities based on the visiting numbers and popularity degrees of different types of POIs in the core area of Shenyang. e spatial configuration and attractiveness of the commercial centers, schools, and hospitals are examined, helping urban planners for reallocation and equalization of public resources. Although social media data are widely used, few studies have investigated how popular perceptions of urban critical hotspots affect the housing price. We select these hotspots associated with human preferences as independent variables to analyze spatial effects on the housing price, providing new options to classify the housing submarket. e selection of the influential factors is an important consideration in this research. Literature review shows the existence of some common factors tightly related to the housing price, such as building or house age, residential area, distance to the commercial center, distance to the river, traffic condition, living facility (submarkets or shopping facilities), education facility, and distance to the park [10,21,43]. In addition, the green space ratio, property management, and the distance to sport facility are chosen as independent variables. According to reference [21], selected variables are chosen and divided into four categories, including property structure, basic condition, traffic condition, and living condition. Based on the common factors, the housing area, bedroom number, decoration condition, and orientation are as new property condition variables, the property management fee, floor area ratio, and green ratio as new basic condition variables, and the shortest distance to the movie theater as a new living condition. Factors may be classified into different categories; for instance, the distance to the park is defined as a natural environment condition or living condition. We initially try to involve as many variables as possible in analyzing the housing price, such as the decoration condition or orientation. However, these values are too concentrated, possibly due to data insufficiency. To avoid multicollinearity, we finally select twelve factors to explain the independent variables of the housing price, including basic condition, traffic, environment, and living condition of the community.
We then compare our results with the results of previous studies. e house age, floor area ratio, and distance to subway stations commonly show negative effects [44,45]. e property management fee and green ratio show an opposite trend, suggesting that the houses near the subway with high-level property management and elegant inner environment are more expensive [46]. e effects of other variables show variance across the space. For instance, the residential area shows spatially changed effects. People prefer to choose small houses with higher housing price when the house is near high-quality schools. People pay special attention to public facilities such as schools, supermarkets, and hospitals in the inner-city area than suburb area [47−49]. Moreover, the characteristics of parks and rivers have different effects on the housing price, where people prefer to live in an environment with a good view to green space and river.
To further confirm the validity of this research framework and compare results with the original data and coefficients of MGWR, we select the original data of the housing price and the factors, where the optimal number of clusters is derived as 3. In order to compare the results in the same cluster, we still choose 10 as the cluster number, as shown in Figure 12. e differences in the housing prices between the north and south areas of Shenyang are observed in both results. Submarket 1 with the highest price is in the southeast area, where the city government is just relocated, attracting extensive interests. Submarket 2 mostly overlaps submarkets 1, 3, and 4 in Figure 11 and shows the second place of the housing price. e submarket 2 is located in the center of Shenyang with sufficient public facilities, but the house age is old and the green space is low due to poor urban planning. Public facilities in suburban submarkets 8, 9, and 10 are underdeveloped, similar to the submarkets in the suburban areas in Figure 11. e factors demonstrated by Figure 12 show less heterogeneity with smaller intervals between each other than the factors in Figure 11, leading to a smaller optimal number of the clusters and contiguous areas in the southwest area. erefore, the classi cations of the submarkets using MGWR coe cients of Shenyang are more reasonable and present higher distinction degree than the results based on the original unprocessed data.
It is worth noting that since the social media data are limited in this research, the scope of the housing price samples is small and the validity of the results needs to be veri ed with more samples in future. We will use this framework to investigate the spatiotemporal variation of the housing submarkets and to capture spatial variability of the clusters. In addition, only hotspots for commercial, medical, and educational services considered critical factors are analyzed. We will explore and compare the relations between different types of POI hotspots with human preferences and housing prices. More types of geo-tagged data such as food and service comments from online websites or taxi GPS may also be applied to analyze the urban attractive and emotional maps and preferences.

Conclusions
is research presents a comprehensive data-driven three-step frame to classify housing submarkets considering human preferences in Shenyang, China. is study uses POI saliences density-field hotspot detector to quantify human preference to public facilities. e MGWR model is applied to investigate spatial relations between the housing price and housing characteristics at multiple scales. e SKATER method is applied to classify the housing submarkets according to the MGWR results. e proposed frame is used to identify the spatial cluster degree of similar factors affecting the housing price. e following conclusions are summarized: (1) e human preferences to public facilities are spatially heterogeneous and impose different influences on the housing price. Social media data are used to quantify the popularities of public facilities, in order to analyze the influential factors on the housing price in a more accurate and objective manner. High-quality public facilities significantly impact the housing values at local scales; therefore, submarket classification should focus on the effects of the weights of POIs in Shenyang. (2) e MGWR model provides a more realistic and scalable spatial process model, allowing for different variables with individual suitable bandwidths. e variables of housing size and the shortest distances to park and river impose effects at local scales. In contrast, the variables of housing age, floor area ratio, green space ratio, and the shortest distances to bus and subway stations demonstrate global effects on the housing price. Among all variables, the shortest distance to the bus stop is the least important factor. e spatial heterogeneity of neighborhood environment and high-quality public facilities is more significant in Shenyang. (3) e SKATER cluster analysis reveals the advantages to identify the spatial characteristics of the housing submarkets based on the MGWR coefficients. Significant unbalance of the housing price exists in Shenyang, where the housing price in the city center is higher than the price in suburban areas. Similar unbalance also exists between south and north areas of Shenyang.

Data Availability
Supplementary data to this article are available upon request to the authors.

Conflicts of Interest
e authors declare that there are no conflicts of interest.