Studying the Regional Cyberspace by Exploiting Internet Sequential Information Flows

The study of cyberspace is faced with the challenge of the data shortage and model verification. This paper proposed a method to explore the regional cyberspace by employing Internet sequential information flows crawled from social network platforms. Compared with previous studies which only use one type of data sources for analysis, the main contribution of this manuscript is adopting the scheme that uses one kind of Internet information flow to extract cyberspace feature while relevant data collected from the other network platform is used for verification. Moreover, starting from measuring the informatization level of a region, a modified gravity model is designed by adding the value of informatization level to the traditional method. Then, an information associationmatrix based on the improved gravity model is constructed for analyzing the characteristics of cyberspace. To demonstrate the efficiency, Fuzhou city is considered as an interesting regional sample in this paper. The reasonable results indicate that the proposed approach is practical for regional cyberspace.


Introduction
By breaking through the limits of space and time, the relationships and interactive behaviors among humans have extended from the realistic geo-space to cyberspace.Cyberspace is the communication and information space created by computer, which is an abstract concept in the field of philosophy and computer science.It uses information flow as its studied data, while realistic geo-space is based on material flows.
The research of cyberspace has received various attentions.In the world, the research on cyberspace mainly focuses on three aspects: (1) cyberspace security; (2) the access control mechanism and communication protocol in cyberspace; (3) the study of the spatial network pattern based on sequence information in cyberspace.For example, Clark [1][2][3][4][5][6][7] designed trustworthy mechanisms and access control models to ensure the security of cyberspace.Chawki [8] tried to find the balance between privacy and security.Iyer [9] focused on "smart grid" and used cryptography and key management techniques to overcome some attacks to cyber security.Wechsler [10] advanced new directions for cyber security using adversarial learning and conformal prediction in order to enhance network and computing service.Slonim [11] presented a novel sequential clustering algorithm which is motivated by the Information Bottleneck method.Prinzie [12] tried to overcome the inability to capture sequential patterns by modeling sequential independent variables by sequenceanalysis methods.Mcculloch [13] and Mishra [14,15] use the sequential information flow to diagnose the Swiss inflation in real time.Meanwhile, Tijsseling [16] presented a variant of Categorization-and-Learning-Module network, which is capable of categorizing sequential information with feedback.
Copeland [17] studied the effect of sequential information arrival on asset prices.Mishra [18] utilized the Sequence and Set Similarity Measure with rough set based similarity upper approximation clustering algorithm to group web users based on their navigational patterns.Lottes [19] proposed a novel crop-weed classification system that relies on a fully convolutional network with an encoder-decoder structure and incorporates spatial information by considering image sequences.In mainland China, the study of cyberspace mainly focuses on two aspects: (1) relationship between cyberspace and realistic geo-space; (2) characteristics of cyberspace in specific areas.For instance, based on the 2 Mathematical Problems in Engineering

Indexes
Indexes Indexes (1) Per capital GDP (2) The number of express (3) Income of express business (4) Number of Colleges and Universities (5) Length of optical cable line (6) Number of IPv4 addresses (7) Comprehensive coverage rate of TV programs (8) The number of computers used by every hundred people (9) The number of students in Colleges and Universities Internet infrastructure, Wang [20][21][22] discussed the relationship between the Internet geographical structure and the Internet urban system of China.Bakis [23], Dong [24] and Sun [25] conduct a comprehensive analysis on the hierarchical structure and information flows pattern, as well as, reveal the spatial distributions of the Internet network structure of China.By linking the relationship between micro-blog users and geography, Zhen [26], Wang [27], and Chen [28] studied the centrality of nodes in the networks and the consistency of the whole network.Although Zhang [29][30][31][32] explored some methodologies of mining the relationship between geospace and cyberspace by using of information flows, it is difficult to obtain the sequential information of cyberspace in mainland China except cooperating with the data owners (usually government agencies).Therefore, researchers often obtain data from various statistical yearbooks and annual public reports.In this paper, we focus on the use of sequence information crawled from social networks to analyze the pattern and characteristics of cyberspace.Currently, there are two main shortcomings in the prior researches of cyberspace: (1) many scholars conducted their studies only based on one kind of sequential information flow, which makes their conclusions less convincing; (2) when they explore the linkages among studied regions, they often directly use the classical gravity model which ignores the attributes of the studied region itself.In order to fix the first weakness, this paper adopts the scheme that uses one kind of internet information flow (data of Sina micro-blog) to extract cyberspace feature while relevant data collected from the other network platform (Baidu Index) is used for verification.Aiming at the second shortcoming, a method is firstly proposed to measure the informatization level of a region and then the classical gravity model is improved by introducing some attributes of the studied regions themselves; finally, an information association matrix is built based on the improved gravity model.By inputting the information association matrix into the network analysis tools (e.g., UCINET software) and selecting appropriate evaluation indicator (e.g., the degree centrality of nodes), the important nodes in the network space can be detected.
In order to explore the efficiency, Fuzhou city, the capital of Fujian province, is considered as an interesting region for approach verification.Specifically, firstly, we use a crawler to grab information about Sina micro-blog users, such as their registration addresses and other fundamental information.Then, for those users whose registration addresses are Fuzhou, we also grab the information in their concern lists and concerned lists to analyze their social relationships.Based on the obtained data, the intensity of active connection, passive connection, and total connection are used to study the spatial pattern of cyberspace in Fuzhou city.To make the conclusions more convincing and more credible, the data from the Baidu Index is used for verification.According to the data collected from Baidu Index, we can get the number of times that one research unit is retrieved by another.Finally, we explore some possible factors which may have impacts on the pattern and characteristics of cyberspace.

Measurement Method of Informatization Development Level of Provinces in Mainland China
The regional informatization level is one of the most important factors that may affect the spatial pattern of cyberspace, so we propose a method for measuring the informatization level of a province in this section.
In this paper, we selected some indicators that can well reflect the regional informatization level according to the following steps.
Step 1. Obtain 186 indicators from China National Information Center which can be used for describing the development level of information society.
Step 2. Use world cloud analysis tools for counting the frequency of keywords contained in the 186 indicators.Then, 42 indicators that contain the keywords with higher frequency remain.
Step 3. The correlation coefficient and variation coefficient of each of the 42 indicators are calculated.Then, 17 indicators with poor correlation or high redundancy are eliminated.
Step 4. KMO test and factor analysis are carried out on the remaining 25 indicators.Finally, 9 indicators (shown in Table 1) remain and will be used to evaluate the level of information development in a region.
After that, Standard Deviation method, CRITIC (Criteria Importance through Inter Criteria Correlation), and Entropy Weight method are used for calculating the weight of each indicator for each province.The calculation formulas of each method are shown as follows.
(1) Standard Deviation Method where    is the weight of indicator  in indicator system,  stands for the number of research units, V  represents the specific value of index , and V stands for the arithmetic mean of index .
(2) CRITIC (Criteria Importance through Inter Criteria Correlation) Method where   and  have the same meanings in formula (1) and   is the correlation coefficient between indicator  and indicator .
(3) Entropy Weight Method where   stands for the value of indicator  being normalized and  have the same meanings in formula (1).Then,   will be used as the final weight of the indicator .The value of   can be calculated using formula (4).
Finally, the composite scores of information for each province on these indicators could be calculated according to the following formula.

Construction of Information Association Matrix Based on Improved Gravity Model
Gravity model is a widely used model for measuring spatial interaction capability; its formula is shown as follows:  where   stands for the distance between unit  and unit  and  and  represent the coefficient of gravity and the distance attenuation coefficient, respectively, while the meaning of   (  ) is varying in different applications.For example, if we study the intensity of communication between two regions, the meaning of   (  ) can be the number of calls made by the mobile users in the two regions.When using the above model, researchers usually only focus on the connection between nodes, but ignore the attributes of nodes.
Assume the following cases (Figure 1).
(1) The interactions between  and  and the interactions between  and  occur at the same time periods.
(2) The number of times that  actively interacts with  is the same as the times that  actively interacts with , such that  =  = 10.
(3) The number of times that  actively interacts with  is different from the times that  actively interacts with , such that  ̸ = ,  = 100,  = 1.(4) The distance between  and  is the same as that between  and ( = ).
Then, if we use the classical formula of gravity model, we would get the conclusion that the interaction intensity between  and  is the same as that between  and .It is clearly incorrect, because it ignores the essential attributes of the research objects.
In view of the above analysis, and combined with the actual situation, in this paper, we modify the model as follows: where   stands for the intensity of information flow between province  and province ,   is the gravity coefficient of information network, in which    is the distance attenuation factor of information space,   is the shortest road distance Mathematical Problems in Engineering between province  and province , and   represents the intensity of network concern between province  and province .In this paper, we use the parameter estimation method in Wang [31,32] and set the values of   and  to 0.85 and 1, while the value of   can be obtained from Baidu map.We take the average number of searches between corresponding provinces from January-1-2016 to January-1-2017 as the value of   .Then, for the sake of comparison, we standardize the value of   using formula (8).
Finally, we construct the information association matrix as follows: In this matrix, for example,    stands for the standardized intensity of information flow between province  and province .Inputting this matrix into network analysis software, we can find out the important nodes in the network.

Research Object and Experimental Data
In this section, Fuzhou city, the capital of Fujian Province in mainland China, is used as an example of model verification.The sequential information flows which we grab from Sina micro-blog platform and Baidu website are applied to study the pattern and characteristics of the cyberspace in Fuzhou.As the capital of Fujian province, Fuzhou city is a hometown of overseas Chinese people.There are many overseas Chinese people with Fuzhou descent distributed all around the world.Fuzhou and Taiwan face each other across the Taiwan Strait.People in these two regions have close links with each other.As one of the important city nodes of the EZWCC, the development of Fuzhou has received attention from the local government and also the Chinese government.
In this section, Fuzhou is chosen as the studied region, and then the pattern and characteristics of its cyberspace are analyzed.

Principle Data Acquisition and
Preprocessing.In order to analyze the cyberspace pattern of Fuzhou more accurately, two kinds of actual network information flows were chosen as principal experimental data.

Data about Sina Micro-Blog Users. Micro-blog is a
completely open network interaction platform for public participation.The research of cyberspace based on microblog reveals the communication characteristic among people more clearly and reflects the influence of information on human relations networks more directly.Geographical attributes of micro-blog users provide the basis for the association of cyberspace and realistic geo-space.
According to the report of QuestMobile [33], the number of monthly active users (MAU) of Sina micro-blog has increased more than 45% and reached up to 341 million by the end of 2016.Table 2 shows that the MAU of Sina micro-blog is top-1.Considering the availability of Sina micro-blog users' location information, Sina micro-blog is chosen finally as the principle data source.In this paper, one thousand users' (OTU) information is grabbed.These users meet the following three conditions: (1) their registration addresses are Fuzhou; (2) they are ordinary users rather than celebrities or big V (the verified users who have more than 500,000 fans and use the microblog mainly for commercial or personal propaganda rather than sociality); (3) they are active users who not only are concerned about one hundred to five hundred fans but also concern one hundred to five hundred other users actively.According to the report of CNNIC [34], the proportions of Internet users of different ages at the end of 2016 are shown in Table 3. Accordingly, in order to make the sample more reasonable, the numbers of sampled users in these ranges are 234, 303, 232, and 231.
For the one thousand users, we not only grab their basic information (such as IDs, nicknames, sexes, registered addresses, character signatures, birthdays, marital status, and home links), but also obtain the registered addresses of the top 100 users in OUT's concern lists and concerned lists.Finally, 92015 users concerned about the OTU and 55449 users that concern the OTU are found.Among these users, there are 10451 pairs of friends.
There are three kinds of relationships between these users: active concern, passive concern, and mutual concern.Unilateral concern or be concerned model is a weak relationship, while the mutual concern model is a strong relationship.If user B is concerned about user A, then the direction of information flows can be described from A to B.
Three indicators are used to evaluate the intensity of the network information flows between Fuzhou city and other regions.These three indicators are the intensity of active connection (the number of concerns,   ), the intensity of passive connections (the number of be concerned,   ), and the intensity of total connections (sum of concern and be concerned,   +   ).The meanings of the three indicators are shown in Table 4.All of the top 100 users in OUT's concern lists and concerned lists were classified and counted according to their registered addresses and relationships.The results are shown in Table 5.
In this paper, the network information flows between Fuzhou city and other provinces in mainland China are studied.To make   ,   and (  +   ) comparable, maximum value standardization was carried out by using formulas (10).
Because users in the concern lists and concerned lists are dominated by Fujian province, Fujian province is analyzed separately.When the values of max(  ), max(  ), and max(  +   ) are selected, Fujian province is excluded.

4.2.2.
Data about Baidu Index.The Baidu Index (https:// index.baidu.com/)is one of the most important data sharing and statistical analysis platforms.It records a large amount of Internet users' behavior data.With its help, we can obtain the number of times that one specified keyword was retrieved in different areas and at different times, which can reflect the intensity of the connection between two regions to some extent [25,[35][36][37][38][39].In this paper, the indexes between Fuzhou city and all the provinces in mainland China are obtained.The results are shown in Table 6.
There may be greater contingency if we use the number of passive retrievals to verify the intensity of passive concern or use the number of active retrievals to verify the intensity of active concern.So, the sum of passive retrieval and active retrieval was used for verifying the final intensity of total connections between two research units.

Spatial Difference Analysis on the Intensity of the Network Information Flows
Cyberspace Breaks through the Limitation of Time and Geographical Space.Table 5 shows that the cyberspace in Fuzhou is very wide; the users either in the concern lists or in the concerned lists are distributed in all of provinces in mainland China.As a famous hometown of overseas Chinese people, there were also network information flows between Fuzhou users and some foreigners.
There Were Grade Differences in Cyberspace.Taking into account the actual situation of the object region and the data selected in this paper, all provinces in mainland China are divided into seven levels by the value of   ,   , and (  +   ), and the provinces are divided into five levels by the value of    ,    , and (  +   )  .The results are shown in Figures 2(a), 2(b), and 2(c).It is necessary to explain that the solid line with an arrow in Figures 2(a) and 2(b) shows the direction of concern.For example, "Fujian←Xinjiang" means that Xijiang takes the initiative to pay attention to Fujian, which also means that Fujian is paid attention to by Xinjiang.The solid line without arrow in Figure 2(c) means the total intensity of attention between two units.In all of these three figures, the thicker the solid line is, the stronger the attention is.
In general, there is an obvious grading phenomenon in the intensity of connections, such that, from the east to the west, the intensities become gradually weaker: (1) the Table 4: Three indicators and their meanings.

Indicators' names
Meanings of indicators the intensity of active connection (X a ) used to measure the active connection between Fuzhou and other research units; the greater the value means the research units have greater impacts on Fuzhou the intensity of passive connection (X b ) used to measure the interest of other research units in Fuzhou; the larger the value indicates that the units are more willing to accept the information from Fuzhou the intensity of total connection (  +   ) used to measure the intensity of the total linkage between Fuzhou and other research units; the larger the value, the greater the intensity of interaction   Regional Embedding Still Exists in Cyberspace.First, the results of grade division show that although cyberspace breaks through the limit of time and geographical space, the geographic distance factor still has some influence on the spatial pattern of cyberspace.The distance attenuation phenomenon also exists in cyberspace to some extent.Second, among these research units, Fujian province has the strongest connection with Fuzhou in the intensity of passive connection and total connection.Although information technology has compressed the space-time distance and expanded the scope of social communication, the information connections within the local domain occupied the dominant position in its cyberspace because of their geographical proximity and social cultural similarity.
The Information Flow in Cyberspace Is Asymmetric."Information potential" refers to the capability to picking up, using, transmitting, formulating, aggregating, and processing information.The difference of "information potential" also leads to the asymmetry of information flow.Regions are more likely to establish contact with other regions with higher "information potential."For example, Beijing, Shanghai, Zhejiang, and Jiangsu are far less concerned about Fuzhou than Fuzhou's attention to them.

The Pattern of Cyberspace Can Be Affected by Population Flow (Labor Input and Output).
Taking Sichuan province as an example, the distance between Sichuan and Fuzhou is much farther than that between Jiangxi and Fuzhou, but the connection intensity between the former two is stronger than that of the latter.These results are primarily because Sichuan is a big province of labor output in mainland China.Most of the laborers have been outputting to Fuzhou, Xiamen, Quanzhou, and some other cities in Fujian province.The flows of population will inevitably bring about information flows.

The Pattern of Cyberspace Has a High Correlation with Regional Economic Development Pattern.
Excluding Fujian province, strong connections occurred between Fuzhou and some economically developed areas, such as Zhejiang, Shanghai, Guangdong, Jiangsu, and Beijing.It means that the pattern of cyberspace is also affected by the level of economic development.It is primarily because the developed areas have higher "information potential."The influence of "information potential" sometimes is even greater than that of geographic distance.Taking Beijing as an example, Fuzhou is concerned about Beijing much more actively than Fujian province.
Then, the information association matrix    is inputted into the UCINET software.If there is information connection between two provinces, there will be a connection line between them.In this way, the information spatial association network diagram at provincial level in mainland China is generated.In this paper, the degree centrality is used to evaluate the importance of nodes in the information network.If the degree centrality of nodes is higher, then the larger area of the graph is used to describe the node.Finally, we get Figure 3, which shows that economically developed provinces, such as Beijing, Shanghai, Guangdong, and Zhejiang, usually are the key nodes in the cyberspace.Other provinces with relatively sluggish economic development are willing to contact these key nodes more frequently.

Verifying the Effectiveness of Grade Division.
All the analysis results in Section 5.1 are based on the grade division.Therefore, in this section, another kind of data (data from Baidu Index) is used to verify the effectiveness of grade division.
From Tables 5 and 6, the rankings of the number of total connections and total retrievals for each province can be obtained.The detailed rankings are shown in Table 7.
Figure 4 is visualized based on the information in Table 7.And then the correlation between the two discrete  curves is calculated according to formula (11) and 87.10% is achieved.
Although there are little differences in some provinces' rankings by using the two different kinds of data, most of the provinces do not change too much in their rankings.The same conclusions can be achieved by comparing Figure 2(c) with Figure 5. Therefore, the conclusions obtained based on the first set of data in Section 5.1 have high credibility.What needs to be explained is that the provinces or cities in Figure 5 are classified according to the total intensity of mutual retrieval between them and Fuzhou city.

Analysis of the Possible Influential Factors.
In this section, some influential factors which can be quantified and may have impacts on the pattern of cyberspace in Fuzhou are analyzed.
Geographic distance, the "Internet plus" index, and regional informatization development level are considered as the most likely influential factors that can affect the pattern of cyberspace.The "Internet plus" index is an important indicator to reflect the level of regional development.It consists of four subindexes: "Internet plus infrastructure," "Internet plus industry," "Internet plus innovation," and "Internet plus smart city."Further, these subindexes consists of 14 firstclass indexes and 135 second-class indexes.Its content covers social, news, video, cloud computing, and the 19 major subindustries of the three industries.It uses the Tencent users' digital economic behavior as basic data and collects data Geographic distances between Fuzhou city and other research units are measured with the help of Baidu map, the value of "Internet plus" indexes and rankings for each research unit are obtained from T. R. Institue [40], and the informatization development level of each research unit can be calculated with the method that we proposed in Section 2. These factors are shown in Tables 8-10.
SPSS (Statistical Product and Service Solutions) software is used to analyze the correlation between rankings of the three factors and rankings of the connection intensities.The analysis results are shown in Table 11.
Table 11 shows that there are high correlations existing between the connection intensity and the "Internet plus" index, as well as between the connection intensity and information level.The "Internet plus" index has the highest positive correlation with intensities.Results of correlation analysis also illustrate that there is a certain correlation between the distance factor and the connection intensity.However, this factor is no longer the primary factor.Some other factors, such as the government's economic policies and cultural similarities, can also affect the connections intensity, but they are difficult to quantify.

Conclusion
This paper points out the shortcomings of existing works in the field of cyberspace: (1) researchers conduct their studies only based on one kind of sequential information flow, which makes their conclusions less convincing; (2) the study of the cyberspace pattern based on the sequential information flow usually directly employs the classical gravity model but ignores the attributes of the studied objects themselves.To overcome these weaknesses, we hold the idea that it is necessary to use some different kinds of sequential information flows to analyze this problem.We also advocate that we should consider the attributes of our study objects themselves when the gravity model is used.Accordingly, we proposed a method for measuring the informatization level of a region and improved the classical gravity model by adding the score of informatization level to the classical formula.And then, we constructed an information association matrix based on the improved gravity model.Finally, we took Fuzhou city as our study object and focused on its cyberspace characteristics.Experiments in this paper are conducted on two kinds of sequential information flows, data about Sina micro-blog users and data of Baidu Index.According to our experimental result, the following conclusions can be drawn: First, cyberspace breaks through the limit of geographical distance and has a wider range of communication.Second, there is an obvious grade difference in cyberspace.From the east of China to the west, the intensities of the total connections decrease gradually.Third, the social communication mode in realistic geospace has also been brought into cyberspace, such that the local domain information still occupied the dominant position in cyberspace.Fourth, the economically developed provinces usually are the principle node in network information space.Fifth, the information flows in cyberspace are asymmetric.Areas with low information potential are easily attracted by the areas with higher information potential.Economically backward areas are more willing to establish active contacts with economically developed areas.
There are many factors that can affect the pattern of cyberspace.By conducting comprehensive analysis of these factors, the characteristics and future of cyberspace could be understood and grasped more accurately.

Figure 1 :
Figure 1: The interpretation of relevant variables in the scenario assumption.

4. 1 .
Brief Introduction of the Research Object.In May 6, 2009, the State Council of the People's Republic of China issuedthe "opinions on supporting Fujian province to speed up the construction of the Economic Zone on the West Coast of China (EZWCC)."As an important part of China's coastal economic zone, the EZWCC is separated by the Taiwan Strait, north of the Yangtze River Delta and south of the Pearl River Delta.It occupies an important position in the layout of regional economic development.The EZWCC is composed of Fujian, Guangdong, Zhejiang, and Jiangxi provinces.Fujian province is the most closely related to Taiwan, because of their geographical proximities and historical and cultural similarities.With this unique advantage, Fujian province occupies the dominant position in the EZWCC.

Figure 2 :
Figure 2: Grade classification at provincial level.

Figure 3 :
Figure 3: The information spatial association network diagram at provincial level in mainland China.

Figure 4 :
Figure 4: Correlation between total connection rankings and total retrieval rankings.

Table 1 :
Detailed information of the nine indicators.

Table 2 :
The value list of Social App in 2016.

Table 3 :
The proportions of Internet users of all ages.

Table 5 :
Classification statistics of the users in provincial level.Provinces Concern (X a ) Be Concerned (X b ) Total (  +   ) Provinces Concern (X a ) Be Concerned(X b ) Total (  +   )

Table 6 :
Baidu indexes between Fuzhou and all the provinces in mainland China.

Table 7 :
Total connection rankings and total retrieval rankings.

Table 8 :
Distances between Fuzhou and other provinces.

Table 9 :
"China Internet plus" indexes 2016 for each province.
Figure 5: Grade division based on total retrieval.from Didi Taxi, Meituan Dianping, Jingdong Mall, Ctrip, and some other Internet companies.Because the data are very comprehensive, it can reflect the degree of combination of Internet and all walks of life and the ability of the Internet utilization.

Table 10 :
Composite scores and rankings for each province.

Table 11 :
Pearson correlation in provincial level.