Complexities in Financial Network Topological Dynamics : Modeling of Emerging and Developed Stock Markets

1Center of Cyberspace and Security, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China 2School of Management and Economics, University of Electronic Science and Technology of China, Chengdu 610054, China 3Department of Physics, University of Fribourg, Chemin du Musée 3, CH-1700 Fribourg, Switzerland 4Department of Computer Information Systems and Supply Chain Management, Walker College of Business, Appalachian State University, Boone, NC 28608, USA 5Department of Computer Science, Rutgers University, Piscataway, NJ 08854, USA 6Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China


Introduction
The visualization of networks and research of hierarchy structures are essential to study complex systems like financial markets.Thanks to the significant development of complex network science [1], quantitative methods and models have been applied in the studies of financial markets network structure.In financial network analysis, entities like assets, stocks, markets, companies, and institutions are modeled as vertices while their mutual relationships are abstracted as edges.This approach empowers industrial professionals and researchers to reveal hidden information embedded in the topological structures of financial networks, such as the market dynamics, trading activities, and investment sentiment.This information is essential to evaluate and monitor the financial market risks, contagions, distress propagation, as well as market mode shifts.Financial network analysis has been utilized in applications like portfolio management, trading, market regulation, stress testing, and risk management.
The USA and China are the top two dominating economies with influences over the global economies.The two economies are similar in market size.However, the US economy is well established and developed while the Chinese economy is emerging and still undertaking fast development.As the leading economic powerhouses, the health and stabilities of these two economies are vital for the prosperities of the world economy.During the past few years, both countries suffered a series of stock market disasters, such 2 Complexity as the 2008 US subprime crisis and the 2007 and 2015 Chinese stock market bubble bursts.These dramatic market crashes brought widespread and long-lasting negative impacts on economies and markets.The stock markets of both countries are also different regarding history, regulations, maturities, and scales.Thus, it is essential to understand the properties of these two markets by utilizing the data-driven science approach.Recently, various major markets in the world have been investigated using the financial network analysis approach.However, there is still a lack of systematical studies dedicating to compare the network structures and properties of the US and Chinese stock markets using the financial network analysis approach.
To understand how the two markets differ in the structures and topological properties, as well as the dynamics market properties, this research investigates the markets using a dataset spanning over nine years.In this research, the stock markets are modeled as multiple networks including hierarchical trees, minimum spanning trees, planar maximally filtered graphs, and assets graphs.Meanwhile, their detailed topological properties are analyzed and systematically compared.Through quantitative analysis and network visualization, results show the two markets are different in many ways.This provides insights for regulators on the structures and dynamics of stock markets from the perspective of network science.
This paper is organized as follows.First, Section 2 gives a background on the theory of complex networks, financial network analysis, and relevant complex network parameters are introduced.Then, in Section 3, the data and method used to construct networks are described.Section 4 presents the network properties.In Section 5, the detailed hierarchical structures of both stock networks are carefully investigated and compared.Finally, conclusions and discussions are presented in Section 6.

Literature Review of Financial Network Analysis
Network science has become an innovative tool widely used in studies of complex systems in a variety of engineering and scientific domains [2][3][4].The network modeling methodologies and theoretical frameworks have revealed informative and useful empirical discoveries [5].Studying the statistical properties such as degree distribution, average length, and clustering coefficient can help describe the networks topologies and the dynamics of network evolution.Furthermore, it is possible to study the information spreading, network stability, and phase changes and hopefully to predict and control the network dynamics [6].Time series data can be modeled as networks [7,8].For price time series [9], calculation of correlation matrices for a group of assets is possible [10][11][12].From the correlation matrices, financial network analysis could be applied to construct networks for further analysis and data mining [13][14][15][16][17].In most existing literature, assets are treated as vertices, while the interconnectivity relationships are modeled as pairwise edges among assets.The correlation matrices are not only important for network analysis and topological visualization but also serve as a bridge between financial network analysis and traditional finance theories.This is similar to modern portfolio theory (MPT) [18,19], which is based on the correlation relationships among assets.Network-based portfolio selection has been proposed for optimization and empirically proofed workable [20].
Since the minimum spanning tree approach is first used in the study of stock market structure [21], financial network analysis has grown into an essential tool of financial big data.However, this fast-growing field is still at an early stage [22].Financial network analysis provides an unprecedented perspective shedding new insights on evaluating the market stability, market risk, shock propagation, and contagion [23][24][25].The connectedness among assets plays the critical role of market contagion phase transition [26] which is similar to other tolerance properties of other nonfinancial complex networks [27].Further research reveals that intermediate level of risk diversification can enhance the market robustness [28].The importance and risk contribution of companies can be identified through the network analysis [29,30].The systemic risks and stability can also be evaluated according to the topological properties of the financial network, and providing implications for market regulations [22,31,32].Through investigating the clustering of assets, portfolio optimization can be achieved with better predicted over realized risks [33].Overlapping of portfolios is revealed by network analysis as one primary factor for market contagion [34].In another approach, risk spillover networks are constructed to study the behaviors of financial institutions [35].Instead of a single layer approach, by building multiple-layer network, the banking system risk is analyzed and quantified [36].Regression models can also take network structure into consideration as factors for resilience and robustness of the markets [37].This financial network approach opens more interesting new possibilities and dramatically enriches regression models in finance studies.Recently, there has been a thread of studies on major players in financial networks, such as 'too interconnected to fail' institutions [38], 'too central to fail' [39], and 'too big to fail' [40].The research demonstrated that financial network analysis brings new insights to finance studies and benefits to finance practices.
In the rise of quantitative trading, the causality and lead/lag relationships revealed by financial network analysis can be particularly interesting for trading strategy design [41,42].Many researches have revealed stylized evidence that the network structure has a profound influence on the asset returns [43].Taking risks into consideration, it has been found that investing in peripheries of financial networks might generate better returns over risks [44].Furthermore, industry professionals would be inspired by financial network analysis to seek price movement signals for potential predictions [45].
Many major individual financial markets around the world have also been studied in network approach, such as US [62,63], China [64][65][66], Germany [67,68], EU [69][70][71], Brazil [33,72], Italian [53,56], Korea [73], Russia [74], and Mexico [36,75].Furthermore, there is literature focusing on the cross-board global markets [76,77].Using the partial data, networks of global markets are reconstructed, and methods are compared [56].Bayesian graphical models are applied to identify groups of counties which are major contributors to systemic risks according to banking behaviors in the global banks [78].For the European markets, the risk and contagion channels are studied, and results show the EU markets are vulnerable to risks [71].Global stock exchange network is investigated to evaluate the attractions for IPOs [79].A recent study has demonstrated the approach which uses transfer entropy to study a selection of major individual stocks around the world and reveals that stocks are clustered according to their countries and industries [80].By looking into the network structures of the global financial network, it is possible to give new insights into the international business cycle [81].The diversification and participation are investigated for various economies [82].
Considering a large number of assets in financial markets, the initial networks have massive edges.By filtering the noises of the networks, the financial networks can be significantly simplified to enable advanced analysis such as principal component analysis (PCA) [76] and random matrix theory (RMT) [83,84] to further extract hidden patterns.Hierarchical tree (HT) [21,68], minimum spanning tree (MST) [85], planar maximally filtered graphs (PMFG) [86], asset graph (AG) [87,88], and partial correlation network [15,89,90] are major approaches applied in filtering financial networks.Mantegna [21] first introduces the minimum spanning tree method into the study of hierarchical structures in financial markets.With the network, we can study the topological structure of a market or a portfolio.In this research, we adopt the frameworks to study the correlations and the corresponding networks of stock markets both in China and the United States to systematically study how the two markets behave differently.

Data and Research Methods
3.1.Indices of CSI300 and S&P500.We study the stock markets of China and United States: the former is a typical representative of emerging countries with fast-growing GDP rate and influence on global economies, while the latter is the most established and developed economy in the world.To study the major stocks of each market, we focus on the component stocks of the major indices of the two stock markets, i.e., China Securities Index 300 (CSI300) for the Chinese stock market and Standard & Poor's 500 (S&P500) for the US market.In our study, we cover a period of nine years starting from 04/01/2007 and ending on 06/11/2015 with 2149 trading days for CSI300 and 2228 trading days for S&P500.The reason why the two markets have different numbers of trading dates is that the two markets have different trading calendars.Index and all component stocks daily price data of CSI300 are retrieved from the CSMAR Solution Database of Shenzhen GTA Education Tech.Ltd.We download the S&P500 index and component stocks daily prices data through Yahoo finance service.Since not all stocks are traded on each trading date, so we only select those CSI300 stocks with at least 2000 trading dates, and without continuous 100 nontrading dates, this selection results in a final set of 163 stocks.For S&P500, we select those stocks with at least 2100 trading dates, and in results, we get 468 stocks.After stocks selection, we take the prices on the available closest trading date to fill the nontrading dates.In Figures 1(a) and 1(b) we plot the daily close prices and the daily log returns for the index of CSI300 in the study period of 04/01/2007 and 06/11/2015 with 2149 trading days.In Figures 1(c) and 1(d), we plot the daily close prices and the daily log returns for the index of S&P500 in the same study period from 04/01/2007 to 06/11/2015 with 2228 trading days.From the figures, we see that the two markets show large fluctuations in the last nine years.CSI300 experienced huge market crashes in 2008 and 2015, while S&P500 kept climbing almost continuously after the 2008 financial crisis.
3.1.1.CSI300.China has two independent stock market exchanges, i.e., the Shanghai stock exchange and the Shenzhen stock exchange.Opened at the beginning of 1990s with only 25 years of trading history, the two markets have grown into important financial markets playing vital roles in China's financial markets and economy.Among the many stock market indices, the China Securities Index 300, or CSI300, was introduced by the China Securities Index Company, Ltd. in 2005 to a base of 1000 on 31/12/2004.In CSI300, a set of 300 stocks are included as the index components; all of them have the largest market values and are actively traded in Shanghai or Shenzhen stock exchanges.CSI300 has become a widely accepted benchmark to evaluate the whole stock markets behaviors in China as well as a good basis for other derivative products.Starting from 1000 points in the early of 2004, now CSI300 has reached 3793 points as of 06/11/2015 [91].To give an image of the Chinese stock market, we plot the 2149 CSI300 index daily close prices and daily log returns in the study period between 04/01/2007 and 06/11/2015 in Figures 1(a) and 1(b).In the past nine years, CSI300 experienced two major market crashes in 2007-2008 and 2015, respectively, during which the market suffered huge losses and fluctuations.There are 163 stocks of CSI300 component stocks included in our dataset, as shown in Table 1, in which we summarize the numbers of these 163 stocks for all 20 industry sectors.As shown, all industry sectors from Agriculture to Comprehensive are represented.For convenience, we will refer to these 163 stocks as CSI163 in the following parts.considered as one of the best benchmarks for the US financial markets and economy.Starting from less than 100, after more than 50 years of development, the S&P500 reached 2099 on 06/11/2015 [92].We plot 2228 daily close prices and log returns of the index of the S&P500 in our study period between 04/01/2007 and 06/11/2015 in Figures 1(c) and 1(d).We can observe that the S&P500 index suffered a major crash between 2008 and 2009 then recovered almost steadily with minor fluctuations.After the selection, there are 468 stocks of the S&P500 component stocks included in our dataset, as shown in Table 2.We summarize the numbers of these 468 stocks for all ten industry sectors.As shown, all industry sectors from Energy to Utilities are represented.For convenience, from now we will refer to these 468 stocks as S&P468 in the following parts.Table 3 gives a summary of the two datasets of both CSI163 and S&P468; CSI163 has a larger standard deviation   of the log returns, indicating larger fluctuations than S&P468.We use ⟨⟩ to denote the average of variable  in this paper.

Price Returns and Correlations.
From the time-stamped price time series of a blanket of stocks, it is possible to calculate the correlations for any pair of stocks once a time window is given.  () is the price at time  of stock   .It could be one of the daily prices of open, close, high, or low.Per most literature suggests, we choose the most used daily close price.To smooth the fluctuation without loss of generality, the logarithmic return for   in the period of [ − Δ, ] is defined as and usually used instead of   () itself.In most cases, daily log returns are used where Δ = 1.For stock pair of   and   , we can extract the two price time series in a time window with a length or size of , i.e., with  price values included in the window.The selection of  is expected to meet the requirement of / > 1.In a sliding window approach, we where ⟨⋅ ⋅ ⋅ ⟩ stands for the average.The value of   ranges between -1 and 1, where a negative value of   < 0 indicates the two stocks fluctuate in a noncorrelated manner, i.e., one falls down while another one climbs up.For a positive value of   > 0, the two stocks fluctuate in a positively correlated way.
In this case, they move in the same direction.If   ≈ 0, then they are not correlated.If |  | ≈ 1, then the two stocks are perfectly correlated or noncorrelated.In a stock market, the stocks from the same industry are more likely to be correlated.For a portfolio of  stocks  1 ,  2 , ⋅ ⋅ ⋅ ,   , we can calculate all  ×  pairs of correlation coefficients   for any   and   .These  2 pairs of values can be expressed as a correlation coefficient matrix  with a size of  × .
Based on the correlation matrix , we can define the distance   between stock pair of   and   as The values of   form an adjacent symmetric matrix , in which there are ( − 1)/ Or in an edge ranking approach, we only keep a certain number of top edges with the strongest relationships, say −1 edges.With this approach, the remaining edges are more likely to form loops in strongly connected vertices and are referred to as an asset graph [93].

Network Filtering.
By filtering edges in a threshold approach, we may get isolated vertices or loops in the filtered network.To avoid this, tree approaches including minimum spanning tree (MST) can be used to chop edges but still keep all vertices connected as a tree.MST is introduced to investigate the hierarchical structure of stock networks first by Mantegna [21].Many studies also use this approach, such as Jang et al., to investigate the foreign exchange market using in the periods of currency crises finding that the values of correlation coefficients decrease but the normalized tree length increase in crises [94].Matteo et al. find that the dynamical planar maximally filtered graphs (PMFGs) can preserve same hierarchical structure as the dynamical MST, and the financial sector dominates the central role in the network [47].As an application of network analysis in portfolio management, Onnela et al. suggest the assets of the classic Markowitz portfolio are always located on the outer leaves of the tree [88], and Pozzi et al. further suggest that even it is better to invest in the peripheries of the MST of a market [44].In [85], MST networks extracted from real correlation data are compared with those generated from artificial random models.Results reveal that the properties of MST from real data cannot be reproduced, showing the uniqueness of real stock networks.
Based on the network (, ), we can extract a tree connecting all vertices with −1 edges with a minimum total distance also known as minimum spanning tree (MST) of the stocks.By only using the  − 1 edges out of the maximum ( − 1)/2 edges, the network is dramatically simplified or filtered while keeping the most important shortest edges.To extract the MST, Kruskal's Algorithm was applied in three steps: (1) we rank all edges according to the distances from the shortest to the longest; (2) in each round, we choose the shortest edge into the MST while avoiding loops; (3) we repeat round #2 until all vertices and all  − 1 shortest edges are added [95].Bonanno et al. review the MST approach in revealing information of markets [96].
Using the MST, we can construct the hierarchical tree (HT) in which the subdominant ultrametric distance  <  is defined as the maximum distance of an edge along the path between V  and V  .The HT satisfies the first two rules with a stronger third one: With this ultrametric inequality, we can construct a hierarchical tree based on a MST and present a unique topological structure of the stocks [21,97].By loosening the requirements of MST up to 4 vertices, but forbid crossings, as many as 3( − 2) edges containing the MST as the subgraph including all the top  − 1 shortest edges can be gathered.This new network can be drawn on a planar surface without link crossings is called planar maximally filtered graphs (PMFG) [47,86,[98][99][100][101].This makes PMFG different from MST, which also shows richer structures of the network.In a similar construction to MST, to construct PMFG, we firstly rank the edges in ascending orders according to the distances of edge pairs.Then we add the shortest edges into the PMFG but keeping the genus  = , where the  is the largest number of simple closed curves one can draw on a planar surface without separating it.For the case of  = 0, when all edges are considered, PMFG can be gathered [86].It has also been proved that an MST is a subgraph of a PMFG and the number of 3-and 4-cliques in a PMFG is 3 − 8 and 3 − 4, respectively [102].
Since PMFG contains more edges and allows loops and cliques, there is more information embedded in PMFG than in MST.After the introducing of PMFG into the study of network structures of stocks, PMFG has been used in studies of many stock markets, and more recently, PMFG is applied in investment strategy design [44].Based on PMFG, a clustering approach called Directed bubble hierarchical tree (DBHT) is proposed and show good performance compared with other algorithms and also been applied to study financial data [103].It has been reported that, in a running window approach, the PMFG shows stronger stability in a long run compared with MST [104].

Stock Network Topological Properties
A network  = (,) is a graph composed of a set of vertices  and a set of edges .In a network model, the participants are represented as the vertices , and the Based on how the correlations are calculated, there are two approaches, static or dynamic.In a static approach, the correlations are calculated over the whole period using all available prices.Thus we get a single static correlation matrix to describe the market regardless of the different market periods.When sliding windows are used in a dynamic approach, we get a sequence of correlation matrices.The static approach, which is the most used in literature, gives a static description of the structure of the market with details of different market periods like bear markets or bull markets.However, the dynamic approach can reveal the evolution of market structures and behaviors, which are especially useful for the comparisons of calm periods and crashes.
In this part, we present the topological properties of stock networks of the two markets, CSI163 and S&P468, in a dynamic approach.Considered to meet the requirement of / > 1, we set the sliding window size  163 = 170 for CSI163 and  &468 = 500 for S&P468.In total, there are 2149 windows for CSI163 and 2228 windows for S&P468.After calculating the log returns for both CSI163 and S&P468 by using equation ( 1), we calculate the correlation coefficient matrices over the period between 12/09/2007 and 06/11/2015 for CSI163, 26/12/2008 and 06/11/2015 for S&P468 using equation (2).Based on the correlation matrices, it is straight to get stock networks.For CSI163, we have a network of 163 vertices, and for the S&P468, we have a network of 468 vertices.The edge connecting two stocks indicates how the two stocks behave correlatively or anticorrelatively.For a positive correlation coefficient value, the two prices move in the same direction, while for a negative value, the two prices move in opposite directions, so to normalize all correlation coefficients to positive values as edge distances, we adopt the definition of distance based on equation (3).Through this definition, all negative values are transformed into positive distance values, and the order of values is preserved.All vertices in the networks for both markets are fixed.However, the edges vary in each sliding window as the correlation coefficients change.In the following parts, the statistical properties of both networks evolved in our study period are investigated.

Degree and Degree Distribution.
For a network of  stocks, there are  ×  edges, which is a huge number for a large .So we normally filter the weakest edges to simplify the network.In the threshold approach, a threshold  can be used to chop the edges, if   > .For a given network, different  can lead to different structures with same vertices but different sets of remaining edges.Based on the correlation matrices, we first investigate the stock networks with different  for both CSI163 and S&P468.In the sliding window approach, using daily log return time series, we first calculate the correlation matrices of 163 × 163 for CSI163 and 468 × 468 for S&P468; then we average all the correlation matrices over the study periods.After that, we get the averaged correlation matrices, with which we can apply the edge filtering process for different  based on the equation (4).Based on the result, small  closes to 0 will filter most edges while larger  close to the maximum value two will keep most edges.We use an  interval of [0.1 − 1.5] with a step of 0.1.We present the basic network properties in Table 4 for  the CSI163 network and Table 5 with different  between 0.1 and 1.5.The maximum possible edges are 13203 for CSI163 and 109278 for S&P468, respectively.For different distance thresholds , any edges whose distances are greater than the threshold are filtered.So with smaller , only a few edges remain in the network and this results a smaller edge density ||  , smaller average degree ⟨⟩, average distance ⟨  ⟩, minimum distance  min  , and maximum distance  max  as well.In Figure 2, we plot the edge densities of CSI163 and S&P468 for different thresholds  from 0.1 to 1.5.In the interval of 0.1 to 0.6, the densities for both networks are close to 0, meaning all edges are filtered.While in the interval of 1 to 1.5, the densities are close to 1, meaning that all edges are preserved.Between these two intervals, we see that the two curves have a similar shape with a slope when  lies between 0.6 and 1.This indicates that most edges are within this interval.A similar edge density distribution is also reported in [65].The study shows that stock networks also demonstrate a similar transforming interval.
We investigate the degree distributions of both networks with different .No matter if  is too small or too large, the degree distributions are noisy, while in a narrow interval around 0.7, the distributions follow the power law.The regression fitting curve is a straight line in the plots of log () against log(), where the log () is the log 10 probability for a vertex with  degrees and the log() is the log 10 degree.After running on the data, we plot the typical power law distributions in Figure 3(a) for CSI163 and Figure 3(b) for S&P468 respectively.For both distributions, we fit the log 10log 10 distribution and get the power law exponents  = −0.9935for CSI163 and  = −1.2323for S&P468.In the plots, we use the same  = 20 to calculate the probabilities for different degrees.As shown in Figure 3, we see that a large number of vertices have small degrees.Only a few vertices have large degrees.As the vertices are stocks, and the degrees are rooted in the correlations between stocks, for both CSI163 and S&P468 networks, only a few stocks are the highly correlated with the most parts of rest stocks.These stocks have a wider and larger influence over the whole networks, while other stocks with relatively smaller degrees are less correlated with other stocks.This presents limited influence over the network.The negative fitting slope value  also indicates that both the CSI163 and S&P468 networks are scale-free networks in which a small portion of vertices have larger degrees while a large portion of Figure 3: Log-Log degree distributions of CSI163 and S&P468 networks.By using different , we can filter out edges with larger distances.We find that not all the filtered networks can demonstrate the power law degree distributions.Only when  falls within a narrow interval around 0.7, the filtered networks follow the power law degree distribution.In Figure 3(a), for the CSI163 network, we use  = 0.68 and we get a fitting line with a slope of  = −0.9935.In Figure 3(b), for the S&P468 network, we use  = 0.75 and we get a fitting line with a slope of  = −1.2323.This indicates that the degree distributions of both stock markets follow the power law in the form of () ∼  − , which also means the networks are scale free in which a small portion of vertices have larger degrees, while a large portion of vertices have small degrees.
vertices have smaller degrees this agree with previous studies [105].

Average Clustering Coefficient.
Average clustering coefficient ⟨⟩ is an average of all clustering coefficients ⟨  ⟩ of all vertices.The clustering coefficient ⟨  ⟩ indicates the transitivity for an individual vertex V  , while the overall averaged clustering coefficient ⟨⟩ is an indication of the transitivity and density of the whole network.In Figure 4, we present the average clustering coefficient ⟨⟩ for both CSI163 and S&P468 networks comparing with random networks.The ⟨⟩ gets larger with the  when larger  will preserve more edges, while it remains almost unchanged with a slight increase in both random networks.Comparing with random networks of same sizes of 163 × 163 for CSI163 and 468 × 268 for S&P468, ⟨⟩ of both two stock networks are significantly larger than that of the corresponding random networks.For CSI163, ⟨⟩ is 4.9574 times larger than that of the random networks on average with a maximum of 11.8903 times.For S&P468 the average multiple is 5.2305 times, and maximum multiple is 10.4180 times compared with the random networks.This shows that both stock networks are well connected with better transitivity.This result agrees with many other previous studies.

Average Path Length.
Unlike clustering coefficient which is a local property, for any two vertices V  and V  in a network, the number of edges covering the shortest route linking the two vertices is defined as the characteristic path length,   , in [2] which is a typical global property.By averaging the lengths of all possible pairs, we can calculate the average path length ⟨⟩.As an indication of how the network is connected, many real networks have small ⟨⟩ compared with random networks.In Figure 5, we plot the average path length ⟨⟩ of both CSI163 and S&P468 networks with comparisons of random networks in same sizes.The two stock networks are significantly different from the corresponding random networks with the same sizes of 163 × 163 and 468 × 468.While the flat curves of ⟨⟩ of random networks stay almost unchanged with , this is a result of the universal homogeneous edge distribution on the whole network.There are peaks for the stock networks.On the left hand of the peak, there is a decline of ⟨⟩ with the decrease of , since when  gets too small, most edges are filtered, and the giant networks break into small parts and ⟨⟩ in small parts are decreasing significantly.However, for the right hand of the peak, ⟨⟩ gets smaller with the increase of  due to the increasing connectivity when more and more edges are preserved.This shows the stock networks of both CSI163 and S&P468 are different from random networks.

Betweenness Centrality.
The betweenness  V  of vertex V  is defined as the number of shortest paths passing V  , which is an indication of the importance of an individual vertex in the contribution to the global connectivity.By averaging over the betweenness of all vertices, we can compare the betweenness for any two vertices.Larger betweenness means great global influence of the stock networks.This is the same to the edges.The betweenness    is the number of shortest paths passing the edge   indicating the importance of this edge for its contribution to the global connectivity.In this study, we focus on the vertex betweenness.In the calculation of the shortest paths, we can use the original distance   defined in equation (3) or simplify the network as a binary network according to  In the former, edges with different distances have different contributions to the paths, while, for the latter, all edges of nonzero distance are normalized as 1 and treated equally with great scarifying of original distance information.As shown in Figure 6, we plot the average betweenness ⟨⟩ of CSI163 and S&P468 for both binary case and weighted case under different  in Figures 6(a) and 6(b) respectively.For binary network case, all edges with positive distances are normalized as unit 1, while, in weighted networks, the original distances are directly used in the calculation of shortest paths.The shapes for binary and weighted networks are different.There are peaks for both stock networks in binary networks, while ⟨⟩ gets larger with  from almost zero to large numbers in weighted networks.
We visualize the stock networks of CSI163 and S&P468 with different  of 0.6, 0.7, 0.8, 0.9 in Figures 7(a)-7(d) and 8(a)-8(d), respectively.It shows that the networks can be dramatically simplified using small values of  and the edges are preserved in larger values of .As listed in Table 4, the edge density of CSI163 grows dramatically from 0.0209 ( = 0.6) to 0.7994 ( = 0.9), while, for S&P468 as shown in Table 5, the edge density of S&P468 grows also dramatically from 0.0067 ( = 0.6) to 0.4609 ( = 0.9).All networks in this paper are generated using the Pajek complex network software [106].

Components.
A component  is a subnetwork of the whole network with connected vertices.For a given network with a set of  vertices, the possible size of  can range from 1 for an isolated vertex to N for all connected vertices.When an individual vertex V  is disconnected from any other vertices, V  itself forms the smallest component with a single vertex.When all the vertices are connected without any isolated vertices, the network is a single giant component.For a stock network, the stocks are correlated with each other, while stocks belonging to different components are not correlated.The component structures of stock networks have great implications for risk management of a portfolio.Since the stocks fall in the same component are correlated, so it is a bad idea to invest in most stocks from the same components.We should invest stocks from different components to diverse the risk of the whole portfolio.When  is small, most edges are filtered leaving many vertices isolated.As a result, we see the emerging of a large number of small components.However, with the growth of , more and more edges are preserved.This allows the connectivity increase resulting in the appearing of larger components.In Figure 9, the properties of components of CSI163 and S&P468 networks with different  is presented.For the two networks, the number of components   (red), the max component size   (green), and the average component size ⟨⟩ (blue)   At first,  is small, most edges are filtered, and the whole networks are broken into parts; thus the disconnected vertices and edges are also filtered.So on the left hand, starting from the peak, with the decrease of , we see that ⟨⟩ decreases too, while starting from the peak, with the increase of , we see a constant decline of ⟨⟩, for more edges remained resulting in decreasing of ⟨⟩.For the random networks, ⟨⟩ stay almost unchanged because of the homogeneous edge distributions across the whole networks.Figure 6: Average betweenness ⟨⟩ of CSI163 and S&P468 is calculated for both the original weighted approach and binary simplification approach under different .The average betweenness for binary networks is different from the weighted network.For binary networks, the curves for CSI163 and S&P468 share a similar shape.On the left hand of the peak, the ⟨⟩ gets larger with the increase of , for more edges and more vertices are preserved, and this leads to a growing number of paths, while on the right hand of the peak, large connected network emerges leading to a small value of averaged ⟨⟩.In other words, the importance of a single individual vertex or edge is weakened in well-connected networks (large ).shows similar pattern and changes with the values of .Critical changes are obvious for both networks in the  interval about [0.3 − 0.8] for CSI163 and in [0.5 − 1.1] for S&P468, respectively.Before the transition interval, most vertices are isolated when the whole network breaks into small components, and both of the maximum and average component sizes are small.In the transition interval, the number of components decreases with both maximum and average component sizes.After the transition interval, the three properties stay unchanged when the giant connected component appear with maximum and average size equal to the number of total vertices.The similar component properties transition under different of  phenomena is also observed in the study of a set of Chinese stocks [65] with a reported transition critical value about  = 0.17.
To investigate how industry sectors are connected in the stock network, we summarize the properties of both CSI163 and S&P468 networks with  = 1.0 listed in Tables 6 and 7, respectively.As it shows, in the CSI163 network, the industry sectors are all most the same in average degree ⟨⟩ and average clustering coefficient ⟨⟩, while with significantly different values of average betweenness coefficient ⟨⟩.The difference of average degree ⟨⟩ and average clustering coefficient ⟨⟩ are not significant among the industry sectors.This indicates that all sectors have similar degrees and clustering coefficients.The difference between the average betweenness coefficient ⟨⟩ shows that the sectors contribute to the global connectivity differently.It is worth mentioning that the finance and insurance sector has the largest average clustering coefficient ⟨⟩ of 0.9829 but with a relatively small value of average betweenness coefficient ⟨⟩ which is only 118.4000.For the S&P468 network, as shown in Table 7, we observe that Financials sector has the largest value of ⟨⟩ of 421.7471 and the 3rd largest value of ⟨⟩ of 1297.8161, with a smaller value of the average clustering coefficient ⟨⟩ of 0.8975, which are very different from the CSI163 network.Furthermore, the Energy and Industrials have the largest values of ⟨⟩ and ⟨⟩, while Consumer Staples and Telecommunication Services have the smallest ⟨⟩ and ⟨⟩.From this, we also observe that, for the S&P468 network, the sectors with larger ⟨⟩ are likely to have smaller values of ⟨⟩ and vise verse.The findings indicate that the US market is dominated by Financials while the finance and insurance in Chinese stock markets play relatively less influential roles.By focusing on the top stocks, it is possible to look into the details of the networks.In Table 8, for the CSI163 network with  = 0.8, we present the top 10 stocks with the largest values of degree   and betweenness   ranked in descending order in the upper part and the lower part, respectively.The stock code, company name, industry name, and values of   ,   , and   are listed.Younger Group, which is a leading fashion brand in China, has the largest degree of   = 133, and HuDong Heavy Machinery, which is a major machinery manufacturer in China, has the largest betweenness coefficient of   = 5444.While, in the S&P468 network, as shown in Table 9, T. Rowe Price Group has the largest degree of 266, and Loews Corp. has the largest betweenness coefficient value of 13202, both stocks are in the Financials sector.For both stock networks, the two lists based on   and   are similar.In other words, top stocks with largest degree values also appear as top stocks with largest betweenness coefficients   .It is worth noting that the list based on the ranking of clustering coefficients   are dramatically different those based on degrees or betweenness coefficients.This indicates that degree   and the betweenness   are consistent in describing the importance of an individual vertex, since the higher degree a vertex has, the more likely it is on the shortest paths.As indicated in the two tables, the stocks with codes labeled in bold appear on both top 10 lists, and in fact, the rest of the stocks on one list also can be found appearing in a similar ranking position on another list.We can also observe that stocks belong to Industries of Metals & Nonmetals, Machinery, and Pharmaceuticals are dominant the two top 10 lists in the CSI163 network.However, for S&P468 network, Financials, Industrials, and Materials are major stocks in the two lists.As an emerging market, Industrials sector has great influence in Chinese stock market, while the Financials sector has greater influence in US stock market which agrees with [10].The significant difference between the two stock markets confirms the previous studies with similar results indicating that Industrials is the most influential sector among all industry sectors, while the financial sector has weaker influence [107].This is consistent with our previous results revealed Tables 6 and 7.

Hierarchical Structures of Stock Networks
Mantegna introduced the minimum spanning tree and hierarchical clustering methods into the study of financial networks [21], in which a distance matrix  is built from the correlation matrix for all stocks.Based on the distance matrix, the minimum spanning tree is extracted.Since the minimum spanning tree contains the information of edges connecting all vertices in a single spanning tree with the shortest total length, it is also possible to extract the hierarchical clustering tree from the minimum spanning tree, where the distance for vertex V  and V  is subdominant ultrametric distance  < (, ) as the maximum value of distance along the shortest path between the two vertices V  and V  [108].However, the subdominant ultrametric distance approach will lose much edge distance information, for two separated vertices which are indirectly connected on the minimum spanning tree with a specific larger subdominant ultrametric distance might be directly linked in the original distance matrix.Vertices which should be clustered together might be separated in a hierarchical clustering tree based on ultrametric distance.To preserve the hierarchical structure of the minimum spanning tree as well as more information allowing loops and cliques, Planar Maximally Filtered Graph (PMFG) is proposed in [86].Based on PMFG, the influence of different sectors of CSI300 is studied revealing that the industrial sector is Complexity the dominant part of the whole market [107].In [109], the hierarchical tree structure of multiple industry indices in China are investigated before and after a crisis showing the structure changes around the crisis period.A similar study of global financial crisis impact on stock market shows that the Turkish market is less influenced [110].Authors propose to use Kullback-Leibler distance for the filtering procedures in [11].International real estate market networks in different countries are studied in [99] revealing that markets are clustered according to geographical locations.Instead of using the methods of [21], a typical approach is applied to extract the hierarchical structure of the German stock market in [68].Using the industry classification information as the benchmark, authors compared the methods used to extract the clusters in financial networks [103].In [111], Neighbor-Net approach is applied in which more distance information is used in the construction of the tree compared to the hierarchical clustering.Since a sliding window approach with a window size of  is utilized, in a study period of total  trading days, we can get a sequence of  −  + 1 trading windows.We have  163 = 2149 for CSI163 and  &468 = 2228 for S&P468 trading dates in our study period between 04/01/2007 to 06/11/2015, respectively.As we adapted in previous parts, we set the sliding window size as  163 = 170 for CSI163 and  &468 = 500 for S&P468.We have  163 = 1980 windows for CSI163 and  &468 = 1729 for S&P468 respectively.For each sliding window at time , we can get the distance matrices  163 () and  &468 () where  = 1, ⋅ ⋅ ⋅ , .To investigate the structures of the two markets taking the whole study period as a whole, we calculate the averaged distance matrix by averaging all elements overall sliding windows as With this averaged distance matrix, we construct the hierarchical trees, minimum spanning trees, and asset graphs and study the evolvement of the properties of minimum spanning trees and asset graphs for both CSI163 and S&P468.

Hierarchical Tree.
In the study of the stock market or a portfolio, a set of individual stocks belonging to different economic sectors behavior correlated together.Based on the prices information, the correlation matrix can be formed.
Based on that, a distance matrix can be derived.Using the distance matrix, clustering algorithms can be further applied to extract the clustering structures of the stocks.For the stocks falling in the same cluster, they behave similar sharing correlated price fluctuations, while, for the stocks coming from different clusters, they are less similar than the ones of the same clusters.The main objective of clustering algorithms is to minimize the dissimilarity for stocks in the same cluster and maximize the dissimilarity for stocks in different clusters meanwhile.Since the dissimilarity is naturally measured by the distance, the selection of definition of distance between clusters is important for clustering algorithms.Four distance definitions as shown in equations ( 8)-( 11) are used in extracting of hierarchical clustering trees.The distance between two clusters,   and   , is defined as the minimum distance for all pairs as in equation ( 8), the maximum distance for all pairs as in equation ( 9), the average distance for all pairs as in equation (10), and the distance between average centroids of the two clusters as in equation (11), respectively.
In our study, we use all these four definitions of cluster distance.For CSI163 network, we present the dendrogram hierarchical cluster trees in Figures 10(a)-10(d).For S&P468 network, we present the trees in Figures 11(a)-11(d).In these trees, the leaf nodes are individual stocks, and the height of two merged branches indicates the distance or dissimilarity between two clusters or stocks.The higher they merge, the larger the distance is.For similar clusters or stocks, they merge in a lower value of height.To color the similar stocks, a color threshold of 0.7 × max(    ,   ) is used.Thus all similar clusters or stocks are colored with the same colors.By adjusting this color threshold, we can get the clusters from the dendrogram hierarchical cluster trees.As shown in the figures, using different definitions, we can get different hierarchical cluster trees and it is obvious that Figures 10(b) and 10(c) reveal more details of the structures, in which the distance between clusters is the largest of all pairs and the average distance of all pairs, respectively.The similar effect is also observed in Figures 11(b) and 11(c) for S&P468 networks.These clustering results are found to agree with the classifications of stocks very well.The colored clusters are composed of stocks mostly from the same economy sectors.Though there are exceptions that some stocks from different sectors are clustered together, or stocks from the same sector are clustered in different clusters.It is still astonishing to see that stocks can be clustered which agree with the economy sectors classification only from the prices information.These results indicate that hierarchical cluster trees constructed from price correlation matrix can reveal economy sectors and this has potential applications in portfolio selection and risk management.(11).The color threshold is 0.7.All stocks whose linkage values are less than this threshold would be colored with a unique color.As shown in the figures, different distance definitions extract different dendrogram hierarchical cluster trees whereas the same color threshold generates different results.Again, the largest distance definition reveals more details of network.

Minimum Spanning Tree.
For a given undirected weighted network with  vertices, we can simplify the network by extracting the backbone of the network connecting all vertices, but with a minimum total length of edges, this backbone is called minimum spanning tree, or MST for short.Since loops or circles are not allowed to connect vertices, a MST has a topological structure of tree with  − 1 edges which is dramatically simplified from the original network which might have a maximum of ( − 1)/2 edges.This brings huge advantages to the study of networks of stocks by reducing noises and simplifying the computation as well.
To construct a minimum spanning tree from a given network, it is easy to be achieved by using Kruskal's Algorithm [95], in which all edges are ranked in ascending order.Starting from the shortest edge on the edges ranking list, we add edges to the tree by keeping the tree in spanning form without introducing circles.When all edges are considered, we get a final minimum spanning tree comprising all connected  vertices with a minimum total length of  − 1 edges.For The extraction of a minimum spanning tree from a six vertices network using Kruskal's MST Algorithm.We rank all edges in descending order according to the edge lengths.Starting from the shortest edge and add the edges into the tree but avoiding loops or circles, after considering all edges, we get a final tree connecting all vertices with the minimum total edge lengths.In our example, after adding  1,3 ,  1,6 ,  2,4 ,  2,5 , and  6,4 , we finally extract a tree of  3,1,6,4,2,5 with a total length of 1.1.
a network in which all edges are with distinct lengths, the extracted MST is unique.In Figure 12, we demonstrate the   process of extracting the minimum spanning tree from a six vertices network following Kruskal's MST Algorithm.The edges are ranked in descending order, and we start from the shortest ones and add them into the tree but omit the edges which might introduce loops; after considering all edges, we get a minimum spanning tree with a minimum total length.In this example, edge (3,6) and (6,4) are omitted because that  3,6 might bring a loop of (3,6,1) and  6,4 might bring a loop of (6,4,2,5).Another widely used algorithm is Prim's Algorithm [112] which begins with a starting vertex and adds the shortest one to the existing tree from all edges connected to the tree.By repeating this greedily, we can extract the minimum spanning tree of the given network.In this research, we apply Kruskal's Algorithm to analyze the network structures of CSI163 and S&P468.
To extract the minimum spanning trees of the stock networks of CSI163 and S&P468, we average all correlation matrices over the investigated time windows, presented in Figures 13 and 14 for CSI163 and S&P468, respectively.We see that the stocks of the same industry sectors are clustered in the MSTs for both CSI163 and S&P468, and this clustering effect is much more obvious for S&P468 in which stocks are well clustered according to the industry sectors of S&P500.
We further look into the connectivities of MSTs for both CSI163 and S&P468; we find that after the edge filtering process, some stocks are still well connected with other stocks.These stocks are the key stocks in the contribution of connectivities of the MSTs, while most stocks are poorly connected with the degree of only one or two.In Tables 10  and 11, we present the top 10 stocks according to their degrees in MST of CSI163 and S&P468, respectively.We find that the most connected stocks of CSI163 are diverse, while, for the MST of S&P468, 3 Financials stocks appear in the top 10.This agrees with other analysis that the Chinese stock market is much more diverse and Financials stocks play important roles in the US market.[86], PMFG has been applied in many studies of financial networks.In [67], the authors study the PMFG networks of DAX 30 stocks.Instead of using the correlation matrix, a -values matrix of EngleGranger cointegration test is used to extract the PMFG for Chinese stocks in [98].The stability and robustness of PMFG for 300 NYSE stocks are compared with MST in a  running window approach, and the results reveal that PMFG is stabler than MST [104].In [47], the same authors of [104] confirm that PMFG provides stronger robustness and stability in revealing network structures of stock markets.It has also been proven that the PMFG always contains an MST for the same distance matrix [86].The PMFGs of CSI163 and S&P468 networks are plotted in Figures 15 and 16, respectively.We see that PMFGs have much more edges compared to MSTs.Further, we use another layout to plot the two PMFGs in Figures 17 and 18, from which, we find that PMFGs also produce good clusters for stocks of different industry sectors.

Asset Graph.
In the minimum spanning tree (MST), a connected tree structure connecting all vertices with a minimum total length of edges is extracted.The selection process of adding edges in generating an MST out of a distance matrix is presented in Figure 12; an MST is always a connected single tree without disconnected parts.By connecting the  vertices, a total of  − 1 edges are needed, where  is the number of vertices in the original network.It is obvious that an MST does not guarantee to be with the possible minimum total lengths of the  − 1 edges.By changing the strategy of how edges are selected and allowing disconnected parts, the asset graph (AG) approach is proposed in [87,88].Similar to MST, to generate an AG, we start the distance matrix containing all pairwise distances information of the network; we first rank all edges in ascending order from the shortest to the longest.Without considering the requirement of keeping a tree connected, we choose the top  − 1 edges to form an AG.It has been found that AG extracts similar structures as MST can do with smaller normalized length and with better stable structure over time.In this section, we show the AG networks for both markets.In Figures 19 and 20, we present AG structures for CSI163 and S&P468 networks, respectively.Compared with Figures 13 and 14 of the minimum spanning trees of CSI163 and S&P468 networks, we see that AG structures are more complex than MST and there are many isolated vertices in AG.The connected cliques in AG are the most correlated stocks connected by the shortest possible edges; in other words, by connecting the most correlated stocks, AG omits the less correlated stocks.We also observe that many cliques emerge in AG and this reveals more information about the structures than in MST where no loops or cliques are allowed.We see that AG is a simple but effective network simplification approach in extracting the most correlated stocks.However, the sacrifice is also obvious, as shown in Figures 19 and 20, the clustering is poor in AG compared with MST for both markets.
We have shown that AG allows isolated vertices and not all vertices are connected in one giant tree.To generate an AG, we can use different numbers of edges; with the increase in edge number, we can see that the portion of isolated vertices declines.It is interesting to investigate how the vertices are related to the edge numbers.In the original distance matrix, the maximum possible number of edges is ( − 1)/2.The percentage of the fraction of AG is the top edges added to AG networks to the total possible edges.When we increase this edge percentage, more and more vertices are connected.We calculate the percentage of connected vertices as the number of connected to the total vertices number of .We plot these results in Figures 21(a) and 21(b) for CSI163 and S&p468 networks, respectively.As the figures show, with a small fraction of edges being included, more and more vertices are connected; it requires only 0.0123 and 0.0043 of the total edges for all vertices to be connected in CSI163 and S&P468, respectively.This indicates that the top edges are very effective in connecting vertices for S&P468 than CSI163.
In the previous section, all structures are extracted from the average distance matrices over the whole study periods which is defined in equation ( 7) as ⟨⟩ = (1/) ∑    , where  is the number of sliding windows.In this part, we investigate the dynamic structures of the filtered networks with a focus on the AG and MST.For each sliding window, at time , we get a series of distance matrices   based on the returns data on the interval of [,  − 1, . . .,  −  + 1] where  is the length of a sliding window.For each sliding window, using the distance matrix   , we construct the corresponding original network   , the asset graph   , and the minimum In Figure 22, we present the distance distributions of original distance matrices   , asset graphs   , and minimum spanning trees   for both of CSI163 and S&P468 in the study period between 04/01/2007 and 06/11/2015.In the original network   , a number of ( − 1)/2 edges are considered, while, for   and   ,  − 1 edges are considered.Since the sliding window sizes are  163 = 170 and  &468 = 500, we should keep in mind that a slice of distribution is a result of the past  dates, i.e., about half of a year for CSI163 and two years for S&P468.The shapes of these distributions are influenced by the lengths of .We choose the same set of lengths by considering the requirements of random matrix theory approach which we shall discuss later.The similar plots are reported in [87] in the study of 477 stocks from NYSE which is in a similar size of our S&P468 dataset in which 468 stocks are included.We add more evidence by comparing two markets of CSI163 and S&P468.In Figures 22(a) and 22(b), we observe obvious shifts of the distribution centers for both markets.In these shifts, positive shifts to the mean value of ⟨  ⟩ = √ 2 roughly correspond to the normal market periods, while negative shifts to the mean value correspond to the bear or collapsing market periods.This indicates that the stocks behave synchronized in bad periods and this agrees with many previous studies.This also provides a potential market measurement for investors and regulators to watch how market shift behaviors.In Figures 22(c) and 22(d), the distributions of distances of AG for CSI163 and S&P468 are plotted.Since AG is a subgraph of the original network and is composed of the top  − 1 shortest edges, we expect the distributions show a left shift to the mean center of ⟨  ⟩ = √ 2 compared to the original networks, and this is well shown in the plots for both CSI163 and S&P468, more precisely, the distributions of AG are zoom-in of the left tails of original networks.The MST, as shown in Figures 22(e) and 22(f), has a relatively wider distribution which is positively shifted compared to AG but negatively shifted to the original network.Also, we find that, in AG and MST networks, the most parts of the distributions stay on the left hand of the center √ 2 which means the network is correlated on average; in other words, for periods when the mean center stays on the left hand, the network backbones of AG and MST are on average correlated, and rarely anticorrelated.A potential implication is that, for the whole market, the network provides a diversified portfolio when the market is normal or in a bull state, but for the top edges in AG and MST, the network moves together with less diversification when the market falls into bear markets or crisis periods.
The distance   indicates how the two stocks correlate with each other.Larger   means smaller correlation and vice versa.For an original network  at time , the total distance can be introduced as and the average distance for the original network can be defined as Complexity   In the same way, we can calculate the total distances for   and   using equation ( 12), but considering the edge number for   and   is  − 1, we normalize the average distance for them as To investigate the tightness of the network, the total distance   and average distance ⟨  ⟩ for the original network   ,   , and   in our study periods for both networks are investigated.We plot the results in Figure 23 for   and Figure 24 for ⟨  ⟩, respectively.For each stock market, total distance   and average distance ⟨  ⟩ show similar shapes.For both stock markets, the values are in this order:   >   >   ; i.e., the original networks have the largest values of   and ⟨  ⟩ compared to   and   , while   has the smallest values.
By comparing the total distance   plotted in Figure 23(a) and the average distance ⟨  ⟩ plotted in    Figure 24(a) for CSI163 over the study period, we find the six plots share similar shapes.The same similarities are also observed in Figures 23(b) and 24(b) for the S&P468 network.This indicates that the AG and MST are both good backbones of the whole original market networks, and this tracking stays robust over time.For both networks, we also find that the curve of   is above   which means the total and average distances are slightly larger in MST than in AG.Our findings agree with the results reported in [87].Since the two stock markets datasets have different stock numbers, we compare the average distance between the two markets, and as shown in Figure 24, we see that the CSI163 is slightly sparser than S&P468, which indicates that the CSI163 which is a developing market is more diversified than S&P468 which is a developed market; this also agrees with many previous research.
In Table 12, we summary the average (  ), minimum ⟨  ⟩ min , maximum ⟨  ⟩ max , and standard deviation ⟨  ⟩  of CSI163 and S&P468 networks for three kinds networks: original, AG, and MST.We can see that the values are in the order of  >  >  for average, minimum, and maximum.Also the three networks have similar standard deviations.We find that the values of (  ) and minimum ⟨  ⟩ min for CSI163 are slightly larger than S&P468 which indicates stocks in CSI163 are less likely to correlated than in S&P468.To visualize the distributions of these three kinds of networks, we plot the probability density function (PDF) for the original network, AG, and MST for CSI163 and S&P468  in Figures 25(a) and 25(b), respectively.We see that the distributions of all three networks share similar shapes but with different mean centers; as shown in the figures, the AG locates on the left, MST locates in the center, and original locates on the right.

Conclusion and Discussion
In this research, we investigated the properties and models of the complex network theory and its applications from data science perspective.Using the daily close prices of two sets of stocks from CSI300 and S&P500, we constructed the correlation matrices for both the whole study periods and all sliding windows.Based on these correlation matrices, we build the networks with stocks as the vertices and correlation relationships as the edges.We systematically applied network filtering methods like hierarchical tree, minimum spanning tree, planar maximally filtered graph, and asset graph to simplify the networks.For each filtered network, the network properties are discussed.Financial markets are complex systems, and it is important to extract useful information from the noise background by applying methods like complex networks.We find that, for the stock markets, CSI300 and S&P500, the former is an emerging market while the latter is a mature well-developed market.They share similar properties in many ways and also vary in many aspects.The revealed properties and robustness might provide sights of the structures and dynamics of the two stock markets for practitioners and regulators.Furthermore, it is interesting to develop trading strategies with the information revealed from the topological networks of stocks or indices.For instance, the pair trading [113][114][115][116] is a basic and market neutral strategy considering the movement of a correlated stock pair, in which if the spread widens, then traders can short one and long another one to gain the spread.One might use the information of the networks to identify the pairs and evaluate the reliabilities.Also, considering pairs between groups of stocks rather than only two stocks, we might use the component or cluster information revealed in the networks to build the trading groups.Furthermore, the directed networks built with Granger causalities or lagged correlations might give more lead/lag details of stock pairs on the time factors asynchronously.Second, with the help of network edge filtering, we can significantly simplify the networks, but most studies focus on the topological simplification without concerns of the original portfolio returns.What if Complexity we consider the returns with the topology of the networks to optimize the portfolio selection?The topological structure can give us information on how diverse the portfolio is but this is not enough to design the portfolio without return information.A possible way is to adjust the portfolio selection by considering measurements like the ratio of returns over total distances of a portfolio or other approaches combining both topological and return information.Also, the techniques of network modeling and analysis can enhance the ability in policy modeling and decision making.We hope this work can inspire policymakers and researchers in applying network theories in wider applications.

Figure 2 :
Figure 2: Edge densities of CSI163 and S&P468 for different thresholds  from 0.1 to 1.5.It shows that the densities increase sharply from 0 to 1 in the  interval of 0.6 and 1.

 1 Figure 4 :
Figure 4: Average clustering coefficient ⟨⟩ of CSI163 and S&P468 for different thresholds .It shows that ⟨⟩ gets larger with .To compare with random networks, we plot the corresponding average clustering coefficients under the same interval of .As is shown, for both CSI163 and S&P468 networks, ⟨⟩ values are significantly larger than the random networks of the same size.This indicates the stock markets are far from random and the stocks are comparatively clustered.

 1 Figure 5 :
Figure 5: Average path lengths ⟨⟩ for CSI163 and S&P468 under different  compared with values for random networks in same sizes of 163 × 163 and 468 × 468.It shows that for the two networks, there are peaks of ⟨⟩ above the curve of the corresponding random networks.At first,  is small, most edges are filtered, and the whole networks are broken into parts; thus the disconnected vertices and edges are also filtered.So on the left hand, starting from the peak, with the decrease of , we see that ⟨⟩ decreases too, while starting from the peak, with the increase of , we see a constant decline of ⟨⟩, for more edges remained resulting in decreasing of ⟨⟩.For the random networks, ⟨⟩ stay almost unchanged because of the homogeneous edge distributions across the whole networks.

9 Figure 7 :
Figure7: CSI163 networks with different  of 0.6 (a), 0.7 (b), 0.8 (c), 0.9 (d).It shows that the network is relatively sparser with small  while denser with large , for small  greatly simplifies the network by filtering most edges with larger distance.As a result, the edge density when  = 0.9 is about 38.25 times to that when  = 0.6.Different vertex colors indicate different industry sectors.

9 Figure 8 :
Figure 8: S&P468 networks with different  of 0.6 (a), 0.7 (b), 0.8 (c), and 0.9 (d).It also shows that the edge densities with different  change dramatically.We find that the edge density when  = 0.9 is 68.79 times to that when  = 0.6.Different vertex colors indicate different industry sectors.

Figure 9 :
Figure9: The component properties of the components number   (red), the max component size   (green), and the average component size ⟨⟩ (blue) are plotted for CSI163 and S&P468 with different thresholds  in (a) and (b), respectively.Both networks show similar transitions when the networks transform from a large number of small isolated components into a connected giant network.Before the transition interval, edges are filtered leaving isolated vertices are not correlated.After the transition interval, edges are preserved making most vertices connected to form a single giant network in which all vertices are correlated.

Figure 10 :
Figure 10: CSI163 dendrogram hierarchical cluster trees extracted with different distance definitions in (a) smallest distance for stock pairs, equation (8); (b) largest distance for stock pairs, equation (9); (c) average distance for stock pairs, equation (10); (d) distance between centroids for clusters, equation(11).The color threshold is 0.7.All stocks whose linkage values are less than this threshold would be colored with a unique color.As shown in the figures, different distance definitions extract different dendrogram hierarchical cluster trees whereas the same color threshold generates different results.We see that the largest distance definition reveals more details of network.

Figure 11 :
Figure 11: S&P468 dendrogram hierarchical cluster trees extracted with different distance definitions in (a) smallest distance for stock pairs, equation (8); (b) largest distance for stock pairs, equation (9); (c) average distance for stock pairs, equation (10); (d) distance between centroids for clusters, equation(11).The color threshold is 0.7.All stocks whose linkage values are less than this threshold would be colored with a unique color.As shown in the figures, different distance definitions extract different dendrogram hierarchical cluster trees whereas the same color threshold generates different results.Again, the largest distance definition reveals more details of network.

6 Figure 12 :
Figure12: The extraction of a minimum spanning tree from a six vertices network using Kruskal's MST Algorithm.We rank all edges in descending order according to the edge lengths.Starting from the shortest edge and add the edges into the tree but avoiding loops or circles, after considering all edges, we get a final tree connecting all vertices with the minimum total edge lengths.In our example, after adding  1,3 ,  1,6 ,  2,4 ,  2,5 , and  6,4 , we finally extract a tree of  3,1,6,4,2,5 with a total length of 1.1.

Figure 13 :
Figure 13: Minimum spanning tree of CSI163.Vertices are colored to indicate different industry sectors.

Figure 14 :
Figure 14: Minimum spanning tree of S&P468.Vertices are colored to indicate different industry sectors.

Figure 17 :Figure 18 :
Figure 17: CSI163 PMFG.We label the vertices with stock codes in (a) and industry codes in (b).

Figure 21 :
Figure 21: Percentages of connected vertices of AG against edge densities for CSI163 and S&P468 networks.

Figure 22 :
Figure 22: Probability distributions of all distances   of   ,   , and   of CSI163 and S&P468 over the years in our study period between 04/01/2007 and 06/11/2015.The total number of edges is ( − 1)/2 for   , and  − 1 for   and   , respectively.Since the sliding window size  163 = 170 and  &468 = 500, so the data only starts after a period of .

Figure 23 :
Figure 23: The evolving of total distances   of original network   , asset graph   , and minimum spanning tree   for CSI163 and S&P468 over time in the study period.

Figure 24 :
Figure 24: The evolving of average distances ⟨  ⟩ of original network   , asset graph   , and minimum spanning tree   for CSI163 and S&P468 over time in the study period.

Figure 25 :
Figure 25: Probability density function (PDF) of average distance ⟨  ⟩ of original network , asset graph , and minimum spanning tree  for CSI163 and S&P468.
3.1.2.S&P500.Compiled by Standard & Poor's in 1957, the S&P500 is an established American stock market index with more components, more risk diversification, and better reflection of the overall stock market performance than both the New York Stock Exchange (NYSE) and Nasdaq.All components are large stocks in capitalization with good liquidities and diversifications in different industry sectors.The S&P500 represents major parts of the market and is

Table 1 :
163 component stocks of CSI300 are included in our dataset.In this table, we list the China Securities Regulatory Commission (CSRC) industry code, sector name, and numbers of stocks for each industry sector of these 163 stocks.All 20 industry sectors are represented.

Table 2 :
468 component stocks of S&P500 are included in the dataset.In this table, we list the Global Industry Classification Standard (GICS) code, sector name, and number of stocks for each industry sector in S&P500.All ten industry sectors are represented.
[21]fferent elements.It is verified that this definition satisfies the three rules of Euclidean distance: (1)   = 0 if and only if  = ; (2)   =   ; Complexity (3)   ≤   + [21].Since −1 ≤   ≤ 1, we have 0 ≤   ≤ 2. With this definition, the distance for two stocks has a value of 2 when they are completely anticorrelated (  = −1), and a small distance close to 0 when they are positively and completely correlated (  → 1).This makes it possible to compare the distances for any two pairs of stocks.  is represented as vertex V  ∈ , and   ∈  represents the edge between V  and V  with a distance of   .

Table 4 :
For the CSI163 network, the maximum possible number of edges of || max for  = 163 vertices, the existing edge number ||, the edge density ||  , the average degree ⟨⟩, the average distance ⟨  ⟩, the minimum distance  min  , and the maximum distance  max relationship between any pair of two participants  and  is represented as the edge   connecting the two vertices V  and V  .In this study, the following properties of financial markets are researched: (1) Degree and Degree Distribution which describes the connectivities of vertices; (2) Clustering Coefficient which is the indication of the transitivity and density of a network; (3) Average Path Length, which is a global property indicating how the network spans; (4) Betweenness Centrality which describe the global importance or centrality of vertices or edges; (5) Components which describe the grouping phenomena of substructures of the networks.

Table 5 :
For the S&P468 network, the max possible edges || max for  = 468 vertices, the existing edge number ||, the edge density ||  , the average degree ⟨⟩, the average distance ⟨  ⟩, the minimum distance  min  , and the maximum distance  max  are presented for different  from 0 to 1.5 in a step of 0.1.

Table 6 :
In this table, we list the China Securities Regulatory Commission (CSRC) industry code, the sector name and the numbers of stocks, the average degree ⟨⟩, the average clustering coefficient ⟨⟩, and the average betweenness coefficient ⟨⟩ for each industry sector of these 163 stocks.The values are calculated from the CSI163 network with  = 1.0.

Table 7 :
In this table, we list the industry code, the sector name, the numbers of stocks, the average degree ⟨⟩, the average clustering coefficient ⟨⟩, and the average betweenness coefficient ⟨⟩ for each industry sector of S&P468 stocks.The values are calculated for the S&P468 network with  = 1.0.

Table 8 :
Top stocks with highest values of degree   , and betweenness   ranked in descending order of   and   when the  = 0.8 for CSI163 network.Stock codes in bold indicate the stocks appear at both top 10 stocks.

Table 9 :
Top stocks with highest values of degree   , and betweenness   ranked in descending order of   and   when the  = 0.8 for S&P468 network.Stock codes in bold indicate the stocks appear at both top 10 stocks.

Table 10 :
In this table, we present the top 10 stocks with the largest degrees in the MST of CSI163.As is shown, the top 10 stocks are diverse in industry sectors which represents 1 Wholesale & retail stock, 1 Metals & Non-metals stock, 2 Pharmaceuticals stocks, 2 Real estate stocks, 1 Finance & insurance stock, 2 Utilities stocks, and 1 Textiles & Apparel stock.

Table 11 :
In this table, we present the top 10 stocks with the largest degrees in the MST of S&P468.As is shown, Honeywell Intl. is the most connected stock in the MST with a degree of 38.The top 10 stocks are composed of 2 Industrials stocks, 3 Financials stocks, 2 Utilities stocks, 1 Health Care stock, and 1 Consumer Discretionary stock.

Table 12 :
Average (  ), minimum ⟨  ⟩ min , maximum ⟨  ⟩ max , and standard deviation ⟨  ⟩  of CSI163 and S&P468 networks.The values of first row belong to the original network , those of the second row belong to the AG, and those of the third row belong to MST.CSI163 S&P468 (  ) ⟨  ⟩ min ⟨  ⟩ max ⟨  ⟩  (  ) ⟨  ⟩ min ⟨  ⟩ max ⟨  ⟩