Study of Evolution Model of China Education and Research Network

By searching the hyperlinks with domain name “.edu.cn” which constitutes the China Education and Research Network, we build a complex directed network containing 366,422 web pages containing 540,755 URLs. These URLs constitute a complex directed network through self-organization. By analyzing the topology of China Education and Research Network, we found that it is different from the common Internet in several aspects. Most of the vertices have incoming links, a few vertices have outgoing links, and very few vertices have both incoming and outgoing links. The vertex distribution has a power-law tail. A large proportion of newly added edges always connect with those pages selected from one subnetwork that they belong to, instead of connecting with the pages selected from the whole network. According to these features, we presented the evolution model of this complex directed network. The results indicate that this model reflects some main characteristics of China Education and Research Network.


Introduction
The research on complex networks is developing at a brisk pace, and significant achievements have been made in recent years; among them is the introduction of scale-free network and related models [1][2][3][4], as it makes big progress in revealing the characteristics of dynamic evolution of complex networks.Theoretical and empirical research on complex network has been carried out with some important achievements [5][6][7][8][9].
China Education and Research Network (CERNET) was established since 1995.More than 1000 universities and research institutes have been connected to this network so far.It has 36 regional network centers and main nodes, which are distributed among different provinces of China.As of now this network has host machines more than 1,200,000 and has become the second largest internet in China.However, compared with the large number of researches that has been done on the general Internet [10][11][12][13], only a few work is on CERNET can be found.From these studies we found that the features of CERNET are different from those of the general Internet, especially in the structure and formation mechanism [14,15].Hence, the study on CERNET is quite important.
We have been working on CERNET since 2005 and trying to establish the evolution model of CERNET for analysis and prediction purposes [14][15][16].However, due mainly to the large scale of CERNET and lack of computing power, it took quite a long time to adjust the parameters to modify the model at that time.Therefore, the model we got is relatively simple which cannot well reflect the main features of CERNET [16].For example, the average shortest path length of the simulation model is only about 2.8, far from 8.95 of the real network [17].
In this paper, the CERNET we analyze is a virtual network made up of web pages where ".edu.cn" is included in the addresses of all these pages.In this network, all web pages are nodes, and all the hyperlinks in these pages that link to other pages are the directed edges.This directed complex network has 366,422 nodes and 540,755 edges.We analyze the features of this network and extract the evolution model using empirical methods to reveal the formation mechanism of CERNET.
The remainder of the paper is organized as follows.Topological structure of CERNET is analyzed in Section 2, and the evolution model of CERNET and comparison between the real and simulated networks are described in Section 3, before giving conclusion and future work in Section 4.

Topological Structure of CERNET
There are several features that can be used to characterize a network, for example, the degree distribution, the average shortest path length, and the clustering coefficients.Among them the degree distribution is considered to be the most important [2].
From graph theory we know that the number of edges connected to one node is the degree of this node.For directed graph, the outdegree is the number of output edges and the indegree the number of input edges.Using the data we collect, we setup a database of CERNET and get the  out () and  in (), where  out () is the probability that one page has  output pages and  in () is the probability that one page has  input pages.The formulas we use to calculate the output and input probability of node  are listed in (1) and ( 2), respectively, where  out is the maximum outdegree of the network and  in the maximum indegree of the network: We plot the double logarithmic curves of  out () and  in () that change as a function of , as shown in Figures 1 and 2, respectively.Linear-regression analysis is done on the linearized data, as shown in the straight red lines in these figures.From Figure 1 we see that the tail of outdegree distribution of CERNET follows the power law distribution,  out () ∼  − out , where  out = 2.48.From Figure 2 we see that the indegree distribution generally follows the power law distribution, but the tail is not very smooth,  in () ∼  − in , where  in = 2.40, which differs greatly with the Poisson distribution predicted using the traditional theory of random graph.
We make statistical analysis of these data and get the accumulated frequency of degree and the corresponding ratio of the degree to total degree in CERNET, as shown in Table 1.
From Table 1 we can see that a large amount of pages have small connections, a few pages have a medium number of connections, while a tiny minority of notable pages have a large number of connections.This phenomenon is similar to the research result made by Albert et al. [1].
This virtual network of CERNET is made up of subsets of web pages of different universities.The number of web pages of each subset is determined by the corresponding universities; the addition and deletion of pages totally depended on the university that these pages belong to.However, we find that though the number of pages is different for different universities they do share some similar features.For example, the proportion of pages that have output links to the total number of pages is less than 25% in every university, while   the proportion of pages that have input links to the total number of pages is usually bigger than 85%.Only a very small number of pages have both output links and input links.Hence, if each university is treated as a subnetwork, then in each network most nodes only have input edges, a few nodes only have output edges, and the number of nodes with both input edges and output edges is rare.From these features we know that each university connects to other universities through a small number of pages, as shown in Table 2.

The Evolution Model of CERNET
Using the mechanism of growth and preferential attachment, the scale-free model proposed by Barabasi et al. can to some degree disclose the nature of many complicated phenomena in the practical world.However, this model cannot be applied to CERNET.For example, every newly attached node has output edges in this scale-free model, but for the directed network of CERNET a larger amount of newly attached nodes have only one input edge; that is, these nodes have zero outdegree.Also in this model, the preferential attachment of newly added nodes will search the whole network for the best node to connect to, while in CERNET the newly added pages will generally choose some pages in the same university to connect to.Only occasionally, the newly added pages will choose pages in other universities, but these pages will not search the whole CERNET for the best pages to connect to.From these features of CERNET, we propose the evolution model of CERNET, as follows.
(ii) At each moment, a new node will randomly be added into one of the subsets of the network.There are 5 cases for the edges that are added together with the new node: (1) the new node has only one input edge; (2) the new node has only  output edges; (3) the new node has one input edge and one output edge; (4) the new node has one input edge and  − 1 output edges; (5) the new node has one output edge and  − 1 input edges, where  ≤ ( 0 min ) and  0 min is the minimum initial number of nodes among  subsets and  0 min = min(01, 02, . . ., 0).
(iii) When the new node with one input edge is added to the network with probability , this node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset.Let ∏ out () denote the probability of node  to be selected as the source node; then ∏ out () is determined by   out , the outdegree of .
(iv) When the new node with  output edges is added to the network, there are 2 cases we should consider.The probabilities of the two cases are  1 and  2 , respectively.
(1) For the first case, the new node will randomly choose a subset and let itself connect to a preferentially selected node in this subset.Let ∏ in () denote the probability of node  be selected as the target node; then ∏ in () is determined by   in , the indegree of .For the rest of the  − 1 output edges, at each moment only one edge randomly chooses a subset which has not been connected by the new node and connects itself to a preferentially selected node in this subset, till all  − 1 output edges are processed.
(2) For the second case, the new node will still randomly choose a subset, but this time this node will preferentially choose  − 1 nodes in this subset and let itself be connected.Let ∏ in () denote the probability of node  be selected as the target node; then ∏ in () is determined by   in , the indegree of .For the rest of the edges that this new node carries, it will randomly pick a subset which has not been connected by this new node and connect itself to a preferentially selected node in this subset.
(v) When the new node with one input edge and one output edge is added to the network, there are also 2 cases we should consider.The probabilities of the two cases are  1 and  2 , respectively.
(1) For the first case, the new node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset.
The probability of node  to be selected as the source node is determined by   out , the outdegree of .The output edge of the new node will randomly select a subset which has not been connected by the new node and connect itself to a preferentially selected node.The probability of a node  to be selected as the target node is determined by   in , the indegree of .(2) For the second case, the new node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset.The probability of node  to be selected as the source node is determined by   out , the outdegree of .For the output edge that this new node carries, it will still pick a node in the same subset and connect itself to a preferentially selected node which has not been connected by the input edge of the new node.The probability of a node  to be selected as the target node is determined by   in , the indegree of .
(vi) When the new node with 1 input edge and  − 1 output edges is added to the network with probability , this node will randomly choose a subset and let itself be connected by a preferentially selected node in this subset.The probability of node  to be selected as the source node is determined by   out , the outdegree of .For the rest of the  − 1 output edges, at each moment only one edge randomly chooses a subset which has not been connected by the new node and connects itself to a preferentially selected node in this subset, till all the  − 1 output edges are processed.
The probability of node  to be selected as the target node is determined by   in , the indegree of .
(vii) When the new node with 1 output edge and  − 1 input edges is added to the network with probability , this node will randomly choose a subset and connect itself to a preferentially selected node in this subset.
The probability of node  to be selected as the target node is determined by   in , the indegree of .For the rest of the  − 1 input edges, at each time only one edge randomly chooses a subset which has not been connected by the new node and lets itself be connected to a preferentially selected node in this subset, till all −1 input edges are processed.The probability of node  to be selected as the source node is determined by   out , the outdegree of .
The definitions of ∏ in () and ∏ out () are listed in ( 3) and ( 4), respectively.The relation between different probabilities is listed in (5).We have the following equations: In ( 3) and ( 4),   is the number of nodes of the subset that has new edges connected to it.The denominator of (3) is the sum of indegree of the same subset and the denominator of (4) is the sum of outdegree in this subset.
When  0 = 12,  = 3, and  = 3, we get the distribution of outdegree and indegree of this simulated model.The outdegree and indegree distributions are illustrated in Figures 3 and 4, respectively.Figures 5 and 6 illustrate the comparison between the simulated data and the real data.From the comparison of outdegree distribution we can see that the slope of the simulated data is 2.48, the same as that of the real data, but the beginning part of the simulated data cannot

Mathematical Problems in Engineering
We also compared other features of the simulated and the real networks.For example, the average shortest path length for the real network is 8.95, while for the simulated network, it is 7.81, which is much closer than that of the model listed in [16].
The main contribution of this paper is the evolution model of the CERNET.The result shows that the simulated model can partly disclose the property of this network.However, the model introduced in this paper is only the ideal model, which means that only the main features of the real network are considered.With the help of the fast growing computing power, we intend to adjust this model so that it can be used in the analysis of the ever increasing large scale complex networks.

Figure 1 :
Figure 1: Distribution of outdegree of real data.

Figure 2 :
Figure 2: Distribution of indegree of real data.

Table 1 :
The accumulated frequency and percentage of degree in CERNET.

Table 2 :
Degree and link features in some universities.