Diagnosing and Predicting the Earth’s Health via Ecological Network Analysis

,


Introduction
Ecological balance is one of the most attractive topics in biological, environmental, Earth sciences and many other related disciplines [1], especially since the industrialization has been undergoing for about two hundred years.To better understand how biosphere responds to the increasing pressure (e.g., population explosion, water and air pollution, climate change), there is a vast class of researches devoted to discovering possible solutions in alleviating pains of the Earth.However, due to the complexity of ecosystems, it is not easy to find a perfect way to conclusively explain all the potential impacts [2] that is responsible for the ecological fragility [3,4].Among various studies, Ecological Network Analysis (ENA) [5,6] is regarded as one promising methodology to assess the Earth's health [7].
Ecological networks can be extracted from various information, resulting in different kinds of networks, where each node represents a nation, a continent, an ocean, a habitat, or a park, and an edge is present when two nodes are directed or mutually influenced, varying from economic impact, population flow to environmental pollution, economic level, and so forth [5,8].
Since the network information is not explicitly provided, we start our research by constructing the global ecological network via extracting nodes and links from the world geography map (see Figure 1).We consider each nation or ocean as one node, and a link is present if the two countries/oceans (or country and ocean) are geographically neighboring.For example, the two nodes, China and Russia, are connected since they are neighboring countries.In addition, there is also a link between the Pacific Ocean and China since they are also mutually connected.Figure 2 shows the constructed network.Furthermore, we also consider five time-dependent factors of each node, including the population size, economic level, habitats area size, energy consumption, and air pollution, which might affect the global ecological environment.
After constructing the time-dependent ecological network, we start our research as follows.
(i) Firstly, we observe network topological properties by formulating both local features of a node and global topology of the network.
(ii) Secondly, we investigate how the dynamical factors evolve and how they affect the Earth's health.
(iii) Thirdly, we use a machine learning algorithm to identify the influential factors of the ecological network.
(iv) Fourthly, we design a spreading model to predict the Earth's health and perform sensitive analysis to test its robustness.
(v) Finally, we use the -core deposition method [9] to identify the influential nodes by considering factors as weights.
Many researches have adopted operation researches (OR) methods, such as minimum spanning tree (MST), to discover mathematical solutions (e.g., the minimum cost/maximum flow).Hill proposed a matrix solution to solve the number of paths with certain length from any two nodes [10].Finn presented three important indices to evaluate the ecosystem flows, including the Total System Through-Flow (TST), the Average Path Length of an Inflow (APL), and the Cycling Index (CI), which are widely adopted in discussing the mass and energy flow mechanism [11].However, researches from Network Science (NS) [12][13][14][15][16][17][18][19] consider the ecosystem problem from a very novel perspective.It not only abstracts nodes and affiliated properties, but also takes into account various kinds of interactions and functions among them.Therefore, NS-based methods, including link prediction and navigation, have been introduced, trying to discover the potential global topology via analyzing the local structure and dynamics [20,21].

Methods
In this paper, we adopt the Network Science to analyze the ecological networks mainly because of its robustness and explainable solution in simulating both the static and dynamical properties of graphs.In this section, we shall describe the construction of our network model, including nodes and edges information, property formulation, and the dynamical evolution as well.In particular, we begin our study based on several necessary assumptions.
(i) All the data and information about the ecological network are collected from reliable statistical database; (ii) Both the properties and topological position of a node in the network are important to evaluate its influence for the Earth's health; (iii) The spatial network has similar structure and features with real ecological network.

Static Properties.
In the constructed network, a graph (, ) is used to describe its structure, where  is the node set and  is the edge set (Table 1 shows the basic information of the observed ecological network).Here, we consider four respective indices to analyze the static properties of the ecological network.
(i) Degree Index.Degree index [22] indicates how many nodes that a nation/ocean connects to.Naturally, the degree index of node  is defined as where   = 1 if there is a link between node  and ; otherwise   = 0.
(ii) Betweenness Index.Betweenness [22] is defined as how many shortest paths pass through the target node.The larger the Betweenness is, the more connective role the target node plays.In a given network, the Betweenness is denoted as where    = 1 if there is a shortest path between node  and  passing node .
(iii) Closeness Index.Closeness [22] is defined as the reverse distance between the target node and other nodes.The larger the Closeness is, the closer they would be and vice versa.In a given network, the Closeness is denoted as where   = 1 denotes the length of the shortest path between nodes  and .
(iv) -Core Index.k-core [23], denoted as   (), is the core number of a node which is the largest value  of a -core containing that node.It is obtained as follows: (i) remove from the graph all nodes of degree less than  and (ii) then remove these vertices repeatedly until no further removal is possible.The remaining result, if exists, is the -core.Thus, a network is organized as a set of successively enclosed -cores.

Dynamical Factors.
Besides the static network properties, we collect various data from World Bank (http://data.worldbank.org/) to investigate the dynamical factors of all the nodes in the ecological network.The dataset includes population size, per capita GDP, area of land and marine, energy consumption per unit of GDP, and carbon dioxide emission of each country from year 1962 to 2011.Specifically, we observe the five following factors for each node.
(i) Population size, denoted as PS   , is the total population of node  in the year .
(ii) Economic level, denoted as EL   , is the per capita GDP of node  in the year .
(iii) Habitats area, denoted as HA   , is the total area of land and marine of node  in the year .
(iv) Energy consumption, denoted as EC   , is energy consumption per unit of GDP of node  in the year .
(v) Air pollution, denoted as AP   , is the total amount of carbon dioxide emission of node  in the year .
Figure 3 shows how the five factors change from the year 1961 (HA starts from 1990 and EC starts from 1980 due to the data absence) to 2011.It can be seen that, generally, the values of all the factors increase year by year.It also shows that the population size has highly positive relationship with economic level.For example, the population size of China (CHN) and USA are both in the top five nations, and their average GDP also have high ranks among all the 126 nations.Meanwhile, their air pollution is ranked in the worst five nations, which might suggest that the development of economy would have negative impact on the environment.Furthermore, we list the top 20 nodes for both static properties and dynamical factors in Table 2.It can be seen that the Atlantic Ocean (ATO) holds the most significant role in maintaining the robustness of the ecological network because it connects the largest number of nodes (Table 3).Russia (RUS) has high network property, with degree rank number 5, Betweenness rank number 2, and Closeness rank number 2, but simultaneously has a relatively bad air quality (rank number 3).Other nations, such as China (CHN), have the similar situation.Comparatively, USA is not ranked in the top network structure list but has a large population (rank number 3) and a high economic level (rank number 6), which might promote its impact in affecting the global Earth's health.

Definition of the Earth's Health Index.
Inspired by previous analyses, we consider that the Earth's health is not just affected by a single factor, but a joint influence resulting from many complicated factors.In this paper, we use the Shannon Entropy [24][25][26] to integrate the impact of all possible factors to characterize the Earth's health, denoted by EH, as where is the normalized fraction of factor  for node , || denotes the set of all the factors defined in Section 2.2, and the final Earth's health value, EH, runs over the sum of all nodes.According to the original definition of Shannon Entropy, the larger the entropy value is, the more equal the distribution will be.Therefore, a large value of EH suggests a good situation of ecological balance both among nations and factors and hence indicates good Earth's health and vice versa (Table 4).

Identifying the Influential Factor via
Machine Learning Approach 2.4.1.Random Forest.We use Random Forest [27] to evaluate the importance of factors.Random Forest is an ensemble regressor/classifier that consists of many decision trees and then outputs the value that is the mode of the values/classes output by individual trees.In this scenario, we apply this method in a regression way and use it to evaluate each feature's importance.Compared with other regression models, we choose the Random Forest model because of its following advantages for our solution: (i) it can tackle high-order variable interactions or correlated predictor variables; (ii) it can be used not only for prediction, but also to assess variable importance; (iii) it can partially overcome the overfitting problem.

Base Learner: Classification and Regression Tree (CART).
Given a training vector   ∈   ,  = 1, . . .,  and a label vector   ∈   , the decision tree recursively partitions the space such that the samples with the same labels can be classified together.Let the data at node  be represented by , for each candidate split  = (,   ) consisting of a feature  and threshold   ; partition data into two parts: The impurity at  is computed using an impurity function , the choice of which depends on the task being solved: Then, select the parameters that minimize the impurity: After this recurse for subsets  1eft( * ) and  right ( * ) until the maximum allowable depth is reached,   < .
In regression problem, for a node , representing a region (9)

Construction of Random Forest.
Let the number of training cases be  and let the number of variables in the regressor be .We are told the number  of input variables to be used to determine the decision at a node of the tree;  should be much less than .
(i) Choose a training set for this tree by choosing  times with replacement from all  available training cases (i.e., take a bootstrap sample).Use the rest of the cases to estimate the error of the tree.
(ii) For each node of the tree, randomly choose  variables on which to base the decision at that node.Calculate the best split based on these  variables in the training set.
(iii) Each tree is fully grown and not pruned (as may be done in constructing a normal tree classifier).For prediction a new sample is pushed down the tree.It is assigned the value of the training sample in the terminal node it ends up in.

Out-of-Bag (OOB) Evaluation.
To evaluate the Random Forest model, we use 2/3 of the data as training set and remaining 1/3 (saying Out-of-Bag) are regarded as the test set when constructing the base learner.We calculate the average result from 50-round simulations to alleviate the random fluctuations.For each tree , the Out-of-Bag (OOB) simulation is tested in detail as follows.
(i) Consider the associated OOB  sample.
(ii) Denote by errOOB  the error of a single tree t on this OOB  sample.
where   represents th variable of each data.Then we can get err(  ) for each  and normalize it.The bigger the err(  ) is, the more important the variable is. Figure 4 shows the err() result for all the five features.It can be seen that, among all the features, the err(  ) value of habitats area size is the biggest, hence the most important factor for the Earth's health, which is in the agreement with reports by public media that people, especially humans living cities, now occupy smaller and smaller space than before, resulting in comparatively a much worse living condition.

Predicting the Earth's Health in Ecological Networks.
We use a dynamic spreading model [28] to predict the Earth's health by considering the observed ecological network structure.The model runs as follows.
(i) At the initial step, each node  is set a health value by averaging the Earth's health index (EH 0  defined by ( 4)) from the most recent five years.
(ii) We then choose 10% of the nodes as "seed" nodes and add ΔEH 0  to each seed node.The ΔEH 0  is calculated by averaging over the incremental EH values of the most recent five years.
(iii) At each time step  + 1, each node  will affect all its neighbouring nodes' Earth's health index by where ΔEH +1 → is the Earth's health influence from node  to  and  and  are tunable parameters.
(iv) Then, the node 's Earth's health index at time step +1 is summed over all 's neighbours as where   = 1 if there is a link between node  and node , and   = 0 otherwise.(v) Finally, the global Earth's health, at time step  + 1, is summarized over all the nodes:

Performance Comparison.
To test the performance of our spreading model, we set  = 4 and  = 0.65 to predict the global Earth's health in the computer simulation.
In addition, we also use Gaussian Fitting to compare with our model.Figure 5 shows the comparison results.It can be seen that our model can better fit the real data, comparing with Gaussian Fitting.In addition, the proposed EH index shows that the Earth's health is getting worse from 2008, which gives us the warning that we should put much more attention to our environment.Correspondingly, results from Section 2.4.4 suggest for us a possible solution that returning more living lands might be the most effective way to solve this dilemma.

Sensitivity Analysis.
We then perform the sensitivity analysis to test the robustness of our model.We randomly delete  fraction of the links and see whether our model is reliable or not. Figure 6 reports different prediction results of model for various values of .It can be seen that the prediction result of model is quite robust that even a large fraction of links, 60% for instance, is removed.Therefore, it can be concluded that our model is reliable for predicting the Earth's health.

Identifying the Influential Node in Ecological Networks.
Our Earth's health index tries to diagnose and predict the global Earth's health status.In addition, in order to find which node (saying nation or ocean) plays the most important role in affecting the Earth's health, we additionally perform analysis to rank the node importance.We integrate the -core value,   () (see Section 2.1), and the Earth's health index EH  to evaluate the node 's importance,   , which is consequently defined as Figure 7 illustrates the node importance in affecting the Earth' health by   versus the corresponding rank, where some typical nodes are marked.The oceans (with the highest rank) indeed are the key nodes in affecting the Earth's health; USA, Russia, and China are also important nations in influencing the Earth's health.Some small nations, such as Madagascar (MDG) and Iraq (IRQ), play less important roles for the Earth's health.

Conclusion and Discussion
In this paper, we collect various data and construct a 145nation (including 126 nations, 19 oceans/seas, and 403 edges) world ecological network, with each node representing a nation or an ocean and each edge representing geographical neighboring relationship of the corresponding two nodes.Firstly, we analyze both the topological properties and timedependant features of nodes.Secondly, we propose an Earth's health index based on Shannon Entropy.Thirdly, we identify the importance of elements by a machine learning approach (Random Forest).Fourthly, we design a spreading model to predict the Earth's health and perform sensitive analysis to test its robustness.Finally, we integrate the topological  property (-core index) and the health index to identify the influential nations in the observed ecological network.
The model results indicated that the oceans (with the highest rank) indeed are the key nodes in affecting the Earth's health.The Big countries, for example, USA, Russia, and China, are also important nations in influencing the Earth's health.Correspondingly, it suggests for us a possible solution that returning more living lands might be the most effective way to solve this dilemma.The combination of topological properties and local factors leads to good performance in both predicting the good and bad trends of the Earth's health.The model can be easily extended by considering more factors.However, our model needs empirical support from more sufficient data.Also, the incremental mechanism may hinder long-term prediction.

Figure 1 :
Figure 1: Illustration of the world map.

Figure 2 :
Figure 2: The 145-nation (including 19 oceans) network, where edges represent the geographically neighboring relationship.The positions of the nodes are drawn according to their respective centrality index, regardless of the geographical positions.The size of each node is set according to the stationary value of (14).

Figure 3 :
Figure 3: Five factors of evolution of all nodes versus time.(a) Population; (b) economic level; (c) habitats area; (d) energy consumption per GDP; (e) air pollution.In each subgraph, five nations with top amount of corresponding property are highlighted.

6 Figure 6 :
Figure 6: Comparisons with the original network for various fractions of links are removed from the network for  = 0.2, 0.4, 0.6, respectively.

Table 1 :
Basic statistics of the observed ecological network.

Table 2 :
Top 20 nodes of the static and dynamical properties.
denotes the degree index,   denotes the Betweenness index,   denotes the Closeness index, and   denotes the -core index, respectively.PS, EL, HA, EC, and AQ, respectively, denote the population size, economic level, habitats area size, energy consumption, and air pollution in the year of 2011.

Table 3 :
Abbreviations of ocean nodes.

Table 4 :
Abbreviations of nation nodes.

Table 4 :
Continued.Randomly permute the values of   in OOB  to get a permuted sample denoted by OOB