Predicting the Robustness of Large Real-World Social Networks Using a Machine Learning Model

high computational cost, especially for large networks. Here, we propose a methodology such that the robustness of large real-world social networks can be predicted using machine learning models, which are pretrained using existing datasets. We demonstrate this approach by simulating two efective node attack strategies, i.e., the recalculated degree (RD) and initial betweenness (IB) node attack strategies, and predicting network robustness by using two machine learning models, multiple linear regression (MLR) and the random forest (RF) algorithm. We use the classic network robustness metric R as a model response and 8 network structural indicators (NSI) as predictor variables and trained over a large dataset of 48 real-world social networks, whose maximum number of nodes is 265,000. We found that the RF model can predict network robustness with a mean squared error (RMSE) of 0.03 and is 30% better than the MLR model. Among the results, we found that the RD strategy has more efcacy than IB for attacking real-world social networks. Furthermore, MLR indicates that the most important factors to predict network robustness are the scale-free exponent α and the average node degree < k > . On the contrary, the RF indicates that degree assortativity a , the global closeness, and the average node degree < k > are the most important factors. Tis study shows that machine learning models can be a promising way to infer social network robustness.


Introduction
Te study of the social network from a complexity science perspective has attracted much interest recently [1]. Especially, the study of dynamic processes that take place in these complex networks can have various applications. For example, the study of network robustness, i.e., "network robustness" is the capacity of a network to hold its functionality when a proportion of nodes/edges are removed, can help attack a network efciently, or inversely design a more robust network structure in practice [2][3][4][5][6][7]. On the other hand, the study of epidemic processes that take place in the network can be used to spread the news [8][9][10][11][12], optimize vaccination strategy [13][14][15], or defne a better social-distancing rule [16][17][18][19].
Besides a few simple model networks where analytical models can be developed [20][21][22][23][24], most of the studies rely on computer simulations. For example, for the study of the network's robustness, node/edge removal Monte-Carlo simulations are usually employed. In such a process, nodes/ edges are sequentially removed from the network using computer simulations. A "robustness" metric is then recorded at each step of the removal process. Te most commonly used robustness metric is the largest connected component (LCC) of the remaining network [25].
Te way nodes/edges are selected to be removed is called the removal strategy or attack strategy. One can classify attack strategies into two types, initial and recalculated attack strategies. For an initial attack strategy, nodes/edges are removed according to a node/edge ranking that is computed ahead of the removal simulation. In contrast for a recalculated attack strategy, the ranking is updated after each node/edge removal [4].
Because of the sequential nature of the removal process, the node removal simulation is computationally costly, especially for recalculated strategies. For example, a simulation using an RD attack strategy has a time complexity of O (N × E), where N is the number of nodes and E is the number of edges of the network. Te reason is that the node removal process has an N step, and at each step, a degree ranking is computed taking a time that scales with E. However, for RB, the computation of the whole network's betweenness is known to be very computationally costly, due to the defnition of the network's node betweenness [31,32]. Te most efcient known algorithm for calculating network betweenness is the Brandes algorithm [33], which has a time complexity of O (N × E). In consequence, the whole node removal process using IB and RB attack strategies can have a time complexity of O (N × E) and O (N 2 × E), respectively. Although the IB attack strategy has the same time complexity as the RD attack strategy, the RB's time complexity is much higher. For illustration, in Figure 1, we present the total simulation time tIB and tRD for the corresponding attack strategies IB and RD, respectively, for all our studying social networks (48 networks see Section 2). In addition, we present the total simulation time tRB for the attack strategy RB for 4 networks (insert graph) as an example, as a function of the product N × E. We found a good linear relationship between t IB and t RD and N × E for all networks as expected, and tRB is about two orders of magnitude higher than t IB and t RD , for networks of equal N × E.
Te simulation time can become an issue for the cases of social networks because their size can be extremely large. In fact, to our knowledge, most studies of dynamic processes on social networks that use an RB attack strategy only consider small-size real-world social networks of less than 100,000 nodes [7,28,30]. For very large social networks, the RB node attack strategy can take an unrealistic amount of time. Terefore, RB is not suitable for large social networks for an average computer station. One possibility is to use the alternative betweenness-based attack strategy with only one betweenness calculation, namely, the initial betweenness attack strategy IB, together with other recalculated strategies that use another node centrality metric that is less computationally costly. In consequence, in this work, we consider two candidate attack strategies for breaking large realworld social networks, IB and RD attack strategies. Besides the comparative study between diferent network node attack strategies, other works focused on the relationship between network robustness and network structural indicators (NSIs). Iyer et al. [4] studied network robustness as a function of the node clustering coefcient (or node transitivity). Te research on model networks with tunable clustering coefcients demonstrates that networks with higher clustering coefcients are more robust, with the most important efect for the node degree and node betweenness attack [4]. Nguyen and Trang [34] studied Facebook social networks and found that those networks with higher modularity Q have lower robustness to node removal. Te modularity indicator Q introduced by Newman and Girvan [35] measures how well a network breaks into communities, (i.e., a community or module in a network is a well-connected group of nodes that have sparser connections with nodes outside the group). In [29], the authors empirically analyzed how the modularity of scale-free models and realworld social networks afects their robustness and the relative efcacy of diferent node attack strategies. Te abovementioned studies analyzed the relationship between network robustness and a single NSI.
On the other hand, machine learning (ML) is a technique that has seen a huge breakthrough in the last decade, beating state-of-the-art results in many prediction applications [36]. It initially solved technical problems in computer vision and natural language processing [37][38][39] and then expanded into many other felds such as health care, fnance, manufacturing, energy, and environment. Te key characteristic of an ML model is the ability to intelligently learn nonlinear relationships between the input and output without explicitly knowing them.
In this work, given such a complex relationship between network robustness and NSIs, we adopted a method from machine learning in order to learn such a complexity. Our main contribution is the application of the ML model to predict real-world social network robustness with acceptable errors. We develop ML models to predict network robustness under two main attack strategies, the IB and RD attack strategies, independently. We also implemented three popular ML models, single-variable linear regression, multiple-variable linear regression, and random forest models. Our results demonstrate that a data-driven method 2 Complexity such as ML can be an efcient way to study the network's complexity.
Our work comprises three steps: (1) collect a real-world network dataset and compute NSIs; (2) run Monte Carlo node attack simulations to estimate network robustness; (3) build and evaluate a model that predicts network robustness from their NSIs. Te paper is organized as follows: in Section 2, we describe our dataset of 48 real-world social networks. In Section 3, we describe the network robustness Monte Carlo simulation method and three ML models for predicting the network robustness, i.e., simple and multiple linear regression (SLR and MLR, respectively) and random forest (RF) model. Section 4 presents the main results, and fnally, we discuss and conclude in Section 5.

Real-World Social Network Datasets and Robustness Estimation
Real-world social networks are downloaded from two sources: the Stanford Large Network Dataset Collection (https://snap.stanford.edu/data/) and the Network Repository social networks (https://networkrepository.com/ soc.php). We select 48 social networks with a node number (N) ranging over fve orders of magnitude. Te smallest network is the "Twitch user-user network of gamers who stream in Portugal" having N � 1,914, and the largest network is the "e-mail network from an EU research institution" with N � 265,216. However, the network with the largest number of edges (E) is the "BlogCatalog social blog" with E � 4,186,390. Te social networks used in this study are unweighted (i.e., we do not take into account edge weights) and undirected (we do not consider edge directionality). Table 1 summarizes 48 real-world social networks and their NSIs. Besides N and E, we also compute the following NSIs: (i) Network density <k> is the average node degree, i.e., the average number of edges per node. (ii) Fitted scaled-free exponent (α): we assume that all social network degree distributions follow a power law of P(k) ∼ k−α where k is the node degree. Te power exponent value α is ftted using the ordinary least squared method. From this ftting, we also extract the ftting variance of α, denoted by α 2 . (iii) Assortativity (a): the assortativity coefcient is a Pearson correlation coefcient of the degree between pairs of linked nodes [40], which varies between −1 and 1. A positive value of a indicates a preferential connection between nodes of a similar degree, while negative values indicate that nodes of diferent degree have more change to connect. (iv) Modularity (Q) : Te modularity indicator Q calculates how a network can be partitioned into subnetworks (modules or communities):  Figure 1: Computation time of a complete Monte Carlo network node attack simulation for all studied real-world social networks (using initial betweenness (IB) and recalculated degree (RD) attack strategy) and for 4 networks (using recalculated betweenness strategy (RB)) as a function of the product N × E (node number (N) edge number (E). We found that tIB and tRD scale approximately linearly with respect to the product N × E, while tRB scales linearly with respect to the product N 2 × E (insert graph). From this result, we can estimate that the RB simulation time for the largest networks in our dataset will take more than 50 days using the same hardware. where E is the number of edges, a ij is the element of the adjacency matrix A in the row i and column j, k i is the degree of i, k j is the degree of j, c i is the module (or community) of i, c j that of j, the sum goes over all i and j pairs of nodes, and δ(x, y) is 1 if x = y and 0 otherwise [13]. (v) Global clustering coefcient (C): the global clustering coefcient (C) is based on triplets of nodes. A triplet is three nodes that are connected by either two (open triplet) or three (closed triplet) undirected edges. Te global clustering coefcient is the number of closed triplets (or 3x triangles because a triangle comprises 3 overlapping triplets, each centered at one of the three nodes) over the total number of triplets (both open and closed). Te formula is as follows: where λ closed is the number of closed triplets and λtotal is the total number of triplets in the network. Te global clustering coefcient represents the overall probability for the network to have adjacent nodes interconnected, thus making more tightly connected modules [41]. (vi) Average closeness (Cl) is the average of all network nodes' closeness, where the closeness (or closeness centrality) of a node is calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph [42,43]: where N is the number of nodes and d(i, j) is the length of the shortest path between nodes i and j.

Network Robustness Monte Carlo Simulation.
For each network, we run two node removal processes using Monte-Carlo simulations. Nodes are removed consecutively following the ranking of initial betweenness (IB) and the ranking of the recalculated degree (RD). In the case of ties, e.g., nodes with an equal betweenness or degree score, we removed one of them at random. After each node removal, we compute the network robustness measure and the relative size of the largest connected component LCC, together with the accumulated proportion of nodes removed q. Finally, we obtain two curves LCC (q) corresponding to two node removal processes, IB and RD. Te whole simulation is repeated 10 times, and the fnal curves LCC (q) are the average results. In addition, we compute a single value defned as the network robustness (R), as performed by Bellingeri et al. [44], and the area below the normalized LCC curve during the removal process, R � LCC(q). R therefore can be between two theoretical extremes, R≃0 (absolute fragile network) and R≃0.5 (absolute robust network). We denote RRD and RIB as the network robustness against RD and IB node attack strategies, respectively. In summary, we collect 48 real-world social networks, and then, we compute 9 NSIs for each network as inputs. In parallel, we run Monte Carlo simulations and obtain the robustness represented by two metrics, RRD and RIB. Te higher they are, the more robust the network is. Tose two metrics are the output of each network and will be predicted using ML models.

Machine Learning Approach
Tis section presents the details of SLR, MLR, and RF models.

Simple Linear Regression Model (SLR).
Linear regression is the simplest model for prediction. Te SLR model between the network robustness R and an NSI x is expressed by the linear equation: where a 0 is the intercept and a 1 is the slope. In (4), an ordinary least square (OLS) is applied for estimating coeffcients by minimizing an appropriate loss function [45,46]. Once the OLS process, which is also called the ftting process, is performed, we can use (1) to predict the robustness R of a new network for a given indicator x. In addition, we derive a statistics t-test from the OLS process with the null hypothesis H0: a 1 � 0. A rejection of H0 means that there is a signifcant linear relationship between R and the NSI x.
We run the SLR model ft for all NSIs listed in Table 1 excluding E because it can be expressed in terms of two other NSIs : E � N < k>/2.

Multiple Linear Regression Model.
Multiple linear regression (MLR) is an extension of SLR for multidimension variables x � (x 1 , x 2 , . . ., x n ), where x 1 , x 2 , . . ., x n are NSIs. Te linear equation between network robustness R and NSIs is as follows: R � a 0 + a 1 x 1 + a 2 x 2 + · · · + a n x n , where a i are coefcients obtained from the OLS method.

Random Forest Model.
Te random forest (RF) belongs to the ensemble class of ML models, indicating that it aggregates the prediction from an ensemble of ML base models, here, decision tree regression (DTR) models. We briefy describe the DTR in the following section. A DTR starts with the root of the tree containing all samples (48 networks in our case). It then splits into two diferent nodes by selecting samples whose value of a certain variable is higher or lower than a certain threshold value. Figure 2(a) represents a basic decision tree diagram for our dataset. Te root node containing 48 networks splits into two other nodes by considering whether the variable (NSI in our case) scale-free exponent α is higher or lower than 2.5.
Te DTR selects the variable, and its splitting value is based on information theory, in concrete considering the entropy concept. Entropy is a metric of uncertainty of a node. Te DTR splits a node by maximizing the information gain, which is the weighted diference between the total entropy of two resulting nodes and the entropy of the initial node. Te DTR successively splits until a stopping condition is reached, for example if the size of the current node is smaller than 20. Te fnal node is also called a leaf node. In Figure 2(a), after the frst split of the root, the left child node becomes a leaf node, while the right child node continues to split into two leaf nodes.
Once the fnal DTR is obtained, it can be used to predict the value of a new sample as follows. Te new sample will be classifed into one of the leaves, and its prediction value will be the average value of all the samples that are classifed into the same leaf.
Finally, the RF model creates multiple decision trees randomly drawn from the data, usually several hundred, and averaging the results from all trees to output a new result often leads to strong predictions [47,48].
Te decision tree can ft nonlinear datasets because it can split the same NSI multiple times. However, decision tree is easy to be overftting, i.e., it is too sensitive to the training data while failing to predict new coming (testing) data. In order to address this problem, a random forest (RF) model is obtained by creating multiple randomly drawn decision trees from data, usually several hundred. Te fnal regression prediction will be the average prediction of all the decision trees [47][48][49] (in this work, we implement an RF with 300 DTRs). Using an RF, "feature importance" measurement can be derived to rank the NSI [50].

Data Preparation, Validation, and Performance
Evaluation. All NSIs can be computed from the network's data, and thus, our dataset did not contain missing values. We also exclude E because of redundancy as mentioned above. Te other 8 NSIs are normalized to avoid large diferences in the indicators' range: where x i,j is the value of the NSI i for observation (network) j and x i and σ(x i ) are the mean and the standard deviation of the NSI i, respectively. In the frst step, we use the whole dataset to build ML models and compare the results between models and two target variables. However, due to overftting problems in many ML models, the model's performance for new data is not always coherent as that in the training step, and we need to validate models in the second step. We choose the leaveone-out validation [51]. In this way, we train each of the above models 48 times: each time the whole dataset excluding one observation is used to train the model, and then, the model is used to predict the target value of the remaining (hold-out) observations and repeats for each of 48 hold-out observations. Te overall evaluation result is the average across all 48 regressions. 6 Complexity It is noted that for the SLR model, we only consider regression coefcients in order to analyze the dependence of robustness metrics with respect to each NSI. However, for MLR and RF models, we analyze the prediction of robustness metrics using four common evaluation metrics for regression problems, the root mean square error (RMSE) and the coefcient of determination (also named the explained variance ratio, R 2 ) as analytical metrics and the frequency distribution and the Q-Q plot of residual errors as graphical metrics.
RMSE is the square root of the summation of the squared diference between observed and predicted data points. Te RMSE has the same unit as the target feature and is generally considered the model error. A lower RMSE value represents superior prediction results. Te formula of the RMSE is provided by where n is the number of observations, R j denotes the empirical (simulated) network robustness, and R predicted, j is the predicted value of robustness for the observation j. R 2 is used to represent the general prediction performance of regression models. R 2 is one minus the ratio of the remaining variance and the original variance. Te formula of R 2 is provided by where n is the number of observations, R j is the simulation robustness, R predicted, j denotes the predicted value for observation j, and R j is the average of all the simulation robustness. R 2 varies between 0 (the model has no prediction ability) and 1 (the model correctly predicts all values). Te residual error, ε � R empirical − R predicted , is simply an error between the empirical (simulated) network robustness and the predicted value of robustness. Te distribution histogram of ε is expected to be close to the origin. Furthermore, the most important assumption of a linear regression model is that residual errors are independent, and consequently, these errors are expected to be normally distributed.
Te network is analyzed using the "graph-tool" library in Python. All data preparation, model building, and evaluation are written using the Python code. Te hardware for numerical simulations is a PC with an i9-10850 Intel processor and 32 GB RAM. Table 2. Overall, we found that RRD is slightly smaller than RIB for most networks (43 out of 48 networks), with an average of 0.148 vs. 0.173, respectively. It suggests that the RD strategy has more efcacy than IB for attacking real-world social networks. Te largest and sparsest network, Email-EuAll (N � 265,216 and <k ≥ 1.58), has the smallest robustness with an equal RIB and RRD of 0.001. In contrast, the gemsec_deezer_HR network, with N � 54,575 and <k ≥ 9.12, has the strongest robustness with an RIB and RRD of 0.375 and 0.338, respectively.

Network Robustness as a Function of the NSIs and SLR T-Test. Te simulation robustness of each network RIB and RRD is represented in
In Figure 3, we plot RRD and RIB as a function of 8 independent NSIs, and we found that RRD and RIB behave similarly in all cases. Te SLR unveils some signifcant relationships between R and NSIs ( Figure 3 and Table 3). For example, in Figure 3(a), we can see that both RRD and RIB slightly decrease with the network size N. Tis linear dependence between robustness RRD and RIB and N is tested by using the SLR model, and we found that it is statistically signifcant, with a confdence level of 95% (p value <0.05,  Figure 2: (a) An example of a decision tree: the root node containing 48 networks splits into two other nodes, Node 11 and Node 12, with 13 and 35 networks, respectively, according to the value of the scale-free exponent α. Ten, Node 12 splits into two other nodes, Node 21 and Node 22, with 15 and 20 networks, respectively, according to the value of the network density <k>. We assume that at Node 11, Node 21, and Node 22, no split is possible because of a certain stopping rule, and thus, they become fnal leaves. In general, any NSI can be used to divide networks at any split, and the decision tree can be arbitrarily complex depending on the stopping rule. (b) An illustration of the same decision tree in the 2-dimension (α and <k>) space with fnal leaves.

Complexity
Interestingly, RRD and RIB do not statistically linearly depend on the network density <k> as found previously in [4,52] (Figure 3(b) and Table 3). Tis contrasting observation would suggest that network robustness also depends on other NSIs and that the network density alone cannot predict the whole network's robustness as previously seen.
Besides N, the only other NSI that shows a signifcant linear relationship is the modularity Q (Figure 3(f )) in the case of RRD.
However, in Figure 3, we still observe some nonlinear dependencies. For example, in Figure 3(e), we show that network robustness decreases with the assortativity coefcient a when a > 0. However, it decreases faster when a is close to 0 and increases with a when a < 0.
Similarly, in Figure 3(g), we found that the relationship between RRD and RIB and the global clustering coefcient C follows an inverted u-shaped pattern. We ran a two-line statistical test [53] and found that two-line (or broken line) regression is signifcantly better than a single-line test. Te breakpoint was found to be C � 0.115. Both RRD and RIB linearly increase with C (with a signifcance level of 95%) up to the breakpoint and linearly decrease with C (with a signifcance level of 95%). One possible explanation is that if the network is sparse, more triplets help increase the network's connectivity and thus increase its robustness. However, above a certain value (when C � 0.115), more triplets may denote the presence of hubs or central nodes, which are likely to be the target of intentional node removal strategies such as RD and IB, consequently lowering network robustness.

Machine Learning Prediction of Network Robustness.
Te results of the previous section suggest that the social network's robustness depends on multiple NSIs in a highly complex, multidimensional, and nonlinear manner. To improve the model prediction, in this section, we use two multiple variable ML models, MLR and RF, to predict network robustness.
Te results of multiple linear regression MLR are shown in Table 4. We found that both R IB and R RD have a positive overall linear regression coefcient with respect to α, Q, Cl, and <k> and a negative overall linear regression coefcient with respect to α 2 , a, C, and N. Moreover, the MLR result indicate that α, α 2 , and <k> are the most signifcant coeffcients. A positive linear regression coefcient for the average node degree <k> suggests that networks are more robust when k is higher, while all other NSIs are fxed. Tis result agrees with previous outcomes demonstrating that denser networks may be more resistant to the attack [4,52]. However, the diferent results between the MLR and SLR would suggest that there is a strong correlation between <k> and other NSIs. In addition, the MLR model predicts R IB better than R RD , with an R 2 coefcient of 58.04% compared to 51.76%. Nevertheless, the RMSE was smaller for R RD , with a value of 0.0657, compared to 0.0709 for RIB (this is because the standard deviation of RIB is higher than that of R RD , as shown in Table 2 (bottom row)).
Because of the nonlinearity found in the previous section, we expect that the regression result using the RF model will be improved. Table 5 represents the regression result of the RF model. We found that R 2 increases to 92.24% and 91.88% for R IB and R RD regressions, respectively. Interestingly, the RF model predicts R RD roughly as well as R RD , while MLR predicts R IB better than R RD , suggesting that R RD may follow a stronger nonlinear relationship with NSIs than RIB. Additionally, the RMSE improved both for R IB and R RD , with a value of 0.0272 and 0.0241, respectively. Interestingly, the feature importance ranking in Table 5 shows that with an RF model, the assortativity a, the global closeness C, and the node number N are the most important NSIs. Tis result agrees with the exploratory observations shown in Figure 3 as discussed above.
In Figure 4, we compare network robustness R IB and R RD with the prediction value given by MLR and RF using a scatter plot. Te scatter plots indicate that RF ft data signifcantly better than MLR, where the predicted actual data points are closer to the diagonal line y = x. Meanwhile, for MLR regression, we still found nonlinear dependency between the actual and predicted values. As a matter of fact, the MLR model was not able to capture the inherent nonlinearity dependency in the actual data. We also analyzed the residual errors of the above regression using the frequency histogram and QQ-plot and found that they follow a normal distribution relatively well (Figures 5-8).
Finally, we run leave-one-out regression for both models MLR and RF in order to avoid overftting. Te result is summarized in Table 6, and the scatter plots are shown in Figure 9. We found that the prediction result is less accurate than the above "in-sample" training with lower RMSEs in both MLR and RF models. We obtained an RMSE of 0.0812  and 0.0760 for R IB and R RD predictions using MLR, respectively, and an RMSE of 0.0733 and 0.0636 for R IB and R RD predictions using RF, respectively. Even though the regression results are less efective because we predict the single sample which is independent of the remaining samples used for training (building the ML model), residual errors still ft well to a normal distribution as shown in the histogram and QQ-plots (Figures 10-13).        Figure 9: Scatter plots between the predicted value of robustness (R predicted) and simulated (R empirical) of the hold-out observation for MLR (a) and RF model (b). Te model is trained using the whole dataset excluding one observation (hold-out observation) and is used to predict the outcome of the hold-out observation.    Figure 11: Histogram of residual errors for MLR prediction of the RD strategy for the leave-one-out sample (a) and its QQ-plot (b).

Discussion and Conclusion
In this work, we have analyzed the robustness of 48 realworld social networks with the node number ranging over fve orders of magnitude, from 1,914 to 265,216. Using Monte Carlo simulations, we have run two commonly used node attack strategies, IB and RD strategies, whose computation time is within our hardware capability. We found that their corresponding simulation time, t IB and t RD , scales linearly with the product of the network's node number and edge number, i.e., N × E. We also found that the two attack strategies IB and RD present similar efcacy when evaluated by the unique robustness metric R, with RD slightly better than IB (average R RD is slightly smaller than average R IB ). It suggests that for the social networks used in this study, the RD strategy is the most efcient strategy to dismantle (breakdown) networks, both in terms of computational cost and breakdown efciency.
To understand how the structure of a social network determines its robustness, we investigate the relationship between the metric R and a set of network structural indicators (NSIs) from the literature. Te simple linear regression (SLR) between R and NSIs shows low goodness of ftting, and it is overall not able to produce signifcant prediction models. Te low goodness of SLR would indicate that network robustness depends on NSIs in a nonlinear manner.
To improve ftting, we have developed two machine learning models to predict two robustness metrics R RD and R RD from the combination of 8 NSIs, multiple linear regression (MLR), and random forest (RF) model. Te latter one is chosen as it can handle nonlinear data well and is built on a collection of base models, decision tree classifers. We found clearly that the random forest model can predict network robustness better than the multiple linear regression model. In concrete, the RF model predicts network robustness with an RMSE of 0.0272 and 0.0241 for R IB and R RD , respectively. Tis result is encouraging to predict real-world social network robustness, although the error is about 16% (for R IB , the RMSE is 0.0272 compared to an average R IB of 0.173, and for R RD , the RMSE is of 0.0241 compared to an average R RD of 0.148). Meanwhile, when the leave-one-out evaluation is applied, the RMSE increases to 0.0733 and 0.0636 for R IB and R RD , respectively, which is about one-third of the average value.
Finally, MLR indicates that the most important factors to predict RIB are the exponent α and the average node degree <k>, for both R IB and R RD . In particular, a higher value of α is correlated with higher R IB and R RD . Higher absolute values of the exponent α denote a network with fewer hub nodes (highly connected nodes) [35]. In consequence, the RD and IB attack strategies cannot fnd large hub nodes whose removal may disintegrate the network faster, resulting in higher values of R RD and R IB . Additionally, MLR indicates that <k> is positively related to lower R IB and R IB . Tis last outcome agrees with previous results, demonstrating that networks with higher edge density may be more resistant to the attack [4,52]. On the   Figure 13: Histogram of residual errors for RF prediction of the RD strategy for the leave-one-out sample (a) and its QQ-plot (b).

Complexity 13
other hand, it confrms that SLR, which focuses on a single NSI, may not be able to predict the robustness of real-world social networks. Our work demonstrates that the ML model can be used to predict network robustness with acceptable results. Terefore, it alleviates the need to run a full Monte Carlo simulation on a network when only approximate robustness is needed. Meanwhile, more network datasets are expected to improve the accuracy of ML models. Tis work also contributes to the understanding of the relationship between real-world social network robustness and its structural indicators. Finally, we have proved that using a data-driven approach to predict the outcome of the nonlinear and complex dynamic process, such as network robustness, is an appropriate approach [54][55][56][57][58][59][60].