A State-Based Sensitivity Analysis for Distinguishing the Global Importance of Predictor Variables in Artificial Neural Networks

Artificial neural networks (ANNs) are powerful empirical approaches used to model databases with a high degree of accuracy. Despite their recognition as universal approximators, many practitioners are skeptical about adopting their routine usage due to lack of model transparency. To improve the clarity of model prediction and correct the apparent lack of comprehension, researchers have utilized a variety of methodologies to extract the underlying variable relationships within ANNs, such as sensitivity analysis (SA). The theoretical basis of local SA (that predictors are independent and inputs other than variable of interest remain “fixed” at predefined values) is challenged in global SA, where, in addition to altering the attribute of interest, the remaining predictors are varied concurrently across their respective ranges. Here, a regression-based global methodology, state-based sensitivity analysis (SBSA), is proposed for measuring the importance of predictor variables upon a modeled response within ANNs. SBSA was applied to network models of a synthetic database having a defined structure and exhibiting multicollinearity. SBSA achieved the most accurate portrayal of predictor-response relationships (compared to local SA and Connected Weights Analysis), closely approximating the actual variability of the modeled system. From this, it is anticipated that skepticisms concerning the delineation of predictor influences and their uncertainty domains upon a modeled output within ANNs will be curtailed.


Introduction
Variable nonlinearity, correlation, and noise are characteristic, yet inseparable components of real-world databases. Clearly, the managing of such data conditions (both singularly and synergistically) by users, while simultaneously taking into account the large degree of complexity of natural systems, remains a considerable shortcoming of many empirical modeling efforts. A powerful approach for the modeling of nonlinear functions within a defined degree of accuracy is artificial intelligence, of which artificial neural networks (ANNs) are a core technology [1]. ANNs require little (if any) expert knowledge for their application and do not necessitate known probability distributions for variables. Multilayer feed-forward perceptrons (MLPs), the most commonly utilized ANN topology, can approximate any measurable function of an output (response) variable within a database, regardless of the nonlinear function utilized and/or the data dimensionality and variance [2].
Nonetheless, MLPs exhibit a holistic deficiency in declarative knowledge structure. Predictor-response associations are encoded incomprehensibly as weight and bias values within a multilayered topology, providing little (or no) apparent realization to users regarding network functionality of and/or knowledge extraction for the modeled process ( Figure 1). This "black-box" trait remains the main constraint  to the routine utilization of ANNs within predictive analytics.
Overcoming this lack of model transparency (i.e., defining input influences and their uncertainty domains upon the modeled output) has become a focus of recent machine learning research (e.g., [3][4][5][6]). Numerous (symbolic and algorithmic) approaches have been used to identify the confidence of network prediction as well as distinguish uncertainties associated with the "cause and effect" relationships between input and output variables (e.g., [7][8][9][10][11]). One such approach is sensitivity analysis (SA), with methodological perspectives categorized as either qualitative or quantitative. Qualitative SA relies upon domain "expert" opinions, where individual users define, based on (his/her) opinion, the importance of predictor variables. Notably, the relative importance of singular variables cannot be quantified due to a generalized lack of expertise or when a system's behavior is not understood a priori. Conversely, quantitative SAs are classified as either local or global approaches [12]. In local SA, the predictive uncertainty upon the modeled output is quantified by altering a single predictor across a predefined range, while remaining predictors are "fixed" at predefined values. In global SA, the singular effects of predictors upon the output are evaluated during instances when remaining inputs are varied concurrently across predefined ranges (for reviews, see [13][14][15]).
However, local SA ignores conditions of model nonlinearity and the lack of independence and multicollinearity among database variables. Model linearity and independence between/among multiple input variables often are not valid assumptions; as a consequence, altering the value of one predictor variable will affect (the values of) other inputs and, in turn, impact the predictive power (and uncertainty) of a response variable. For example, assume that the meteorological and hydrological variables, ambient (air) and water temperatures, respectively, are used as predictors within an ANN attempting to model a water-quality response (e.g., water clarity arising from biotic abundance [16,17]). Because ambient and water temperatures typically would be highly correlated (yet not perfectly related due to distinct physical properties of water and air), analyzing the network's predictive uncertainty in regard to ambient temperature alone, while keeping the value of water temperature static, is neither appropriate nor logical. In this simplistic example, the assumption of independence and nonassociation between Advances in Artificial Neural Systems 3 these two inputs imposes an unrealistic portrayal of model capability/uncertainty, a situation that becomes more perplexing when, in addition to correlative processes, database input/output relationships are nonlinear.
Here, a global state-based sensitivity analysis (SBSA) is proposed. For SBSA, input vectors associated with a predictor variable initially were assigned to a bin (or "state") along a distribution range where every "state" was equivalent to a predetermined variance of that distribution. Quantitative uncertainties for individual predictors upon the modeled response then were estimated as a function of dithered alterations across its data range. Importantly, the values of remaining predictors were allowed to vary concurrently across respective data ranges and in correspondence to those of the predictor of interest.

Local Sensitivity Analysis
In local SA, the effect of altering the value of a singular predictor upon the response variable(s) is evaluated, while values of other inputs remain unchanging (often fixed at values of the 1st quartile, median, or mean [18]). Typically, inputs not of interest are held at their respective mean values; as such, this methodology often is termed "sensitivity aboutthe-mean analysis" [14]. Local SA commonly is used within statistical/mechanistic modeling practices due to the ease in its computation and the simplicity of its comprehension [14,19]. Yet, despite the straightforwardness of its calculation and interpretation, local SA is appropriate only during instances when inputs are independent. In actuality, predictor variables typically display dependability upon and correspondence with each other (fully or in part). As such, "fixing" values of multiple predictors while examining the effect of a single "changing" variable is neither logical nor representative of real-world systems.
To illustrate the functionality of this approach, consider a feed-forward backpropagation MLP comprised of inputs ( ,..., ), multiple processing elements (PE ,..., ) in one hidden layer (HL), and one output ( ). The sensitivity, Sen , of output with respect to input is defined as ( [20]; refer to Figure 1(a) for notation) Sen = , where PE and are the derivative values of the HL and output activation function and PE and PE ,PE are the weights between the input-hidden and hidden-output layers, respectively.

Global Sensitivity Analysis
In global SA, the variation of a singular input variable is considered while the values of other predictors also are allowed to vary [12]. In this manner, the comprehensive effects of predictor variables upon the modeled output(s) are assessed [21], lending to immediate advantages of global approaches (over local SA), most notably, the inclusion of predictor influences in terms of range and shape for input probability density functions and the multidimensional "averaging" of dependencies (or correlations) among multiple inputs [14]. By assessing output uncertainties over multidimensional variable space, global SA is not limited by the conditions of variable linearity or additivity within model estimations, thereby ensuring that such attributes are accounted for within derivation of predictive uncertainty functions [21,22]. Global SA affords validity to the delineation of importance for multiple predictors to a modeled output and, by yielding a range of model outcomes attributable to concurrent perturbation by a variable "suite," provides knowledge about the direct (firstorder) effect of an individual predictor upon the response variable(s) as well as interaction (higher order) effects [23,24].
Diverse global SA methodologies exist (dependent upon the objective of the analysis) but can be categorized into screening and regression-based and variance-based approaches [21,23,25,26]. Screening assesses the holistic importance of (select) groups of input variables upon model prediction uncertainty, often via cluster analysis in multidimensional scatter plots, but more advanced techniques have been proposed [27]. Because screening provides minimal, if any, quantitative information pertaining to predictor importance, it typically is utilized as an initial step by users to first sort and then remove variables having little importance to model prediction [28]. Regression approaches traditionally involve fitting predictor (independent) variables to a response (dependent) variable via parametric multiple regression in a complete or stepwise manner [21]. By utilizing standardized regression coefficients within the regression model (note that standardizing the regression coefficients parameterizes the variance of independent variables to values of one, i.e., a correlation coefficient), users can assess the number and direction of standard deviations that a dependent variable will be altered, per standard deviation changes of predictor variables. Population resampling via Monte Carlo simulation followed by regression modeling affords delineation of inconsistencies in predictor effects, while circumventing difficulties with conditions of model nonlinearity and nonadditivity [29]. Variance-based approaches evaluate predictor uncertainty upon the response variable as probability distributions, from which users partition (and ultimately quantify) the response variance attributable to an individual (or groups of) predictor(s). Variance decomposition-based methodologies commonly utilized in science and engineering applications include the analysis of variance for linear models, the Fourier amplitude sensitivity test, and Sobol [30] indices for nonlinear models [31][32][33]. Nevertheless, the computational "burden" typically associated with population resampling and variance-based algorithms limits their routine usage.

Complementary Analyses to Sensitivity Analysis
Techniques complimenting SA in identifying the interactions and/or predictive uncertainties within (trained) ANNs include Neural Interpretation Diagrams (NIDs) and Connected Weights Analysis (CWA). NIDs provide a qualitative depiction of contrasting weight strength and direction among neurons/nodes within a network's topology whereas CWA incorporates a quantitative contribution of synaptic weights among input-HL-output layers, with predictors having greater values considered more influential in describing the behavior of the modeled system. Yet, excitatory/inhibitory influences of predictor variables are not denoted within calculations, and the extent of predictive uncertainties for input-output data ranges cannot be inferred from final values. Accordingly, explanatory interpretation of calculations by users is limited to the holistic magnitude of the impact of predictors upon the modeled response.

Neural Interpretation Diagram (NID).
Originally introduced by S. L.Özesmi and U.Özesmi [34], a NID portrays the magnitude and direction of connection (synaptic) weights between/among network neurons and nodes. Neurons/nodes are arranged in cascading layers (input-hidden-output) with synaptic weight direction/values represented as lines within network depictions (Figure 1(b)). Interaction between input neurons is identified during instances when more than one "significant" connection enters a node. Justifiably, user interpretation of a NID illustrating a complex network (i.e., having numerous inputs exhibiting multicollinearity and/or multiple predictive significance with numerous PEs and HLs) can be difficult, if not impossible (e.g., [16,[35][36][37]).

Connected Weights Analysis (CWA)
. CWA quantifies NID portrayals by utilizing the final synaptic weight values to identify the relative share of prediction associated with model inputs [38]. For this, interacting weight quotients among neurons/nodes are summed, with the predictors having the greatest relative value (in terms of connecting weights to output) deemed the most influential in describing the holistic behavior of the modeled outcome. As an example, users can consider the aforementioned MLP (Figure 1 where ,..., ,PE ,..., and PE ,..., ,PE are the products of the input-HL and HL-output connections, respectively. The relative contribution for a single input (e.g., ,..., ) then is determined from the summation of all input-HL-output contributions ( ) as ,..., = ,..., . (3)

Derivation of State-Based Sensitivity Analysis
In natural systems, a variable's numerical value lies within a predetermined interval, hereafter a variable "state," expressed from the mean ( ) and standard deviation ( ) of its sample population ( Figure 2). Because variable "states" are a function of distinct variances within individual (distinct) sample populations, the number of states and associated numerical ranges among multiple distributions will differ. Given this, the correspondences among variables within distinct states of their sample ranges also will differ. Local SA, where the value of predictors (besides that of the variable of interest) is fixed at their mean value, disregards this principle. For SBSA, the sensitivity of an output ( ) corresponds to an input variable ( ) within its th state. For , sensitivity (Sen ) becomes or with values for other inputs remaining fixed (to mean values) within respective states corresponding to . By partitioning the (changing) range of values for inputs into intervals, becomes the value of belonging to the th interval.
As an example, users can consider a hypothetical database (Table 1) having five input variables ( 1,...,5 ) and one output variable ( ). The data range for 1 , with mean ( 1 ) and standard deviation ( 1 ) values for variable 1 of −0.2281 and 0.9144, respectively, partitioned into twelve "states" (but any number could be utilized), becomes   with similar calculations completed for other variables. The relationships among multiple "states" for variable 1 with corresponding states of other variables then were tabulated ( Table 2). Calculations for the sensitivity of an output in relation to alterations of 1 initiated from the center of the most negative state for 1 (i.e., state three), with the value of 1 increasing and values for other variables adjusted to their corresponding "state" mean value (Table 3).
Briefly, when values of 1 occur within a particular state (e.g., state eight; Table 2), values for other input variables may occur within multiple, distinct states of their respective distribution (e.g., values of 2 occur within state six and single occurrences in each of states seven, nine, and eleven, with other variables following suit). The value for 2 would be computed as the weighted mean value for the differing states (i.e., states six, seven, nine, and eleven, with the value for state six tallied three times). Because 1 was considered to be distributed normally and the associated variable states were based on the sample population mean and standard deviation, no values for 1 occurred within states one, two, ten, and twelve (see Tables 2 and 3). In this manner, values for input variables, 2 to 5 , other than the variable of interest ( 1 ) within SBSA calculations will vary considerably from those utilized in local SA, thereby resulting in distinct modeled output values (Table 4).

Case Study
An artificial database having a defined structure, consisting of 10,000 data vectors (or "exemplars") with five predictor variables ( 1 , 2 , 3 , 4 , 5 ) and one dependent variable ( ), was obtained from Olden et al. (University of Washington, Seattle, WA, USA; refer to [8]). The correlation structure for dependent and independent variables was fixed ( ⋅ 1 = 0.8, ⋅ 2 = 0.6, ⋅ 3 = 0.4, ⋅ 4 = 0.2, and ⋅ 5 = 0.0), with a correspondence of 0.2 between any two predictors. In this manner, the predictive significance of independent variables within subsequent models progressively decreased from 1 , the most important predictor, to 5 , the least important. Monte Carlo simulation then was conducted to provide 50 distinct data subsets (of 150 exemplars each) from the original database.  ---- Table 3: Resultant mean values for input variables, 2 to 5 , in the hypothetical database when the mean value of 1 lies within a distinct "state" along its distribution range.  Table 4: Resultant mean values for input ( 2 to 5 ) and output ( ) variables in the hypothetical database derived in state-based sensitivity analysis (SBSA) and local sensitivity analysis (local SA), for instance, when the mean value of 1 lies within a distinct "state" along its distribution range.  Figure 1(a) for notation) This network architecture was identical to that used by Olden et al. [8] in order to make study comparisons possible. Transfer functions within PEs were linear (note that a network with linear transfer functions is capable of modeling nonlinear interactions among dependent variables). Exemplars within each of the data subsets were assigned randomly for model training, cross-validation, and testing (60, 15, and 25% of subset exemplars, resp.). ANNs were trained and cross-validated prior to being applied to test exemplars. The presentation of cross-validation data concurrent with training data provided for an unbiased estimation of prediction. For training, exemplars were repeatedly presented to a network utilizing a constant momentum rate (0.7) and varied step sizes (1.0 and 0.1 for the HL and output layer, resp.), with weights adjusted after presentation of all exemplars (or "epoch") to minimize the mean square error (MSE). The MSE was computed concurrently for the cross-validation exemplars. Training was terminated prior to the maximum of 1000 epochs if the MSE within either the training or cross-validation data began to increase (i.e., an indication that the network began to memorize the data [39,40]). Values of MSE for both training and cross-validation data subsets converged to global minimums over multiple epochs (e.g., Figure 3(a)), indicating that MLPs provided adequate estimates of prediction (e.g., Figure 3(b)). Upon applying trained networks to testing data subsets, modeled response values closely approximated simulated (measured) values (e.g., Figure 3(c)).

Determining and Evaluating SBSA.
For each simulated data subset, values of singular input variables of interest were deviated across predefined states of their data range. The remaining input variables were fixed at values corresponding to the particular state that the value of the input variable of interest resided within. When the state for the variable of interest changed, values of the remaining predictors also changed to values within their respective distributions corresponding to the "new" state for the variable of interest. Following this approach for each of the 50 simulations, weighted mean values for ancillary variables were generated for singular variables of interest within each of the designated states across its distribution (e.g., Table 5; see above). These data then were implemented within the aforementioned trained/validated ANNs.
State-based sensitivity values were derived for all input variables (following (4) and (5)). The accuracy of ranking the input variable importance then was estimated via Gower's coefficient [8,41]. Briefly, for inputs ( 1,...,5 ) within a simulation subset ( ), the agreement between the modeled and actual variable ranks was tabulated as a binary number ("1" if correct and "0" if incorrect). The accuracy in ranking inputs across 50 simulations then was derived as In this manner, the correctness in ranking predictors for each of SBSA, local SA (following (1)), and CWA (following (2) and (3)) was estimated, affording the means with which to compare methodology effectiveness (i.e., accuracy). The absolute deviation of the modeled rank from the actual rank (i.e., precision) also was tabulated. Correctness in ranking predictive significance of input variables was dissimilar among SBSA, local SA, and CWA ( Figure 4), with the accuracy of SBSA (Gower's coefficient = 0.83) being twofold greater than those provided by local SA and CWA (coefficients of 0.42 and 0.43, resp.).
SBSA correctly ranked the initial three variables having the greatest predictive significance (i.e., 1 , 2 , and 3 ) in 95% of simulations; thereafter, accuracy diminished to approximately 65% for the rankings of 4 and 5 (Figure 4(a)). Notably, SBSA yielded minimal deviations in estimation of rank importance across variables (Figure 4(b)). Local SA was comparable with SBSA in ranking the two most significant predictors ( 1 , 2 ). However, the ranking of 3 correctly occurred in only 20% of simulations with zero instances of correctness for inputs 4 and 5 . In contrast, CWA correctly ranked 1 and 2 , in only 84 and 52% of instances, respectively. Accuracy subsequently diminished to approximately 28% for inputs 4 and 5 . Notwithstanding the substantial deviation in ranking 3 by local SA (see Figure 4(b)), CWA produced the greatest holistic deviation of modeled rankings from actual rankings (i.e., the least precision) for predictors.

Discussion
Through delineation of inputs having the greatest impact upon the variability of modeled outcomes, SA offers an interpretable metric of how well predictor variables embody the behavior of the modeled process. Because SA is used to characterize regions of maximal variation within a model domain, it can be considered a byproduct of uncertainty analysis [42]. Local SA is easy to manipulate through minimal computational effort; as such, it has had universal, ostensibly indiscriminate application within statistical and mechanistic modeling practices spanning diverse disciplines [14,19]. Yet, local SA only provides information concerning how network outcomes may (or may not) be altered via deviation of a single predictor. Such a "one parameter at a time" approach is realistic only when the relationships between predictor 8 Advances in Artificial Neural Systems Table 5: Resultant mean values for input variables, 2 to 5 , and the modeled output, , within the case study simulation, #37 (see Figure 3), when values of 1 lie within distinct "states" across its distribution (refer to Figure 2). : the instances of data vectors within a distinct 1 "state" (total = 150). Training data Cross-validation data and response variables are linear, and/or variable averaging is reasonable for the modeled process [13,43].

State of
The proposed SBSA incorporated a global perspective where values of all input variables deviated, corresponding to the variable of interest, yet retained the computational ease and simplicity of comprehension for local SA. From this, SBSA not only affords the identification of interacting predictors (or potentially insignificant predictors) within networks, but also provides a quantitative means upon which users can base network optimization, validation, and parsimony via variable "pruning" [13,31,44]. Notably, SBSA retained the generalized correlation structure among predictors, enabling direct absolute comparison(s) of interacting predictor importance. In doing so, SBSA provided a realistic representation of the predictive influences for multiple inputs (with minimal deviation in the predictive significance of inputs; see Figure 4) while negating the unrealistic premise underlying local SA (i.e., altering one input while keeping values of other attributes "fixed").
To assess the efficiency of SBSA, results arising from an artificial database having correlated predictors were compared to those generated via local SA and CWA. SBSA provided an accurate and precise portrayal of holistic variable interactions (cf. [27,45]) and, as such, a superior approximation of database variability to those achieved via the alternative methodologies. The SBSA application was demonstrated in conjunction with artificial intelligence due to the superior capabilities of ANNs in modeling complex, nonlinear data relationships. Quantifying the importance(s) of input variables upon a modeled output is a significant problem that users must address when identifying the functionality and interpreting the predictive uncertainty of a trained network. The accounting of interaction among database variables and the robustness of SBSA for computing and subsequently ranking the predictive importance of correlated network predictors was impressive. Moreover, the ability to extend SA applications to predictor variable states where insufficient input data existed (see Table 3) was a benefit. Though originally derived for ANN applications, SBSA also is applicable for defining complex, interacting, variable relationships within alternative parametric modeling efforts (e.g., multiple linear regression).
SBSA is an extended, albeit global derivation of traditional SA and can be classified as a regression-based approach (i.e., based upon multiple linear regression of the response variable upon the predictor vector). Similar to local SA, SBSA typically requires minimal analytic effort. Yet, depending upon the number of distinct states chosen for characterization of a variable's data range, the computational complexity for SBSA can become cumbersome. As users increase the number of variable states, calculations become more complicated (and with a large number of states potentially unwieldy). Conversely, as the number of variable states decreases, SBSA becomes more manageable and increasingly approximates local SA methodology. In instances where only one variable state is recognized, SBSA will emulate local SA.
Multiple dissimilar databases typically have been utilized in studies intent on denoting differences among sensitivity methodologies (e.g., [3,4,7,18,38,[46][47][48]). Olden et al. [8] utilized a similar approach to that presented here (i.e., a synthetic database where predictor-response relationships were known a priori and used to evaluate multiple analytics estimating predictor importance for a trained ANN). In evaluating the effectiveness of genetic algorithms, partial derivatives, input perturbation, the profile method, SA by forward-backward stepwise addition/elimination, and Garson's Algorithm for differentiating predictor influences, Olden et al. [8] stated that, due to disparate data structures, a user's ability to identify the "best" analytic for assessing predictor importance among studies is difficult. Nevertheless, it was concluded that CWA provided ". . . the best overall performance compared to the other approaches in terms of its accuracy (i.e. the degree of similarity between true and estimated variable ranks) and precision (i.e. the degree of variation in accuracy) . . .." Interestingly, the exact database within the aforementioned assessment was assessed here and although the accuracy of CWA rivaled that of local SA, CWA generally displayed less precision in the rankings of predictive significance across inputs. The accuracy of SBSA was twofold greater than those of local SA and CWA, yielding minimal deviation in modeled ranking (from actual) across predictors.
A disadvantage of CWA is the inability to identify the absolute influence that a predictor variable has upon modeled outcomes. Specifically, the explanatory ability of CWA is limited to ranking the relative importance of inputs that neither are based upon variable means nor provide results for summarizing (and illustrating) predictor variability within an ANN. To accomplish detection of input-output relationships of a predictor variable across its data range via CWA, practitioners would need to separate the database into distinct subsets, via mechanical means and/or Monte Carlo resampling, which reflect a single-(or multiple-) specified range(s)/deviation(s) of predictor variability. Such a technical approach is analogous to that used in local SA where the effects of deviating predictor variables by one and two standard deviations (denoting "common" and "disturbance" variation, resp.) upon the response variable can be reported [16,49]. A similar application of CWA would require the manual "splitting" of the holistic database into multiple (representative) subsets, with derivation of separate ANNs having individualized training procedures for each data grouping.
Here, a global SA that provided a realistic insight into the significance of correlated predictor variables upon a modeled response was developed. Within its calculations, SBSA maintained predictor relationships, challenging the premise of local SA that variable attributes are linear and independent. When validated with ANN models of a synthetic database, SBSA outperformed local SA and CWA in predictive accuracy and precision. Despite the predictive capability of ANNs, most researchers entrust low confidence to their use as empirical models due to the lack of transparency in ascertaining predictor significance [38,50]. SBSA helps to overcome this "black-box" nature of MLPs by improving upon the identification of interacting predictor influences and the delineation of their uncertainty domains (thereby simplifying interpretation of modeled outcomes). From this, it is anticipated that user skepticisms for ANNs as a common modeling practice will be reduced.