Investigating the pattern of correlations among large numbers of variables in large databases is certainly a quite difficult task that is seriously demanding in both computational time and capacity. The statistically oriented literature has developed a variety of methods with different power and usability, all of which, however, share a few basic problems, among which the most outstanding are the nature of the a priori assumptions that have to be made on the data-generating process, the near impossibility to compute all the joint probabilities among the vast number of possible couples and n-tuples that are in principle necessary to reconstruct the underlying process’ probability law, and the difficulty of organizing the output in an easily grasped, ready-to-access format for the nontechnical analyst. The consequence of the first two weaknesses is the fact that when analyzing poorly understood problems characterized by heterogeneous sets of potentially relevant variables, traditional methods can become very unreliable when not unusable. The consequence of the last one is that, also in the cases where traditional methods manage to provide a sensible output, their statement and implications can be so articulated to become practically unuseful or, even worse, easily misunderstood.
In this paper, we introduce a new methodology based on an Artificial Neural Network (ANN) architecture, the Auto Contractive Map (AutoCM) [
Previously we approached the genetic of sporadic ALS (SALS) disease with artificial neural networks to identify a possible genetic background predisposing to the sporadic form. A dataset containing genetic data from 54 SALS cases and 208 controls was analyzed with three different analytical approaches: Linear Discriminant Analysis, Standard Artificial Neural Networks, and Advanced Intelligent Systems; with this latter approach the predictive accuracy to discriminate between cases and controls reached an average of 96% (range 94.4 to 97.6). In addition we identified seven genetic variants essential to differentiate cases from controls [
The obtained results point out the need to employ systems really able to handle the disease complexity instead of treating the data with reductionistic approaches unable to detect multiple genes of smaller effect predisposing to the disease.
We report here the application of a new developed analytical approach to the SALS dataset, based on Auto-CM system and Maximally regular graph theory.
The idea was to test the power of this new algorithm in a medical context such as SALS disease to shed light on the puzzling of the disease.
We used a previously described dataset [
Controls subjects were 144 males and 67 females; age range from 21 to 75 years, (average 38.94).
We begin our analysis with a relatively concise but technically detailed presentation of the ANN architecture that provides the basis for all of the subsequent analysis: the Auto Contractive Map (AutoCM) [
Each layer contains an equal number of
An example of an AutoCM with
All of the connections of AutoCM may be initialized either by assigning a same, constant value to each, or by assigning values at random. The best practice is to initialize all the connections with a same, positive value, close to zero.
The learning algorithm of AutoCM may be summarized in a sequence of four characteristic steps: signal transfer from the input into the hidden layer; adaptation of the values of the connections between the input and the hidden layers; signal transfer from the hidden into the output layer; adaptation of the value of the connections between the hidden and the output layers.
Notice that steps 2 and 3 may take place in parallel.
We write as
In order to specify the steps 1–4 that define the AutoCM algorithm, we have to define the corresponding signal forward-transfer equations and the learning equations, as follows. Signal transfer from the input to the hidden layer:
Adaptation of the connections Signal transfer from the hidden to the output layer:
Adaptation of the connections
First of all, we need to specify that AutoCM weights are updated at every cycle; the order of selection of any record at each epoch is random (a epoch is the number of cycles we need to update every record of the dataset); after every cycle the AutoCM is closer to its converge point,
For this reason it is necessary to set up the learning coefficient in a way that AutoCM can update its weights after a reasonable number of epochs, without to be influenced by the random order of the records at each cycle.
Consequently, we suggest to chose the learning coefficient taking into account the contractive factor,
There are a few important peculiarities of Auto-CMs [ AutoCMs are able to learn also when starting from initializations where all connections are set at the same value, that is, they do not suffer the problem of the symmetric connections. During the training process, AutoCMs always assign positive values to connections. In other words, Auto-CMs do not allow for inhibitory relations among nodes, but only for different strengths of excitatory connections. AutoCMs can learn also in difficult conditions, namely, when the connections of the main diagonal of the second layer connection matrix are removed. In the context of this kind of learning process, Auto-CMs seem to reconstruct the relationship occurring between each couple of variables. Consequently, from an experimental point of view, it seems that the ranking of its connections matrix translates into the ranking of the joint probability of occurrence of each couple of variables. Once the learning process has occurred, any input vector, belonging to the training set, will generate a null output vector. So, the energy minimization of the training vectors is represented by a function trough which the trained connections absorb completely the input training vectors. Thus, AutoCM seems to learn how to transform itself in a “dark body”. At the end of the training phase (
One can use the information embedded in the Alternatively, the matrix
Equation (
A graph is a mathematical abstraction that is useful for solving many kinds of problems. Fundamentally, a graph consists of a set of vertices, and a set of edges, where an edge is an object that connects two vertices in the graph. More precisely, a graph is a pair (
At this point, it is useful to introduce the concept of Minimum Spanning Tree (M.S.T.) [
The Minimum Spanning Tree problem is defined as follows: find an acyclic subset
From conceptual point of view, the MST represents the energy minimization state of a structure. In fact, if we consider the atomic elements of a structure as vertices of a graph and the strength among them as the weight of each edge, linking a pair of vertices, the MST represents the minimum of energy needed so that all the elements of the structure preserve their mutual coherence. In a closed system, all the components tend to minimize the overall energy. So the MST, in specific situations, can represent the most probable state for the system to tend.
To determine the MST of an undirected graph, each edge of the graph has to be weighted. Equation (
Obviously, it is possible to use any kind of Auto-Associative ANN or any kind of Linear Auto-Associator to generate a weight matrix among the variables of an assigned dataset. But it is hard to train a two-layer Auto-Associative Back Propagation ANN with the main diagonal weights fixed (to avoid autocorrelation problems). In most cases, the Root Mean Square Error (RMSE) stops to decrease after a few epochs, and especially when the orthogonality of the records is relatively high, a circumstance that is frequent when it is necessary to weight the distance among the records of the assigned dataset. In this case, it is necessary to train the transposed matrix of the dataset. By the way, if a linear Auto-Associator is used to the purpose, all of the nonlinear association among variables will be lost.
Therefore, AutoCM seems to be the best choice to date to compute a complete and a nonlinear matrix of weights among variables or among records of any assigned dataset.
Now we introduce a new indicator: the degree of protection of each node in any a directed graph.
This indicator defines the rank of centrality of each node within the graph, when an iterative pruning algorithm is applied. The pruning algorithm was found and applied for the first time as a global indicator for graph complexity by Giulia Massini at Semeion Research Center in 2006 [
Rank = 0; Do
{
Rank++; Consider_All_Nodes_with_The_Minimum_Number_of_Links (); Delete_These_Links (); Assign_a_Rank_To_All_Nodes_Without_Link (Rank); Update_The_New_Graph (); Check_Number_of_Links ();
} while at_least_a_link_is_present
The higher the rank of a node, the bigger the centrality of its position within the graph. The latest nodes to be pruned are also the kernel nodes of the graph. In the present paper, this algorithm is generalized to measure the global complexity of any kind of graph.
The pruning algorithm can be used also to define the quantity of graph complexity of any graph. If we take
Equation (
Equation (
Equation (
Using
The
The quantity
Considering how the structure of a given graph is changed by a pruning process, it becomes natural to think of what happens to graphs, and in particular to MSTs, as one or more of their nodes are deleted. In which way will the graph has to be organized to continue to reflect as best as possible the underlying structure of relationships once one or more nodes are taken away? How will the other nodes rearrange their links on the basis of the underlying metric and constraints, to connect each other once again?
Define a
Each
The MST represents what we could call the “nervous system” of any dataset. In fact, summing up all of the connection strengths among all the variables, we get the total energy of that system. The MST selects only the connections that minimize this energy, that is, the only ones that are really necessary to keep the system coherent. Consequently, all the links included in the MST are fundamental, but, on the contrary, not every “fundamental” link of the dataset needs to be in the MST. Such limit is intrinsic to the nature of MST itself: every link that gives rise to a cycle into the graph (viz., that destroys the graph’s “treeness”) is eliminated, whatever its strength and meaningfulness. To fix this shortcoming and to better capture the intrinsic complexity of a dataset, it is necessary to add more links to the MST, according to two criteria: the new links have to be relevant from a quantitative point of view; the new links have to be able to generate new cyclic regular microstructures, from a qualitative point of view.
Consequently, the MST tree-graph is transformed into an undirected graph with cycles. Because of the cycles, the new graph is a dynamic system, involving in its structure the time dimension. This is the reason why this new graph should provide information not only about the structure but also about the functions of the variables of the dataset.
To build the new graph, we need to proceed as follows: assume the MST structure as the starting point of the new graph; consider the sorted list of the connections skipped during the derivation of the MST; estimate the
We will call Maximally Regular Graph (MRG) the graph whose H function attains the highest value among all the graphs generated by adding back to the original MST, one by one, the missing connections previously skipped during the computation of the MST itself. Starting from (
The
The MRG calculation is also useful to define the MST compactness: less is the number of arcs skipped during the MST generation, more the MST is representative; in other terms:
the MRG Hubness, and the number of new links added by MRG generation:
The fuzzy combination of these two indexes can express the MRG Relevance:
We have divided the ALS dataset into: the Cases dataset (58 records) and the Control dataset (207 records). Then we have independently applied to each one the AutoCM algorithm. The AutoCM algorithm generates two weighted MST and the Delta H function points out the key variables of the two datasets (see Figure 3 variables (APOA4_glu360his, NOS3_A_922_G, LPL_ser447term) seem to be the reason of the low complexity of the cases MST: when each one of them is removed, the MST increases its complexity, taking the same 3 variables (ADRB3_trp64arg, LIPC_C_480_T, MMP3_5A_6A) seem to be the reason of the high complexity of the control MST: when each one of them is removed, the MST decreases its complexity, taking the same
The Delta
Control | |||
---|---|---|---|
Variables | Hub relevance | Variables | Hub relevance |
Global | 0.17193 |
NOS3_C_690_T |
0.171429 |
ADRB3_trp64arg | 0.136905 | NOS3_glu298asp |
0.171429 |
LIPC_C_480_T | 0.136905 | DCP1_ins_del | 0.171429 |
MMP3_5A_6A | 0.136905 | AGTR1_A1166C |
0.171429 |
APOC3_C_641_A | 0.171429 | AGT_met235thr |
0.171429 |
APOC3_C_482_T | 0.171429 | NPPA_G664A |
0.171429 |
APOC3_T_455_C | 0.171429 | NPPA_T2238C |
0.171429 |
APOC3_C1100T | 0.171429 | ADD1_gly460trp |
0.171429 |
APOC3_C3175G | 0.171429 | SCNN1_trp493arg |
0.171429 |
APOC3_T3206G | 0.171429 | SCNN1A_ala663thr |
0.171429 |
APOE_cys112arg | 0.171429 | GNB3_C825T |
0.171429 |
APOE_arg158cys | 0.171429 | ADRB2_arg16gly |
0.171429 |
APOA4_thr347ser | 0.171429 | ADRB2_gln27glu |
0.171429 |
PPARG_pro12ala | 0.171429 | APOB_thr71ile |
0.171429 |
APOA4_glu360his | 0.171429 | F2_G20210A |
0.171429 |
LPL_T_93_G | 0.171429 | F5_arg506gln |
0.171429 |
LPL_asp9asn | 0.171429 | F7_del_ins |
0.171429 |
LPL_asn291ser | 0.171429 | F7_arg353glu |
0.171429 |
LPL_ser447term | 0.171429 | PAI_G5_G4 |
0.171429 |
PON1_met55leu | 0.171429 | PAI_G11053T |
0.171429 |
PON1_gln192arg | 0.171429 | FGB_G_455_A |
0.171429 |
PON2_ser311cys | 0.171429 | ITGA2_G873A |
0.171429 |
LDLR_Ncol_Ncol | 0.171429 | ITGB3_leu33pro |
0.171429 |
CETP_630 | 0.171429 | SELE_ser128arg |
0.171429 |
CETP_628 | 0.171429 | SELE_leu554phe |
0.171429 |
CETP_ile405val | 0.171429 | ICAM_gly214arg |
0.171429 |
LTA_thr26asn_A | 0.171429 | TNFa_G_376_A |
0.171429 |
MTHFR_C677T | 0.171429 | TNFa_G_308_A |
0.171429 |
NOS3_A_922_G | 0.171429 | TNFa_244 |
0.171429 |
TNFa_238 |
0.171429 | ||
LTA_thr26asn_B |
0.171429 | ||
| |||
Cases | |||
Variables |
Hub relevance |
Variables |
Hub relevance |
| |||
Global |
0.137127 | AGTR1_A1166C |
0.136905 |
APOA4_thr347ser |
0.136905 |
AGT_met235thr |
0.136905 |
APOB_thr71ile |
0.136905 | NPPA_G664A |
0.136905 |
APOC3_C_641_A |
0.136905 | NPPA_T2238C |
0.136905 |
APOC3_C_482_T |
0.136905 | ADD1_gly460trp |
0.136905 |
APOC3_T_455_C |
0.136905 | SCNN1_trp493arg |
0.136905 |
APOC3_C1100T |
0.136905 | SCNN1A_ala663thr |
0.136905 |
APOC3_C3175G |
0.136905 | GNB3_C825T |
0.136905 |
APOC3_T3206G |
0.136905 | ADRB2_arg16gly |
0.136905 |
APOE_cys112arg |
0.136905 | ADRB2_gln27glu |
0.136905 |
APOE_arg158cys |
0.136905 | MMP3_5A_6A |
0.136905 |
ADRB3_trp64arg |
0.136905 | F2_G20210A |
0.136905 |
PPARG_pro12ala |
0.136905 | F5_arg506gln |
0.136905 |
LIPC_C_480_T |
0.136905 | F7_del_ins |
0.136905 |
LPL_T_93_G |
0.136905 | F7_arg353glu |
0.136905 |
LPL_asp9asn |
0.136905 | PAI_G5_G4 |
0.136905 |
LPL_asn291ser |
0.136905 | PAI_G11053T |
0.136905 |
PON1_met55leu |
0.136905 | FGB_G_455_A |
0.136905 |
PON1_gln192arg |
0.136905 | ITGA2_G873A |
0.136905 |
PON2_ser311cys |
0.136905 | ITGB3_leu33pro |
0.136905 |
LDLR_NcoI_NcoI |
0.136905 | SELE_ser128arg |
0.136905 |
CETP_630 |
0.136905 | SELE_leu554phe |
0.136905 |
CETP_628 |
0.136905 | ICAM_gly214arg |
0.136905 |
CETP_ile405val |
0.136905 | TNFa_G_376_A |
0.136905 |
LTA_thr26asn_A |
0.136905 | TNFa_G_308_A |
0.136905 |
MTHFR_C677T |
0.136905 | TNFa_244 |
0.136905 |
NOS3_C_690_T |
0.136905 | TNFa_238 |
0.136905 |
NOS3_glu298asp |
0.136905 | LTA_thr26asn_B |
0.136905 |
DCP1_ins_del |
0.136905 | APOA4_glu360his |
0.171429 |
NOS3_A_922_G |
0.171429 | ||
LPL_ser447term |
0.171429 |
(a) The MST of the cases databest. Into the blue circles the key variables of the graph. (b) The MST of the controls databest. Into the red circles the key variables of the graph.
If these considerations should have a biological reason, the AutoCM algorithm and the Delta Function procedure have shown to be very capable to catch the hidden information into the medical datasets.
As a second step of this analysis, we have calculated the MRG of the two dataset (see Figures
(a) The MRG of the cases databest. In red the MRG connections. (b) The MRG of the controls databest. In red the MRG connections.
Healthy physiologic function is characterized by a complex interaction of multiple control mechanisms that enable an individual to adapt to the exigencies and unpredictable changes of everyday life. The disease process appears to be marked by a progressive impairment in these mechanisms, resulting in a loss of dynamic range in physiologic function and, consequently, a reduced capacity to adapt to stress. The emerging concept is that loss of redundancy, entropy and complexity is an hallmark of disease and in particular of chronic diseases.
Defining and quantifying the complexity of variables interactions are very difficult tasks from a mathematical point of view. Complex network theory by establishing criteria to define hubs in a particular variables network provides a framework on which building up parameters corresponding to an increase or loss of complexity in relation to the presence or absence of a particular variable in a variables set.
In this paper we have applied a novel revolutionary methodology to establish which of polymorphisms potentially involved in SALS occurrence play a fundamental role in protecting or in increasing the vulnerability for the disease occurrence increasing or reducing the hubness of a graph encoding the dynamic relation among genotypes many to many.
Six genetic variants were identified which differently contributed to the complexity of the system: apolipoprotein A-IV (APOA-IV) glu360his (rs5110), nitric oxide synthase 3 (NOS3)-922A/G (rs1800779), lipoprotein lipase (LPL) ser447term (rs328), adrenergic, beta-3 receptor (ADRB3) trp64arg (rs4994), hepatic lipase (LIPC)-480C/T (rs1800588) and matrix metallopeptidase 3 (MMP3)-1171 5A/6A (rs3025058). Three of the above genes/SNPs represent protective factors, APOA4 glu360his, NOS3-922A/G and LPL ser447term, since their contribution to the whole complexity resulted to be as high as 0.17 (see table 1). On the other hand ADRB3 trp64arg, LIPC-480C/T, and MMP3-1171 5A/6A, whose hub relevancies resulted to be as high as 0.13, seem to represent susceptibility factors (see Table
Among the genes/SNPs conferring risk or protection from the disease, we noted that four of these are involved in the lipid pathways, APOA4, LPL, LIPC, ADRB3 while two are involved also in oxidative stress, angiogenesis, and cellular cytoskeletal (NOS3 and MMP3).
The protective genes/SNPs here identified include the gene for apo A-IV, mapping on chromosome 11q2 and coding a glycoprotein whose primary translation product is a 396-residue preprotein which after proteolytic processing is secreted. Although its precise function is not known, apo A-IV is a potent activator of lecithin-cholesterol acyltransferase in vitro and displays antioxidant and antiatherogenic properties in vitro, and the antiatherogenic properties of apoA-IV suggest that this protein may act as an anti-inflammatory agent [
The last protective factor, NOS3-922A/G variant, belongs to a gene localized to chromosome 7q36 and coding the cytosolic enzyme of endothelial cells, a key actor in the process of modulation of vascular tone by producing nitric oxide (NO), a vasodilator agent. Constitutive NO release from microvascular endothelium seems to be responsible to prevent leukocyte margination under physiological conditions by modulating oxidative metabolism in endothelial cells. In this mechanism NO act as antioxidant agent to prevent the formation of iron-mediated hydroperoxide. Accumulating evidences indicate that ALS is associated with oxidative damage induced by free radicals. Enhancement of oxidative damage markers and signs of increased compensatory response to oxidative stress was found in patients with SALS [
Considering now the vulnerability factors, the LIPC-480C/T belongs to a gene located on chromosome 15q21–23 and coding a glycoprotein involved in metabolism of several lipoproteins. The C/T substitution at −480 of the promoter region of the gene has been shown to be significantly associated to lower lipase activity [
The ADRB3 gene has been localized to chromosome 8p12-8p11.1 and it codes for a member of the adrenergic receptor group of G-protein-coupled receptors; it is located mainly in adipose tissue and is involved in the regulation of lipolysis and thermogenesis. Some
Regarding the last at risk factor, MMP3-1171 5A/6, this belongs to a gene mapping on chromosome 11q22.3 and coding a protein of the matrix metalloproteinase family (MMPs). MMPs a family of zinc-dependent endoproteinases, are effector molecules in the breakdown of the blood-brain and blood-nerve barrier, and promote neural tissue invasion by leukocytes in inflammatory diseases of the central and peripheral nervous systems. Moreover, MMPs play an important role in synaptic remodeling, neuronal regeneration, and remyelination [
We know that motor neuron death in ALS is the culmination of multiple aberrant biological process involving also nonneuronal cells such microglia and astrocyte, what emerge from our data is that lipid homeostasis, oxidative stress and cellular remodelling are strictly related to ALS. We have just previously commented the role of the specific here identified variants in the cellular/molecular pathways. A recent finding has been reported on how lipid molecules can induce the cytotoxic aggregation of Cu/Zn superoxide dismutase, the major gene linked to the familial and sporadic form of the disease, under physiological conditions suggesting that it might provide a possible mechanism for the pathogenesis of ALS [
In a first work about ALS [
In this work we pose to the scientific community a different question: which genetic polymorphisms (variables) protect or make more vulnerable the ALS patients and the control subjects?
There is not a necessary intersection between these two questions: small differences in an organ at work can produce big differences in symptoms, because of the interactions with other organs. Therefore, some polymorphisms can work as more evident symptoms of a disease without to be the main reason of that disease. In the same way, the seven variables of the previous work can be optimal predictors of the ALS, without to be the main reason of the ALS syndromes: they are useful to recognize the ALS, but they are not a necessary explanation of the ALS.
The more predictive features in a disease are not necessary the same features able to explain better the dynamics of that disease; an example: in the case of alcohol addiction, the main reason to become an alcoholic could be a sociopsychological condition, but the more predictive features to understand if someone is an alcoholic can be the analysis of the functional state of his/her liver.
In the actual work, using a completely new adaptive algorithm, we have tried to understand which genetic polymorphisms explain better the deep difference between Cases and Controls. In other words how all the polymorphisms are arranged in different networks, with different links and connections strength, into the two subsamples.
We applied here a revolutionary methodology able to deal with complex disease such as sporadic ALS. This new approach allowed to identify genes/SNPs conferring susceptibility or protection to the disease, we were not able to discriminate which allele of the six variants identified is really involved, and this is due to how the database was realized. From the dataset here analyzed we extrapolate biological information coherent with possible pathogenetic pathways related to ALS. Our data clearly demonstrate the power of this new approach and it would be of great interest to test with other more complex ALS database to get more information.
M. Buscema performed the statistical analysis and developed the intelligent systems. S. Penco participated in the design of the study, coordinated and drafted the paper. E. Grossi participated in the design of the study, in the statistical analysis, coordinated and drafted the paper. All authors read and approved the final paper.