^{1}

^{2, 3}

^{1}

^{2}

^{3}

Protein-protein interaction (PPI) networks carry vital information on the organization of molecular interactions in cellular systems. The identification of functionally relevant modules in PPI networks is one of the most important applications of biological network analysis. Computational analysis is becoming an indispensable tool to understand large-scale biomolecular interaction networks. Several types of computational methods have been developed and employed for the analysis of PPI networks. Of these computational methods, graph comparison and module detection are the two most commonly used strategies. This review summarizes current literature on graph kernel and graph alignment methods for graph comparison strategies, as well as module detection approaches including seed-and-extend, hierarchical clustering, optimization-based, probabilistic, and frequent subgraph methods. Herein, we provide a comprehensive review of the major algorithms employed under each theme, including our recently published frequent subgraph method, for detecting functional modules commonly shared across multiple cancer PPI networks.

Recent advances in systems biology research have generated a wealth of data on physical and genetic interactions capable of revealing relationships between biomolecules. For example, high-throughput screening methods, such as two-hybrid analysis [

The first group of algorithms of interest, graph comparison, is the process of comparing and contrasting graph-based networks in order to determine the PPI network similarities or detect common or distinct substructures (i.e., subnetworks or subgraphs). Thus, graph-theory-based methods are widely used in the comparative study of the molecular interaction networks (MINs). These methods have been applied in various studies and analyzed in previous review articles. For example, in 2006, Sharan and colleagues published a review of the applications of graph comparison methods to analyze MINs [

The second group of algorithms, module detection, involves the identification of functionally important substructures within a larger PPI network, which is one of the most widely studied topics in PPI network analyses. Considering that biological interactions do not operate based on sequence homology between partners, sequence homology-based methods such as Basic Local Alignment Search Tool (BLAST) [

Graph comparison can be used to search conserved regions representing functional, orthologous modules across different species or biological systems. In contrast, module detection algorithms can be applied to graph alignment to find the optimal local alignment between protein networks [

Graph comparison is an important tool for understanding PPI networks. For instance, by measuring the discrepancy between PPI networks of healthy and diseased individuals it is possible to predict disease outbreak and progression [

Graphlets are smaller units of subgraphs of distinct sizes. Graphlet degree distribution has been established as a more-comprehensive model compared to early graph comparison methods. Graphlet degree distribution is a generalization of degree distribution of larger networks. In 2006, Przulj reported that agreement in graphlet degree distribution could be effectively used to compare biological networks [

Like the graphlet degree distribution method that uses graphlets to compare PPI networks, graph kernels decompose networks into subunits and use the subunit information to calculate PPI network similarities.

As proposed by Haussler, graph kernels can be viewed as special cases of R-convolution kernels [

In 2005 and 2007, Borgwardt and colleagues proposed fast algorithms for computing random walk kernels [

Given that decomposing networks to small substructures is an expensive process, many state-of-the-art graph kernels do not scale to large graphs. To address this issue, in 2009 Shervashidze and colleagues proposed a statistical approach to compare graphs based on the distribution of graphlets [

In 2007, Shervashidze and colleagues proposed another algorithm for fast computation of subtree kernels [

Graph kernels bridge the gap between graph-structured data and a large spectrum of machine-learning algorithms [

Graph alignment is the process of mapping nodes and edges between graphs such that conserved subgraphs can be identified. Graph alignment adopts a similar concept as that used for sequence alignment. However, in contrast to sequence alignment, which aligns linear sequences to identify regions of similarity, graph alignment must be able to handle data from multiple dimensions of the graph. Graph alignment involves a subgraph isomorphism test that is proven to be NP-complete. Similar to graph kernels, getting an exact solution for graph alignment is not feasible for even moderate sized graphs. Thus, most graph alignment methods resort to heuristic solutions to reduce the cost of computation.

Similar to sequence alignment strategies, graph alignment can be local or global. Local graph alignment matches nodes and edges to maximize the local alignment score. In 2006, Koyuturk and colleagues proposed a local alignment framework for PPI networks based on the duplication/divergence evolutionary model [

A series of NetworkBLAST algorithms have been reported for the global alignment of networks. For example, PathBLAST [

Other methods use biological features of interacting proteins for graph alignment. For example, using integer quadratic programming, Li and colleagues proposed a PPI alignment algorithm in 2006 based on similarities in both the protein sequence and network architecture [

The GHOST alignment method [

GRAAL (GRAph ALigner) is a global alignment method based solely on the network topology [

Regardless of if the calculation is for local or global alignments, performing efficient and accurate alignments on multiple networks continues to be challenging. Nevertheless, the Graemlin algorithm was the first algorithm capable of performing scalable, multiple network alignments [

IsoRank [

As a summary of graph comparisons, we have reviewed an array of methods from early single-feature, distance-based algorithms to the most current multiple network alignment graphs. Early distance-based algorithms are founded on strict matching of network structures, and thus these algorithms are only suitable for simple network comparisons. Graph kernels compare graphs by decomposing and comparing graph substructures, which are based on well-supported statistical analyses and mathematical derivations. This results in more accurate and meaningful comparisons, particularly for approximate structural similarities. However, with a single value being produced as the comparison result, graph kernels cannot provide substantial internal details for the comparison, such as the node and edge mapping. Therefore, graph kernels are most suitable for solving classification problems for small to medium sized graphs. For large sized graph comparison, graph kernels are at disadvantage, because on one hand the kernel calculations are very time consuming for large networks and on the other hand the calculated values are less informative than those of smaller sized networks. To perform detailed comparisons of networks, graph alignments are preferred. Local alignments align subnetworks to maximize the local alignment score. Global alignments, on the other hand, focus on maximizing the overall alignment score. While local alignments may be ambiguous, global alignments typically produce unique mapping between nodes. In recent years, several multiple network alignment methods have been developed. Compared to pairwise alignments, multiple network alignment methods provide greater proof of conservation of the identified subnetworks. Graph alignments are very effective for network comparison and identification of conserved regions in networks. However, due to the multidimensional nature and complexity of graph data, graph alignment algorithms rely on heuristics to derive the optimal solution. The drawbacks in graph alignments are that different heuristics usually result in very different solutions and there are no standards or benchmarks like those available for sequence alignment.

A summary of different strategies used for graph comparisons is provided in Table

A summary of graph comparison methods by the strategy employed.

Methods | Comparison strategy | Specification | References | |||
---|---|---|---|---|---|---|

Local | Global | Pairwise | Multiple | |||

MCS | Distance-based | x | x | [ | ||

Editing distance | Distance-based | x | x | [ | ||

Graphlet | Graphlet | x | x | [ | ||

Fast random walk kernel | Graph kernel | x | [ | |||

Graphlet kernel | Graph kernel | x | [ | |||

Fast subtree kernel | Graph kernel | x | [ | |||

Weighted alignment | Graph alignment | X | x | [ | ||

Substructure-based alignment | Graph alignment | X | x | [ | ||

Class of NetworkBLAST | Graph alignment | X | x | x | [ | |

Quadratic programming | Graph alignment | X | x | [ | ||

Class of GRAAL | Graph alignment | x | x | [ | ||

Class of Graemlin | Graph alignment | x | x | [ | ||

Class of IsoRank | Graph alignment | x | x | [ | ||

GHOST | Graph alignment | x | x | [ | ||

NETAL | Graph alignment | x | x | [ |

One of the most important applications of biological network analysis is the identification of functionally relevant modules in PPI networks. Similar to social networks and internet-based networks, PPI networks are conjectured to exhibit a power law degree distribution [

Seed-and-extend approaches predict functional protein modules based on the density of PPI networks. The functional modules are generally initiated from single nodes deemed as central nodes or “seeds,” and new nodes are added to “extend” the subnetworks. Different algorithms have specific metrics for determining when the subnetworks will reach convergence.

The first seed-and-extend approach we will review is Molecular Complex Detection (MCODE). This method detects densely connected regions in large PPI networks that may represent molecular complexes [

Both the MCODE and SPICi methods are purely based on network topology. In contrast, Maraziotis and colleagues presented a new method that discovers functional modules from weighted graphs [

Hierarchical clustering is another group of clustering algorithms widely used for biological data analysis. Hierarchical clustering methods are often applied to gene expression data to determine coexpressed genes, clusters, and outliers [

The Protein Distance Based on Interactions (PRODISTIN) method was developed based on the principle that the greater the two proteins in a network share common interactions, the more likely it is that they are functionally related [

The hierarchical clustering methods discussed above are agglomerative methods in which the additions of edges are used to construct hierarchical trees. In hierarchical clustering, there is another class of methods called divisive methods that construct hierarchical trees by removing edges. Divisive methods attempt to find the least similar connected pairs of vertices from the network of interest and then remove the edges between the pairs. An example of a divisive method is Newman and colleagues’ hierarchical clustering for finding community structures in networks [

Hierarchical clustering methods have primarily focused on grouping nodes. An unconventional method proposed by Ahn and colleagues in 2010 focuses on edge clustering [

In addition to seed-and-extend approaches and hierarchical clustering, module detection can also be formulated as an optimization problem. In 2004, King and colleagues completed work on predicting protein complexes via cost-based clustering [

Genes with significant changes in expression have immediate and wide interest as markers of disease and the stage of disease development, as well as markers for a variety of other cellular phenotypes [

In 2010, Zhang and colleagues introduced a new method that uses graph modularity density to detect functional modules in PPI networks [

HOTNET, published by Raphael’s lab in 2011, is another framework for de novo identification of significantly mutated subnetworks [

In recent years, probabilistic-based machine learning methods have been developed and successfully used in many areas in bioinformatics. Here, we review a few machine learning methods developed for network module detection. In 2011, Shi and colleagues used a “semisupervised” method for detecting protein complexes in PPI networks [

In 2008, Qi and colleagues presented a Bayesian network (BN) algorithm for detecting protein complexes from PPI networks [

Most module detection algorithms are based on either network connectivity or the density of subgraphs. In contrast, we recently developed a novel method that predicts functional modules based on the frequency of subgraphs [

In summary, we reviewed multiple methods for module detection. The performance of each of these methods varies. In Brohee and colleagues’ review paper, the performance of MCL, RNSC, and MCODE is compared in terms of robustness, sensitivity, and the results of clustering [

Among the numerous methods for module detection, different metrics are used to evaluate the weights of modules. Some metrics are based on the connectivity density, such as MCODE and edge clustering, while some are based on vertex scoring, such as RNSC and jActiveModules. Seed-and-extend methods such as MCODE and SPICi assume that hub nodes always exist as the centers of modules. Such assumptions may limit the types of modules that can be discovered by seed-and-extend algorithms. On the other hand, hierarchical clustering algorithms, including both agglomerative and divisive, do not effectively use the topological information of the networks. Because the distances between nodes or edges determine how the clusters are drawn, using different distance metrics in the algorithm will lead to different clustering results. Optimization methods have their limitations too. Each run of optimization methods may generate different results, depending on the initial settings. Therefore multiple runs of optimization methods are required to achieve a relatively consistent result. Finally, frequency-based methods look for recurring patterns in PPI networks. Frequency-based methods are plagued by the performance issue because frequent pattern matching involves subgraph isomorphism tests, and it is proven that subgraph isomorphism problem is NP-complete.

Table

A summary of module detection methods by the strategy employed.

Methods | Module detection strategy | Specification | References | |
---|---|---|---|---|

Topological | Both | |||

MCODE | Seed-and-extend | x | [ | |

SPICi | Seed-and-extend | x | [ | |

Kernel set | Seed-and-extend | x | [ | |

PRODISTIN | Hierarchical clustering | x | [ | |

ADJW and Hall | Hierarchical clustering | x | [ | |

Divisive | Hierarchical clustering | x | [ | |

Edge clustering | Hierarchical clustering | x | [ | |

RNSC | Optimization | x | [ | |

jActiveModules | Optimization | x | [ | |

Modularity density | Optimization | x | [ | |

HOTNET | Optimization | x | [ | |

Semi-supervised | Probabilistic | x | [ | |

Bayesian network | Probabilistic | x | [ | |

MCL | Probabilistic | x | [ | |

Frequent subgraph | Frequency-based method | x | [ |

Graph comparison and module detection are two commonly used strategies for analyzing PPI networks. Among the algorithms for graph comparisons, graph kernels compare graphs by decomposing the graphs to nontrivial subunits. Similarity scores between the graphs can be derived through comparing these subunits. In contrast to other graph comparison methods, graph kernels have the advantage of speed. In the development of graph kernels, efficiency is the key issue addressed. In contrast to graph kernels that can only produce limited information from the comparison, graph alignment provides in-depth analysis of the mappings between graphs. Graph alignment adopts concepts from sequence alignment; the alignment scores are adjusted to reflect topological or relational information for the graphs. For the purpose of graph comparison, graph kernels are suitable for classification tasks that require high-speed computation and intuitive measurement of distance. Graph alignment methods are suitable for determining conserved regions between PPI networks. Note that graph alignments can be local or global, and graph comparisons can be between two networks (pairwise) or greater than two networks (multiple).

Among the module detection algorithms, seed-and-extend methods identify modules by first selecting their core nodes and then expanding the core nodes with new nodes that increase the subgraph density. Hierarchical clustering creates clusters hierarchically based on distances between the clusters. Optimization-based and probabilistic approaches use mathematical derivations to determine best scoring modules. The frequent subgraph approach searches for common and frequent substructures among PPI networks. Different methods tackle the problem from different perspectives. For example, seed-and-extend methods use connection density and neighboring network information to find heavily connected modules in PPI networks. Hierarchical clustering methods use distances between nodes or edges as the key factor for clustering. Optimization-based methods represent network divisions using mathematical models. Modules are detected through the optimal division of the network. Probabilistic methods use statistics of graph data to construct training models and to determine the state transitions of the algorithm. Finally, for frequency-based algorithms, the frequency of subgraphs becomes the key criterion for detecting modules. The method selection for graph analysis depends on the interpretation of the problem and the perspective of the investigator tackling the problem. As the methods are developed from different perspectives, they produce complementary views of graph data. The future development of graph analysis will be likely focused on integrated analysis using an ensemble of methods because no single method can perform well for all types of comparisons. Because most of the interaction network studies require both graph comparison and module detection, such analyses will benefit from the integration of methods.

The authors declare that there is no conflict of interests regarding the publication of this paper.

This research was supported in part by Grants from National Institutes of Health [1R01GM086533-01A1 to CG] and startup funds to CG from University of Nebraska Medical Center. The authors also thank Ms. Melody Montgomery at the UNMC Research Editorial Office for help in the professional editing of this paper.

^{T}+CXD

^{T}= E