Existing algorithms allow us to infer phylogenetic networks from sequences (DNA, protein or binary), sets of trees, and distance matrices, but there are no methods to build them using the gene order data as an input. Here we describe several methods to build split networks from the gene order data, perform simulation studies, and use our methods for analyzing and interpreting different real gene order datasets. All proposed methods are based on intermediate data, which can be generated from genome structures under study and used as an input for network construction algorithms. Three intermediates are used: set of jackknife trees, distance matrix, and binary encoding. According to simulations and case studies, the best intermediates are jackknife trees and distance matrix (when used with Neighbor-Net algorithm). Binary encoding can also be useful, but only when the methods mentioned above cannot be used.

Gene order data gain increasing popularity in the phylogenetic community because of several advantages they have, compared with gene sequences. First, in most cases the genome structure evolves slower than DNA or protein sequence, allowing the inference about ancient events with less noise level [

As with any analysis, it is based upon several assumptions. First, every genome is represented as a permutation of homologous markers, which are usually directed. All genomes in the dataset should contain the same set of markers. Though in the majority of studies these markers are genes (hereinafter referred to as “genes”), they can be contiguous parts of a chromosome of any reasonable length. Such a permutation is traditionally represented as a sequence of signed numbers with absolute values being identifiers of elements and a sign denoting direction. Second, a set of operations on a permutation is limited to some subset of all actually possible evolutionary events. Inversions are the most common, but there can be also translocations, double-cut-and-join operations, and several other events [

In the majority of phylogenetic studies, the evolutionary relationships of taxa are represented by phylogenetic trees. Despite their usefulness for biology, a phylogenetic tree by definition is able to display only one divergence-based scenario. Real evolution, on the other hand, is not limited to diverging taxa: recombination and horizontal gene transfer occur in all major groups of organisms. Even in cases when a tree-like evolution is safe to assume, sometimes it is useful to simultaneously visualize all conflicting scenarios supported by the dataset [

An idea of the split network is based upon one crucial observation: every branch of the phylogenetic tree defines a split or bifurcation of taxa set, that is, separates it into two parts. Split is called trivial if one of its parts contains only one taxon. Such a split corresponds to a leaf branch on the phylogenetic tree. A set of splits is called compatible if it can be represented by an unrooted phylogenetic tree. Phylogenetic inference in this framework is reduced to generating a set of nontrivial splits and representing them as a graph.

A split network is a generalization of mathematical concept of a phylogenetic tree. It is able to represent both compatible and incompatible split sets. Unlike unrooted tree, a split network may use several parallel edges to represent any given split. Deletion of all edges corresponding to a split divides the network in exactly two connected components, one containing all taxa from one part of the split, and another containing taxa from the other. The edge lengths are defined by split weight, which can be of different sense depending on the algorithm used to generate a split set [

The main idea of this work is to apply existing network-building algorithms to intermediate data, which were generated according to the structure of studied genomes. We used a set of phylogenetic trees (consensus network algorithm), inversion distance matrix (split decomposition and Neighbor-Net algorithms), and binary encoding (parsimony-splits algorithm) as intermediate data. In all cases, we considered rearrangements in unichromosomal genomes consisting of directed markers and limited the evolutionary process to inversions. The latter assumption is not applied to binary encoding (see details below).

We did not filter the splits with any of the algorithms and intermediates. The aim of the additional filtering is to obtain a relatively simple set of splits (e.g., cyclic or weakly compatible). Such a set of splits corresponds to a network with simple topology, which can be easily drawn on a flat surface. However, such a simplification causes removal of some splits from the resulting network, which potentially leads to loss of important data. We decided to sacrifice simplicity of the network for the sake of its accuracy.

The data transformations, jackknife analysis, and simulations were done by custom Perl scripts, which are available upon request. All network construction algorithms are implemented in SplitsTree4 software by Huson and Bryant [

Since the function of split networks is to represent conflicting trees, the most obvious solution is to generate a bunch of trees and sum them up into a network. To obtain a set of trees, we conducted a jackknife procedure: 40% of genes were chosen randomly and removed from all permutations, preserving the order of the remaining ones. We generated 100 replicates and built phylogenetic trees using COGNAC package by Kang et al. [

Phylogenetic network was built by consensus network algorithm [

Distance matrix is a common kind of intermediate data which allows building phylogenetic trees and networks from different raw data using the same algorithms. GRAPPA package [

Binary encoding (BE) represents a set of permutations as a matrix of binary characters [

The median network algorithm is also popular in the analysis of binary sequences. However, a crucial step of the analysis is to reconstruct ancestral states. In case of BE, the algorithm does not account for actual evolution of underlying gene order, therefore the reconstruction of ancestral sequences will be incorrect. Thus, we did not use this approach in our work.

For simulations, we generated Yule trees for 10 and 20 taxa with the branch lengths sampled from Poisson distribution. The expected branch lengths

The resulting networks were compared with the true underlying tree. We assessed only a network topology, that is, a set of splits. We counted nontrivial splits that are either present in the true tree, but not in the network (false negative (FN)), or

Two values were obtained from the FN and FP counts to compare performance of the methods. First value is sensitivity, that is, probability for any split in the true tree to be included into the phylogenetic network. Another value is a positive predictive value (PPV), which has an opposite meaning: probability that a network split belongs to the true tree. We calculated neither specificity nor negative predictive value because the true negative count, which is the amount of all possible splits on given taxa set minus the amount of splits in the true tree, is always several orders of magnitude larger than that of FN and FP.

Split weights were not taken into account because their meaning is different in the network-building approaches used. This is an important flaw of the simulation procedure because once a split is present in the network, it will increase sensitivity (or decrease PPV) of the method, even if the weight of this split is vanishingly small.

We used two real datasets in the case studies. The first is Campanulaceae dataset, which is commonly used to evaluate performance of the rearrangement-based phylogenetic inference software [

Another dataset consists of six chloroplast genomes of diatoms and two genomes of diatom-derived chloroplasts of dinoflagellates. All these are circular genomes 120–130 kbp long, containing from 154 to 159 genes, including tRNAs. Each chloroplast DNA bears a long inverted repeat that contains genes for rRNAs and several proteins. Phylogenetic relationships of taxa are quite well established [

We removed one of the copies of inverted repeat from all diatom genomes, thus transforming sequences from circular to linear. Then all common genes were assigned numbers and marked with signs depending on their orientation. The resulting dataset consisted of eight permutations of 149 genes each. Both real datasets were analyzed in the same way as the simulated data.

According to simulation studies (Figure

Simulation results.

Trees with 10 taxa

Trees with 20 taxa

Split networks and reference tree for the Bacillariophyta dataset.

Parsimony splits algorithm on binary encoding

Consensus network algorithm on a set of jackknife trees

Split decomposition algorithm on the distance matrix

Neighbor-Net algorithm on the distance matrix

Phylogenetic tree built by MGR package

Split networks and reference tree for the Campanulaceae dataset.

Parsimony splits algorithm on binary encoding

Consensus network algorithm on a set of jackknife trees

Split decomposition algorithm on the distance matrix

Neighbor-Net algorithm on the distance matrix

Tree built by BADGER package (redrawn from [

The split decomposition algorithm is slightly outperformed by the consensus network approach in terms of both sensitivity and PPV. The distance matrix can be computed significantly faster than a set of trees. It took less than a minute on a desktop computer for all tests performed. Additionally, this method guarantees to produce a weakly compatible set of splits, ensuring less complicated network.

PPV of Neighbor-Net algorithm significantly decreases with increasing lengths of tree branches. On the other hand, sensitivity is similar to that of other methods.

The parsimony-splits algorithm applied to the binary encoding is different from other methods. It does not analyze permutations, but processes binary matrices built from them. This may be useful, if the evolutionary process is not assumed to be based mainly on inversions. However, in our “inversions-only” simulations, the networks built with this method have the lowest sensitivity and the second lowest PPV.

Obviously, one can never know exactly which tree is actually true for a real dataset. Therefore, by analyzing real data: one can only assess relative complexity of a network, that is, how many splits are included, and whether it is congruent with the trees obtained on the same dataset with other methods.

On Bacillariophyta dataset, we first built a phylogenetic tree using MGR package (Figure

The Neighbor-Net algorithm has produced the most complex network (Figure

Campanulaceae dataset was confirmed to be ambiguous. All methods support five relatively well-supported monophyletic clusters (marked by colors in Figure

In this study, we propose several methods to build split networks using the gene order data via generating the intermediate datasets. We used a set of jackknife trees, an inversion distance matrix, and a binary encoding of the gene order as intermediate data. The performance of these methods is shown to vary depending on input data. Furthermore, the suggested methods are different in assumptions and mathematical approaches behind them. Below we summarize

A set of jackknife trees is useful in most cases. In simulations, it performs well in terms of both sensitivity and positive predictive value. Moreover, it produces networks with bifurcation support as a split weight, which is very useful when comparing the reliability of different scenarios. Since, in the absence of additional split filtering, consensus network approach does not limit the produced split set to weakly compatible or circular, it can create networks of very complex, hard to read topology.

Networks are computationally cheaper to build with the distance matrix as an intermediate dataset. This matrix can be analyzed by Neighbor-Net and split decomposition algorithms. When comparing these two algorithms, it is necessary to take into account that PPV is much less important than sensitivity. If the data clearly support only one scenario, it would not be obscured by addition of several low-weight splits represented by barely visible edges. On the other hand, if several contradictory trees are supported, a resulting network must include splits from all of them. In this case, PPV will decrease. However, the use of a network instead of a tree is aimed at representing this contradiction.

Unlike split decomposition algorithm which generates multifurcations, Neighbor-Net tends to add a lot of low-weight splits into network. Moreover, it has slightly higher sensitivity. These two features seem to be advantageous to apply the Neighbor-Net approach. However, the split network generated via Neighbor-Net, which is always producing a circular set of splits, may lack some splits versus the network derived with split decomposition algorithm. For detailed example, see archaeal chaperonins dataset [

Analysis of binary encoded genome structures by parsimony-splits algorithm has lower sensitivity and PPV than the rest of methods. Still, it can be useful for very large datasets, when other approaches are computationally expensive. The fact that evolutionary process is not assumed to consist of some limited set of operations is also advantageous when no such set can be proposed. However, in most cases, the consensus network or Neighbor-Net approach would be more reliable.

This work was financially supported by the Program of the Presidium of the Russian Academy of Sciences “Molecular and Cell Biology” (project no. 6.9).