Constructing Phylogenetic Networks Based on the Isomorphism of Datasets

Constructing rooted phylogenetic networks from rooted phylogenetic trees has become an important problem in molecular evolution. So far, many methods have been presented in this area, in which most efficient methods are based on the incompatible graph, such as the CASS, the LNETWORK, and the BIMLR. This paper will research the commonness of the methods based on the incompatible graph, the relationship between incompatible graph and the phylogenetic network, and the topologies of incompatible graphs. We can find out all the simplest datasets for a topology G and construct a network for every dataset. For any one dataset 𝒞, we can compute a network from the network representing the simplest dataset which is isomorphic to 𝒞. This process will save more time for the algorithms when constructing networks.


Introduction
The evolutionary history of species is usually represented as a (rooted) phylogenetic tree, in which one species has only one parent. Actually, the evolution of species has caused reticulate events such as hybridizations, horizontal gene transfers, and recombinations [1][2][3][4][5], so species may have more than one parent. Then, the phylogenetic trees cannot describe well the evolutionary history of species. However, phylogenetic networks can represent the reticulate events, and they are a generalization of phylogenetic trees. Phylogenetic networks can also represent the conflicting evolution information that may be from different datasets or different trees [6][7][8][9].
Phylogenetic networks can be classified into unrooted [10][11][12] and rooted networks [4,[13][14][15][16][17][18][19]]. An unrooted phylogenetic network is an unrooted graph whose leaves are bijectively labelled by the taxa. A rooted phylogenetic network is a rooted directed acyclic graph (DAG for short) whose leaves are bijectively labelled by taxa [20][21][22]. The rooted phylogenetic networks have been studied widely for representing the evolution of taxa, as evolution of species is inherently directed. The paper will study relevant properties of the rooted phylogenetic networks constructed from the rooted trees.
The algorithms constructing rooted phylogenetic networks from rooted phylogenetic trees are mainly classified into three types: the cluster network [17] based on the Hasse diagram; the galled network [16] based on the seed-growing algorithm; the Cass [23], the Lnetwork [24], and the BIMLR [25] based on the decomposition property of networks. In particular, the third type of methods (Cass, Lnetwork, and BIMLR) can construct more precise networks than the other methods. In the following, unless otherwise specified, we refer to rooted phylogenetic networks as networks.
Let X be a set of taxa. A proper subset of X (except for both 0 and X) is called a cluster. A cluster is trivial if | | = 1; otherwise, it is nontrivial. Let be a rooted phylogenetic tree on X; if there is an edge = ( , V) in such that the set of taxa which are descendants of V equals , we say that represents . Figure 1 shows two rooted phylogenetic trees 1 and 2 and all nontrivial clusters represented by 1 and 2 . Here, all trivial clusters are not listed. Given a network and a cluster , when just connecting one incoming edge and disconnecting all other incoming edges for each reticulate node (i.e., its incoming edges >1), if there is a tree edge = ( , V) (i.e., incoming edge of V ≤ 1) in such that the set of taxa which are descendants of V equals , we say that represents in the softwired sense. On the other hand, if  there is a tree edge = ( , V) in such that the set of taxa which are descendants of V equals , we say that represents in the hardwired sense. The abovementioned three types of methods constructing networks are based on clusters; that is, they first compute all of the clusters represented by input trees and then construct a network representing all clusters in the softwired sense. In this process, the third type of methods (Cass, Lnetwork, and BIMLR) will recur to the incompatibility graph (will be discussed in the following). This paper will discuss the relationship between the incompatibility graphs and the constructed networks.

Preliminaries
A rooted phylogenetic network = ( , ) on X is a rooted DAG, and its leaves are bijectively labelled as X. The indegree of a node V ∈ is denoted by indeg(V). A node V with indeg(V) ≥ 2 is called a reticulate node, a node V with indeg(V) ≤ 1 is called a tree node, and, specially, the tree node with indegree 0 is the root node. The reticulation number in a network = ( , ) is ∑ indeg(V)>0 (indeg(V) − 1) = | | − | | + 1.
Given a set of taxa X, two clusters 1 and 2 on X are called compatible, if they are disjoint or one contains the other; that is, 1 ∩ 2 = 0 or 1 ⊆ 2 or 2 ⊆ 1 ; otherwise, they are incompatible. Obviously, a trivial cluster and any one cluster are compatible. Given two incompatible clusters 1 and 2 , 1 ∩ 2 is called the incompatible taxa with respect to 1 and 2 . A set of clusters C on X is called compatible, if C is pairwise compatible; otherwise, it is incompatible. For a set of clusters C, its incompatibility graph IG(C) = ( , ) is an undirected graph with node set = and edge set , where an edge connects two incompatible clusters.
Given a cluster set C on X and a subset of X, the result of removing all elements in X \ from each cluster in C is called the restriction of C to , denoted by C| . If (where | | > 1) and any one cluster ∈ C are compatible and C| is also compatible, then we say that is an ST-set (Strict Tree Set) with respect to C. If there are no other ST-sets containing except itself, we say that is maximal. For a maximal ST-set , there is a subtree constructed by the set of clusters { | ∈ C, ⊂ } ∪ . For each maximal ST-set with respect to C, after collapsing it into a single taxon , the result set is denoted as is the only maximal ST-set; then, Collapse(C) = {{3, 4}, {{1, 2}, 3}}. Then, the taxa of Collapse(C) are {{1, 2}, 3, 4}, denoted as X(Collapse(C)). A set of clusters C is called the simplest if it has no maximal ST-set with respect to C.
Let C be a set of clusters on X and let be a network representing C. Usually, a tree edge in can represent more than one cluster in C and a cluster in C can be represented by more than one tree edge in . A mapping is defined from C to the set of tree edges of , such that ( ) is a tree edge of that represents for any one cluster ∈ C. A network is decomposable with respect to C if there exists a mapping : C → ( is the set of tree edges of ) such that (i) for any two clusters 1 , 2 ∈ C, 1 and 2 lie in the same connected component of the incompatibility graph IG(C) if and only if two tree edges ( 1 ) and ( 2 ) are contained in the same biconnected component of .
Then, we also say that the network has the decomposition property. The decomposition property makes the network constructed by an appropriate divide-and-conquer (DC for short) strategy; that is, it first constructs a subnetwork for each one connected component of the incompatibility graph and then merges all subnetworks into a whole network. Then, the constructed network is called DC network, and the algorithms are called DC algorithms. The paper [23] has proven the DC networks satisfying the decomposition property.
Given a set of clusters C, the DC algorithms first compute the incompatibility graph IG(C) and then compute the subnetwork for the result set after collapsing each one maximal ST-set into one taxon for each biconnected component of IG(C); next, "decollapse," that is, replace each leaf labelled by a maximal ST-set by a maximal subtree, and finally integrate those subnetworks into a final network. The paper [25] has proven that there exists a DC network for any one set of clusters C. Figure 2 shows the construction process of the DC algorithms for the set of clusters in Figure 1, in which constructing subnetwork for each one connected component (i.e., Step 2) is crucial.
The Cass, the Lnetwork, and the BIMLR algorithms are the DC algorithms, which can construct the networks with fewer reticulations than other algorithms. The networks constructed by the BIMLR and the Lnetwork have fewer redundant clusters except for the input clusters than other available methods. When constructing phylogenetic networks, the BIMLR and the Lnetwork are faster than the Cass, and the constructed networks are more stable, that is, the difference between constructed networks for the same dataset when different input orders are used is smaller than the Cass. Figure 3 shows three networks constructed by the Cass for the same dataset with different input orders, while BIMLR and Lnetwork can construct only one network 1 for the dataset with different input orders [25]. Collapsed maximal subsets

Topologies of Incompatibility Graphs
Step 1 Step 2 Step 3 Step 4 Add second simple network    (ii) the label of is equal to the label of ( ) for any one leaf ∈ 1 .
Given two sets of clusters C 1 on X 1 and C 2 on X 2 , let C 1 and C 2 be the results after collapsing all maximal ST-sets of C 1 and C 2 , respectively, C 1 on X 1 and C 2 on X 2 .
Definition 2. C 1 and C 2 are isomorphic, if and only if there is a bijection from X 1 to X 2 such that (i) and are in the same cluster 1 ∈ C 1 if and only if ( ) and ( ) are in the same cluster 2 ∈ C 2 .
By Definition 2, we have that the isomorphism of the cluster sets is an equivalence relation; that is, it is reflexive, symmetric, and transitive.

Lemma 3. Given a DC network
representing the set of clusters C, then any one maximal ST-set with respect to C is a maximal subtree in .
Proof. From the constructing process of DC networks, this conclusion is obvious. Proof. There must exist a DC network 1 for C 1 . Given a tree edge = ( , V), the subtree of the root V in 1 is a maximal subtree if and only if the set of taxa is a maximal ST-set with respect to C 1 , where the taxa in are labels of leaves which are descendants of V. Replace each maximal subtree of 1 by a node, and then denote the result network as 1 . Obviously, 1 represents the set of clusters C 1 . From Definition 2, there exists a bijection from X 1 to X 2 such that and are in the same cluster 1 ∈ C 1 if and only if ( ) and ( ) are in the same cluster 2 ∈ C 2 .
Then, we can obtain a network 2 from 1 by replacing each one taxon in X 1 by ( ) in X 2 . Obviously, 2 represents C 2 . Finally, we replace each leaf labelled by a maximal ST-set with respect to C 2 in 2 by a maximal subtree, and the result network is denoted as 2 which represents C 2 .
For two isomorphic sets of clusters C 1 and C 2 , let 1 be a DC network representing C 1 . Lemma 4 tells us that there is a DC network 2 representing C 2 , which can be obtained from 1 .

Lemma 5. Let C = {C | C }, where IG(C) is a biconnected component with two nodes. Then, any one element
Proof. Any one element C ∈ C has two incompatible clusters. Let Each one of 11 , 1 , 12 , 21 , 2 , and 22 is a maximal ST-set if it contains more than one taxon; then, we can collapse it into one taxon which is also denoted by itself. Denote the set of clusters after collapsing all maximal ST-sets as C 1 and C 2 . Obviously, there is a bijection from X 1 = { 11 , 1 , 12 } to X 2 = { 21 , 2 , 22 }, and any two taxa , ∈ X 1 are in the same cluster in C 1 if and only if ( ) and ( ) are in the same cluster in C 2 . Hence, C 1 and C 2 are isomorphic. Accordingly, any one set of clusters C ∈ C is isomorphic to C 0 = {{1, 2}, {2, 3}} because C 0 ∈ C.
For a cluster set C, there may be several cluster sets isomorphic to C, but the simplest set of clusters isomorphic to C is only one, denoted as C 0 . Let 0 be the DC network representing C 0 . Then, we can obtain a DC network representing C from 0 . Lemmas 4 and 5 show there is a DC network for any one set of clusters whose incompatible graph is a biconnected component with two nodes, and it is obtained from the network 0 (see Figure 3) representing C 0 . Proof. Figure 4 shows the topology of the linear biconnected component with three nodes. C is the simplest set of clusters, and its incompatible graph is the topology in Figure 4. Next, we will prove that C (1 ≤ ≤ 4) are all simplest sets of clusters for the topology in Figure 4. Any one set of clusters in C has three clusters denoted as 1 , 2 , and 3 . Let be the incompatible taxa with respect to 1 and 2 , and let be the incompatible taxa with respect to 2 and 3 ; then and have the following cases: Since there is no edge between 1 and 3 , 1 and 3 are compatible; that is, 1 ∩ 3 = 0, or 1 ⊆ 3 , or 3 ⊆ 1 . Because ⊆ 1 and ⊆ 3 , we have that 1 ∩ 3 ̸ = 0. Therefore, 1 ⊆ 3 or 3 ⊆ 1 . Then, we have the simplest set of clusters C 1 = {{1, 3}, {1, 2}, {1, 3, 4}}, and any one set of clusters in this case is isomorphic to C 1 .
(iii) ⊂ . This case is similar to case (ii). The sets of clusters are in case (ii) if and only if they are in case (iii). Hence, any one set of clusters in case (iii) and C 2 are isomorphic.
(iv) ∩ = 0. Then, 1 ∩ 3 = 0. We have that | | = 1 and | | = 1 in the simplest set of clusters, since they can be collapsed if | | ≥ 2 or | | ≥ 2. Assume that 1 = { , 1 } and 3 = { , 2 }. We have that | 1 | = 1 and | 2 | = 1 in the simplest set of clusters, since they can be collapsed if | 1 | ≥ 2 or | 2 | ≥ 2. Then, | 1 | = 2 and | 3 | = 2 in the simplest set of clusters. (v) ∩ ̸ = 0, ̸ ⊆ and ̸ ⊆ . Let = { 0 , 1 } and = { 1 , 0 }, where 0 , 1 , and 0 are not empty. We have { 0 , 1 , 0 } ⊆ 2 , and 1 ⊆ 3 or 3 ⊆ 1 . If 1 ⊆ 3 , then 1 ⊆ 3 . So 1 ⊆ , which contradicts the case that ̸ ⊆ . Similarly, we can get the contradiction when 3 ⊆ 1 . Thus, there exists no set of clusters in this case. Figure 5 shows the DC networks for the simplest sets of clusters C 1 , C 2 , C 3 , and C 4 , respectively.   Figure 5: The DC networks for all simplest cluster sets whose incompatible graphs are topologies in Figure 4. (see Figure 6). Let Proof. Figure 6 shows the topology of the nonlinear biconnected component with three nodes. Here, 1 , 2 , and 3 are the clusters, and , , and are the incompatible taxa corresponding to them. All cases are as follows: (i) = ; then, ⊆ or = ; (ii) ⊂ ; then, ⊂ , and ∩ = ; (iii) ∩ = 0; then, ∩ = 0 and ∩ = 0; (iv) ∩ ̸ = 0, ̸ ⊆ , ̸ ⊆ ; then, ∩ ̸ = 0 and ∩ ̸ = 0.    Any one set of clusters in this case will be isomorphic to one of them. Figure 7 shows the DC networks for the simplest sets of clusters C (1 ≤ ≤ 12), respectively. Lemmas 5, 6, and 7 compute all simplest sets of clusters, whose incompatible graphs are the biconnected components with two nodes or three nodes. Figures 6 and 7 show the DC networks constructed by the BIMLR algorithm for all simplest sets of clusters; then, the DC network for a set of clusters C can be obtained from the DC network representing the simplest set of clusters which is isomorphic to C; that is, it does not need to be constructed once again. This conclusion is very important to the construction of networks.

Conclusion
This paper computes all simplest sets of clusters for the topologies of incompatible graph with two nodes and three nodes. We can construct the DC networks for those simplest sets of clusters and save them. When constructing DC networks for any one set of clusters C, algorithms only need to read the DC network 0 of the simplest set of clusters isomorphic to C and then compute the DC network for C from 0 by replacing labels of leaves in 0 by the taxa in C, which will save more time for the algorithms.
We will compute the simplest sets of clusters for more topologies of incompatible graph in the future.  Figure 7: The DC networks for all simplest cluster sets whose incompatible graphs are topologies in Figure 6.