![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||
Copyright © 2009 by Cold Spring Harbor Laboratory Press Breakpoint graphs and ancestral genome reconstructions Department of Computer Science and Engineering, University of California at San Diego, La Jolla, California 92093-0404, USA 1Corresponding author.E-mail ppevzner/at/cs.ucsd.edu; fax (858) 534-7029. Received June 30, 2008; Accepted January 22, 2009. Abstract Recently completed whole-genome sequencing projects marked the transition from gene-based phylogenetic studies to phylogenomics analysis of entire genomes. We developed an algorithm MGRA for reconstructing ancestral genomes and used it to study the rearrangement history of seven mammalian genomes: human, chimpanzee, macaque, mouse, rat, dog, and opossum. MGRA relies on the notion of the multiple breakpoint graphs to overcome some limitations of the existing approaches to ancestral genome reconstructions. MGRA also generates the rearrangement-based characters guiding the phylogenetic tree reconstruction when the phylogeny is unknown. The first attempts to reconstruct the genomic architecture of ancestral mammals predated the era of genomic sequencing and were based on cytogenetics approaches (Wienberg and Stanyon 1997). The rearrangement-based phylogenomic studies were pioneered by Sankoff and colleagues (Sankoff et al. 1992; Blanchette et al. 1997; Sankoff and Blanchette 1998) and were based on analyzing the breakpoint distances. Moret et al. (2001) further optimized this approach and developed a popular GRAPPA software for rearrangement analysis. MGR, another genome rearrangement tool (Bourque and Pevzner 2002), uses genomic distances instead of breakpoint distances for ancestral reconstructions. Since genomic distances lead to more accurate ancestral reconstructions (Moret et al. 2002; Tang and Moret 2003), GRAPPA has been modified for genomic distances as well. While MGR has been used in several phylogenomic studies (Bourque et al. 2005; Murphy et al. 2005; Bulazel et al. 2007; Pontius et al. 2007; Xia et al. 2007; Cardone et al. 2008; Deuve et al. 2008), both MGR and GRAPPA have limited ability to distinguish reliable from unreliable rearrangements and to address the “weak associations” problem in ancestral reconstructions (Bourque et al. 2004, 2005, 2006; Froenicke et al. 2006). Recently, Ma et al. (2006) made an important step toward reliable reconstruction of the ancestral genomes. In contrast to MGR and GRAPPA (which analyze both reliable and unreliable rearrangements), they have chosen to focus on the reliable breakpoint reconstruction in the ancestral genomes and to avoid assignments in the case of weak associations (complex breakpoints). This proved to be a valuable approach since, as it turned out, most breakpoints in the ancestral mammalian genomes can be reliably reconstructed. However, there are some limitations (discussed in Rocchi et al. 2006) that this approach has to overcome to scale for large sets of genomes. First, while the Ma et al. (2006) inferCARs algorithm assumes that the phylogeny is known, it remains a subject of enduring debates even in the case of the primate–rodent–carnivore split (which is assumed to be resolved in Ma et al. 2006). With the increase in the number of species, the reliability of the phylogeny will become even a bigger concern, thus raising the question of devising an approach that does not assume a fixed phylogeny but instead uses rearrangements as new characters for constructing phylogenetic trees (see Chaisson et al. 2006). While MGR does not assume a fixed phylogeny, its heuristically derived weak associations are less reliable. The challenge then is to integrate the reliability of inferCARs with the flexibility of MGR. Another avenue to improve inferCARs algorithms is to find out how to deal with complex breakpoints that create gaps in reconstructions. Note that the Ma et al. (2006) approach focuses on the reliable ancestor reconstruction rather than on the specific rearrangements that happened in the course of evolution. These are related but different problems that both can benefit from incorporating into a single computational framework. Indeed, Ma et al. (2006) consider individual breakpoints and do not distinguish between particular types of rearrangements that generated a breakpoint of interest. In reality, the reversals and translocations operate on pairs of dependent breakpoints rather than individual breakpoints. Some rearrangements (and synteny associations) cannot be inferred from the analysis of single breakpoints but become tractable via analyzing the breakpoint graph.2 As a result, while MGR constructs provably optimal scenarios in the absence of breakpoint reuse, it is not clear whether the same result holds for inferCARs. Recently, Zhao and Bourque (2007, 2009) developed the EMRAE algorithm, which reconstructs both reliable rearrangements and ancestors, thus addressing the shortcomings of both MGR (difficulty in distinguishing between reliable and putative rearrangement events) and inferCARs (ancestor reconstruction only). However, EMRAE (in contrast to MGR) does not attempt to reconstruct the phylogenetic tree and is limited to unichromosomal genomes. Below we address some limitations of MGR, EMRAE, and inferCARs by developing the Multiple Genome Rearrangements and Ancestors (MGRA) algorithm (available from http://www.cs.ucsd.edu/users/ppevzner/software.html). In particular: (1) MGRA constructs provably optimal scenarios even when there is some breakpoint reuse and when other tools do not guarantee optimality. (2) MGRA is suitable for ancestral reconstructions of multi-chromosomal genomes (in contrast to EMRAE). (3) MGRA is conceptually simpler and orders of magnitude faster than MGR. (4) MGRA is not limited to reconstructing ancestral genomes in the case of known phylogeny (like inferCARs and EMRAE). Instead, it can guide the rearrangement-based reconstruction of phylogenetic trees. (5) MGRA does not require prior information about the approximate lengths of the branches of the phylogenetic trees (in contrast to inferCARs). To evaluate the performance of MGRA, we compared ancestral reconstructions generated by MGRA and inferCARs. Despite the fact that MGRA and inferCARs are very different algorithms, their reconstructions turned out to be remarkably similar (98.5% of synteny associations are identical). We further analyzed some differences between MGRA, inferCARs, and the cytogenetics approach. Methods From pairwise to multiple breakpoint graphs We start with analysis of rearrangements in circular genomes (i.e., genomes consisting of circular chromosomes) and later extend it to genomes with linear chromosomes. We assume that each genome is formed by the same set of synteny blocks, which are arranged differently in different genomes. We will find it convenient to represent a chromosome formed by synteny blocks b1,…, bn as a cycle with n directed labeled edges (corresponding to blocks) alternating with n undirected unlabeled edges (connecting adjacent blocks). The directions of the edges correspond to signs (strand) of the blocks. We label the tail and head of a directed edge bi as bit and bih, respectively (Fig. 1
Let P be a genome represented as a collection of alternating black-obverse cycles (a cycle is alternating if the colors of its edges alternate). For any two black edges (x1, x2) and (y1, y2) in the genome (graph) P, we define a 2-break rearrangement (first introduced as a DCJ rearrangement in Yancopoulos et al. 2005 and recently studied in Bergeron et al. 2006 and Lin and Moret 2008) as “replacement of these edges with either a pair of edges (x1, y1), (x2, y2), or a pair of edges (x1, y2), (x2, y1)” (Fig. 2A,B
Let P and Q be genomes on the same set of blocks B. The (pairwise) breakpoint graph G(P, Q) is simply the superposition of genomes (graphs) P and Q (Fig. 1C B} with edges of three colors: obverse (connecting vertices bt and bh), black (connecting adjacent blocks in P), and green (connecting adjacent blocks in Q). The black and green edges form the black–green alternating cycles that play an important role in analyzing rearrangements (Bafna and Pevzner 1996). From now on we will ignore the obverse edges in the breakpoint graph so that it becomes simply a collection of (black–green) cycles (Fig. 1The 2-break distance d2(P, Q) between genomes P and Q is defined as “the minimum number of 2-breaks required to transform one genome into the other.” In contrast to the Genomic Distance Problem (Hannenhalli and Pevzner 1995; Tesler 2002a; Ozery-Flato and Shamir 2003) (for linear multi-chromosomal genomes), the 2-Break Distance Problem for circular multi-chromosomal genomes has a trivial solution (Yancopoulos et al. 2005; Alekseyev and Pevzner 2007): d2(P, Q) = b(P, Q) − c(P, Q), where b(P, Q) = |B| is the number of synteny blocks in P and Q, and c(P, Q) is the number of black–green cycles in G(P, Q). A linear genome is a collection of linear chromosomes represented as sequences of signed synteny blocks. Each linear chromosome on n blocks is represented as a path of n directed obverse edges (encoding blocks and their direction) alternating with (n − 1) undirected black edges (connecting adjacent blocks). In addition, we introduce an extra vertex ∞ and connect it by an undirected (irregular) black edge with every vertex representing a chromosomal end (hence, the degree of vertex ∞ is twice the number of linear chromosomes). A “linear chromosome” is an alternating path of black and obverse edges, starting and ending at the vertex ∞, and a “linear genome” is a collection of such paths. The 2-breaks involving irregular edges model the rearrangements affecting the chromosome ends (Fig. 2C,D Analyzing reversals, translocations, fusions, and fissions in linear genomes poses additional algorithmic challenges as compared to analyzing 2-breaks in circular genomes. However, rearrangement scenarios in linear genomes are well approximated by 2-break scenarios in circular genomes (Alekseyev 2008). Hence, we use 2-breaks as a single substitute for reversals, translocations, fusions, and fissions, admitting that 2-breaks may violate linearity of the genomes by creating circular chromosomes. While previous rearrangement studies (e.g., MGR) were limited to analyzing the pairwise breakpoint graphs, MGRA uses multiple breakpoint graphs (Caprara 1999b), which simplify the rearrangement analysis. Let P1,…, Pk be genomes on the same set of synteny blocks B. Similarly to the pairwise breakpoint graph, the (multiple) breakpoint graph G(P1,…, Pk) is simply the superposition of genomes (graphs) P1,…, Pk on the same vertex set V = {bt, bh | b B} {∞} (Fig. 3A,B
A vertex in the breakpoint graph is regular if it is different from ∞. Similarly, an edge is regular if both its endpoints are regular, and irregular otherwise. The edges of G(P1,…, Pk) are represented by undirected edges from the genomes P1,…, Pk of k different colors (hence, the degree of each regular vertex is k). To simplify the notation, we will use P1,…, Pk also to refer to the colors of edges in the multiple breakpoint graph, and denote the set of all colors C = {P1,…, Pk}. Furthermore, any non-empty subset of C is called a multi-color. All edges connecting vertices x and y in the (multiple) breakpoint graph form the multi-edge (x, y) of the multi-color represented by the colors of these edges [e.g., the multi-edge (eh, fh) in Fig. 3B A breakpoint in the multiple breakpoint graph G(P1, P2,…, Pk) is a vertex of the multi-degree >1. A multiple breakpoint graph without breakpoints is an identity breakpoint graph G(X,…,X) of some genome X. Alternatively, the identity breakpoint graph can be characterized as a breakpoint graph consisting of complete multi-edges (i.e., multi-edges of the multi-color C) that correspond to the synteny blocks adjacencies in X. Multiple genome rearrangement problem The key observation in studies of pairwise genome rearrangements is that every 2-break transformation of a “black” genome P into a “green” genome Q corresponds to a transformation of the breakpoint graph G (P, Q) into the identity breakpoint graph G (Q, Q) (Supplemental Fig. S21) with 2-breaks on pairs of black edges (black 2-breaks). MGR (Bourque and Pevzner 2002) implicitly applies a similar observation and attempts to come up with rearrangements that bring the multiple breakpoint graph G (P1, P2,…, Pk) closer to the identity multiple breakpoint graph G(Pi, Pi,…, Pi) for i varying from 1 to k. However, this approach does not allow one to use the internal edges of the phylogenetic tree for finding reliable rearrangements. Below we formalize the Multiple Genome Rearrangement Problem in terms of multiple breakpoint graphs. The key element of MGRA is finding a shortest transformation of the multiple breakpoint graph G(P1, P2,…, Pk) into an arbitrary identity multiple breakpoint graph G(X, X,…, X) for some a priori unknown genome X. We first illustrate this concept with pairwise breakpoint graphs. Let G(P1, P2) → G(X, X) be an m-step transformation of G(P1, P2) into G(X, X) by either black or green 2-breaks (in contrast to the standard breakpoint graph analysis based on black 2-breaks only).5 It is easy to see that every such transformation corresponds to a transformation P1 → X → P2 that uses m black 2-breaks. Therefore, instead of searching for a shortest transformation G(P1, P2) → G(P2, P2), one can search for a shortest transformation of G(P1, P2) into any identity breakpoint graph G(X, X) without knowing X in advance. In the case of k ≥ 2 genomes P1, P2,…, Pk, 2-breaks can be applied to multi-edges in the multiple breakpoint graph G(P1, P2,…, Pk) of as many as (2k − 2) different multi-colors formed by proper subsets of C. However, not every series of such 2-breaks makes sense in terms of ancestral genome reconstructions. A basic property of ancestral genome reconstructions is that 2-breaks on multi-edges of multi-color Q C can be applied only when all genomes corresponding to colors in Q are merged into a single genome. We give an alternative definition of this property as follows: A transformation (series of 2-breaks) S of the multiple breakpoint graph G(P1, P2,…, Pk) is “strict” if for any 2-breaks operating on multi-edges of multi-colors Q1 Q2, ρ1 precedes ρ2 in S. The Multiple Genome Rearrangement Problem (MGRP) is reformulated as follows:Given genomes P1,…, Pk, find a shortest strict series of 2-breaks that transforms the breakpoint graph G (P1,…, Pk) into an identity breakpoint graph. Let T be a (unrooted) phylogenetic tree of the genomes P1,…, Pk (Fig. 3A Removing a branch from T breaks it into two subtrees, each of which is induced by the set of its own leaves. A multi-color consisting of all colors (leaves) of either of these induced subtrees is called “T-consistent.” Let G be the set of all T-consistent multi-colors. Note that if a multi-color Q is T-consistent, then its complement = C \ Q is also T-consistent. Therefore, there is a one-to-one correspondence between the pairs of complementary T-consistent multi-colors and the branches of T (Fig. 5
When a phylogenetic tree is given, MGRA addresses a restricted version of MGRP where 2-breaks are applied only to multi-colors consistent with the phylogenetic tree. The Tree-Consistent Multiple Genome Rearrangement Problem (TCMGRP) is as follows: Given genomes P1,…, Pk at the leaves of a phylogenetic tree T, find a shortest strict series of T-consistent 2-breaks, transforming the breakpoint graph G(P1,…, Pk) into an identity breakpoint graph. Note that MGRP and TCMGRP problems in the case of three unichromosomal genomes correspond to the median problem that is NP-complete (Caprara 1999a; Tannier et al. 2008). While existence of exact polynomial algorithms for solving MGRP and TCMGRP is unlikely, we describe a heuristic approach to “eliminating” breakpoints in G(P1,…, Pk) that uses reliable rearrangements. In particular, MGRA optimally solves these problems in case of semi-independent rearrangement scenarios with some breakpoint reuses (see below). We will find it convenient to fix a branch χ of the tree T and assume that this branch contains a root X (viewed as yet another node), the precise location of which is to be determined later. The choice of X defines directions “toward” X on all branches of the tree T (Fig. 5 -consistent.” Alternatively, -consistent multi-colors can be defined as T-consistent multi-colors whose induced subtrees do not contain χ. Note that exactly one of the multi-colors in each pair of complementary T-consistent multi-colors is -consistent and it labels the starting node of the corresponding directed branch in T (except for the multi-colors corresponding to the branch χ that both are -consistent).MGRA transforms the genomes P1,…, Pk into X along the directed branches of T, using 2-breaks on -consistent multi-colors ( -consistent 2-breaks). In terms of breakpoint graphs, MGRA eliminates breakpoints in G(P1, P2,…, Pk) with -consistent 2-breaks and transforms it into the identity breakpoint graph G(X,…, X).6 This transformation defines a reverse transformation of the genome X into the genomes P1,…, Pk by -consistent 2-breaks (such as in Fig. 3CWhile initial steps in transformation of the breakpoint graph G(P1,…, Pk) into an identity breakpoint graph usually correspond to reliable rearrangements, sooner or later one needs to use less reliable heuristic arguments in order to complete the transformation. However, sometimes it is preferable to stop after reaching a certain level of reliability even if the transformation is not complete (and the TCMGRP problem is not solved). In this case, we stop short of reconstructing the ancestral genomes since the transformation has not resulted in an identity breakpoint graph. In Supplement C, we describe an alternative method (not requiring solution of the TCMGRP problem) for reliable reconstruction of (parts of) ancestral genomes (similar to CARs from Ma et al. 2006) at internal nodes of the phylogenetic tree. Results MGRA algorithm Supplement A introduces the notion of independent (no breakpoint reuses), semi-independent (breakpoint reuses may occur only within single branches of the phylogenetic tree), and weakly independent (breakpoint reuses are limited to adjacent branches of the phylogenetic tree) rearrangements. MGRA optimally solves the MGRP problem in case of semi-independent 2-breaks and uses heuristics to move beyond the semi-independent assumption. Below we show that most 2-breaks in mammalian evolution are either independent, semi-independent, or weakly independent, resulting in reliable ancestral reconstructions. Cycles and paths in the breakpoint graph Visual inspection of a rather complex breakpoint graph in Figure 4 We note that the immediate result of a 2-break performed along a branch Q + in the phylogenetic tree T is a cycle of four multi-edges whose multi-colors alternate between Q and . All vertices in this cycle have multi-degree 2 and represent breakpoints that were not “reused.” Even if one of these multi-edges is used in later rearrangements, the remaining three multi-edges still form an alternating path that serves as a footprint of the 2-break. This observation motivates a search for alternating paths and cycles in the breakpoint graphs. We introduce the following definitions to analyze such cycles/paths.We define a simple vertex as a regular vertex of multi-degree 2 and a simple multi-edge as a multi-edge connecting two simple vertices. Simple multi-edges form simple cycles/paths in the breakpoint graphs, that is, cycles/paths in which multi-colors of consecutive multi-edges alternate between Q and . Simple multi-edges/paths/cycles are called “good” if their multi-colors are T-consistent.
Table 1A describes the statistics of the breakpoint graph and illustrates how rearrangement analysis contributes to construction of phylogenetic trees. Indeed, all three internal branches (correct tree partitions) are supported by large numbers of good paths/cycles and good multi-edges (86 and 305 for MR + DQHC, 37 and 111 for MRD + QHC, 30 and 87 for HC + MRDQ). Each of 32 incorrect partitions (only eight of them are shown in Table 1A) have at most one simple path/cycle and at most six simple multi-edges, an order of magnitude smaller number than non-trivial correct partitions. This observation illustrates that reconstruction of the correct tree topology is a simple exercise in this case (see Chaisson et al. 2006). This and other statistics produced by MGRA (see below) may be used to determine the phylogenetic tree rather than to assume that it is given. In contrast to Cannarozzi et al. (2007), MGRA provides a large number of certificates supporting the tree topology in Figure 5
MGRA Stage 1: Processing good cycles and paths Alternating cycles represent well-studied objects in the case of the pairwise breakpoint graphs. Every such cycle of length 2m is formed by (m − 1) 2-breaks (Alekseyev and Pevzner 2008) in each most parsimonious scenario.7 Therefore, there is little difference between alternating cycles in the pairwise breakpoint graphs and good cycles in the multiple breakpoint graphs: indeed, the good cycles with alternating multi-colors Q and in the breakpoint graph model the rearrangements separating the sets of the genomes Q and exactly in the same way as in the pairwise genome comparison. We therefore argue that such alternating cycles (and the corresponding rearrangements) can be reliably assigned to the branch Q +
in the phylogenetic tree T. This operation generalizes the notion of “good rearrangements” in MGR by extending them from cycles alternating multi-colors Pi and = {P1,…, Pi−1, Pi+1,…, Pk} to cycles alternating any complementary T-consistent multi-colors. While MGR attempts to find rearrangements bringing Pi closer to all genomes from (i.e., rearrangements on the leaf branches of the phylogenetic tree), MGRA processes reliable rearrangements on all (both leaf and internal) branches of the phylogenetic tree (c.f. Zhao and Bourque 2007).Similarly, good paths can be also assigned to branches of the phylogenetic tree by transforming them into good cycles first. Consider a good path x1, x2,…, xm consisting of (m − 1) multi-edges with T-consistent multi-colors alternating between a multi-color Q of the multi-edge (x1, x2) and its complement . We extend this path by vertices x0 and xm + 1 incident to its first and last vertices, respectively, resulting in the path p = (x0, x1, x2,…, xm + 1). If the first and the last multi-edges in this path have the same -consistent multi-color, we perform a 2-break over the multi-edges (x0, x1) and (xm, xm + 1) to transform p into a good cycle c = (x1, x2,…, xm) and a multi-edge (x0, xm + 1) (Fig. 6A,C -consistent multi-color, we remove it/them to obtain a path flanked by a -consistent multi-color that is processed (if it is longer than one edge) as above. Note that processing good cycles/paths in the breakpoint graph can create new good cycles/paths. We therefore process the good cycles/paths in an iterative fashion until no more good cycles/paths remain.9
Figure 7
The results of MGRA Stage 1 already reveal valuable insights about the ancestral genomes (even without MGRA Stage 2). To simplify the analysis of the Boreoeutherian ancestral reconstruction10 by MGRA Stage 1, we restrict the set of genomes to single representatives of rodents (mouse), carnivores (dog), and primates (macaque). The resulting breakpoint graph (with obverse edges shown) reveals many long unicolored paths formed by alternating obverse edges and complete multi-edges (Supplemental Fig. S10). Such paths represent parts of different human chromosomes in the reconstructed ancestor genome. We compress every such path into a single rectangular vertex as shown in Supplemental Figure S11 (top panel), resulting in a rather small graph. We further show the chromosomal associations present in this graph in Supplemental Figure S12. We emphasize that MGRA Stage 1 reveals some subtle but reliable adjacencies that other ancestral reconstruction algorithms may miss. In particular, it reveals two adjacencies that are absent in any of the extant genomes and many adjacencies that are present in only one of the extant genomes. The compressed breakpoint graph reveals only 5 complete multi-edges connecting vertices of different colors: 12 + 22, 12 + 22, 3 + 21, 4 + 8, and 14 + 15. These are exactly the same five adjacencies 12a + 22a, 12b + 22b, 3 + 21, 4a + 8p, 14 + 15 revealed in Ma et al. (2006). It also reveals the CARs corresponding to the human chromosomes 2, 2, 5, 6, 7, 8, 9, 10, 10, 11, 17, 18, and X (represented as isolated boxes in Supplemental Fig. S12), exactly the same as the ancestral chromosomes revealed by previous cytogenetics analysis (Froenicke et al. 2006) (2q, 2pq, 5, 6, 7a, 8q, 9, 10q, 11, 17, 18, X) with a single exception: The second segment from chromosome 10 is identified as an isolated chromosome by us and is tentatively assigned as 10p + 12a + 22a by Froenicke et al. (2006). However, Froenicke et al. (2006) acknowledged that the association of 10p and 12a is only weakly supported (indicated by a question mark in Froenicke et al. 2006).11 Our analysis also rules out the associations 1 + 22, 5 + 19, 2 + 18, 1 + 10, and 20 + 2 suggested in Murphy et al. (2005) as weak associations and later criticized by Froenicke et al. (2006) as unreliable. Supplement D further focuses on the connected component of the breakpoint graph representing the human chromosomes 7, 16, and 19, where the cytogenetics approach disagrees with Ma et al. (2006). MGRA Stage 2: Processing fair cycles and paths
Figure 7 and Q2 + ) and reveals that (1) most composite multi-edges are fair and (2) while some types of composite multi-edges are common [e.g., (M+, R+), (M+, MR+), (R+, MR+), (MR+, D+), (D+, QHC+), (MR+, QHC+)], others [e.g., (Q+, R+)] are either rare or absent. Table 2 illustrates the extremely biased statistics of composite multi-edges: The branches Q1 + and Q2 + corresponding to the multi-colors Q1 and Q2 of a composite multi-edge are likely adjacent in the phylogenetic tree (compare to the weakly independent rearrangements). Table 2 provides yet another illustration of utility of MGRA for deriving phylogenetic trees. Indeed, it reveals valuable information about the topology of the phylogenetic tree (incident edges) that can be combined with information (valid partitions) in Table 1A,B to infer the trees.
Every fair multi-edge (x, y) can be transformed into a good multi-edge by a 2-break (fair 2-break) either on multi-edges (x, x1) and (y, y1) (of multi-color Q1) or on multi-edges (x, x2) and (y, y2) (of multi-color Q2) (Fig. 6 , while in the latter case, it is transformed into a good multi-edge of color . The resulting good paths (formed by fair 2-breaks) can be further processed as described in MGRA Stage 1. An important observation is that the final result of processing a fair multi-edge does not depend on whether we start with a 2-break on Q1 or Q2 multi-color (see Fig. 6MGRA Stage 2 detects fair paths/cycles, transforms them into good paths/cycles by fair 2-breaks, and further processes the resulting good paths/cycles as in MGRA Stage 1. In some cases, fair paths in Stage 2 should be chosen with caution since the choice of fair paths may influence ancestral reconstructions in some nodes (see Supplement E). Figure 7 Reconstructing ancestral genomes After removing vertex ∞, the breakpoint graph (after MGRA Stage 2) consists of only nine connected components (Fig. 7 The simplest way to deal with such short blocks is to simply remove them from the set of input synteny blocks (Supplement J). Such removal will not significantly affect the architecture of the ancestral genomes (indeed, these blocks are well below the resolution of the cytogenetics approaches) while at the same time resolving five out of nine remaining components in the graph. Supplement F describes a different approach that attempts to find the positions and orientations of such short synteny blocks in the ancestors by processing complex breakpoints (MGRA Stage 3). We remark that processing at MGRA Stage 3 is viewed as less reliable, and the resulting associations are not considered in the proposed ancestral reconstructions (see below). Recall that a strict T-consistent rearrangement scenario uniquely defines ancestral genomes at all internal nodes of the phylogenetic tree T. However, because of the use of 2-breaks instead of reversals/translocations/fissions/fusions, the ancestral genomes initially obtained by MGRA may contain (a small number of) circular chromosomes. Whenever possible, MGRA linearizes them by rearranging 2-breaks in the transformation. While circular chromosomes may occasionally appear in the initial rearrangement scenario obtained by MGRA, their appearance is a result of either 2-breaks applied in the “wrong” order (that can be avoided by reordering the 2-breaks [see Pevzner 2000], or a “shortcut” in processing hurdles that can be remedied by introducing additional 2-breaks [Hannenhalli and Pevzner 1999]). MGRA eliminates possible circular chromosomes in the reconstructed genomes at the post-processing stage. We emphasize that the outcome of MGRA is the set of ancestral (linear) genomes, while the 2-break rearrangement scenario produced by MGRA is considered only as a starting point for constructing the reversals/translocations/fusions/fissions scenario. An optimal linear rearrangement scenario can be found by applying GRIMM to the ancestral genomes reconstructed by MGRA. Supplemental Figure S14 illustrates the results of ancestral genome reconstruction for chromosome X for six mammalian genomes. Supplement H shows the pairwise rearrangement distances between the ancestral and leaf genomes, following the strict T-consistent transformation constructed by MGRA and compares them to the genomic distances computed by GRIMM (Tesler 2002b). The differences between these distances are rather small, suggesting that the -consistent transformation found by MGRA is close to the most parsimonious.Benchmarking MGRA Benchmarking of the ancestral genome reconstruction algorithms may be challenging since the architecture of ancestral genomes is not known. While MGR, GRAPPA, inferCARs, and MGRA showed excellent performance on simulated data sets, these benchmarks were mainly designed for rearrangements generated according to the Random Breakage Model (RBM). Since MGRA improves on MGR and is guaranteed to produce optimal solutions for semi-independent scenarios, it is bound to provide even better results than MGR on such benchmarks. Supplement L compares MGRA and inferCARs on simulated data and illustrates that MGRA generates more accurate ancestral reconstructions for all choices of parameters. However, analyzing all these tools on simulated data may generate over-optimistic results since RBM does not reflect the realities of mammalian evolution (Bailey et al. 2004; van der Wind et al. 2004; Zhao et al. 2004; Murphy et al. 2005; Webber and Ponting 2005; Hinsch and Hannenhalli 2006; Ruiz-Herrera et al. 2006; Yue and Haaf 2006; Caceres et al. 2007; Gordon et al. 2007; Kikuta et al. 2007; Mehan et al. 2007). We therefore decided to analyze the differences between MGRA and inferCARs reconstructions and to further track evidence for each such difference in a case-by-case fashion. MGRA and inferCARs produce highly consistent ancestral reconstructions. For illustration purposes, we have chosen to focus on the reconstruction of the MRD ancestral genome (Fig. 5 Comparison of two inferCARs reconstructions and using MGRA to improve inferCARs ancestral reconstructions We start by comparing inferCARs with itself on two inputs: the original six mammalian genomes M, R, D, Q, H, C and the genomes M′, R′, D′, Q′, H′, C′ produced by MGRA Stage 1 (Fig. 7 Since MGRA Stage 1 processes only good cycles/paths that are unambiguously present in every optimal rearrangement scenario, one can safely assume that any optimal ancestral reconstruction should include the rearrangements performed at Stage 1. Therefore, running inferCARs on M, R, D, Q, H, C genomes should ideally produce the same results as running inferCARs on the “equivalent” M′, R′, D′, Q′, H′, C′ genomes. However, since inferCARs makes some greedy decisions and does not claim optimality, it does not guarantee to produce the same results on M, R, D, Q, H, C as compared with M′, R′, D′, Q′, H′, C′. Any such inconsistency would point to either somewhat less reliable CARs reconstructed by inferCARs or to reliable adjacencies missed by inferCARs. Therefore, inferCARs reconstructions can be potentially improved if MGRA Stage 1 runs before inferCARs as a pre-processing step. Comparison of the reconstructed genomes MRDCARs and MRD′CARs indicates that while they share the overwhelming majority (99.0%) of reconstructed adjacencies, there are 13 adjacencies present in MRDCARs but absent in MRD′CARs and 13 adjacencies absent in MRDCARs but present in MRDCARs (out of the 1325 reconstructed adjacencies). Figure 8
To resolve the conflicts between inferCARs results on equivalent inputs, we analyze each of these adjacencies [(658h, 652h), (871t, 873t), (770t, 771t), and (1014t, 1017h)] in a case-by-case fashion. For example, in the case of the (658h, 652h) adjacency, inferCARs failed to connect them, since the vertices 658h and 652h represent breakpoints of multi-degree 3 (Fig. 8 Comparison of inferCARs and MGRA reconstructions Supplemental Figure S19 displays the breakpoint graph of the three MRD reconstructions: MRDMGRA, MRDCARs, and MRD′CARs, and illustrates that the number of differences between MRDCARs and MRD′CARs (we consider the latter reconstruction to be more reliable) is comparable to the number of differences between MRDMGRA and MRD′CARs. Indeed, MRD′CARs differs from MRDCARs by 30 adjacencies and differs from MRDMGRA by 39 adjacencies. Since the large-scale architecture of MRDCARs was shown to be largely consistent with previous cytogenetics reconstructions (Ma et al. 2006) and since MRD′CARs (that is arguably even more reliable than MRDCARs) and MRDMGRA share at least 98.5% of all adjacencies, all these reconstructions can be viewed as largely consistent with the cytogenetics-based reconstructions. Remarkably, most differences between MGRA and inferCARs reconstructions are represented by ambiguous joins that MGRA labels as less reliable anyway (shown as dashed edges). In particular, inferCARs reports eight less reliable adjacencies as unambiguous (complete multi-edges with dashed purple edges in Supplemental Fig. S19). However, most of them correspond to micro-inversions and have minor effects on the large-scale ancestral architectures (see Supplement I for detailed comparison of MGRA and inferCARs reconstructions). Table 3 shows the genomic distances from MRDMGRA and MRDCARs to each of the six leaf genomes and illustrates that MGRA results in a slightly more parsimonious scenario as compared to inferCARs (the total distance is 1503 for MGRA and 1518 for inferCARs).
The primate–rodent–carnivore split in mammalian evolution Knowledge of the correct phylogeny is an important prerequisite for many comparative genomics approaches (Blanchette and Tompa 2002; Kellis et al. 2003). However, even the basic features of the mammalian phylogeny (e.g., the primate–rodent–carnivore split) remain controversial (Fig. 9
Chaisson et al. (2006) made an attempt to analyze mammalian phylogeny using micro-rearrangements in the CFTR region representing 0.06% of mammalian genomes. However, the small size of this region and ambiguities in revealing micro-rearrangements between distant mammals made it difficult to find micro-rearrangements that can certify the deep branches of the mammalian phylogenetic tree. Cannarozzi et al. (2007) made an attempt to analyze large-scale rearrangements (as opposed to micro-rearrangements) for reconstructing the mammalian evolutionary history. Their approach, while promising, left many questions unanswered. In particular, Cannarozzi et al. (2007) discussed only reversals and ignored translocations, fusions, and fissions. Also, they computed the reversal distances using a (unpublished) greedy algorithm. Since breakpoint reuse is prominent (Pevzner and Tesler 2003) in mammalian evolution, greedy approaches are unlikely to provide an adequate rearrangement scenario. Finally, Cannarozzi et al. (2007) used the “distance-based” rather than “character-based” methods for computing the phylogenetic tree. It is well known that the performance of the distance-based methods deteriorates in the case of the large breakpoint reuse typical for mammalian genomes.
Lunter (2007) criticized Cannarozzi et al. (2007) and wrote in April 2007: “It appears unjustified to continue to consider the phylogeny of primates, rodents, and canines as contentious.” Huttley et al. (2007) wrote in May 2007: “We have demonstrated with very high confidence that the rodents diverged before carnivores and primates.” (See Niimura and Nei 2007 and Huerta-Cepas et al. 2007 for other recent studies supporting the primate–carnivore clade.) We therefore argue that the rearrangement-based study of the primate–rodent–carnivore controversy is timely. To analyze the primate–rodent–carnivore controversy, we added the opossum genome (Mikkelsen et al. 2007) to our rearrangement analysis.13 However, while the phylogenetic tree of the previously considered six mammalian genomes is well established, the position of the opossum genome in this tree is being debated (Fig. 9 To further address the uncertainty with the opossum branch, we applied MGRA only to the non-controversial parts of the tree with the goal to find characters supporting each of two currently debated tree topologies. The debated tree topologies share (noncontroversial) HC + MRDOQ, QHC + MRDO, and MR + DOQHC branches (as well as seven leaf branches corresponding to single genomes). We refer to these branches as “confident” and consider only good and fair paths that correspond to confident branches in MGRA analysis. To further compare the support for the primate–carnivore and the primate–rodent clades, we run MGRA to simplify this breakpoint graph. MGRA Stages 1–2 result in the breakpoint graph (Supplemental Fig. S18, top) that encodes rearrangements during mammalian radiation. While running MGRA on all seven genomes was important for simplifying the initial breakpoint graph of seven genomes, it hardly makes sense to analyze all these genomes in the complex graph in Supplemental Figure S18 (top). Indeed, we are not interested in subtle inconsistencies between mouse–rat and human–chimpanzee–macaque genomic architectures revealed by this graph. We therefore select single representatives of the primate (macaque–human–chimpanzee ancestor), rodent (mouse–rat ancestor), and carnivore (dog) as well as the outgroup (opossum genome) to simplify the analysis (see Supplemental Fig. S18, top, for a similar analysis with the representatives corresponding to extant macaque, mouse, dog, and opossum genomes).
Table 1B allows one to analyze features supporting the primate–carnivore clade (26 multi-edges of MO + DQ multi-colors) and the primate–rodent (12 multi-edges of MQ + DO multi-colors). While the rearrangement-based support for the primate–carnivore clade is more significant than for the primate–rodent clade (26 vs. 12 multi-edges), one cannot exclude a possibility that some complex breakpoint reuse events skewed the statistics in Table 1B in favor of the primate–carnivore clade (see Table 4A–C). Since the elephant genome provides a better (less diverged) outgroup than the opossum genome, there is a hope that the completion of the elephant sequencing project may eventually lead to the resolution of the primate–rodent–carnivore controversy.
We re-run MGRA on the set of seven genomes, assuming the primate–carnivore topology.14 The resulting rearrangement distances as well as 2-breaks assigned to the MRO + DQHC branch (supporting the carnivore–primate split) are given in Supplement H. Discussion Recently, Froenicke et al. (2006) expressed a concern about some differences between the rearrangement-based and cytogenetics-based approaches to ancestral genome reconstruction. The problem is that some important insights developed by the cytogenetics community still did not find their way into the genome rearrangement tools like MGR, GRAPPA, inferCARs, and EMRAE. While MGRA started as an attempt to close this gap, we quickly realized that the problem of merging the cytogenetics-based and rearrangement-based approaches is far from being simple. First, there is still no cytogenetics-based software that can be automatically applied to genome-scale data sets to enable an unbiased comparison of two approaches on the same data set. Second, it is not clear how well the cytogenetics approach scales with increase in the resolution, for example, with 1000+ synteny blocks from Ma et al. (2006). Despite the low resolution of the cytogenetics data, the cytogenetics-based ancestral reconstructions are very accurate as there are relatively few discrepancies between the cytogenetics-based and the recent genomics-based high-resolution reconstructions (Bourque et al. 2005; Murphy et al. 2005; Ma et al. 2006). Moreover, the discrepancies are usually attributed to some arbitrary assignments of the genomics-based MGR algorithm (Froenicke et al. 2006) rather than errors in the cytogenetics analysis. Indeed, MGR was developed for finding the most parsimonious scenario rather than finding which rearrangements in this scenario are less reliable than others. The discrepancies between MGR and the cytogenetics-based reconstructions are likely to be a reflection of the “strength in numbers” principle rather than shortcomings of the genomics-based approaches: while the cytogenetics reconstructions are based on more than 100 known cytogenetics maps, there are still only seven completed mammalian genomic sequences suitable for the rearrangement analysis. However, even with a small increase in the number of the genomes from three to four (as in Bourque et al. 2004, 2005) to six to seven (as in Murphy et al. 2005; Ma et al. 2006), there are very few discrepancies between the cytogenetics-based and the genomics-based approaches (Rocchi et al. 2006). Despite a recent debate (Bourque et al. 2006; Froenicke et al. 2006), the cytogenetics-based and genomics-based approaches are converging and benefiting from the higher resolution of the genomics-based approaches. However, the key condition for such convergence is the availability of algorithms that improve on the existing heuristics for separating between strong and weak associations. We addressed this challenge by devising the MGRA algorithm, which remedies some limitations of the previous approaches to ancestral reconstructions. Similarly to the algorithms recently proposed by Ma et al. (2008) and Chauve and Tannier (2008) (published after this paper was submitted), MGRA focuses on accurate rather than the most parsimonious ancestral reconstructions. Acknowledgments We thank Jian Ma for providing us with the synteny blocks for mammalian genomes from the latest builds and for numerous thoughtful discussions. We thank Bill Murphy for a discussion about the primate–rodent–carnivore controversy, and Guillaume Bourque and Glenn Tesler for discussions on the algorithmic aspects of this project. We also thank Lutz Froenicke and Claus Kemkemer for useful comments on the cytogenetics approach and CytoAncestor. This work was supported by the Howard Hughes Professor Award. Footnotes 2The breakpoint graphs represent a popular technique for the rearrangement analysis since they reveal pairs of breakpoints representing footprints of the rearrangement events. See chapter 10 of Pevzner (2000) for background information on genome rearrangements and breakpoint graphs. 3In this study, we use the term “reversal” (common in bioinformatics literature) instead of the term “inversion” (common in biology literature). For circular chromosomes, fusions and translocations are not distinguishable, that is, every fusion of circular chromosomes can be viewed as a translocation, and vice versa. 4The detailed information about synteny blocks and assembly builds is provided in the Supplemental material. Out of 1360 synteny blocks (kindly provided by Jian Ma), three synteny blocks represent intermixed segments of the chromosome X and other chromosomes (the mouse chromosome 7 and the rat chromosomes 15 and 20). Since these blocks are short (16, 47, and 17 kb, respectively), we have discarded them to simplify the chromosome X analysis below. For better illustration of the breakpoint graphs, the vertex ∞ is shown in multiple copies as black dots, each connected by a single multi-edge to a regular vertex. 5Switching from black rearrangements to a mixture of black and green rearrangements is a simple but powerful paradigm that proved to be useful in previous studies (Bafna and Pevzner 1998; Tannier and Sagot 2004). 6The use of -consistent 2-breaks here is motivated by an important property that every -consistent transformation can be turned into a strict -consistent transformation by changing the order of 2-breaks. Therefore, we do not directly address the strictness requirement in MGRA that first produces a -consistent transformation of the genomes P1, P2,…, Pk into the genome X and then reorders it into a strict transformation.7While this representation is not unique, all these representations are equivalent (i.e., they produce the same final result). Figure 6B 8In the special case x0 = xm + 1 = ∞, and the flanking edges are of the same -consistent multi-color; we perform a fusion 2-break as shown in Figure 6D9One can prove that the topology of the resulting graph does not depend on the order in which good cycles/paths are processed. 10We use the MRD node of the phylogenetic tree in Figure 5 11We are not claiming that this association does not exist since it may be present in some of 100+ genomes with available cytogenetics data. However, there is no support for this association in the six mammalian genomes. We remark that Ma et al. (2006) also did not find support for this association. 12inferCARs reconstructions slightly differ from those reported in Ma et al. (2006) since we use the synteny blocks from the latest builds of mammalian genomes (provided by Jian Ma, University of California, Santa Cruz). Similar to Ma et al. (2006) and Kemkemer et al. (2006), we ignore very short CARs blocks in both inferCARs and MGRA reconstructions to simplify the analysis (see Supplemental Table S14). 13Adding the seventh genome increases the number of the synteny blocks to 1746 (by ~30%) but reduces the coverage of the genomes by the synteny blocks from 89% to 79%. 14In contrast, Ma et al. (2006) assumed the primate–rodent topology. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.082784.108. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||
Curr Opin Genet Dev. 1997 Dec; 7(6):784-91.
[Curr Opin Genet Dev. 1997]Proc Natl Acad Sci U S A. 1992 Jul 15; 89(14):6575-9.
[Proc Natl Acad Sci U S A. 1992]J Comput Biol. 1998 Fall; 5(3):555-70.
[J Comput Biol. 1998]Genome Res. 2002 Jan; 12(1):26-36.
[Genome Res. 2002]Bioinformatics. 2003; 19 Suppl 1():i305-12.
[Bioinformatics. 2003]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2006 Dec; 16(12):1441-4.
[Genome Res. 2006]Proc Natl Acad Sci U S A. 2006 Dec 26; 103(52):19824-9.
[Proc Natl Acad Sci U S A. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Bioinformatics. 2005 Aug 15; 21(16):3340-6.
[Bioinformatics. 2005]Bioinformatics. 2008 Jul 1; 24(13):i114-22.
[Bioinformatics. 2008]J Bioinform Comput Biol. 2003 Apr; 1(1):71-94.
[J Bioinform Comput Biol. 2003]Bioinformatics. 2005 Aug 15; 21(16):3340-6.
[Bioinformatics. 2005]J Comput Biol. 2008 Oct; 15(8):1117-31.
[J Comput Biol. 2008]J Comput Biol. 1998 Fall; 5(3):555-70.
[J Comput Biol. 1998]Genome Res. 2002 Jan; 12(1):26-36.
[Genome Res. 2002]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Proc Natl Acad Sci U S A. 2006 Dec 26; 103(52):19824-9.
[Proc Natl Acad Sci U S A. 2006]PLoS Comput Biol. 2007 Jan 5; 3(1):e2.
[PLoS Comput Biol. 2007]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2006 Mar; 16(3):306-10.
[Genome Res. 2006]Science. 2005 Jul 22; 309(5734):613-7.
[Science. 2005]Proc Natl Acad Sci U S A. 2006 Dec 26; 103(52):19824-9.
[Proc Natl Acad Sci U S A. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Biol. 2004; 5(4):R23.
[Genome Biol. 2004]Genome Res. 2004 Jul; 14(7):1424-37.
[Genome Res. 2004]Genome Res. 2004 Oct; 14(10A):1851-60.
[Genome Res. 2004]Science. 2005 Jul 22; 309(5734):613-7.
[Science. 2005]Genome Res. 2005 Dec; 15(12):1787-97.
[Genome Res. 2005]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2002 May; 12(5):739-48.
[Genome Res. 2002]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Mol Phylogenet Evol. 1998 Jun; 9(3):572-84.
[Mol Phylogenet Evol. 1998]FEBS Lett. 1993 Jun 28; 325(1-2):152-9.
[FEBS Lett. 1993]Genetics. 1994 May; 137(1):243-56.
[Genetics. 1994]Proc Natl Acad Sci U S A. 2006 Dec 26; 103(52):19824-9.
[Proc Natl Acad Sci U S A. 2006]PLoS Comput Biol. 2007 Jan 5; 3(1):e2.
[PLoS Comput Biol. 2007]Proc Natl Acad Sci U S A. 2003 Jun 24; 100(13):7672-7.
[Proc Natl Acad Sci U S A. 2003]PLoS Comput Biol. 2007 Apr 27; 3(4):e74.
[PLoS Comput Biol. 2007]PLoS Comput Biol. 2007 Jan 5; 3(1):e2.
[PLoS Comput Biol. 2007]Mol Biol Evol. 2007 Aug; 24(8):1722-30.
[Mol Biol Evol. 2007]PLoS One. 2007 Aug 8; 2(1):e708.
[PLoS One. 2007]Genome Biol. 2007; 8(6):R109.
[Genome Biol. 2007]Nature. 2007 May 10; 447(7141):167-77.
[Nature. 2007]Genome Res. 2006 Mar; 16(3):306-10.
[Genome Res. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2005 Jan; 15(1):98-110.
[Genome Res. 2005]Science. 2005 Jul 22; 309(5734):613-7.
[Science. 2005]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2006 Mar; 16(3):306-10.
[Genome Res. 2006]Genome Res. 2004 Apr; 14(4):507-16.
[Genome Res. 2004]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]Chromosome Res. 2006; 14(8):899-907.
[Chromosome Res. 2006]Genome Res. 2006 Dec; 16(12):1557-65.
[Genome Res. 2006]