- Journal List
- Bioinformatics
- PMC2718624

# Guided genome halving: hardness, heuristics and the history of the Hemiascomycetes

^{1}Department of Biology,

^{2}Department of Biochemistry, Microbiology and Immunology,

^{3}School of Information Technology and Engineering and

^{4}Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada

## Abstract

**Motivation:** Some present day species have incurred a whole genome doubling event in their evolutionary history, and this is reflected today in patterns of duplicated segments scattered throughout their chromosomes. These duplications may be used as data to ‘halve’ the genome, i.e. to reconstruct the ancestral genome at the moment of doubling, but the solution is often highly nonunique. To resolve this problem, we take account of outgroups, external reference genomes, to guide and narrow down the search.

**Results:** We improve on a previous, computationally costly, ‘brute force’ method by adapting the genome halving algorithm of El-Mabrouk and Sankoff so that it rapidly and accurately constructs an ancestor close the outgroups, prior to a local optimization heuristic. We apply this to reconstruct the predoubling ancestor of *Saccharomyces cerevisiae* and *Candida glabrata*, guided by the genomes of three other yeasts that diverged before the genome doubling event. We analyze the results in terms (1) of the minimum evolution criterion, (2) how close the genome halving result is to the final (local) minimum and (3) how close the final result is to an ancestor manually constructed by an expert with access to additional information. We also visualize the set of reconstructed ancestors using classic multidimensional scaling to see what aspects of the two doubled and three unduplicated genomes influence the differences among the reconstructions.

**Availability:** The experimental software is available on request.

**Contact:** ac.awattou@ffoknas

## 1 INTRODUCTION

Whole genome doubling (WGD) is a rare but important type of evolutionary event, often giving rise to major new lineages. In its various forms it has occurred across the eukaryotic spectrum, from the pathogenic protist *Giardia* to the ancestor of brewer's yeast, to most of the plant lineages, to several insect species, to the salmonid fishes, to amphibians and even to mammalian species.

WGD is followed, over evolutionary time, by genome rearrangement through intra- and interchromosomal movement of genetic material. The phylogenetic study of synteny, gene order and chromosomal evolution becomes blocked because of the extraordinarily high rates of paralogy in the species descended from the WGD compared to sister species that diverged before the WGD. If we could infer the ancestral genome that underwent the WGD, this difficulty would be resolved. Thus the genome halving problem is to reconstruct the ancestral genome on the basis of a decomposition of the present day genome into a set of apparently duplicated blocks of genes or DNA sequence dispersed among the chromosomes.^{1} A linear-time algorithm to find the ancestral genome that minimizes the genomic distance to the present day genome has been available for some time (El-Mabrouk and Sankoff, 2003; El-Mabrouk *et al.*, 1999). Unfortunately, the solution to the combinatorial optimization problem is not always directly interpretable as a solution to the evolutionary biology problem. First, the algorithmic result suffers from severe non-uniqueness. Second, in common with most methods of inferring history, we have no direct way to verify if the answer is correct. Our goal is to counteract these problems, first by guiding the reconstruction by one or more reference, or outgroup, genomes and second, by checking our results for a particular dataset against an ancestor genome manually reconstructed by an expert.

If our guided reconstruction method were to be feasible and accurate it could have wide application. One or more descendants of a WGD event co-occur with unduplicated sister species in many phylogenies. This is most prevalent among plants where, for example, the poplars and willows descend from a common WGD, while the closely related eurosid angiosperms like papaya diverged before this event, but it also occurs in yeast, where brewer's yeast and several sister species share an origin in an ancestral WGD, while other closely related species have earlier divergence dates, in fish, where the salmonid species like trout and salmon originate in a WGD event after diverging from the related osmerid fish, in mammals, where some genera of viscacha rodents share a WGD history while their relationship with very similar octodontids predates this. In protists, the important pathogen genus *Giardia* has undergone a form of WGD, while the related enteromonad parasites have not, though this may be due to a post-WGD loss rather than an early divergence. This very partial list of examples emphasizes species whose genomes have been sequenced or for which serious sequencing projects are underway or are being actively promoted.

We first explored the idea of guided reconstruction for the ancestral doubled genome of the maize (*Zea mays*) genome, with the rice (*Oryza sativa*) and sorghum (*Sorghum bicolor*) genomes^{2} as outgroups (Zheng *et al.*, 2006). Our strategy was to generate all the 1.5×10^{6} solutions to the genome halving problem for the maize genome, and to identify the subset, containing 10–20 solutions that have a minimum rearrangement distance with the rice (or sorghum) genome. We followed this with a local improvement heuristic searching outside the immediate set of optimal halving solutions to find the genome *A* that minimizes the sum of the distance between the doubled form *A*⊕*A* and present day maize plus the distance between rice (or sorghum) and the predoubling form *A*.

While this approach was feasible with the 34 doubled blocks in maize, present in one copy in each outgroup, the heuristic search step was time consuming, given that the starting points were relatively far from optimal. Then we attempted to reconstruct the ancient doubled yeast genome from which *Saccharomyces cerevisiae* is descended, guided simultaneously by both of the undoubled outgroup genomes *Ashbya gossypii* and *Kluyveromyces waltii* (Sankoff *et al.*, 2007). In these data the number of doubled genes we used was an order of magnitude greater than the number of blocks in the cereals data, and the number of solutions to the halving problem astronomical. It is not feasible to exhaustively search the halving solutions to find those that are closest to the outgroups, to say nothing of the heuristic search step. Instead we tried working with a sample of halving solutions, hoping to generate at least one initialization leading to a good solution. It was not clear, however, how large the sample should be, or how to validate the results, since the local optima found in that study remained fairly far apart, as measured by genomic distance.

The facts that halving guided by a single outgroup involves only two genomes, and that both of its component parts, halving and distance calculation, are basically linear time, suggests that this problem might be susceptible to a polynomial-time analysis, in contrast to problems such as the ‘median problem’ for three or more genomes, which are NP-hard (Bryant, 1998; Caprara, 2003; Pe'er and Shamir, 1998). We dispose of this hope at the outset, by showing that the simplest problem of halving guided by one outgroup is NP-hard.

Nevertheless, in the ensuing sections, we seek to replace the ‘brute force’ approach of generating unconstrained halving solutions first, i.e. before taking into consideration the outgroup genome(s). Instead, we inject all pertinent information derivable from the outgroup(s) into the halving algorithm, influencing hitherto arbitrary choices in that algorithm so that the halving solution is guided towards the outgroup(s).

We analyze data on two yeasts descended from the same doubling event, *S.cerevisiae* and *Candida glabrata*, to try to reconstruct the original doubled genome. Three related outgroup species are currently available in the Yeast Gene Order Browser (YGOB, Byrne and Wolfe, 2005): *A.gossypii*, *K.waltii* and *Kluyveromyces lactis.* YGOB also furnishes an estimate of the ancestral doubled genome painstakingly reconstructed by Jonathan Gordon on the basis of multiple sources of information.

Our new algorithm greatly improves the accuracy of our results, while drastically reducing the computational effort, both in generating halving solutions and in the local optimization search. We compare this new approach to the sampling approach, with and without the local optimization step, from the viewpoints of the objective function value obtained and computing time. We apply our method to all combinations of the two descendants of the doubled ancestor and four single genomes, the three species already mentioned plus Gordon's manually reconstructed ancestor.

We also use data-analytic methods to compare our inferred predoubling genomes to each other and to the Gordon construct.

## 2 PROBLEM STATEMENT

Although the idea of guided genome halving is not difficult, the prerequisite for understanding the analysis is to have some knowledge of standard genome rearrangement problems, namely genomic distance, genome halving and genome median. We can only sketch these in Sections 2.2, 2.3 and 2.4 before enunciating the guided genome halving (GGH) problems in Section 2.5. In Section 3, we discuss the algorithms for these problems.

### 2.1 Genomes and rearrangement operations

A genome *G* is represented by a set of strings (called {*g*_{{11},..,*g*_{1n1,..,gχ1,..,gχnχ}, where *n*=*n*_{1}+···+*n*_{χ} and {|*g*_{..}|}={1, …,*n*}; i.e. each integer *i*∈{1, …,*n*}, representing a gene or other marker, appears exactly once in the genome and may have either positive or negative polarity. The biologically motivated operations generally include^{3} inversions (implying as well change of sign, i.e. change of strand) of chromosomal segments, e.g. *h*_{1} ···*h*_{u} ···*h*_{v} ···*h*_{m}→*h*_{1} ···−*h*_{v} ···−*h*_{u} ···*h*_{m}, and reciprocal translocations, e.g. *h*_{1} ···*h*_{u} ···*h*_{l}, *k*_{1} ···*k*_{v} ···*k*_{m}→*h*_{1} ···*k*_{v} ···*k*_{m}, *k*_{1} ···*h*_{u} ···*h*_{l}.

### 2.2 Genomic distance

*The genome rearrangement distance d*(*G*, *H*) is defined to be the minimum number of operations necessary to convert one genome *G* into another *H*.

*The breakpoint distance*—We say there is a ‘shared adjacency’ if the signed integer *g*_{i, j+1} immediately follows *g*_{i, j} on a chromosome in *H* as well as on the *i*-th chromosome in *G*, or if −*g*_{ij} follows −*g*_{i, j+1} in *H*. There are also shared adjacencies if *g*_{i1} or −*g*_{ini} are first terms on chromosomes in *H* or if *g*_{ini} or −*g*_{i1} are last terms on chromosomes in *H*. Then if *G* and *H* have the same number of chromosomes χ, the breakpoint distance *d*_{B}(*G*,*H*) is defined to be *n*+χ—the number of shared adjacencies.

### 2.3 Genome halving

Let *T* be a genome consisting of ψ chromosomes and 2*n* genes , dispersed in any order on the chromosomes. For each *i*, we call and ‘duplicates’, but there is no particular property distinguishing all elements of the set of in common from all those in the set of A potential ‘doubled ancestor’ of *T* is written *A*^{′}⊕*A*^{′′}, and consists of 2χ chromosomes, where some half (χ) of the chromosomes, symbolized by the *A*^{′}, contains exactly one of or for each *i*=1, …,*n*. The remaining χ chromosomes, symbolized by the *A*′′, are each identical to one in the first half, in that where appears on a chromosome in the *A*^{′}, appears on the corresponding chromosome in *A*′′, and where appears in *A*^{′}, appears in *A*′′. We define *A* to be either of the two halves of *A*^{′}⊕*A*^{′′}, where the superscript (1) or (2) is suppressed from each or . These χ chromosomes, and the *n* genes they contain, *a*_{1},…,*a*_{n} constitute a potential ‘doubled ancestor’ of *T*.

*The genome halving problem for T is to find an A for which some d*(*A*^{′}⊕*A*^{′′}, *T*) is minimal.

### 2.4 The median problem

Let *P*,*Q* and *R* be three genomes on the same set of *n* genes. *The rearrangement median problem is to find a genome M such that d*(*P*,*M*)+*d*(*Q*,*M*)+*d*(*R*,*M*) *is minimal*. *The breakpoint median problem is to find a genome M such that d*_{B}(*P*,*M*)+*d*_{B}(*Q*,*M*)+*d*_{B}(*R*,*M*) *is minimal.*

### 2.5 Guided genome halving

As in Section 2.3, let *T* be genome consisting of ψ chromosomes and 2*n* genes , dispersed in any order on the chromosomes, where for each *i*, genes and are duplicates. Any genome *R* is a reference or outgroup genome for *T* if it contains the *n* genes *a*_{1},…,*a*_{n}.

There are a number of different formulations possible for GGH, depending on the genomic distance used, and the number of outgroups doing the guiding. Here we will study the cases of one outgroup (Zheng *et al.*, 2006) and two outgroups (Sankoff *et al.*, 2007), using the genomic distance *d* defined in Section 2.2, and we will also analyze the complexity of the one outgroup problem in the context of the breakpoint distance *d*_{B}.

*Let R be a reference genome for T*. *The GGH problem with one outgroup is to find (an estimated ancestral) genome A such that some d*(*R*,*A*)+*d*(*A*^{′}⊕*A*^{′′},*T*) *is minimal*. *Let R*_{1} *and R*_{2} *be two reference genomes for T*. *The GGH problem with two outgroups is to find A and a median genome M such that some d*(*R*_{1},*M*)+*d*(*R*_{2},*M*)+*d*(*A*,*M*)+*d*(*A*^{′}⊕*A*^{′′},*T*) is minimal.

## 3 ALGORITHMS FOR GENOME DISTANCE, GENOME HALVING AND THE GENOME MEDIAN

### 3.1 Distance

Rearrangement algorithms (Tesler, 2002) can be formulated in terms of the bi-colored ‘breakpoint graph’, where each end (either 5^{′} or 3^{′}) of a gene in genome *G* is represented by a vertex joined by a black edge to the vertex for adjoining end of the adjacent gene, and these same ends, represented by the same 2*n* vertices in the graph, are joined by gray edges determined by the adjacencies in genome *H*. In addition, if *G* has χ chromosomes, assuming without loss of generality that this is at least as many as *H*, each vertex representing a first or last term of some chromosome in *G* only is connected by a black edge to an individual ‘cap’, or dummy, vertex so that there are 2*n*+2χ vertices in all. The breakpoint graphs necessarily consist of disjoint alternating color cycles and/or paths, and it can be shown that, with some rare, easily identifiable exceptions, *d*(*G*,*H*)=*n*+χ−*c*−Π, where *c* is the number of cycles and Π the number of paths terminating in at least one cap. Calculating the distance can be done in time linear in *n*.

The actual operations, *d*(*G*,*H*) in number, may be reconstructed by successively choosing certain large cycles and paths in the breakpoint graph to split into two, corresponding to a reversal or translocation, until there are *n*−χ cycles each made up of two vertices, a black edge and a gray edge, and 2χ paths each containing one cap and one chromosome-terminating gene vertex connected by a black edge. This requires somewhat more than linear time.

The breakpoint distance *d*_{B} is easily calculated by storing all adjacencies of *G* as it is input, and verifying for each *g*_{ij} as it is encountered when *H* is input, whether its successor is *g*_{i, j+1}.

### 3.2 Halving

In the rearrangement distance algorithm, construction of the breakpoint graph is an easy step. The genome halving algorithms (El-Mabrouk and Sankoff, 2003) also make use of the breakpoint graph, but the problem here is the more difficult one of building the breakpoint graph where one of the genomes (the doubled ancestor *A*^{′}⊕*A*^{′′}) is unknown. This is done by segregating the vertices of the graph in a natural way into subsets, such that the vertices of all cycles must fall within a single subset, and then constructing these cycles in an optimal way within each subset so that the black edges correspond to the structure of the known genome *T* and the gray edges define the adjacencies of *A*^{′}⊕*A*^{′′}.

As a first step each gene *a* in a doubled descendant is replaced by a pair of vertices (*a*_{t},*a*_{h}) or (*a*_{h},*a*_{t}) depending if the DNA is read from left to right or right to left. The duplicate of gene *a*=(*a*_{t},*a*_{h}) is written ā=(ā_{t},ā_{h}).

Following this, for each pair of neighbouring genes, say (*a*_{t},*a*_{h}) and (*b*_{h},*b*_{t}), the two adjacent vertices *a*_{h} and *b*_{h} are linked by a black edge, denoted {*a*_{h},*b*_{h}} in the notation of Bergeron *et al.* (2006). For a vertex at the end of a chromosome, say *b*_{t}, it generates a virtual edge of form {*b*_{t},end}. Note that the use of ‘end’ instead of ‘cap’ reflects a somewhat different book keeping for the beginnings and ends of chromosome in the halving algorithm compared to the distance algorithm in Section 3.1.

The edges thus constructed are then partitioned into *natural graphs* according to the following principle: If an edge {*x*,*y*} belongs to a natural graph, then so does some edge of form and some edge of form . If a natural graph has an even number of edges, as on the left of Figure 1, it can be shown that in all optimal ancestral doubled genomes, if a gray edge, representing two adjacent vertices in the ancestor, has a vertex in this natural graph, then it necessarily connects to another vertex in the same natural graph. For natural graphs with an odd number of edges, which cannot be completed by adding pairs of edges, there are one or more ways of grouping them pairwise into *supernatural graphs*, as on the right of Figure 1. An optimal doubled ancestor exists such that if a gray edge has a vertex in this supernatural graph, then it connects to another vertex within the same supernatural graph. Thus the supernatural graphs may be completed one at a time.

*x*,

*y*,

*z*vertices and

*a*,

*b*,

*c*vertices, respectively, combined into one supernatural graph so that three pairs of gray edges may be

**...**

An important detail in this construction is that before a gray edge is added during the completion of a supernatural graph, it must be checked to see that it would not inadvertently result in a circular chromosome. This involves inspection within this supernatural graph only. Key to the linear worst-case complexity of the halving algorithm is that this check may be made in constant time.

Along with the multiplicity of solutions caused by different possible constructions of supernatural graphs, within such graphs and within the natural graphs, there may be many ways of drawing the gray edges. Without repeating here the lengthy details of the halving algorithm, it suffices to note that these alternate ways can be generated by choosing one of the vertices within each supernatural graph as a starting point.

### 3.3 Median

Unlike the genomic distances and genome halving, which can all be calculated in linear time, the genome median problem, based either on *d* or *d*_{B}, is NP-hard (Bryant, 1998; Caprara, 2003; Pe'er and Shamir, 1998). The heuristics (Bourque and Pevzner, 2002; http://www.cs.unm.edu/ moret/GRAPPA/) commonly used to analyze the problem search for reversals that will move a genome towards the other two. This is iterated as often as possible; otherwise one of the genomes is moved towards only one of the others without prejudicing its distance to the third, and the algorithm stops when all three genomes become identical. These algorithms become prohibitively costly with moderate *n*.

## 4 PREVIOUS WORK ON GGH

### 4.1 Guided genome halving with one outgroup

Consider *T* and a related unduplicated genome *R* with genes orthologous to *a*_{1},…,*a*_{n}. Our problem is to find an unduplicated genome *A* that minimizes, for some *A*^{′}⊕*A*^{′′},

Our solution in (Zheng *et al.* 2006), as on the left of Figure 2, is to generate the set **S** of genome halving solutions, then to focus on the subset *X* ∈**S**′⊂**S** where *d*(*R*,*X*) is minimized.

*T*, with one (

*R*) or two (

*R*

_{1},

*R*

_{2}) unduplicated outgroups. The double circles represent two copies of potential ancestral genomes, including solutions to the genome halving in

**S**, and those on best trajectories between

**S**and

**...**

We then minimize *D*(*T*,*R*) by seeking heuristically for *A* along any trajectory between an element *X* ∈**S**′ and the outgroups. First, each possible genome on one or more trajectories between *X* and *R* is examined in turn to see if it that decreases *D*(*T*,*R*). If so, it is taken as the current best value of *X*. When no better *X* is found for any starting point in **S**′ the current value is taken to be *A*.

In our experience, any more comprehensive search becomes computationally very costly, and very rarely finds a better solution.

When **S**′ is so large that an exhaustive search for a local minimum becomes computationally too costly, or when it is too costly to generate all of **S** in order to find **S**′, we may resort to sampling **S**. In defining the gray edges in the supernatural graphs of Section 3.2, we generally have several choices at some of the steps. By randomizing this choice, we are effectively choosing a random sample of *X* ∈**S**.

### 4.2 Guided genome halving with two outgroups

With reference to the right of Figure 2, consider *T* and two unduplicated genomes *R*_{1} and *R*_{2} with genes orthologous to *a*_{1},…,*a*_{n}. Our problem here is to find a genome *A* and a median genome *M* for *A*,*R*_{1} and *R*_{2} that minimize

for some *A*^{′}⊕*A*^{′′}. Our solution in Sankoff *et al.* (2007), as on the right of Figure 2, was to generate the set **S** of solutions of the genome halving problem, then to focus on the subset *X* ∈**S**′⊂**S** where *d*(*R*_{1},*M*)+*d*(*R*_{2},*M*)+*d*(*X*,*M*) is minimized, using our own implementation of the median heuristics mentioned in Section 3.3. Then we sought the *A* minimizing *D*(*T*,*R*_{1},*R*_{2}), heuristically, along all trajectories between all elements *X* ∈**S**′ and *M*(*X*).

## 5 COMPLEXITY

We prove that GGH for one outgroup under the breakpoint distance *d*_{B} is NP-hard, using a reduction from the Breakpoint Median Problem. The latter is NP-hard, both for unichromosomal (Bryant, 1998) and multichromosomal genomes (E.Tannier, personal communication).

We convert the breakpoint median problem on *P*, *Q* and *R*, three diploid genomes with the same genes, into an instance of GGH:

- Construct genome
*P*_{1}by appending superscript ‘1’ to the symbol for each gene in genome*P*. - Construct genome
*Q*_{2}by appending superscript ‘2’ to the symbol for each gene in genome*Q*. - Let
*T*=*P*_{1}⊕*Q*_{2}. We will treat*T*as a doubling descendant. Superscripts ‘1’ and ‘2’ distinguish the two copies of a gene. - Define an instance of GGH based on the doubling descendant
*T*and the diploid outgroup*R*.

We prove that the solution of GGH for genomes *T* and *R* is also the solution of Breakpoint Median Problem on genomes *P*, *Q* and *R*:

Given any assignment of ‘1’ and ‘2’ superscripts to the pairs of genes in *T*, a solution for GGH minimizes

where *A*^{′} is a genome with one copy of each gene, labeled ‘1’ or ‘2’, and *A*′′ is the same as *A*^{′} with all the ‘1’ and ‘2’ superscripts interchanged. *A* is the same genome without superscripts.

#### LEMMA 1. —

*If we construct genome A*_{1} *by appending superscript* ‘*1*’ *to each gene in genome A*, *and A*_{2} *by appending superscript* ‘*2*’ *to each gene in genome A*, *then*

#### PROOF. —

Genomes *A*^{′} and *A*′′ form a solution to GGH. The sum *d*_{B}(*A*^{′}⊕*A*^{′′},*T*)+*d*_{B}(*A*,*R*) is minimized. Therefore

Due to the construction of the genome *T*, each pair of adjacent elements in *T* must have the same superscript. This implies that for every adjacency that *A*^{′}⊕*A*^{′′} has in common with genome *T*, the two adjacent terms must have same superscript too. Genome *A*_{1}⊕*A*_{2} contains all these common adjacencies, which implies

Thus *d*_{B}(*A*_{1}⊕*A*_{2},*T*)=*d*_{B}(*A*^{′}⊕*A*^{′′},*T*). If *A*^{′} and *A*′′ form a solution of GGH, then *A*_{1} and *A*_{2} also constitute a solution with the same breakpoint distance. ▪

#### LEMMA 2. —

*The breakpoint distance d*_{B}(*A*_{1}⊕*A*_{2},*T*)=*d*_{B}(*A*,*P*)+*d*_{B}(*A*,*Q*).

#### PROOF. —

We constructed *T*=*P*_{1}⊕*Q*_{2}. The adjacencies in common between *A*_{1}⊕*A*_{2} and *T* can be divided into two kinds:

- the common adjacencies between
*A*_{1}and*P*_{1}and - the common adjacencies between
*A*_{2}and*Q*_{2}.

Therefore *d*_{B}(*A*_{1}⊕*A*_{2},*T*)=*d*_{B}(*A*_{1},*P*_{1})+*d*_{B}(*A*_{2},*Q*_{2}). Trivially, i.e. by simply ignoring superscripts, *d*_{B}(*A*_{1},*P*_{1})=*d*_{B}(*A*,*P*) and *d*_{B}(*A*_{2},*Q*_{2})=*d*_{B}(*A*,*Q*) ▪

#### THEOREM 1. —

*Genome A*, *the solution of GGH for T and C*, *is also the solution of the Breakpoint Median Problem on genomes P*, *Q and R*.

#### PROOF. —

From Lemma 2, *d*_{B}(*A*_{1}⊕*A*_{2},*T*)=*d*_{B}(*A*,*P*)+*d*_{B}(*A*,*Q*). Thus

There cannot be any other genome *A** such that *d*_{B}(*A**,*P*)+*d*_{B}(*A**,*Q*)+*d*_{B}(*A**,*R*)<*d*_{B}(*A*,*P*)+*d*_{B}(*A*,*Q*)+*d*_{B}(*A*,*R*), because this *A** would then have the property that

a contradiction. Therefore genome *G* is the solution of the Breakpoint Median Problem on *P*, *Q* and *R*. ▪

Assuming the Breakpoint Median Problem for four genomes *L*,*P*,*Q* and *R* were also NP-hard, although we are not aware of any explicit proof, we could use the same method employed above to show that GGH with two outgroups is hard under the *d*_{B} distance.

We do not yet have corresponding proofs that GGH is NP-hard under the rearrangement distance *d*, but this is almost certainly the case since the breakpoint distance easier to compute than rearrangement distance, even though they are both *O*(*n*). Note that the Reversals Median Problem for three or more (unichromosomal) genomes is NP-hard (Caprara, 2003).

## 6 THE NEW ALGORITHMS

The key idea in our improvement on the brute force algorithms is to combine information from both *T* and the outgroups in constructing the ancestor. It is important to take advantage of the common structure in *T* and the outgroups as early as possible, before it can be destroyed in the course of construction. To this end, we drop the practice of completing all the gray edges in one supernatural graph before starting another. We simply look for elements of common structure and add gray edges accordingly, making sure at each step that no circular chromosomes are inadvertently created. It is still necessary to construct the supernatural graphs at the outset, both for the check against circular chromosomes and for technical reasons we omit here, having to do with chromosome ends.

Our approach requires only slight modifications from the context of a single outgroup to that of two outgroups. For that reason, we present a single algorithm for both, with the modifications for two outgroups in square brackets. Indeed, this presentation is suggestive of a generalization to three or more outgroups.

### 6.1 Paths

By ‘path’ we mean any connected succession of black and gray edges in a breakpoint graph, starting and terminating with a black edge. We represent each path by an unordered pair (*u*,*v*)=(*v*,*u*) consisting of its current endpoints, though we keep track of all its vertices and edges. Initially, each black edge in *T* is a path, as is each black edge in *R* [or in each of *R*_{1} and *R*_{2}].

### 6.2 Pathgroups

A pathgroup Γ is an ordered triple [quadruple] of paths, two in *T* and one in *R* [one each in outgroups *R*_{1} and *R*_{2}], where one endpoint of one of the paths in *T* is the duplicate of one endpoint of the other path in *T* and both are orthologous to one of the endpoints of the path in *R* [*R*_{1} and *R*_{2}]. The other endpoints may be duplicates or orthologs to each other, or not. For the special case where the duplicates are end vertices, and the supernatural graph containing it has four end nodes, then the members of a pair of duplicate dummies must originate in different (odd length) natural graphs.

### 6.3 The algorithms

In adding pairs of gray edges to connect duplicate pairs of terms in the breakpoint graph of *T* versus *X*^{′}⊕*X*^{′′} (which is being constructed) our approach is basically greedy, but with an important look-ahead. We can distinguish six priority levels among potential gray edges, i.e. potential adjacencies in the ancestor. Recall that in constructing the ancestor *X* to be close to the outgroups, such that *X*^{′}⊕*X*^{′′} is simultaneously close to *T*, we must create as many cycles as possible in the breakpoint graphs between *X* and the outgroups and in the breakpoint graph of *X*^{′}⊕*X*^{′′} versus *T*.

- Adding two gray edges would create two cycles in the breakpoint graph defined by
*T*and*X*′⊕*X*′′, by closing two paths, as on the top of Figure 3. When this possibility exists, it must be realized, since it is an obligatory choice in any genome halving algorithm. It may or may not create cycles in the breakpoint graph comparison of*X*with the outgroups. - Adding two gray edges would create three cycles, one for
*T*and one for each of two outgroups. - Adding two gray edges would create two cycles, one for
*T*and one for one outgroup, as in the middle of Figure 3. - Adding two gray edges would create one cycle for
*T*but none for the outgroups. It would, however, create a higher priority pathgroup, e.g., Figure 3, bottom. - Adding two gray edges would create a cycle in the
*T*versus*X*^{′}⊕*X*^{′′}comparison, but none for the outgroups, nor would it create any higher priority pathgroup. - Each remaining path terminates in duplicate terms, which cannot be connected to form a cycle, since in
*X*^{′}⊕*X*^{′′}these must be on different (and identical) chromosomes. In supernatural graphs containing such paths, there is always another path and adding two gray edges between the endpoints of the two paths can create a cycle.

In not completing each supernatural graph before moving on to another, we lose the advantage in (El-Mabrouk and Sankoff, 2003) of a constant time check against creating circular chromosomes. The worst case becomes a linear time check. In practice, this is a small liability, because the worst case scenario is seldom realized.

Once the GGH algorithm is terminated, we undertake the local search described in Sections 4.1 and 4.2 to see if we can improve *X* by allowing it to move out of **S** on a trajectory towards *R*.

## 7 GENOME DOUBLING IN YEAST

Wolfe and Shields (1997) discovered an ancient genome doubling in the ancestry of *S.cerevisiae* in 1997 after this organism became the first eukaryote to have its genome sequenced (Goffeau *et al.*, 1996). According to Kurtzman and Robnett (2003), the recently sequenced *C.glabrata* (Dujon *et al.*, 2004) shares this doubled ancestor. We extracted data from YGOB (Yeast Genome Browser) (Byrne and Wolfe, 2005), on the orders and orientation of the 600 genes (300 pairs) identified as duplicates in both genomes.

The YGOB (Byrne and Wolfe, 2005) contains complete gene orders and orthology identification among the five yeast species depicted in Figure 4: the two descendents of the above-mentioned ancient genome duplication event, *S.cerevisiae* and *C.glabrata*, and three species that diverged before this event, *A.gossypii*, *K.waltii* and *K.lactis.* For the ancient tetraploids, YGOB includes a reconstruction of the ancestral genome. We abbreviate these six genomes as SC, CG, AG, KW, KL and A*, respectively. In addition, we construct an ancestral doubled descendant V lying on a shortest rearrangement trajectory from SC to CG, satisfying the criterion that its halving distance is minimal (Zheng *et al.*, 2007b). We take the ancestor A* as ‘ground truth’ and see how close we can approach it using the sampling method and the guided halving method, with various combinations of doubling descendants and unduplicated genomes.

## 8 RESULTS

Table 1 compares the results, before and after local optimization, of the guided halving algorithm and the sampling approach on 12 pairs of genomes, the three doubling descendants SC, CG and V, each versus the four unduplicated genomes AG, KL, KW and A*. Recall that V and A* are themselves analytical constructs, the former representing the most recent common ancestor of SC and CG, and the latter the ancestral genome at the moment of doubling.

The first observations are methodological. In all 12 cases guided halving results in an *X* closer to *R* than in any of 2000 samples of unrestricted halving. If computing time were no obstacle, the sampling method would be exhaustive and exact, and hence always at least as good as guided halving. The fact that none of the 12 analyses produced a ‘lucky’ sample as good as or better than GGH, suggests that we would need a sample size of 25 000 at the very least, and perhaps one or more orders of magnitude larger, to bring the accuracy of sampling method to the level of guided halving, but this would require thousands of hours or more for our entire dataset versus less than 30 min with guided halving.

The fact that the results of the sampling method are improved by local searching, usually substantially, in all 12 cases, whereas guided halving produces genomes already at or very close to a minimum (albeit possibly local) of the objective function, is another measure of the superior performance of the latter.

Note that aside from the three cases where the ground truth ancestor *A** plays the role of outgroup, this genome is not directly involved in the analysis. It is of great interest, then, from the biological viewpoint, that in all cases, guided halving produces an ancestor *A* closer to *A** than the sampling method. Moreover, when using *A** as an outgroup for the halving of SC, the analysis reconstructs something very close to *A**, i.e. where *d*(*A*,*A**) is only 5. This attests to the internal coherence of the method: the SC evidence was predominant in the original construction of *A** (Byrne and Wolfe, 2005).

Turning to the case of two outgroups, we first point out that the sampling approach becomes infeasible when even a moderate number of analyses are undertaken. This is due to the relatively lengthy time (sometimes more than 2 h) required to compute the median cost, i.e., the sum of the three distances, from *R*_{1},*R*_{2} and the inferred ancestor *X*, to the median. (The halving algorithm alone, and even guided halving, never takes more than 2 or 3 min.) This is not an obstacle to the guided halving method because the median need to be calculated just once, instead of the thousands of times for the sampling approach. Table 2 shows the result of halving guided by two outgroups, using all combinations of two of AG, KL and KW versus each of SC, CG and V.

In general, we note no advantage of using two outgroups over one, in that *d*(*A*,*A**) with two outgroups is not as good as *d*(*A*,*A**) for the better of the two used alone. The exception is the comparison of KL and AG with V. Thus it seems, at least with these data, that the more remote outgroup contributes little more than noise to the reconstruction guided by the closer outgroup. This result may be due to the great discrepancy in the phylogenetic divergence between the doubled genomes and KW compared to the divergence between the former and AG or KL, and may not carry over to other datasets.

Two observations: first, the improvement due to local search is relatively small, though larger than guided halving with one outgroup. Second, though our analyses did find some *A* outside of **S** that minimized *D*(*T*,*R*_{1},*R*_{2}), in each such case there was also a solution (the one entered in Table 2) with *A* ∈**S**.

To investigate to what extent differences between the doubling descendants and among the outgroups are reflected in the reconstructed ancestor genome *A*, we undertook Gower's principal coordinates analysis (Gower, 1966) of the 21 versions of *A* described in Tables 1 and and2,2, as well as *A** itself. We used the implementation of this analysis available as *cmdscale* in the R environment (R Development Core Team, 2007), applied to the 22 × 22 genomic distance matrix.

Figure 5 depicts the results of a 3D principal coordinates analysis. We note first that the first two dimensions basically distinguish among the doubling descendants, first classifying SC and V together versus CG, and then distinguishing between SC and V. The third dimension distinguishes between the genomes in which KW was the outgroup and those in which only AG and/or KL were outgroups. As we would expect, all the genomes with *A** as the outgroup or as one of two outgroups, are closer to the ‘true’ ancestor *A** than when some other outgroup is used instead. Nevertheless, other outgroups, such as AG, also help guide the reconstruction to fairly close approximations of *A**. On the other hand, constructions guided by CG are all very far from *A**, and those involving KW tend to be somewhat farther than those guided by AG and KL. The latter observation is consistent with the known highly rearranged nature of CG, and with the relatively distant evolutionary relationship between KW and *A**, as can be seen in Figure 4.

## 9 DISCUSSION

We have focused on the two main concerns of genome halving, the multiplicity and the diversity of solutions, and the difficulty of assessing the accuracy of the results with real data. GGH was previously shown to drastically reduce the non-uniqueness inherent in unrestricted halving. This is carried further by GGH, which achieves much greater accuracy with much less computational effort.

An important indication of the precision of the reconstruction is its ability with some of the data to come very close to the manually reconstructed ancestor A.

Nevertheless, these results remind us of the uncertainties inherent in historical reconstruction. Some of this is possibly due to the ‘noise’ of mistaken paralogy identification, especially in highly rearranged genomes such as *C.glabrata*. Future work will attempt to attenuate this noise using the techniques of Zheng *et al.* (2007a) and Choi *et al.* (2007).

The significance of halving results depends on what proportion of the doubling descendant *T* is and can be identified as duplicated genes. Our analysis does not attempt to situate the ancestors of genes present in only one copy in *T*, and these will often form the majority. Ongoing work exploits the syntenic relationships between these genes, the duplicated ones, and their orthologs in the outgroups.

## ACKNOWLEDGEMENTS

We thank Ken Wolfe, Kevin Byrne and Jonathan Gordon for encouragement and for valuable information. We also thank Howard Bussey, Eric Tannier and Robert Warren for helpful discussions. D.S. holds the Canada Research Chair in Mathematical Genomics.

*Funding:* Research supported in part by a grant to D.S. from the Natural Sciences and Engineering Research Council of Canada (NSERC).

*Conflict of Interest:* none declared.

## Footnotes

^{1}Sequence analysis tools for dating duplication events are not pertinent to this problem since all pairs of duplicates in the doubled genome were generated at the same historical moment.

^{2}All cereals underwent earlier WGD event(s), but the effects of these can be filtered out on the basis of greater sequence divergence.

See Yancopoulos *et al.* (2005) for a more general inventory.

## REFERENCES

- Bergeron A, et al. A unifying view of genome rearrangements. In. In: Bücher P, Moret BME, editors. Algorithms in Bioinformatics. Proceedings of WABI 2006. Lecture Notes in Computer Science. Vol. 4175. Springer, Berlin: 2006. pp. 163–173.
- Bourque G, Pevzner P. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 2002;12:26–36. [PMC free article] [PubMed]
- Bryant D. Technical Report CRM–2579. Montreal, Canada: Centre de recherches mathématiques, Université de Montréal; 1998. The complexity of the breakpoint median problem.
- Byrne KP, Wolfe KH. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Res. 2005;15:1456–61. [PMC free article] [PubMed]
- Caprara A. The reversals median problem. INFORMS J. Comput. 2003;15:93–113.
- Choi V, et al. Algorithms for the extraction of synteny blocks from comparative maps. In. In: Giancarlo R, Hannenhalli S, editors. Proceedings of the WABI 2007 Workshop on Algorithms in Bioinformatics. Lecture Notes in Bioinformatics 4645. Heidelberg: Springer; 2007. pp. 277–288.
- Dujon B, et al. Genome evolution in yeasts. Nature. 2004;430:35–44. [PubMed]
- El-Mabrouk N, Sankoff D. The reconstruction of doubled genomes. SIAM J. Comput. 2003;32:754–92.
- El-Mabrouk N, et al. Reconstructing the pre-doubling genome. In. In: Istrail S, Pevzner P, Waterman M, editors. Proceedings of the Third Annual International Conference on Computational Molecular Biology (RECOMB 99) New York: ACM Press; 1999. pp. 154–163.
- Goffeau A, et al. Life with 6000 genes. Science. 1996;275:1051–1052.
- Gower JC. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53:325–328.
- GRAPPA (Genome Rearrangements Analysis under Parsimony and Other Phylogenetic Algorithms.) [(Date last accessed May 2008)]; Available at http://www.cs.unm.edu/~moret/GRAPPA/
- Kurtzman CP, Robnett CJ. Phylogenetic relationships among yeasts of the ‘Saccharomyces complex’ determined from multigene sequence analyses. FEMS Yeast Res. 2003;3:417–32. [PubMed]
- Pe'er I, Shamir R. The median problems for breakpoints are NP- complete. [Date last accessed May 2008];Electronic Colloquium on Computational Complexity Technical Report 98-071. 1998 Available at http://www.eccc.uni-trier.de/eccc.
- R Development Core Team. R: A language and environment for statistical computing. [Date last accessed May 2008];R Foundation for Statistical Computing. 2007 Available at http://www.R-project.org.
- Sankoff D, et al. Polyploids, genome halving and phylogeny. Bioinformatics. 2007;23:i433–i439. [PubMed]
- Tesler G. Efficient algorithms for multichromosomal genome rearrangements. J. Comput. Syst. Sci. 2002;65:587–609.
- Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997;387:708–713. [PubMed]
- Yancopoulos S, et al. Efficient sorting of genomic permutations by translocation, inversion, and block interchange. Bioinformatics. 2005;21:3340–3346. [PubMed]
- Zheng C, et al. Genome halving with an outgroup. Evol. Bioinform. 2006;2:319–326. [PMC free article] [PubMed]
- Zheng C, et al. Removing noise and ambiguities from comparative maps in rearrangement analysis. Trans. Comput. Biol. Bioinform. 2007a;4:515–522. [PubMed]
- Zheng C, et al. Parts of the problem of polyploids in rearrangement phylogeny. In. In: Tesler G, Durand D, editors. Proceedings of the RECOMB 2007 Workshop on Comparative Genomics. Lecture Notes in Computer Science 4751. New York: Springer; 2007b. pp. 162–176.

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (351K) |
- Citation

- Genome halving with an outgroup.[Evol Bioinform Online. 2007]
*Zheng C, Zhu Q, Sankoff D.**Evol Bioinform Online. 2007 Feb 21; 2:295-302. Epub 2007 Feb 21.* - Gene order evolution and paleopolyploidy in hemiascomycete yeasts.[Proc Natl Acad Sci U S A. 2002]
*Wong S, Butler G, Wolfe KH.**Proc Natl Acad Sci U S A. 2002 Jul 9; 99(14):9272-7. Epub 2002 Jul 1.* - Theoretical and practical advances in genome halving.[Bioinformatics. 2005]
*Yin P, Hartemink AJ.**Bioinformatics. 2005 Apr 1; 21(7):869-79. Epub 2004 Oct 28.* - Yeast genome evolution--the origin of the species.[Yeast. 2007]
*Scannell DR, Butler G, Wolfe KH.**Yeast. 2007 Nov; 24(11):929-42.* - Inferring ancestral gene order.[Methods Mol Biol. 2008]
*Catchen JM, Conery JS, Postlethwait JH.**Methods Mol Biol. 2008; 452:365-83.*

- On the Complexity of Rearrangement Problems under the Breakpoint Distance[Journal of Computational Biology. 2014]
*Kováč J.**Journal of Computational Biology. 2014 Jan 1; 21(1)1-15* - A flexible ancestral genome reconstruction method based on gapped adjacencies[BMC Bioinformatics. ]
*Gagnon Y, Blanchette M, El-Mabrouk N.**BMC Bioinformatics. 13(Suppl 19)S4* - Fractionation, rearrangement and subgenome dominance[Bioinformatics. 2012]
*Sankoff D, Zheng C.**Bioinformatics. 2012 Sep 15; 28(18)i402-i408* - Multichromosomal median and halving problems under different genomic distances[BMC Bioinformatics. ]
*Tannier E, Zheng C, Sankoff D.**BMC Bioinformatics. 10120* - Reconstructing the History of Yeast Genomes[PLoS Genetics. 2009]
*Sankoff D.**PLoS Genetics. 2009 May; 5(5)e1000483*

- Guided genome halving: hardness, heuristics and the history of the Hemiascomycet...Guided genome halving: hardness, heuristics and the history of the HemiascomycetesBioinformatics. 2008 Jul 1; 24(13)i96

Your browsing activity is empty.

Activity recording is turned off.

See more...