![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||
Copyright © 2008 L. Hamel et al. Unsupervised Learning in Detection of Gene Transfer 1Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI 02881, USA 2Department of Molecular and Cell Biology, College of Liberal Arts and Sciences, University of Connecticut, CT 06269, USA 3Department of Biochemistry and Molecular Biology, Dalhousie University, Halifax, NS B3H 1X5, Canada *L. Hamel: Email: hamel/at/cs.uri.edu Recommended by Daniel Howard Received September 13, 2007; Accepted December 14, 2007. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The tree representation as a model for organismal evolution has been in use since before Darwin. However, with the recent unprecedented access to biomolecular data, it has been discovered that, especially in the microbial world, individual genes making up the genome of an organism give rise to different and sometimes conflicting evolutionary tree topologies. This discovery calls into question the notion of a single evolutionary tree for an organism and gives rise to the notion of an evolutionary consensus tree based on the evolutionary patterns of the majority of genes in a genome embedded in a network of gene histories. Here, we discuss an approach to the analysis of genomic data of multiple genomes using bipartition spectral analysis and unsupervised learning. An interesting observation is that genes within genomes that have evolutionary tree topologies, which are in substantial conflict with the evolutionary consensus tree of an organism, point to possible horizontal gene transfer events which often delineate significant evolutionary events. 1. INTRODUCTION Evolutionary history of species is now inferred from the evolutionary
histories of their genomes. Genomes can be viewed as a collection of genes and
whole genome evolution is concluded from the evolution of individual genes. If
the majority of genes followed the same evolutionary history, supertree
approaches can be used to calculate a majority consensus tree. However, evolutionary trees of individual
genes can differ from the majority [1], and in this case, the consensus tree is
embedded in a network represented by the histories of the different genes. Evolutionary tree topologies of genes that
conflict with the consensus tree are strong indicators of horizontal gene
transfer events. Given this, it is clear that organismal evolution cannot be
inferred from studying the evolution of just a few genes but must be inferred
from studying as many (orthologous) genes as possible. To construct and evaluate an evolutionary consensus tree based on
multiple genes for a set of genomes, it is advisable to construct all possible
evolutionary tree topologies for these genomes and measure the support of each
topology by the (orthologous) genes within the genomes. Unfortunately,
evaluating all possible tree topologies is computationally intractable for any
but a very small set of genomes, since the number of possible tree topologies
grows factorially with the number of participating genomes. An approach based
on the spectral analysis of genomic data using bipartitions [2, 3] allows the
inference of consensus trees from smaller quanta of phylogenetic information,
side stepping some of the difficult computational issues. Table 1 shows the
number of possible trees versus the number of possible bipartitions given a
fixed set of genomes. With n taxa there are (2n − 5)!/[2(n−3) (n − 3)!] different
unrooted tree topologies. The number of possible nontrivial bipartitions for n
taxa is given by the formula 2(n−1) − n − 1, and it grows much slower with an
increasing number of species than the number of different trees. We refer to
the approach based on bipartitions as spectral genome analysis.
It is worth noting that when a single tree is calculated from the
combination of all genes, including genes that were horizontally transferred,
the topology of the resulting tree might not represent the plurality of gene
histories. Therefore, a detailed
analysis of the evolutionary histories of the participating genes is of
interest. The techniques outlined here
support this kind of analysis. In spectral genome analysis, each set of orthologous genes (a gene
family) is associated with a particular set of bipartitions (its spectrum) that define its evolutionary tree. Thus, we can envision a gene family as a point in
the space spanned by all possible bipartitions of a set of genomes. Here, we apply unsupervised learning in the
form of self-organizing maps [4] to this space and obtain a visual
representation of clusters of gene families with similar spectra. The spectra
of the gene families within a particular cluster allow us to infer the
consensus tree for that cluster. It is now possible to investigate whether the
consensus tree topologies of the clusters are compatible or conflicting with
the overall consensus tree. If a cluster of gene families is discovered that
conflicts with the consensus tree topology, then this is a strong indication
for a horizontal gene transfer event. The advantage of this approach is that we
not only see a distinction between consensus and conflicting trees, but that we
can detect trends of agreement between the conflicting genes. This additional
insight might provide biological clues as to the nature of the origin of these
genes. Unsupervised
learning has been used in genomic analyses before (e.g., [5]). However, our
approach seems to be novel in that we do not apply unsupervised learning
directly to DNA sequence data but instead analyze the much more abstract
representation of the genomic data in the form of bipartitions. We have
constructed a web service called Gene Phylogeny eXplorer (GPX, http://bioinformatics.cs.uri.edu/gpx)
that supports spectral genome analysis [6]. 2. MATERIALS AND METHODS 2.1. Spectral analysis of evolutionary trees Given n entities, there are 2n−1 − 1 different ways to
assign the entities to two different nonempty sets. That is, there are 2n−1 − 1 different bipartitions of n entities including trivial bipartitions. An
(unrooted) tree can be viewed as a model of the evolutionary relationships
between n entities or taxa such as species, genes, molecules, and so forth.
Trees and bipartitions are related as follows.
Each edge in a tree can be seen as dividing the tree into a bipartition:
the leaf nodes that can be reached from one end of the edge form one set of
taxa and the leaf nodes that can be reached from the other end of the edge form
the other set of taxa. A binary tree
with n leaf nodes has exactly 2n − 3 edges.
Thus, an evolutionary tree relating n taxa gives rise to 2n − 3
bipartitions. It is easy to see that 2n − 3 < 2n−1 − 1, that is, the number of bipartitions defined by an evolutionary
binary tree of n taxa is much smaller than the number of possible bipartitions
of n entities. Trivial bipartitions, which is bipartitions where one
of the partitions is a singleton set, do not contain any phylogenetic
information. Thus, given n entities, there are 2(n−1) − n − 1 different nontrivial
bipartitions. However, in an unrooted binary tree with n leaf nodes there are n − 3
interior edges and therefore n − 3nontrivial bipartitions. An interior edge is
an edge that is not incident to a leaf node of a tree. It is evident that n − 3 < 2n−1 − n − 1, that is, the number of nontrivial bipartitions generated by a
tree is much smaller than the number of possible nontrivial bipartitions. Let tn be an evolutionary tree over n taxa, then we
define the bipartitions of tn as the spectrum of tn, denoted as S (tn). It is
convenient to adopt a vector notation for the spectrum S (tn) = (b1,…,bn2−1) = (0, 1, 1, 0,…,0, 0), where bk denotes bipartition kwith 1 < k < 2n−1 − 1.
Here, bk = 1 if the spectrum of the tree includes bipartition bk and bk = 0 otherwise. Note that the vector notation is a
representation over all possible bipartitions. Given this, we can now refer to
a bipartition space and we can readily see that a spectrum of a particular
evolutionary tree tn represents the coordinates of a point in that space. In
our case, where the tree represents the evolutionary relationship between
orthologous genes in n genomes, we often refer to the spectrum as the gene
family spectrum and therefore a gene family is denoted by a point in bipartition
space. Figure 1(a)
By further generalizing and interpreting the values in the spectrum
vectors as arbitrary real numbers, as we will do in what follows when we assign confidence values
to bipartitions, a bipartition space can be viewed as a 2n−1 − 1 dimensional real
vector space. An interesting consequence of this is that we can now measure the
difference between spectra as the Euclidean distance between the two corresponding
spectrum points in a bipartition space.
Let t1, t2, and t3 be three different evolutionary trees of n taxa and
let S (t1), S (t2) and S (t3) be the respective spectra, then we say that S (t2)
is more similar to S (t1) than S (t3) if ||S (t1) − S (t2)|| < ||S (t1) − S (t3)||,
here the operator || || denotes the Euclidean distance between two
points in bipartition space.2.2. Representation of bipartitions Let A be a set of n elements, and b is a bipartition
defined on a set A. Each bipartition b splits a set A into two subsets m and
its complement mC, such that A = m mC.We say that two bipartitions are compatible if there
exists a tree whose spectrum includes both bipartitions. We say that two
bipartitions are conflicting if they cannot appear in the same spectrum. In set
notation, two bipartitions are compatible if a set (either m or mC) of one
bipartition is a subset of one of the sets of the second bipartition; or, in
other words, bipartitions b1 and b2 are compatible if and only if one of four
possible conditions is satisfied:
To handle bipartitions computationally in an efficient
way, we can represent them effectively as binary masks. Figure 2(a)
Given our binary representation of bipartitions, there
is a simple computation to test for compatibility between bipartitions. We say that two bipartitions are compatible
if the following returns true:
” operator represents the bitwise OR operation, the “~” operator represents
the bitwise negation, the “||” operator represents the logical OR operation,
and “= =” represents
the bitwise equality operator. Given the
two masks from Figures 1(b)
On the other hand, the bipartitions 11001 and 10011
are conflicting. 2.3. Consensus trees It is customary to compute confidence values for the
edges in an evolutionary tree via bootstrapping [7]. The computed tree represents a consensus tree
over the bootstrap samples. The confidence values are typically chosen between
0 and 100. With this, a bipartition
derived from a particular edge in the bootstrap consensus tree inherits the
confidence value of that edge. This
allows us to refine our spectrum vector notation, for example, S (tn) = (0, 67, 85, 0,…,15, 0) ,
where tn is now a bootstrapped consensus tree and the values in the vector
represent the confidence values for the individual bipartitions. Figure 3(a)
By computing a consensus tree on the bootstrap samples, it is possible
to introduce biases due to the fact that phylogenies that do not agree with the
plurality are suppressed. This is particularly critical in our case where the
biases of this kind of computation might compound during an analysis. A
different approach that avoids computing a consensus tree too early in an
analysis is by taking advantage of the spectra of the bootstrap samples. Before
we can describe this construction, we need to define what we mean by an average
spectrum. Given m spectra, S1,…,Sm, in a bipartition space of n taxa, we define
the average spectrum Sa as
The summation of spectra is
well defined as vector additions in bipartition space and the multiplication of
a scalar and a vector simply scales the components of the vector. The bootstrap approach can be
summarized as follows.
In step 3, we could multiply the average spectrum by
100 to make it compatible with the traditional bootstrap confidence values. A
consequence of this approach is that the average spectrum is no longer
guaranteed to represent a phylogenetic tree due to possible bipartition
conflicts and this represents an extension of our definition of spectrum above that did not admit
any conflicts. However, even in this extended definition of a spectrum we can
retrieve a consensus tree from the average spectrum Sa as follows:
An interesting application of this is the construction
of a consensus tree of multiple spectra in a bipartition space. If we interpret
the spectra S1,…,Sm as a cluster in bipartition space, then the average spectrum
can be viewed as the centroid of that cluster. The following constructs a centroid
consensus tree of m given spectra, S1,…,Sm.
Note that this is essentially the same algorithm as above with the
exception that the spectra, S1,…, Sm are not bootstrapped samples but arbitrary points in some
bipartition space. 2.4. Unsupervised learning in bipartition space Self-organizing maps [4] were introduced by Kohonen in 1982 and can be
viewed as tools to visualize structure in high-dimensional data. Self-organizing maps are considered members
of the class of unsupervised machine learning algorithms, since they do not
require a predefined concept but will learn the structure of a target domain
without supervision. Typically, a self-organizing map consists of a rectangular grid of
processing units. Multidimensional observations are represented as vectors.
Each processing unit in the self-organizing map also consists of a vector
called a reference vector or reference model. In our case, the multidimensional
observations are spectra, where the number of possible bipartitions given n
taxa governs the dimensions of the spectra. The dimensions of processing
elements of the map match the dimensionality of the observations. The goal of the map is to assign values to the reference models on the
map in such a way that all observations can be represented on the map with the
smallest possible error. However, the map is constructed under constraints in
the sense that the reference models cannot take on arbitrary values but are
subject to a smoothing function called the neighborhood function. During training the values of the reference
models on the map become ordered so that similar reference models are close to
each other on the map and dissimilar ones are further apart from each
other. This implies that similar
observations will be mapped to similar regions on the map. Often reference
models are referred to as centroids since they typically describe regions of
observations with large similarities. The
training of the map is carried out by a sequential process, where t = 1, 2,…
is the step index. For each observation x(t)
at time t, we first identify the index c of some reference model which
represents the best match in terms of Euclidean distance by the condition
Here, the index i ranges over all reference models on the map. The quantity mi(t) refers
to the reference model at position i on the map at time step t. Next, all reference models on the map are
updated with the following regression rule where model index c is the reference
model index as computed above:
Here, hci
is the neighborhood function that is defined as follows:
An advantage of self-organizing maps is that they have an appealing
visual representation. That is, the
structure of the input domain is graphically represented as a 2-dimensional
map. Figure 4
Each square in the map represents a reference model. The shading of the
map represents the level of quantization or mapping error for the map. Light
shading represents a small quantization error; that is, the reference models in
those areas match the observations very closely. Dark shading represents a
large quantization error; that is, there is a poor match between reference
models and observations. Contiguous areas of low quantization error represent
clusters of similar entities. Figure 4 Here, we
make use of this ability of self-organizing maps to visualize high-dimensional
spaces in order to visualize similarities and dissimilarities of
high-dimensional tree spectra. We would
expect points in bipartition space that represent similar spectra to map close
together on the visualization and vice versa.
Once we have identified clusters of spectra, we can proceed to compute
consensus trees for those clusters.
Furthermore, we can now compare the trees calculated from individual
clusters to the overall consensus tree, and we can investigate whether there
exists substantial conflict between the bipartitions of various clusters. Furthermore,
the clusters that result from this unsupervised learning allow the biologist to
detect trends in the evolutionary histories of the participating genes which
might provide insight into events such as horizontal transfers of individual
genes or whole metabolic pathways. The fact that the spectra of individual gene
families can be visualized as consensus trees and that it is possible to
compute the average of several selected spectra and the corresponding majority
consensus tree on the fly distinguishes
our approach from other spectral approaches (e.g., [3, 8]). 2.5. The construction of gene families One of the insights of recent evolutionary biology is that
it is not sufficient to use one or a few genes to infer phylogenetic
relationships among species. Therefore, we propose to use as many genes as
possible in our analysis based on the notion of a gene family. A gene family is
a collection of genes from different genomes that are related to each other and
share a common ancestor. In general, a gene family may include both orthologs
and paralogs [9]. Here, we consider only
sets of putatively orthologous genes where each species contributes only one
gene into a family. The evolutionary history of an individual gene family is a
phylogenetic tree. We select common gene families based on reciprocal
best BLAST [10] hit criteria [11] with relaxation (see below). The reciprocal
best BLAST hit method requires strong conservative relationships among the
orthologs so that if a gene from species 1 selects a gene from species 2 as the best hit when
performing a BLAST search with genome 1 against genome 2, then the gene 2 must
in turn select gene 1 as the best hit when genome 2 is searched against genome
1. The requirement of reciprocity is
very strict and often fails in the presence of paralogs. To select more
orthologous sets, we relax the criteria of strict reciprocity by allowing a
fixed number of broken connections. The gene families are aligned with Clustalw version
1.83 using default parameters [12]. For each family, 100 bootstrapped
replicates are generated and evaluated with the Phyml program [13] using the
JTT model, four relative substitution rate categories, and an estimated shape
parameter for the gamma distribution describing among site rate variation. All 100 generated trees are split into their corresponding bipartition
spectra and corresponding bootstrap support values are assigned to each
bipartition by calculating how many times each bipartition is present in a
family (the bootstrap procedure discussed in detail above). The result of these
calculations is a spectrum for each gene family. Observe that trees calculated
from individual bootstrap samples contain edges that are not part of a majority
consensus tree, that is, the spectrum for a gene family can contain
bipartitions that conflict with other bipartitions in the spectrum. For our purposes, this is important since it
prevents information loss and avoids bias during our analyses. We can now use the machinery developed above to
investigate the consensus tree of the collection of gene families and whether
there exist spectra that have a significant conflict with the overall consensus
tree. 3. APPLICATION OF GPX GPX, a tool based on the techniques developed above
supports an active, investigation-style analysis where the user can interact
with the visualization. The user is able
to select centroids on the map and investigate consensus trees and conflicting
bipartitions in the respective spectra.
A detailed description of an experiment using GPX appears in [6]. In a first experiment, we analyzed 123 gene
families of 14 archaea species. We found that sets of gene families exhibited
substantial conflict with the overall organismal consensus tree corroborating
findings of frequent gene transfers between organisms sharing the same or
similar ecological niches [14, 15]. In
the consensus over all 123 gene families, the representative of the Methanosarcinales
(Methanosarcina acetivorans) grouped with the Haloarchaea (Haloarcula marismortui and Halobacterium salinarum) as expected from the analysis of
ribosomal RNAs and enzymes involved in transcription and translation [16, 17]. Two clusters of gene families were recognized
that strongly supported a conflicting bipartiton that places the homolog from Methanosarcina
acetivorans with Archaeoglobus fulgidus. For one of these clusters, the
relationships among the other archaea remained otherwise compatible with the
consensus, suggesting gene transfer events between the ancestors of Methanosarcina
and Archaeoglobus. However, in case of
the second cluster formed by a single gene family, prolyl tRNA synthetases
(prolylRS), the Haloarchaea grouped at the base of the euryarchaeota. This placement suggests that the ancestor of
the Haloarchaea might have acquired this enzyme from outside the archaeal
domain, a finding that was corraborated through more detailed phylogenetic
analysis (Gogarten, unpublished). While
the haloarchaeal prolylRS are more similar to bacterial than to archaeal
homologs, database searches did not identify any sequence from an extant
organism that is specifically related to the haloarchaeal prolyl tRNA
synthetases. The donor of the haloarchaeal
prolylRS is not a member of any of the bacterial or archaeal phyla that have
prolylRS sequences in the current nonredundant or environmental databases;
possibly the lineage that donated this enzyme has gone extinct as a distinct
lineage, and only those genes that were donated to other lineages in the past
survived into the presence [18]. These results were obtained by means of an
originally developed interactive tool [6], which combines computationally
expensive analysis of complex data with convenient visual representation of
phylogenetic information. 4. CONCLUSIONS We developed a comparative genomic analysis technique
based on bipartition spectra and unsupervised learning. We have incorporated the techniques developed
here into a web-based tool and have used this tool successfully in a set of
analyses. The tool allows the user to reconstruct the evolutionary history
shared by the plurality of gene family histories present within a collection of
genomes; gene families with histories that are in conflict with the plurality
are detected, and families which share conflicting histories can be recognized,
thereby facilitating the discovery of major “highways of gene
sharing“ [15]. Bipartition spectrum analysis is not restricted to the
SOM algorithm, other clustering algorithms, such as principal component analysis
(PCA) [19] and local linear embedding (LLE) [20], can be applied to the
analysis of large data sets. A new algorithm, generative topographic mapping (GTM)
[21], displays maps similar to SOM but uses an expectation maximization (EM)
algorithm instead of relying on neural network convergence. An alternative-to-traditional
PCA is kernel PCA [22]. This algorithm is based on support vector machines,
which allows it to easily deal with very wide datasets. ISOMAP [23] is an
algorithm similar to LLE but distinguishes itself from LLE in that there is no
need to solve a set of linear equations. To make comparative genomic studies a
reality, we need to be able to include large numbers of genomes. This implies that we need to be able to
handle large amounts of data. Future
efforts will revolve around scaling up methodologies to include as many species
as possible and testing different clustering algorithms for extraction of
important phylogenetic information. Acknowledgments This work was in part supported through the NASA Applied Information System Research Program (NNG04GP90G). O. Zhaxybayeva was supported through a Postdoctoral Fellowship from the Canadian Institute of Health Research.
References 1. Gogarten JP, Doolittle WF, Lawrence JG. Prokaryotic evolution in light of gene transfer. Molecular Biology and Evolution. 2002;19(12):2226–2238. [PubMed] 2. Hendy MD, Penny D. Spectral analysis of phylogenetic data. Journal of Classification. 1993;10:5–24. 3. Zhaxybayeva O, Lapierre P, Gogarten JP. Genome mosaicism and organismal lineages. Trends in Genetics. 2004;20(5):254–260. [PubMed] 4. Kohonen T. Self-Organizing Maps. 3rd edition. Berlin, Germany: Springer; 2001. 5. Abe T, Sugawara H, Kinouchi M, Kanaya S, Ikemura T. Novel phylogenetic studies of genomic sequence fragments derived from uncultured microbe mixtures in environmental and clinical samples. DNA Research. 2005;12(5):281–290. [PubMed] 6. Nahar N, Poptsova MS, Hamel L, Gogarten JP. GPX: a tool for the exploration and visualization of genome evolution. In Proceedings of the 7th IEEE International Symposium on Bioinformatics & Bioengineering (BIBE '07); October 2007; Boston, Mass, USA. pp. 1338–1342. 7. Felsenstein J. Confidence limits on phylogenies: an approach using the bootstrap. Evolution. 1985;39(4):783–791. 8. Lento GM, Hickson RE, Chambers GK, Penny D. Use of spectral analysis to test hypotheses on the origin of pinnipeds. Molecular Biology and Evolution. 1995;12(1):28–52. [PubMed] 9. Fitch WM. Homology: a personal view on some of the problems. Trends in Genetics. 2000;16(5):227–231. [PubMed] 10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of Molecular Biology. 1990;215:403–410. [PubMed] 11. Zhaxybayeva O, Gogarten JP. Boostrap, Bayesian probability and maximum likelihood mapping: exploring new tools for comparative genome analyses. BMC Genomics. 2002;3:4. [PubMed] 12. Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research. 1994;22(22):4673–4680. [PubMed] 13. Guindon S, Gascuel O. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology. 2003;52(5):696–704. [PubMed] 14. Jain R, Rivera MC, Moore JE, Lake JA. Horizontal gene transfer accelerates genome innovation and evolution. Molecular Biology and Evolution. 2003;20(10):1598–1602. [PubMed] 15. Beiko RG, Harlow TJ, Ragan MA. Highways of gene sharing in prokaryotes. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(40):14332–14337. [PubMed] 16. Brochier C, Forterre P, Gribaldo S. An emerging phylogenetic core of Archaea: phylogenies of transcription and translation machineries converge following addition of new genome sequences. BMC Evolutionary Biology. 2005;5:36. [PubMed] 17. Woese CR. Bacterial evolution. Microbiology and Molecular Biology Reviews. 1987;51:221–271. 18. Zhaxybayeva O, Gogarten JP. Cladogenesis, coalescence and the evolution of the three domains of life. Trends in Genetics. 2004;20(4):182–187. [PubMed] 19. Jolliffe I. Principal Component Analysis. Berlin, Germany: Springer; 2002. 20. Saul LK, Roweis ST. Think globally, fit locally: unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research. 2004;4(2):119–155. 21. Bishop CM, Svensen M, Williams CKI. GTM: the generative topographic mapping. Neural Computation. 1998;10(1):215–234. 22. Scholkopf B, Smola A, Muller KR. Advances in Kernel Methods-Support Vector Learning. Cambridge, Mass, USA: MIT press; 1999. Kernel principal component analysis; pp. 327–352. 23. Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–2323. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||
Mol Biol Evol. 2002 Dec; 19(12):2226-38.
[Mol Biol Evol. 2002]Trends Genet. 2004 May; 20(5):254-60.
[Trends Genet. 2004]DNA Res. 2005; 12(5):281-90.
[DNA Res. 2005]Trends Genet. 2004 May; 20(5):254-60.
[Trends Genet. 2004]Mol Biol Evol. 1995 Jan; 12(1):28-52.
[Mol Biol Evol. 1995]Trends Genet. 2000 May; 16(5):227-31.
[Trends Genet. 2000]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]BMC Genomics. 2002; 3(1):4.
[BMC Genomics. 2002]Nucleic Acids Res. 1994 Nov 11; 22(22):4673-80.
[Nucleic Acids Res. 1994]Syst Biol. 2003 Oct; 52(5):696-704.
[Syst Biol. 2003]Mol Biol Evol. 2003 Oct; 20(10):1598-602.
[Mol Biol Evol. 2003]Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14332-7.
[Proc Natl Acad Sci U S A. 2005]BMC Evol Biol. 2005 Jun 2; 5(1):36.
[BMC Evol Biol. 2005]Trends Genet. 2004 Apr; 20(4):182-7.
[Trends Genet. 2004]Proc Natl Acad Sci U S A. 2005 Oct 4; 102(40):14332-7.
[Proc Natl Acad Sci U S A. 2005]Science. 2000 Dec 22; 290(5500):2319-23.
[Science. 2000]