![]() | ![]() |
Formats:
|
||||||||||
Copyright © 2006, Cold Spring Harbor Laboratory Press Phylogenomic analysis reveals bees and wasps (Hymenoptera) at the base of the radiation of Holometabolous insects 1Abteilung für Evolutionsgenetik, Institut für Genetik, Universität zu Köln, Köln 50674, Germany; 2Human Genome Sequencing Centre, Baylor College of Medicine, Houston, Texas 77002, USA; 3Department of Biology, University of Rochester, New York 14627, USA; 4The Institute for Genomic Research, Rockville, Maryland 20850, USA; 5Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, United Kingdom; 6European Molecular Biology Laboratory, 69012 Heidelberg, Germany 7Corresponding author. E-mail m.j.lercher/at/bath.ac.uk; fax 44-1225-386779. Received November 3, 2005; Accepted April 3, 2006. Freely available online through the Genome Research Open Access option. This article has been cited by other articles in PMC.Abstract Comparative studies require knowledge of the evolutionary relationships between taxa. However, neither morphological nor paleontological data have been able to unequivocally resolve the major groups of holometabolous insects so far. Here, we utilize emerging genome projects to assemble and analyze a data set of 185 nuclear genes, resulting in a fully resolved phylogeny of the major insect model species. Contrary to the most widely accepted phylogenetic hypothesis, bees and wasps (Hymenoptera) are basal to the other major holometabolous orders, beetles (Coleoptera), moths (Lepidoptera), and flies (Diptera). We validate our results by meticulous examination of potential confounding factors. Phylogenomic approaches are thus able to resolve long-standing questions about the phylogeny of insects. The four major orders of holometabolous insects (Hymenoptera, Coleoptera, Lepidoptera, and Diptera) encompass over 45% of all known animal species (Hammond 1992). While analyses based on morphological (Kristensen 1999) or individual molecular markers (such as ribosomal RNA; Whiting 2002b) or mitochondrial DNA sequences (Castro and Dowton 2005) have confirmed the monophyly of these orders, they have been unable to elucidate most of the interordinal relationships with sufficient confidence. A close relationship between Diptera (flies) and Lepidoptera (moths) within the long-recognized Mecopterida assemblage is generally recovered. However, the affinities of Coleoptera (beetles) and more particularly of Hymenoptera (wasps and bees) (Castro and Dowton 2005) remain elusive. In the most widely accepted phylogenetic hypothesis (Kristensen 1999; Whiting 2002b), a preference is given to a sister-group relationship between Hymenoptera and Mecopterida, while Coleoptera are placed at a more basal position as a sister group to the Neuropterida, another long-recognized assemblage. To resolve the phylogenetic relationships of the major holometabolous orders, we adopt a phylogenomic approach, utilizing a large number of nuclear genes to maximize phylogenetic signal over noise (Eisen and Fraser 2003; Rokas et al. 2003; Delsuc et al. 2005; DeSalle 2005; Philippe et al. 2005a). Such approaches, based on the simultaneous analysis of a large number of nuclear genes, have already been shown to be a promising route to understand deep metazoan relationships (Dopazo and Dopazo 2005; Philippe et al. 2005b). Here, we demonstrate that these methods are also able to resolve long-standing questions about the phylogeny of insects. Using EST sequences to obtain phylogenomic data sets has proven fruitful, e.g., in the analysis of Eukaryota (Philippe et al. 2004), Amoebae (Bapteste et al. 2002), and Coleoptera relationships (Hughes et al. 2006). The use of EST sequences in phylogenomic studies of insects was suggested earlier (Theodorides et al. 2002), but sufficient data to answer the questions addressed here has only recently become available. Our analysis focuses on six holometabolous model species, for which large scale sequencing projects are available or in progress. These encompass two dipterans (the fruit fly Drosophila melanogaster and the mosquito Anopheles gambiae), one lepidopteran (the silk moth Bombyx mori), one coleopteran (the flour beetle Tribolium castaneum), and two hymenopterans (the honey bee Apis mellifera, and the sibling parasitic wasp species Nasonia vitripennis and Nasonia giraulti). We further include one orthopteran (the grasshopper Locusta migratoria) and one hemipteran (the pea aphid Acyrthosiphon pisum), both of which are uncon-tested outgroups to the holometabolous insects based on morphological and molecular markers (Boudreaux 1979; Hennig 1981; Kristensen 1991; Wheeler et al. 2001). Results Candidate orthologous clusters were assembled from known or predicted genes based on a stringent sequence similarity criterion, and were then manually curated to ensure orthology (see Methods). After removing ambiguously aligned regions, we assembled the remaining sequences into a concatenated alignment of 33,809 amino acid positions from 185 nuclear genes. As expected, most genes included here perform housekeeping functions (see Table S1 of the Supplemental information for a list of genes). This data set supported the topology in Figure 1
Two common sources of error in phylogenetic reconstructions are compositional biases (Foster and Hickey 1999) and long-branch attraction (Felsenstein 1978). Two of our species, the pea aphid Acyrthosiphon pisum and the honey bee Apis mellifera, have a strongly AT-biased genome. This is reflected in an overrepresentation of amino acids encoded by AT-rich codons (Table 2), confirmed by statistical analysis (pairwise χ2-tests, Table 3). The removal of the outlier species Acyrtho-siphon pisum and Apis mellifera from the data set resulted in the same well supported topology (Table 1).
Because long branches in the in-group are restricted to Diptera and Lepidoptera, whose relative positions are un-contested (Kristensen 1999; Whiting 2002b), long-branch attraction is unlikely to have influenced the topology. However, even though the branches of hymenopterans and Tribolium are short, it is still conceivable that genes with particular substitution rates in these species may have biased the phylogeny. Exclusion of such genes on the basis of a relative rate test (Tajima 1993) did not change the tree (Table 1). Improper outgroup choice can potentially influence the inferred rooting of the ingroup. While our two outgroup species differ profoundly in evolutionary rate and amino acid composition, the results remained unchanged when using either Locusta migratoria or Acyrthosiphon pisum individually as the outgroup (Table 1). Discussion The interordinal relationships among holometabolous insect orders had previously proven to be notoriously difficult to resolve. However, most researchers assumed a basal split between two super-orders, the Coleoptera–Neuropterida (including the beetles) and the Hymenoptera–Mecopterida (including wasps, flies, and moths) (Kristensen 1999; Whiting 2002b). The tree presented in Figure 1 In the present framework, the position of Neuropterida could not be assessed. Neither previous molecular phylogenies nor morphological characters allow settlement of this issue; in particular, wing structure features have been argued to support a sister-group relationship of Neuropterida with either Coleoptera (Hornschemeyer 2002) or with Mecopterida (Kukalová-Peck and Lawrence 2004). The morphological characters used to support the traditional Holometabola phylogeny should certainly be reanalyzed in the light of the relationships presented here. Why was the basal position of hymenopterans not discovered in previous molecular phylogenetic studies? A plausible explanation is the lack of resolution power of single molecules when radiations are old or compressed in time (Rokas et al. 2005). Because the phylogenetic split in question occurred at least 275 million years ago (Mya) (Ponomarenko 2002; Rasnitsyn 2002), analyses based on a single molecule (e.g., 18S rRNA) did not provide sufficient resolution (Whiting 2002a). While 60% of the 185 protein alignments analyzed here were better explained by the tree in Figure 1 Previous studies based on the simultaneous analysis of many proteins also failed to recover the topology in Figure 1 The interordinal relationships presented in Figure 1 Methods Sequence data
Drosophila melanogaster (Adams et al. 2000), Anopheles gambiae (Holt et al. 2002), and Apis mellifera peptides were obtained from Ensembl (www.ensembl.org). Bombyx mori (Mita et al. 2003), Locusta migratoria (Kang et al. 2004), Tribolium castaneum, and Acyrthosiphon pisum mRNA sequences were downloaded from NCBI (ftp.ncbi.nlm.nih.gov). Nasonia vitripennis and Nasonia giraulti EST data were generated by authors J.H.W. and H.T. All nucleotide data sets were cleaned of vector, mitochondrial and bacterial contaminations using SeqClean (available from www.tigr.org/tdb/tgi/software/) before being assembled into nonredundant contigs with cap3 using default settings (Huang and Madan 1999). All nucleotide data sets were then searched against all Drosophila melanogaster proteins using BLASTx. The reading frame from the best hit was assumed to be the correct reading frame. We then chose the longest run of peptides uninterrupted by a stop codon as the peptide corresponding to each nucleotide contig. Identification of orthologs We performed BLASTp searches of all proteome pairs. Orthologs were selected based on reciprocal best BLAST hits (Tatusov et al. 1997) using an E-value cut-off of 10−25. A group of sequences with exactly one member in each species (including either one or both Nasonia species) was accepted as a candidate orthologous family if each sequence had each of the other family sequences as the best BLASTp hit in the respective proteome. This requirement of all-against-all reciprocal best hits is very stringent, and thus gives good confidence in the inferred orthology. Multiple sequence alignments were performed with MUSCLE (Edgar 2004) using default settings. Resulting alignments were then manually curated to ensure completeness and consistency. Poorly conserved families or clusters potentially containing paralogous sequences were discarded. For the sibling Nasonia species, when orthologous sequences were available for both species, the longer one was chosen. Alignments were then purged from unreliably aligned positions as well as gaps with Gblocks (Castresana 2000) using highly stringent block had to be conserved and where blocks smaller than 20 amino acids were discarded. We concatenated the final set of 185 nuclear sequences for phylogenetic analysis, resulting in an eight-species alignment of 33,809 amino acid positions. The list of genes included in our analysis is available as Supplemental-Table S1. Phylogenetic reconstructions We first analyzed the data in a maximum likelihood framework, using phyML (Guindon and Gascuel 2003) under an empirical model of amino acid substitutions (Jones et al. 1992), allowing for substitution rate variation among sites with a gamma distribution (four rate categories). Branch support values in Table 1 are from analysis of 1000 bootstrap replicates. Using alternative models of amino acid evolution (WAG, Whelan and Goldman 2001; and VT, Bapteste et al. 2002) led to the same well supported tree. Additionally, we estimated the tree in a Bayesian framework, using MrBayes (Huelsenbeck and Ronquist 2001) and employing the same model of sequence evolution as above. We ran four independent searches, each starting from a random tree and sampling every tenth tree over 100,000 generations. Each run had equilibrated after less than 1000 generations; thus, the first 100 trees were disregarded as burn-in. The independent runs consistently resulted in the same topology; posterior probabilities (Table 1) were calculated from all sampled trees across independent runs. Finally, we also analyzed 1000 bootstrap replicates under a maximum parsimony criterion, using PROTPARS from the PHYLIP package (Felsenstein 2004). Gene bootstrap resampling To test if the obtained tree was dominated by one or a few disparate genes, we performed maximum likelihood analyses of 1000 bootstrap data sets obtained from the resampling of complete genes (Nei et al. 2001). For each replicate, we drew 185 gene alignments from the full curated data set described above. Because genes were not removed from the pool after being chosen, each bootstrap data set contained some gene alignments more than once, while others were missing altogether; this is analogous to the widely used bootstrap strategy based on individual amino acid sites. For each replicate, alignments were then concatenated and analyzed with phyML as above. Relative rate tests To determine if the relative position of Coleoptera and Hymenoptera was caused by rate variation among these orders, we performed a relative rate test (Tajima 1993) between Nasonia and Tribolium castaneum. We restricted this analysis and the following tree reconstruction to sites that were identical between Locusta migratoria and Acyrthosiphon pisum, and where the ancestral state could thus be inferred reliably. To be conservative, we removed all genes for which ×2 > 1.64, i.e., P < 0.2 (1 degree of freedom) (Kumar and Hedges 1998). Excluding Apis mellifera, we concatenated the remaining 114 genes into an alignment of 16,495 amino acid positions. Maximum likelihood bootstrap analysis was performed as above. Phylogenetic analysis of individual proteins We also analyzed each individual protein alignment with the maximum likelihood method as described above. To analyze each protein's support for the tree in Figure 1 Compositional heterogeneity among species pairs was assessed with a χ2-test, where χmn2=∑i(fmi−fni)2/(fmi+fni) with fmi the total number of amino acids of type i in the concatenated sequence for species m. The values in Table 3 are based on those amino acids that are biased in the GC content of their codons (FYMINK/GARP, see also Table 2; 9 degrees of freedom). Qualitatively very similar results are obtained when using all amino acids (data not shown). All phylogenomic analyses and tests were implemented in Perl scripts, which are available in the Supplemental material. Acknowledgments J.H.W. thanks Wayne Hunter (USDA, ARS) and Phat Dang (USDA, ARS) for construction of the Nasonia EST libraries. We wish to thank Shannon K. McWeeney and Csaba Pal for helpful discussions. J.S. and D.T. acknowledge support through grants from the HFSPO and the DFG. M.J.L. acknowledges financial support from the Royal Society and the DFG. The Nasonia EST project of J.H.W. and H.T. was supported by the 21st Century Research & Technology Fund. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5204306. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||
Mol Phylogenet Evol. 2005 Mar; 34(3):469-79.
[Mol Phylogenet Evol. 2005]Mol Phylogenet Evol. 2005 Mar; 34(3):469-79.
[Mol Phylogenet Evol. 2005]Science. 2003 Jun 13; 300(5626):1706-7.
[Science. 2003]Nature. 2003 Oct 23; 425(6960):798-804.
[Nature. 2003]Nat Rev Genet. 2005 May; 6(5):361-75.
[Nat Rev Genet. 2005]Methods Enzymol. 2005; 395():104-33.
[Methods Enzymol. 2005]Genome Biol. 2005; 6(5):R41.
[Genome Biol. 2005]Mol Biol Evol. 2004 Sep; 21(9):1740-52.
[Mol Biol Evol. 2004]Proc Natl Acad Sci U S A. 2002 Feb 5; 99(3):1414-9.
[Proc Natl Acad Sci U S A. 2002]Mol Biol Evol. 2006 Feb; 23(2):268-78.
[Mol Biol Evol. 2006]Insect Mol Biol. 2002 Oct; 11(5):467-75.
[Insect Mol Biol. 2002]Syst Biol. 2003 Oct; 52(5):696-704.
[Syst Biol. 2003]Mol Biol Evol. 1997 Jul; 14(7):717-24.
[Mol Biol Evol. 1997]J Exp Zool B Mol Dev Evol. 2005 Jan 15; 304(1):64-74.
[J Exp Zool B Mol Dev Evol. 2005]Proc Natl Acad Sci U S A. 2001 Feb 27; 98(5):2497-502.
[Proc Natl Acad Sci U S A. 2001]J Mol Evol. 1999 Mar; 48(3):284-90.
[J Mol Evol. 1999]Genetics. 1993 Oct; 135(2):599-607.
[Genetics. 1993]Science. 2005 Dec 23; 310(5756):1933-8.
[Science. 2005]Nature. 2003 Oct 23; 425(6960):798-804.
[Nature. 2003]J Exp Zool B Mol Dev Evol. 2005 Jan 15; 304(1):64-74.
[J Exp Zool B Mol Dev Evol. 2005]Mol Biol Evol. 2004 Sep; 21(9):1740-52.
[Mol Biol Evol. 2004]Curr Opin Genet Dev. 1998 Dec; 8(6):616-23.
[Curr Opin Genet Dev. 1998]Mol Biol Evol. 2005 Jan; 22(1):74-84.
[Mol Biol Evol. 2005]Science. 2000 Mar 24; 287(5461):2185-95.
[Science. 2000]Science. 2002 Oct 4; 298(5591):129-49.
[Science. 2002]Proc Natl Acad Sci U S A. 2003 Nov 25; 100(24):14121-6.
[Proc Natl Acad Sci U S A. 2003]Proc Natl Acad Sci U S A. 2004 Dec 21; 101(51):17611-5.
[Proc Natl Acad Sci U S A. 2004]Genome Res. 1999 Sep; 9(9):868-77.
[Genome Res. 1999]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Nucleic Acids Res. 2004; 32(5):1792-7.
[Nucleic Acids Res. 2004]Mol Biol Evol. 2000 Apr; 17(4):540-52.
[Mol Biol Evol. 2000]Syst Biol. 2003 Oct; 52(5):696-704.
[Syst Biol. 2003]Comput Appl Biosci. 1992 Jun; 8(3):275-82.
[Comput Appl Biosci. 1992]Mol Biol Evol. 2001 May; 18(5):691-9.
[Mol Biol Evol. 2001]Proc Natl Acad Sci U S A. 2002 Feb 5; 99(3):1414-9.
[Proc Natl Acad Sci U S A. 2002]Bioinformatics. 2001 Aug; 17(8):754-5.
[Bioinformatics. 2001]Proc Natl Acad Sci U S A. 2001 Feb 27; 98(5):2497-502.
[Proc Natl Acad Sci U S A. 2001]Genetics. 1993 Oct; 135(2):599-607.
[Genetics. 1993]J Mol Evol. 1989 Aug; 29(2):170-9.
[J Mol Evol. 1989]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]