• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. Sep 2008; 180(1): 391–408.
PMCID: PMC2535690

Sequencing and Comparative Analysis of a Conserved Syntenic Segment in the Solanaceae


Comparative genomics is a powerful tool for gaining insight into genomic function and evolution. However, in plants, sequence data that would enable detailed comparisons of both coding and noncoding regions have been limited in availability. Here we report the generation and analysis of sequences for an unduplicated conserved syntenic segment (CSS) in the genomes of five members of the agriculturally important plant family Solanaceae. This CSS includes a 105-kb region of tomato chromosome 2 and orthologous regions of the potato, eggplant, pepper, and petunia genomes. With a total neutral divergence of 0.73–0.78 substitutions/site, these sequences are similar enough that most noncoding regions can be aligned, yet divergent enough to be informative about evolutionary dynamics and selective pressures. The CSS contains 17 distinct genes with generally conserved order and orientation, but with numerous small-scale differences between species. Our analysis indicates that the last common ancestor of these species lived ~27–36 million years ago, that more than one-third of short genomic segments (5–15 bp) are under selection, and that more than two-thirds of selected bases fall in noncoding regions. In addition, we identify genes under positive selection and analyze hundreds of conserved noncoding elements. This analysis provides a window into 30 million years of plant evolution in the absence of polyploidization.

GENOME sequences are now rarely studied in isolation, but instead are examined alongside their neighbors on the tree of life. Most animal species of primary research importance in genetics—including human, mouse, Drosophila melanogaster, and Caenorhabditis elegans—now belong to whole “sequenced clades,” consisting of at least half a dozen and in some cases more than two dozen sequenced species (e.g., Lindblad-Toh et al. 2005; Rhesus Macaque Genome Sequencing and Analysis Consortium 2007; Clark et al. 2007; Miller et al. 2007; Stark et al. 2007) (http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/CaenorhabditisSEQ.pdf). The same is true of the model yeast Saccharomyces cerevisiae (Cliften et al. 2003; Kellis et al. 2003). The species within each of these clades are evolutionarily close enough that noncoding as well as coding sequences can be aligned, yet distant enough that genomic comparisons reveal clear signatures of natural selection. In addition, the generally similar physiology, behavior, and genetics of the organisms within each clade help to facilitate comparative analyses. Comparative genomic analyses of sequenced clades have, among other things, allowed for the identification of new genes, regulatory elements, noncoding RNAs, and conserved sequences of unknown function (e.g., Guigó et al. 2003; Kellis et al. 2003; Bejerano et al. 2004; Siepel et al. 2007; Stark et al. 2007); shed light on duplication and rearrangement histories (Murphy et al. 2005; Jiang et al. 2007); produced refined phylogenies (Thomas et al. 2003; Murphy et al. 2007); and enabled the detection of rapidly evolving coding and noncoding sequences (Clark et al. 2003; Pollard et al. 2006).

In plants, however, comparable sequenced clades have not yet emerged. The main embryophytic (land-plant) species that have been fully sequenced—Arabidopsis thaliana (Arabidopsis Genome Initiative 2000), Oryza sativa (Goff et al. 2002; Yu et al. 2002), Medicago truncatula (Cannon et al. 2006), and Populus trichocarpa (Tuskan et al. 2006)—have been selected primarily for their individual importance as model species or agricultural crops, rather than for their value in comparative genomics. These genomes are sufficiently distant from one another that they generally do not align outside of coding regions. In addition, each genome has been considerably scrambled with respect to the others by millions of years of rearrangement, duplication, insertion, and deletion, further complicating comparative analyses. Consequently, with a few exceptions (Inada et al. 2003; Ma and Bennetzen 2004; Haberer et al. 2006; Freeling et al. 2007; Thomas et al. 2007), comparative genomic studies of plants have largely focused on content of protein-coding genes and repetitive elements (Ku et al. 2000; Quiros et al. 2001; Song et al. 2002; Ilic et al. 2003), rather than on the kind of detailed analysis of orthologous functional elements that has been possible in animals.

Moreover, comparative studies of plant genomes so far have largely dealt with species that have experienced recent whole-genome duplications (WGDs) (Ku et al. 2000; Quiros et al. 2001; Song et al. 2002; Ilic et al. 2003; Zhu et al. 2003). These studies have revealed striking differences between species in genome organization, perhaps induced by the massive genetic redundancy created by WGD (Lynch and Conery 2003; Semon and Wolfe 2007). However, they leave open the question of how plant genomes evolve in the absence of WGD, and they complicate comparisons with animal genomes, in which WGD is much less common (Otto and Whitton 2000). Furthermore, WGD creates additional challenges in comparative genomics, by producing dramatic differences in genome size and number of genes, many-to-many relationships among orthologs, and frequent disruptions in synteny.

The Solanaceae are highly important among flowering plants that have diversified in the absence of WGD. The Solanaceae family comprises >3000 species, including aquatic plants, desert dwellers, trees, ornamentals, and familiar crops such as tomato, potato, and pepper. It ranks third among plant families in economic importance, it is the most valuable in terms of vegetable crops, and it includes important model systems for fruit development (tomato and pepper), tuber development (potato), plant defense (tomato and tobacco), and anthocyanin pigments (petunia). Despite their great phenotypic diversity, all Solanaceae derived ~40 million years ago (MYA) from an ancestral diploid with x = 12 chromosomes, and nearly all family members have maintained this chromosome number (Wikstrom et al. 2001; Wu et al. 2006). Moreover, members of the related family Rubiaceae (coffee family) are also diploid with x = 11 or x = 12, implying that any WGD in the history of the Solanaceae and the Rubiaceae occurred before their divergence ~85 MYA (F. Wu and S. D. Tanksley, unpublished data). Comparative genomics of the Solanaceae has been an active area of research for two decades (Tanksley et al. 1988). In addition, an international project is underway to sequence the full euchromatic portion of the tomato genome (http://www.sgn.cornell.edu/about/tomato_sequencing.pl), and a well-developed bioinformatics infrastructure with strong support for comparative analyses is available (Mueller et al. 2005). While several recent studies have included comparative analyses of solanaceous ESTs (Van Der Hoeven et al. 2002; Ronning et al. 2003; Blanc and Wolfe 2004a; Rensink et al. 2005), so far a large-scale comparative analysis of genomic sequences in the Solanaceae has not been possible.

Here we report a comparative analysis of a conserved syntenic segment (CSS) in the genomes of five Solanaceae species. We compare the previously sequenced 105-kb ovate-containing region from chromosome 2 of tomato (Ku et al. 2000) with newly sequenced orthologous regions of the potato, eggplant, pepper, and petunia genomes. This CSS is present as a single copy in all five species, and it contains 17 distinct genes with mostly conserved order and orientation. However, its general conservation is punctuated by numerous small-scale differences, due to nucleotide substitutions, insertions and deletions, tandem duplications of individual genes, inversions, and transpositions. Our detailed comparison of these sequences provides new insights into the evolutionary history of an important group of plants, the evolutionary dynamics of plant genomes in the absence of WGD, and the selective pressures experienced by both coding and noncoding functional elements.


Identification and sequencing of BACs:

The starting point for the study was a gene-rich BAC from the long arm of tomato chromosome 2 known to contain the ovate locus (BAC19, LE_HBa0106H06) (Ku et al. 2000). The 16 predicted genes from the BAC were first tested for copy number in the tomato genome via genomic Southern hybridization. Hybridization was carried out at 60° overnight using probes labeled with 32P and washed in 2× SSC for 20 min and in 1× SSC for 10 min. A hybridization probe for each gene was based on a single exon or two nearby exons (≥350 bp; see supplemental Table S1). The majority of the genes in the BAC were shown to be single copy by virtue of hybridization to a single restriction fragment in digests of at least one restriction enzyme. Several of these single-copy probes were genetically mapped on the high-density tomato genetic map (Frary et al. 2005) (http://www.sgn.cornell.edu/). All mapped to the expected position on chromosome 2.

Five of the single-copy probes (for genes 1, 5, 8, 15, and 17; see supplemental Table S2) were then used to screen BAC libraries for potato (Solanum bulbocastanum) (Song et al. 2000), eggplant (S. melongena cv. “black eggplant”) (J. Vrebalov and J. Giovannoni, unpublished data), petunia (Petunia inflata) (McCubbin et al. 2000), and pepper (Capsicum annuum) (J. Vrebalov and J. Giovannoni, unpublished data). Positive BACs were confirmed by Southern hybridization on HindIII-digested BAC DNA and further selected using three additional probes (for genes 3, 12, and 14; supplemental Table S2) for maximum gene overlap with the tomato BAC. All probes were confirmed to be present in single copy in potato, eggplant, pepper, and petunia by Southern hybridization with genomic sequences digested by more than two restriction enzymes. On the basis of these results, a single BAC from each species, hybridizing to the maximum number of tomato gene probes, was selected for further analysis: tomato (106H06), potato (027I18), eggplant (077N19), pepper (215H17), and petunia (126I14) (Figure 1). These BACs were, respectively, 135, 105, 122, 106, and 139 kb in size.

Figure 1.
Conserved syntenic segment (CSS) in five species of Solanaceae. The sequenced segments of the potato, tomato, pepper, eggplant, and petunia genomes are shown alongside corresponding regions of the Arabidopsis (At) genome. All annotated genes and several ...

Each BAC clone was shotgun sequenced to 10× coverage and assembled using Phred and Phrap with default parameters. Gaps and low-quality regions were finished with sequences from PCR products to obtain final assemblies with minimum quality scores of 25. These were checked by BAC end sequences and by comparing virtual (electronic) and empirical (lab) restriction digests using HindIII, EcoRI, and BamHI.

Alignment and annotation of BAC sequences:

A multiple alignment for the entire CSS was constructed from the assembled sequences using BLASTZ (Schwartz et al. 2003) and the threaded blockset aligner (TBA) (Blanchette et al. 2004). Before processing by TBA, pairwise BLASTZ alignments were filtered by the University of California, Santa Cruz (UCSC), alignment “chaining” and “netting” pipeline (Kent et al. 2003), which uses conserved synteny to help ensure that orthologous sequences are aligned. The chains, nets, and multiple alignment were displayed and manually inspected in a “Solanaceae Genome Browser” based on the UCSC platform (http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?db=sol1).

Ab initio gene predictions were obtained for each BAC, using four different computational gene-finding programs—FGENESH (Solovyev et al. 1994), GenemarkHMM (Borodovsky and McIninch 1993), Genscan+ (Burge and Karlin 1997), and GlimmerM (Salzberg et al. 1998) (Arabidopsis training data set; Wortman et al. 2003). Independently, each BAC was screened against a large Solanaceae EST database (239,593 tomato ESTs, 134,365 potato ESTs, 3181 eggplant ESTs, and 20,738 pepper ESTs; http://www.sgn.cornell.edu/) and against the Arabidopsis proteome. An initial set of gene annotations was defined with the requirement that each gene be supported by at least two computational gene predictions, at least one solanaceous EST (with >95% identity over 80% sequence length if from the same species or BLASTN E-value <10−10 if from another species), or at least one Arabidopsis protein (tBLASTX E-value <10−10). These candidate gene structures were then evaluated for cross-species support, using the clean_genes program [part of the PHAST package (Siepel et al. 2005)], and were inspected manually in the Solanaceae Genome Browser alongside the multiple alignments, EST alignments, alignments of full-length mRNA sequences from GenBank, and protein alignments. This inspection turned up two apparent pseudogenes in tomato (both derived from gene 10; see results) and allowed for some minor refinements in the positions of splice sites, but otherwise supported the candidate predictions. Putative functions for genes were assigned on the basis of Arabidopsis homologs and predicted domains, where available (Table 1).

Summary of genes in the CSS

Repeat elements and low-complexity sequences within all sequences were soft masked using TRF (Benson 1999) and RepeatMasker (http://www.repeatmasker.org). A custom RepeatMasker library was produced by concatenating repeat elements from the Solanaceae Genomics Network Repeat Database (http://www.sgn.cornell.edu), TIGR plant repeats (http://www.tigr.org/tdb/e2k1/plant.repeats), Munich Information Center for Protein Sequence (MIPS) plant repeats (http://mips.gsf.de/proj/plant/webapp/recat), and plant repeats within the RepeatMasker library. The annotated repeats are displayed in the RepeatMasker and Simple Repeats tracks in the Solanaceae Genome Browser.

Alignments of orthologous coding regions:

For the 12 multispecies genes (Table 2), multiple alignments of orthologous protein-coding DNA sequences were extracted from the CSS-wide multiple alignment by concatenating the segments corresponding to the exons of each gene, as defined by the tomato gene models. Manual inspection suggested that no realignment was needed. For use in the analyses based on codon models, a hand-curated version of these alignments was created without frameshifts or stop codons. These alignments were truncated at frame-shifting insertions and deletions (indels) or premature stops near the 3′ ends of genes (Table 2), and any out-of-frame sequences between compensatory frame-shifting indels were masked by replacing them with N's. For the estimation of dates of divergence, the closest Arabidopsis ortholog of each gene was incorporated into these alignments (excluding gene 17), and for the analysis of gene trees, all paralogs and putative homologs in Arabidopsis and rice (based on TBLASTX matches and data reported by Ku et al. 2000) were added. These expanded alignments were created by aligning predicted peptide sequences with T_Coffee (Notredame et al. 2000) and then reverse translating to DNA sequences. They were also truncated at premature stop codons as necessary.

Summary of comparative analysis of multispecies genes

Phylogenetic analysis:

Gene trees were estimated by maximum likelihood using PhyML (Guindon and Gascuel 2003). In all cases the inferred tree topology was consistent with the species phylogeny shown in Figures 3 and and4,4, which is in agreement with previous phylogenetic studies of the Solanaceae (Olmstead and Palmer 1997; Olmstead et al. 1999). In an initial analysis, the two petunia copies of gene 12 grouped together, as expected, but the two pepper copies of gene 7 did not. However, a follow-up analysis using codon models supported a topology for gene 7 in which the two pepper genes grouped together (Figure 7A), as expected from the copy numbers of this gene in the different species.

Figure 3.
Rates of evolution in neutrally evolving and conserved sequences. Branch lengths were estimated from (A) synonymous and (B) nonsynonymous substitutions in 14,961 coding sites, on the basis of the codon model of Yang et al. (1998) (branch lengths are given ...
Figure 4.
Estimated dates of divergence for the five Solanaceae species. This tree was estimated from the coding regions of 12 genes in the conserved syntenic segment for which cross-species data are available (see Table 2), assuming a codon model, a molecular ...
Figure 7.
Maximum-likelihood phylogenies for genes that show tandem duplications. (A) Phylogeny for gene 7, which is duplicated in pepper. (B) Phylogeny for gene 12, which is duplicated in petunia. Labels above branches indicate branchwise estimates of ω. ...

Maximum-likelihood estimates of dN, dS, and ω = dN/dS were obtained using the codeml program (Yang 1997), with F3 × 4 codon frequencies, equal amino acid distances (aaDist = 0), a single ω across sites and across branches (model = 0, NSsites = 0), and the tree topology of Figure 3. Estimates were obtained separately for each of the multispecies genes, using the hand-curated alignments of coding regions, and for a pooled data set in which all alignments were concatenated. Fourfold degenerate (4D) sites were extracted using msa_view (from PHAST) and substitution rates for these sites were estimated using phyloFit (also from PHAST) with the general reversible (REV) model (Tavaré 1986). For each type of site, only sequences with three or more aligned orthologous sequences, including tomato, were included in the analysis.

Dates of divergence were estimated by applying codeml as above, but assuming a global molecular clock (clock = 1). The “fossil calibration” feature (Yang and Yoder 2003; see the “Global and Local Clocks” section of the PAML manual) was used to fix the Arabidopsis/Solanaceae (AS) divergence at the estimated dates of 110, 120, and 130 MYA, and the other divergence times were then estimated by maximum likelihood. The standard errors of the estimates were small compared to the uncertainty in the AS divergence date and were therefore ignored. The data do not strictly support the hypothesis of a global clock (likelihood-ratio test, LRT P = 7 × 10−15), but the branch length estimates were not dramatically altered by the assumption of a clock, and violations of this assumption are not expected to have a dramatic effect on the estimated dates.

The codeml program was also used to perform LRTs for positive selection, based on the branch-site model of Yang and Nielsen (2002) (model = 2, NSsites = 2). A separate LRT was performed for each gene and each branch of the tree by running codeml twice: once with fix_omega = 0 (alternative model) and once with fix_omega = 1, omega = 1 (null model). Nominal P-values were computed by assuming that twice the difference of the log likelihoods of these two models should have a null distribution that is a 50:50 mixture of a χ2-distribution and a point mass at zero (Zhang et al. 2005). These P-values were then corrected for multiple comparisons, using the method of Benjamini and Hochberg (1995).

Proportion of nucleotide sites under selection:

Conservation scores were produced using the program phastOdds (from PHAST) in sliding windows of 5, 10, and 15 bp. PhastOdds computes log-odds scores for each base, comparing a phylogenetic model of conserved evolution with a model of nonconserved evolution. The scores were averaged within windows. Specifically, the score Si for a window of size d and radius r = [left floor]d/2[right floor], centered at position i, was computed as

equation M1

where Xj is the jth column of the multiple alignment, ψc is a phylogenetic model for conserved sites, ψn is a phylogenetic model for nonconserved sites, and P(Xj | ψx) is computed by Felsenstein's pruning algorithm (Felsenstein 1981). (A similar scoring procedure is described in more detail by Siepel et al. 2005.) The models ψc and ψn were estimated using the phastCons program with the REV model (see below). An alternative analysis in which ψc was estimated from coding exons and ψn was estimated from fourfold degenerate sites to estimate ψn produced nearly identical results. Alignment gaps were treated as missing data. Neutral scores were obtained by concatenating alignment columns from 4D sites into a pseudoalignment, randomly permuting them, and then applying phastOdds to this alignment.

The complete distribution of conservation scores, fall, was modeled as a mixture of neutral and selected components, fall(S) = πnfn(S) + πsfs(S), with mixture coefficients πn and πs (0 ≤ πn, πs ≤ 1, πn + πs = 1). The distributions fall and fn were obtained by Gaussian kernel density estimation from the sets of all scores and of neutral scores, respectively, excluding sites with bases from fewer than three species. The density function in R was used with kernel = “gaussian,” bandwidth (bw) of 0.15, 0.20, or 0.25, and n (the number of points) of 1024. The lower bound for πs was then estimated as πs = 1 − minS[fall(S)/fn(S)], as described by Chiaromonte et al. (2003). In this minimization, only scores S with fall > δ and fn(S) > δ (for small positive δ) were considered, to avoid distortion from regions of sparse data. Values of δ between 0.0001 and 0.01 produced essentially identical results. To estimate confidence intervals, bootstrap resampling both of all sites and of neutral sites was performed. Kernel density estimation and estimation of πs were performed for each of 1000 samples, and the 0.025 and 0.975 quantiles of the estimates of πs were taken as 95% confidence intervals. Estimates of πs were converted to estimates of γ by multiplying them by the fraction of bases that were aligned in three or more species (here 0.726), under the conservative assumption that unaligned bases are not under selection.

The posterior probability that each window Wi with score Si is under selection was computed as

equation M2

where Zi is a random variable equal to 1 if Wi is under selection and equal to 0 otherwise (Chiaromonte et al. 2003). On the basis of the gene models and EST/mRNA data, each base was assigned to one of four annotation classes (see Figure 6), and each window was assigned to the class containing the largest number of its bases. These posterior probabilities were then used to compute expected fractions of windows that are under selection within each class and expected fractions of all selected windows that come from each class.

Figure 6.
Selection on sites of various annotation classes. (A) Fractions of all 10-bp windows that are under selection in sites annotated as belonging to coding (CDS), other transcribed, intronic, and intergenic regions. (B) Breakdown of all windows under selection ...

To test the sensitivity of the analysis to the alignment methods, all steps were repeated using an alternative alignment constructed by the Pecan program (B. Paten, K. Beal and E. Birney, unpublished data; http://www.ebi.ac.uk/~bjp/pecan/). This analysis produced very similar results, but slightly higher estimates of γ (see supplemental material).

Coding indels:

A history of indels was reconstructed by parsimony from the alignments of orthologous coding regions, using the program indelHistory (from PHAST). The inferred events were classified as insertions on particular branches of the phylogeny, deletions on particular branches, or ambiguous indels (with no outgroup data). Normalized indel rates for each gene were computed by dividing the estimated number of indels in that gene by the product of its length and the total neutral (dS) branch length of the phylogeny of the available species (with dS values as shown in Figure 3A). This normalization corrects for gene length (more indels are expected in longer genes), branch length (more indels are expected on longer branches), and differences in the sets of species represented at each gene (more indels can be observed where there are more data). The normalized indel rates were multiplied by 100 and expressed in units of indels per 100 neutral substitutions. A similar normalization was used to compare indel rates on different branches of the tree.

Identification and characterization of conserved elements:

Conserved elements were identified with phastCons, after tuning the parameters γ and ω to obtain 60% coverage of the annotated coding regions by conserved elements (see Siepel et al. 2005). All parameters were estimated from the data (including substitution model parameters, the branch lengths of the tree, and the scaling parameter ρ). The expected minimum length of a detectable conserved element was estimated as described by Siepel et al. (2005). Elements under lineage-specific selection were identified with DLESS and significance was assessed with phyloP (Siepel et al. 2006). Predictions with P ≥ 0.05 were discarded. The model estimated from 4D sites was used as the neutral model, and the tuning parameters were set to their default values. All phastCons and DLESS elements were analyzed with RNAz (Washietl et al. 2005), searched against the RFAM database (Griffiths-Jones et al. 2005) with INFERNAL (Eddy 2002), and examined for pre-miRNA and snoRNA structures with RNAmicro (Hertel and Stadler 2006) and snoReport (Hertel et al. 2008), respectively.

Known binding-site sequences from solanaceous plants were collected from the TRANSFAC (Wingender et al. 1996) and PLACE (Higo et al. 1999) databases, as well as from various sources in the primary literature. These included 17 transcription factors (TFs) with three or more independent sites. Position-specific score matrices (PSSMs) were derived for these 17 TFs by standard methods (supplemental Figure S4). The noncoding portion of the tomato genome was scanned for significant matches to each of these PSSMs, by computing log-odds scores with respect to a third-order Markov background model (estimated from all noncoding regions) and retaining all predictions with empirical P < 1.5 × 10−4 (as assessed by simulation from the background model). These predictions are displayed in the “Motif Predictions” track of the Solanaceae Browser.


Sequences, annotations, alignments, and genome browser:

BACs corresponding to a CSS from tomato chromosome 2 were isolated from tomato, potato, eggplant, pepper, and petunia and were sequenced (Figure 1). Gene annotations for each sequenced BAC were prepared by a combination of computational and manual methods, and a multiple alignment of all sequences was constructed, using methods that exploited the conserved synteny of the region to ensure that orthologous sequences were aligned (see materials and methods). Because the tomato sequence is the most complete and best annotated (due to reasonably extensive mRNA and EST data for tomato), it was selected as the reference sequence for the multiple alignment and was used as the main source of gene annotations. The BAC sequences were also annotated with the positions of transposable elements, simple sequence repeats, conserved elements, known regulatory motifs, and other features, as discussed below. Nearly all coding bases, and most noncoding bases, are aligned in the region (supplemental material; supplemental Table S3). The sequences, alignments, and annotations are displayed in a publicly available Solanaceae Genome Browser based on the UCSC platform (Kent et al. 2002) (Figure 2; http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?db=sol1).

Figure 2.
The Solanaceae Genome Browser. (A) The 105-kb region from the tomato genome (the reference for the browser) contains 15 genes and two pseudogenes (labeled tomato.10pa and tomato.10pb). Tan bars represent alignment nets, showing large blocks of conserved ...

Conservation of gene content, order, and structure:

The CSS contains 17 distinct genes, which are generally well conserved across species (Figure 1 and Table 1). However, small-scale duplications and losses have resulted in some differences in gene content. For example, Gene 9 is found in the same position and orientation in potato, eggplant, and petunia but is absent in tomato and pepper (Figure 1). An examination of the phylogenetic tree of these species (Figure 3) indicates that this gene must have been lost independently in both the tomato and the pepper lineages.

Gene 10 has the same position and orientation in all species but tomato. In its place in the tomato genome are two apparent pseudogenes, both aligning to portions of gene 10 from the other species. One pseudogene (tomato.10p) has what appears to be the ancestral position and orientation, while the other (tomato.10p′) is found in the same orientation but ~3 kb upstream. Moreover, these two copies have a large (~500-bp) region of similarity, suggesting that tomato.10p′ arose from the ancestral gene by a partially duplicative transposition. On the basis of the degree of divergence of these two sequences (~12%), this event appears to have occurred soon after the separation of tomato and potato, ~6 MYA. It is possible that this rearrangement is an example of transposon-mediated exon shuffling, as observed in grass genomes (Bennetzen 2007), but no known transposon was identified in the immediate vicinity.

Two other genes have apparently undergone gene expansion via tandem duplication—gene 7, which is present in two adjacent copies in pepper, and gene 12, which is present in two adjacent copies in petunia (Figure 1). In both cases, the duplicate copies are present in the same orientation, consistent with the mechanisms of tandem duplication. These duplications are lineage specific, and phylogenetic trees estimated by maximum likelihood suggest that they occurred relatively recently (see below). Both copies of both genes have intact open reading frames.

Gene order, like gene content, is largely conserved within the CSS. The only major exception is that the order and the orientation of genes 15 and 16 are reversed in petunia relative to tomato, potato, and eggplant—apparently the result of an inversion of ~20 kb. The petunia inversion is likely to be a derived condition as the tomato/potato/eggplant configuration for these genes is shared with Arabidopsis (Figures 1 and and2).2). There is also a much smaller inversion of ~800 bp in potato (Figure 2). Despite these differences, the CSS shows considerably more conservation than previously observed in comparisons of plant genomes (e.g., Ku et al. 2000; Song et al. 2002; Ilic et al. 2003), perhaps in part due to the absence of WGD in the evolution of the Solanaceae (see discussion).

On the basis of the multiple alignment for the region, we analyzed the open reading frames (ORFs) and exon–intron structures of orthologous genes within the CSS, focusing on the 12 genes that were present in tomato and sequenced in at least two other species (henceforth, the multispecies genes; Table 2). Seven of the 12 multispecies genes have well-conserved ORFs and exon–intron structures, with aligned start/stop codons and splice sites, no premature stop codons, and no frameshift indels (Table 2). Only one gene, gene 14, displays extensive disruptions to its ORF, primarily owing to a long (748 bp) CT-rich low-complexity region near its 3′ end, which contains several frame-shifting indels. The remaining four genes show minor changes in ORFs or structure, including compensatory frameshift indels, shifts in splice-site position, and frameshift indels near their 3′ ends that cause changes in stop codon position. Thus, the ORFs and exon–intron structures in this region have generally been fairly well conserved through the evolution of the Solanaceae, but more than a third of genes do show differences among species.

Rates of neutral substitution:

We used synonymous substitutions within the coding regions of the multispecies genes to estimate rates of neutral substitution in the Solanaceae. For this analysis, we pooled data from all 12 multispecies genes (Table 2), but excluded regions in which gene structure or reading frame was not conserved across species. The tree topology of Figure 3 was assumed, on the basis of previous studies and our own analysis (materials and methods). Using the codon model of Yang et al. (1998), we obtained estimates of dS for each branch that summed to 0.78 synonymous substitutions per synonymous site (Figure 3A). By extracting just the 4D sites (which have no effect on the encoded amino acid) and estimating branch lengths under the REV model of nucleotide evolution (Tavaré 1986), we arrived at a similar estimate of 0.73 substitutions/site (Figure 3C). The total neutral divergence of these Solanaceae genomes, therefore, is comparable to the neutral divergence within each of the groups of mammalian, Drosophila, Caenorhabditis, and Saccharomyces genomes that have recently been widely analyzed (e.g., Siepel et al. 2005). Indeed, it is nearly identical to the neutral divergence of the six eutherian mammalian genomes (human, chimpanzee, macaque, mouse, rat, and dog) that have been completely sequenced at present (0.83 substitions/site, as estimated from 4D sites in the ENCODE regions; Margulies et al. 2007). Thus, many of the analytical methods and computational tools that have been developed for comparative genomics of mammals should be well suited for these plant genomes.

Dates of divergence:

While there have been several detailed studies of the phylogeny of the Solanaceae (Olmstead and Palmer 1992; Olmstead et al. 1999), there are, to our knowledge, no published estimates of absolute dates of divergence throughout the family. However, good estimates are available for the date of the Arabidopsis/Solanaceae (rosid/asterid) divergence. By conditioning on these dates and assuming a molecular clock within the Solanaceae, we were able to obtain an approximate timescale for our five-species phylogeny.

In recent studies of divergence dates throughout the angiosperms, the AS divergence was estimated at 125 ± 5 MYA (Wikstrom et al. 2001) and ~115 MYA (Magallon and Sanderson 2005). These studies were based on nonparametric rate smoothing and/or penalized likelihood methods, with multiple dates from the fossil record as calibration points. Earlier estimates of the AS divergence ranged from 112 to 156 MYA (Yang et al. 1999), but the fossil record for tricolpate pollen (present only in eudicots, which include rosids and asterids) strongly suggests that the AS divergence occurred ≤125 MYA (Bell et al. 2005). Here we assume a fairly inclusive range of 110–130 MYA for the AS divergence and use an intermediate value of 120 MYA as a point estimate.

We incorporated the closest Arabidopsis ortholog for each gene (Figure 1) into the multiple alignments for 11 of the multispecies genes (excluding gene 17, which has two equally distant orthologs in Arabidopsis). Phylogenies estimated separately for each gene strongly suggested that the identified Arabidopsis genes were truly orthologs and not more distant “outparalogs” (Sonnhammer and Koonin 2002), because they showed fairly consistent levels of divergence from the Solanaceae in all cases. On the basis of a pooled data set including all genes, we estimated dates of divergence on the basis of the method of Yang and Yoder (2003), assuming a codon model and a molecular clock. The AS divergence was fixed at the point estimate of 120 MYA, the lower limit of 110 MYA, or the upper limit of 130 MYA. This analysis yielded estimates of 6.2 MYA (range 5.1–7.3 MYA) for the divergence of tomato and potato, 13.7 MYA (11.6–16.0 MYA) for tomato and eggplant, 19.1 MYA (16.2–22.2 MYA) for tomato and pepper, and 31.2 MYA (26.9–35.9 MYA) for tomato and petunia (Figure 4; Table 3). Thus, these species appear to have diversified over the course of ~30 MYA, with major splits occurring in intervals of ~5–10 MY. For several reasons—including the dependency on clock assumptions, the relatively small number of genes considered, and the possibility of misidentification of Arabidopsis orthologs—the estimated dates should be considered rough approximations, but they do provide a general idea of the timeline for the diversification of the Solanaceae (see discussion).

Estimated ages of most recent common ancestors with tomato (MYA)

Proportion of nucleotide sites under selection:

We estimated the proportion of nucleotides in the CSS that have been subject to natural selection during the evolution of the Solanaceae (here denoted γ) by the mixture-decomposition method that was used in the comparative analysis of the human and mouse genomes (Mouse Genome Sequencing Consortium 2002). This method requires only a conservation score for each nucleotide and the identification of a subset of nucleotides that are believed to be neutrally evolving. We used phylogenetic conservation scores computed by the phastOdds program in sliding windows of between 5 and 15 bp (see materials and methods) and treated the 1827 4D sites from the 12 multispecies genes as neutrally evolving (Table 2). The method decomposes the distribution of all conservation scores into a neutral component and a nonneutral, or “selected,” component, using the designated neutral sites as a guide. An estimate of γ can be obtained from the estimated mixture coefficient for the selected component of the distribution, by conservatively assuming any unaligned bases are not under selection (materials and methods). The method is essentially nonparametric and does not require any assumptions about the selected distribution (which is likely to be complex). However, it yields only an estimate of a lower bound for γ.

With 10-bp windows, this analysis yielded a lower-bound estimate of γ = 0.337 (bootstrapping 95% C.I., 0.303–0.373; Figure 5 and supplemental Table S4). Estimates were slightly lower with 5-bp windows (0.322; 95% C.I., 0.283–0.364) and slightly higher with 15-bp windows (0.375; 95% C.I., 0.355–0.395), probably because the power to detect selection increases with window size. They also depended slightly on a smoothing parameter for the score distributions. (The values given here are based on an intermediate smoothing bandwidth of 0.20; see supplemental Table S4 for full results.) Nevertheless, all estimates were between ~0.3 and ~0.4, suggesting that at least about a third of the CSS has evolved under natural selection. Only ~18% of bases in the region fall in protein-coding regions, so at least roughly half of selected bases must have noncoding functions. Interestingly, ~5% of windows under selection—or ~1.5% of all windows—have lower, rather than higher, conservation scores than would be expected under neutrality (left peak of solid line in Figure 5). These windows may have been subject to positive selection during the evolution of the Solanaceae. Our estimates of γ are expected to be conservative as lower bounds, although elevated rates of substitution in 4D sites could conceivably produce an upward bias (see discussion).

Figure 5.
Decomposition of distribution of conservation scores into neutral and selected components. The shaded line represents the full distribution (fall) in the CSS, the dashed solid line represents the contribution of neutrally evolving sites (πnfn ...

To gain insight into the functional roles of the selected bases, we partitioned the sites in the CSS into four classes—coding (CDS), other transcribed (predominantly UTR), intronic, and intergenic sites—and examined the score distribution within each class. These classes of sites show clear differences in their scores, with coding sites being most conserved, intergenic sites being least conserved, and the other classes displaying intermediate levels of conservation (supplemental Figure S2), consistent with observations in other species. Assuming 10-bp windows, we estimate that at least 60% of coding windows, 39% of other transcribed and intronic windows, and 23% of intergenic windows are under selection (Figure 6A). By this analysis, only about one-third (32.9%) of the sites under selection come from coding regions, while 38.5% come from intergenic regions and 25.5% come from intronic regions. The remaining 3.1% come from other transcribed regions (Figure 6B).

Evolution of protein-coding sequences:

Synonymous and nonsynonymous substitutions:

From the pooled set of coding regions used to estimate dS (see above) we estimated dN = 0.16 nonsynonymous substitutions per nonsynonymous site (Figure 3B) and a ratio of ω = dN/dS = 0.21, indicating moderate purifying selection (e.g., Nielsen 2005). This estimate of ω is somewhat larger than genomewide estimates for Drosophila, Caenorhabditis, and Saccharomyces genomes (~0.06–0.11; Kellis et al. 2003; Stein et al. 2003; Clark et al. 2007). It is also larger than estimates for nonprimate mammals (0.10–0.14), but it is similar to estimates for hominids (0.17–0.25) (Rhesus Macaque Genome Sequencing and Analysis Consortium 2007; Kosiol et al. 2008). Relatively few genomewide estimates are available for plants, but Tuskan et al. (2006) have reported ω = 0.40 on the basis of single-nucleotide polymorphisms in Populus, and Yu et al. (2005) have reported average KA/KS estimates (which are roughly comparable to ω-estimates) of 0.35 for the indica and japonica subspecies of O. sativa, with higher values in duplicated regions (0.720 in segmental duplications). Earlier estimates, based on smaller data sets, included KA/KS = 0.14 for A. thaliana and Brassica rapa ssp. pekinensis orthologs (Tiffin and Hahn 2002) (although this estimate may reflect a downward bias from the use of EST data; see, e.g., Wright et al. 2004), and KA/KS = 0.20 for duplicate genes in Arabidopsis (Zhang et al. 2002). Overall, these Solanaceae genes appear to be under somewhat weaker purifying selection than the genes of most animals, but under stronger purifying selection than the genes of at least some other plants. It is possible that the absence of a recent whole-genome duplication in the Solanaceae has contributed to the reduced estimates of ω in comparison with Populus and O. sativa.

Separate estimates of ω per gene indicate that most genes have ω < 0.2, suggesting moderate to strong purifying selection, but a few have values closer to 1, suggesting relaxation of constraint (Table 2). Gene 11, with the largest estimate of ω (0.53), also displays several differences between species in its ORF and exon–intron structure. We also tested all genes for positive selection, both across all branches and on individual branches of the phylogeny, using LRT methods (Yang and Nielsen 2002). Most genes (including gene 11) showed no significant evidence of positive selection, but evidence was detected for genes 7 and 12, both of which have undergone tandem duplications (Figure 7). This is consistent with Yu et al.'s (2005) finding of significantly elevated ω-estimates for tandemly duplicated genes and with observed enrichments for positive selection among duplicated genes (e.g., Rhesus Macaque Genome Sequencing and Analysis Consortium 2007). It should be noted that interlocus gene conversion between duplicate copies of genes can complicate analyses of positive selection.

Insertions and deletions:

Using a maximum parsimony-based method (materials and methods), we reconstructed an indel history (history of insertion and deletion events along the branches of the phylogeny) for each of the 12 multispecies genes. A total of 158 coding indels were identified, or an average of ~13 per gene. However, there was considerable variation in indel number, with 6 genes contributing ≤5 indels each, and 2 genes (genes 11 and 14) responsible for more than half of all observed indels (Table 2). Because different numbers of species are represented at each gene and the genes have different lengths, indel rates were expressed in normalized units of indel events per 100 neutral substitutions (indels/hns) (materials and methods). The average normalized rate was estimated to be 1.7 indels/hns, with a range of 0–5.8 indels/hns per gene. The excess of indels in gene 14 is most likely explained by replication slippage in the 748-bp CT-rich low-complexity region near its 3′ end. Gene 11, despite its large number of indels, does not show a substantial elevation in its normalized indel rate (1.9 indels/hns).

No correlation was observed between ω and normalized indel rates (R2 = 0.001). This finding, together with the excess of indels in the low-complexity region of gene 14 and in several simple sequence repeats, suggests that mutation (e.g., replication slippage), rather than selection, may be responsible for most of the variance in indel rates. As expected, the vast majority (78%) of the observed indels have lengths that are exact multiples of three and therefore leave the open reading frames of their genes intact. The remainder generally occur near the 3′ ends of genes or co-occur with compensatory indels that maintain the reading frame. No significant differences were observed in the indel rates on individual branches of the tree.

Almost two-thirds of the observed indels (100/158) could be identified, on the basis of outgroup data, as either insertions or deletions, with deletions outnumbering insertions by ~3:2 (Table 2). This preference for deletions over insertions is consistent with genomewide observations in mammals (Cooper et al. 2004; Rat Genome Sequencing Project Consortium 2004) and plants (Ma and Bennetzen 2004), but is not as pronounced as what has been observed in Drosophila (Petrov et al. 1996). The average sizes of insertions and deletions were similar (6.7-bp insertions, 7.8-bp deletions). Five of seven phylogenetically informative indels (i.e., shared by between 2 and n − 2 species in a region with n ≥ 4 aligned species) in nonrepetitive regions supported the inferred tree topology (Figure 3). The remaining two indels occurred in exon 3 of gene 12 and supported a phylogeny in which eggplant and pepper are grouped as sister taxa (supplemental Figure S3). These two indels might reflect homoplasy events or evolutionary scenarios—such as incomplete lineage sorting or duplication followed by loss—that would cause this exon's phylogeny to differ from the species tree.

Repeat elements:

The BAC sequences from all five species were screened for transposable elements (TEs), simple sequence repeats, and low-complexity sequences using TRF (Benson 1999) and RepeatMasker (http://www.repeatmasker.org), with a custom RepeatMasker library assembled from several databases (materials and methods). The estimated repeat content ranged from 9.6% of bases in eggplant to 23.0% in potato (supplemental Table S6). In all cases, TEs accounted for the majority of repeats, covering between 5.8% (eggplant) and 16.4% (potato) of bases. By applying the same methods to four BACs identified as being in euchromatin by Wang et al. (2006), we obtained estimates of repeat content ranging from 11.5 to 23.0% (median 12.8%), suggesting that the CSS is fairly typical of the euchromatic portions of Solanaceae genomes in this respect. Notably, the overall fractions of repetitive bases in these genomes are considerably higher, because they consist predominantly of repeat-rich heterochromatin (~75% of the tomato genome is estimated to be heterochromatin; Wang et al. 2006). On the basis of 4300 randomly sheared sequences, we estimate the overall repeat content of the tomato genome to be ~56% (L. Mueller, unpublished data), which is comparable to reported values for grass genomes—e.g., 58–66% in maize (Messing et al. 2004; Haberer et al. 2005) and 35% in rice (International Rice Genome Sequencing Project 2005).

We observed a strong depletion of TEs in introns relative to intergenic regions (P < 2.6 × 10−22, Fisher's exact test), probably because intronic TEs tend to disrupt transcription or splicing and hence are eliminated by natural selection. This depletion was observed in all species except eggplant, where an unusually low repeat content suggested the possibility of poor detection sensitivity. As in maize and rice (Haberer et al. 2005), LTR transposons were the most abundant among the annotated classes of repeats, except in tomato, where DNA transposons were more abundant (supplemental Table S6).

Conserved elements:

The mixture decomposition analysis above provided bulk statistical estimates of fractions of the genome under selection, but did not identify specific bases of interest. Therefore, the program phastCons (Siepel et al. 2005) was used to predict specific genomic elements that are significantly more conserved across species than would be expected under neutral evolution and hence are likely to be under purifying selection (Figure 2). The program was calibrated by requiring 60% coverage of coding regions by conserved elements, as estimated above. The predicted conserved elements covered 20.2% of all bases, including 27.9% of transcribed noncoding, 16.7% of intronic, and 8.3% of intergenic bases (supplemental Table S7). These fractions are somewhat lower than those predicted by the mixture decomposition method, most likely because many selected bases are not detectable by phastCons, due to short element length, weak selective pressure, or positive rather than negative selection. (For this data set, the expected minimum length of a detectable element is ~10 bp, assuming all bases within the element obey phastCons's model of conserved evolution.)

The predicted elements that fell outside of annotated protein-coding genes were identified as conserved noncoding sequences (CNSs; Figure 2B). These CNSs covered 11.2% of all noncoding bases, substantially more than the ~2% found in an analysis of rice and maize (using somewhat different methods) (Inada et al. 2003). As in animal genomes, they were shorter and less variable in length than conserved coding sequences (median lengths 26 and 60 bp), indicating possible cis-regulatory function. They were similar in length to CNSs recently identified in Arabidopsis (median length 24 bp; Thomas et al. 2007). The CNSs were considerably more conserved than nonsynonymous sites in protein-coding genes, with substitution rates ~15% lower (Figure 3D), but this difference may simply reflect the fact that stronger conservation is required to detect shorter elements.

In addition, elements showing evidence of lineage-specific differences in selective pressure were identified using the program DLESS (Siepel et al. 2006). The vast majority of the 97 identified elements were predicted to be under purifying selection on all branches of the phylogeny and corresponded closely with phastCons predictions. However, 3 elements were significantly diverged in one species yet conserved in the others (supplemental Table S8). These elements were predicted by DLESS as lineage-specific “losses”—that is, elements under purifying selection that had experienced relaxation of constraint on a particular branch of the phylogenetic tree. However, they could instead be subject to lineage-specific positive selection, which the program does not explicitly model. One element overlapped an exon of gene 11 (Figure 2D), another overlapped an exon of gene 12, and the third fell in an intron of gene 12. Notably, 2 of the 12 multispecies genes showed evidence of lineage-specific selection. No significant evidence of positive selection could be found in the protein-coding regions of these elements, using the branch-site LRT of Yang and Nielsen (2002), and no motif or other functional evidence could be associated with the noncoding elements. Regardless, these elements show striking cross-species differences at the sequence level and, as potential determinants of phenotypic differences between species, they may be worthy of future study.

Using RNAz (Washietl et al. 2005), the elements identified by phastCons and DLESS were examined for evidence of RNA secondary structure. There were 36 distinct elements (71 including overlapping elements from the two sets) with at least one moderately high-scoring predicted structure (probability >50%). Based on an estimated false-positive rate of 37%, an expected 23 of these 36 elements are true positives. Just under one-fourth of the predictions were in introns or intergenic regions, while the remainder overlapped coding regions, on either the sense or the antisense strand. Only 11% (8/71) of predicted structures showed significant similarity to known structures in the RFAM database (Griffiths-Jones et al. 2005) (supplemental Table S9), but this could in part reflect a dearth of plant RNAs in the database. An example of a high-scoring structure with a significant match in the database is shown in Figure 8. In addition to the RNAz predictions, we found five predicted C/D box snoRNAs using snoReport (Hertel et al. 2008), two of which matched known snoRNA structures in RFAM. We also searched for miRNA precursors using RNAmicro (Hertel and Stadler 2006) but found none. Interestingly, we observed a significant enrichment for 3′ exons among the exons of multi-exon genes that had overlapping RNAz-predicted structures (P = 0.045, Fisher's exact test; predictions on sense strand only). Overall, 7 of 23 exons with overlapping structures were last exons, and most of these structures did not extend substantially into the 3′-UTR. This is consistent with data indicating that a large proportion of human and mouse stop codons fall in RNA loop structures, which perhaps contribute to efficient termination of translation (Shabalina et al. 2006).

Figure 8.
Predicted RNA structure for a conserved element overlapping gene 15. (A) Diagram of secondary structure. (B) Multiple sequence alignment, with parentheses indicating paired bases and dots indicating unpaired bases. In B, red indicates columns with complete ...

We scanned the noncoding regions of the CSS for matches to 16 known Solanaceae regulatory motifs (supplemental Figure S4), using position-specific score matrices (PSSMs) and stringent significance thresholds (materials and methods). In some cases, predicted binding sites corresponded closely to CNSs identified by phastCons (Figure 2C). However, of 480 predicted binding sites, only 26 overlapped CNSs, and none of the motifs were significantly enriched or depleted for matches in CNSs (supplemental Table S10). Thus, these regulatory elements appear not to be maintained (in position or sequence) by strong purifying selection. However, weak detection power for binding sites and for conserved sequences that are short in length or subject to weak selection may contribute to these observations.


Comparative genomics has a venerable history in plants, going back at least 20 years (Bonierbale et al. 1988; Tanksley et al. 1988; Gale and Devos 1998). However, sequence data that would allow detailed comparisons of both coding and noncoding regions in plant species at moderate evolutionary distances have been slow to emerge. In this article, we take a step toward addressing this deficiency by presenting an analysis of orthologous genomic sequences in plants similar to the ones that have been so valuable in recent functional and evolutionary studies of animal genomes. Our analysis of an unduplicated CSS from five species in the family Solanaceae reveals, among other things, that the CSS is generally well conserved but has numerous small-scale differences between species, that at least about a third of the region is under selection, that it contains more selected bases in noncoding than in coding regions, and that most of its genes have experienced moderate to strong purifying selection, but two recently duplicated genes show evidence of positive selection.

With evidence for only two gene-gain and two gene-loss events since the divergence of the Solanaceae, the CSS shows greater conservation of gene content than has been observed in other comparative analyses of plant genomes. For example, CSSs in the dicot species P. trichocarpa, A. thaliana, and M. truncatula display frequencies of gene loss in at least one species of ≥40% (Kevei et al. 2005). Similarly high rates of gene loss have been observed in comparisons of tomato and Arabidopsis (Ku et al. 2000) and of the grass species maize, sorghum, and rice (Ilic et al. 2003). Notably, these other comparisons have all included species that have undergone relatively recent WGDs (e.g., maize or Arabidopsis), and the species with the most recent WGD has typically experienced the most gene loss (Ilic et al. 2003). Thus, it seems likely that the increased conservation of gene content observed in the Solanaceae is at least in part due to the absence of a recent WGD, as would be predicted by theory (Walsh 1995; Lynch and Conery 2003). Further support for this conjecture is provided by the high rate of gene loss in the six regions of the Arabidopsis genome homologous to the CSS (Figure 1), which evidently arose from duplications subsequent to the Arabidopsis/Solanaceae divergence. [Three of these segments—At2, At4, and At5-B—exhibit some conserved synteny and can be traced back to the α-WGD in Arabidopsis (Bowers et al. 2003).] Fifteen of the Solanaceae genes have at least one identifiable Arabidopsis homolog in these regions, and 8 have two or more homologs. Remarkably, in 7 of these 8 cases, all but one homolog has been pseudogenized. Altogether, more than half of the Arabidopsis genes in these regions have been lost, and these gene-loss events have been accompanied by a large number of rearrangement and insertion events. The degree of conservation in the Solanaceae is strikingly high by contrast.

We estimate that the five Solanaceae species studied here all derive from a most recent common ancestor that lived ~30 MYA. This would imply that these species diversified well after the breakup of the Gondwana supercontinent, because most major continental separations are estimated to have occurred between 180 and 105 MYA (McLoughlin 2001). Therefore, our estimates support a hypothesis of long-distance—rather than intra-Gondwanan—seed dispersal to explain the worldwide distribution of the Solanaceae (D'Arcy 1991; Olmstead and Palmer 1992, 1997). They are consistent with Olmstead and Palmer's (1997) proposal of a New World origin and initial diversification of the Solanum, followed by dispersal to and further diversification in Africa, Australia, and other parts of the world. However, certain genera of Solanaceae, such as Schizanthus and Schwenkia, are believed to have branched off somewhat earlier than petunia (Olmstead and Palmer 1992; Olmstead et al. 1999), and our data do not help in dating these events.

Our estimate of 5.1–7.3 MYA for the tomato/potato divergence is substantially older than a recent estimate of 1.6–3.3 MYA, based on a large collection of ESTs (Blanc and Wolfe 2004b). This difference appears to stem primarily from the assumption in that study of an absolute rate of 1.5 × 10−8 synonymous substitutions per site per year, as estimated for the Adh and Chs loci in Arabidopsis (Koch et al. 2000). If this substitution rate were assumed for our data, the AS divergence would be estimated at only 76.5 MYA, in conflict with most other evidence. Thus, it appears that Koch et al.'s estimated substitution rate is too high, at least for the Solanaceae. By instead conditioning on the best available estimated dates for the AS divergence, we estimate a substitution rate in the Solanaceae of only 9.6 × 10−9 (8.3 × 10−9–1.1 × 10−8) synonymous substitutions per site per year, ~1.6 times lower than Koch et al.'s estimate. Interestingly, Tuskan et al. (2006) observed a similar, but more dramatic, rate slowdown in Populus compared with Koch et al.'s estimate for Arabidopsis. It is worth stressing that our methods assume a molecular clock not only within the Solanaceae but also between Arabidopsis and the Solanaceae. If substitution rates in the Solanaceae are indeed lower than those in Arabidopsis, then we will have underestimated the divergence times within the Solanaceae. Such a difference in rates would imply that the root of the phylogeny in Figure 4 should slide toward the Solanaceae, which would have the effect of pulling the Solanaceae divergence events back in time. Any underestimation of the neutral distance between Arabidopsis and the Solanaceae due to saturation of synonymous sites would have a similar effect on the estimated divergence times. For these reasons, we believe the tomato/potato divergence occurred at least ~5–7 MYA.

The proportion of nucleotide sites in a genome that are subject to natural selection (γ) is a key quantity in understanding how genomes evolve and in evaluating the completeness of functional annotations. There has been considerable interest in estimating this quantity from comparative sequence data, especially for mammalian (e.g., Shabalina et al. 2001; Mouse Genome Sequencing Consortium 2002; Cooper et al. 2004; Smith et al. 2004; Lindblad-Toh et al. 2005), Drosophila (Bergman and Kreitman 2001; Halligan et al. 2004; Andolfatto 2005), and Caenorhabditis (Shabalina and Kondrashov 1999) genomes. However, we know of no published estimate of this quantity for plant species. In this study, we find evidence that at least about a third of bases within the CSS are under selection and that more than two-thirds of selected bases fall in noncoding regions. Therefore, like mammalian and Drosophila genomes (Mouse Genome Sequencing Consortium 2002; Chiaromonte et al. 2003; Andolfatto 2005), this region appears to contain a large amount of genomic “dark matter”—that is, functional DNA that is invisible to current methods for genome annotation. It is not known how representative the CSS is of the entire Solanaceae genomes, but it is worth noting that the gene density of the region is fairly typical for euchromatin. The CSS has one gene per 7 kb compared with one gene per ~6.7 kb in the euchromatic portion of the tomato genome (Wang et al. 2006). Values of γ in the gene-poor heterochromatin are likely to be considerably lower. Notably, because they are based completely on between-species divergence, our estimates should primarily reflect ancient selection—operating on the scale of millions of years—rather than more recent, possibly human-induced selection.

Several sources of possible error—such as lineage-specific selection, (negative) selection in so-called neutral sites, and failure to align divergent sequences—would tend to produce underestimates, rather than overestimates, of γ. However, spuriously high measurements of the neutral rate of substitution (i.e., spuriously low neutral conservation scores) could lead to an upward bias in γ. Such a bias could result from choosing neutral sites that were actually evolving faster than average or from alignments of nonorthologous neutral bases. Our use of 4D sites (which fall in easy-to-align coding regions) and our synteny-based alignment methods make it unlikely that alignment error in neutral sites has a major effect on γ. We have also repeated the analysis with an alternative alignment and arrived at very similar results (supplemental material), providing further evidence that alignment error is not a dominant factor in the analysis. In addition, 4D sites have generally been found to evolve somewhat more slowly, rather than faster, than other neutral sites (e.g., Hardison et al. 2003). Indeed, we find that a phylogeny estimated from 1915 sites in ancestral repeats in the CSS has >10% greater total branch length than the one estimated from 4D sites (data not shown). This suggests that forces that may produce elevated rates of substitution in 4D sites, such as CpG methylation (Hobolth et al. 2006) or genetic draft, are probably more than offset by weak negative selection on synonymous sites. Another possible source of bias is that the phastOdds scores—which depend on estimated models of neutral and conserved evolution—are influenced by differences between 4D and other sites in base composition, branch length proportions, and substitution patterns. However, two quite different estimation schemes for these models, based on different sets of sites, led to very similar estimates of γ (materials and methods), suggesting that the mixture decomposition is being driven by conservation and not by ancillary properties. Taken together, these lines of evidence suggest that our estimate of equation M3 is conservative as a lower bound for the fraction of bases under selection in the CSS.

Our ability to address major questions about the evolution of the Solanaceae and of other plant species in this study has clearly been limited by having sequence data for only one ~100-kb region. The completion of the tomato genome sequence, and new sequence data for other Solanaceae species and for coffee, will allow for more definitive answers to these and many other questions. Nevertheless, this study demonstrates the potential of comparative sequence analysis to provide insight into the evolutionary forces that have shaped modern-day plant genomes.


We thank Rod Wing for assistance in BAC sequencing and T. Vinar, L. Mueller, J. S. Pedersen, and Y. Xu for assistance in the analysis. This work was partially supported by National Science Foundation grants DBI-0116076 (A.S.) and 0421634 (S.T.) and National Institutes of Health training grant T32 GM007617-27 (A.D.).


Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. AF273333 and EF517791EF5177914.


  • Andolfatto, P., 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437 1149–1152. [PubMed]
  • Arabidopsis Genome Initiative, 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408 796–815. [PubMed]
  • Bejerano, G., M. Pheasant, I. Makunin, S. Stephen, W. Kent et al., 2004. Ultraconserved elements in the human genome. Science 304 1321–1325. [PubMed]
  • Bell, C. D., D. E. Soltis and P. S. Soltis, 2005. The age of the angiosperms: a molecular timescale without a clock. Evol. Int. J. Org. Evol. 59 1245–1258. [PubMed]
  • Benjamini, Y., and Y. Hochberg, 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57 289–300.
  • Bennetzen, J. L., 2007. Patterns in grass genome evolution. Curr. Opin. Plant Biol. 10 176–181. [PubMed]
  • Benson, G., 1999. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27 573–580. [PMC free article] [PubMed]
  • Bergman, C. M., and M. Kreitman, 2001. Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res. 11 1335–1345. [PubMed]
  • Blanc, G., and K. H. Wolfe, 2004. a Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16 1679–1691. [PMC free article] [PubMed]
  • Blanc, G., and K. H. Wolfe, 2004. b Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16 1667–1678. [PMC free article] [PubMed]
  • Blanchette, M., W. J. Kent, C. Riemer, L. Elnitski, A. F. A. Smit et al., 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14 708–715. [PMC free article] [PubMed]
  • Bonierbale, M., R. Plaisted and S. Tanksley, 1988. RFLP maps based on a common set of clones reveal modes of chromosomal evolution in potato and tomato. Genetics 120 1095–1103. [PMC free article] [PubMed]
  • Borodovsky, M., and J. McIninch, 1993. GeneMark: parallel gene recognition for both DNA strands. Comput. Chem. 17 123–133.
  • Bowers, J. E., B. A. Chapman, J. Rong and A. H. Paterson, 2003. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422 433–438. [PubMed]
  • Burge, C., and S. Karlin, 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268 78–94. [PubMed]
  • Cannon, S., L. Sterck, S. Rombauts, S. Sato, F. Cheung et al., 2006. Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes. Proc. Natl. Acad. Sci. USA 103 14959–14964. [PMC free article] [PubMed]
  • Chiaromonte, F., R. J. Weber, K. M. Roskin, M. Diekhans, W. J. Kent et al., 2003. The share of human genomic DNA under selection estimated from human-mouse genomic alignments. Cold Spring Harbor Symp. Quant. Biol. 68 245–254. [PubMed]
  • Clark, A., M. Eisen, D. Smith, C. Bergman, B. Oliver et al., 2007. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450 203–218. [PubMed]
  • Clark, A. G., S. Glanowski, R. Nielsen, P. D. Thomas, A. Kejariwal et al., 2003. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302 1960–1963. [PubMed]
  • Cliften, P., P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton et al., 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301 71–76. [PubMed]
  • Cooper, G. M., M. Brudno, E. A. Stone, I. Dubchak, S. Batzoglou et al., 2004. Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res. 14 539–548. [PMC free article] [PubMed]
  • D'Arcy, W. G., 1991. The Solanaceae since 1976, with a review of its biogeography, pp. 75–138 in Solanaceae III: Taxonomy, Chemistry, Evolution, edited by J. G. Hawkes, R. N. Lester, M. Nee and N. Estrada. Royal Botanic Gardens, Kew, UK.
  • Eddy, S. R., 2002. A memory-efficient dynamic programming algorithm for optimal alignment of a sequence to an RNA secondary structure. BMC Bioinform. 3 18. [PMC free article] [PubMed]
  • Felsenstein, J., 1981. Evolutionary trees from DNA sequences. J. Mol. Evol. 17 368–376. [PubMed]
  • Frary, A., Y. Xu, J. Liu, S. Mitchell, E. Tedeschi et al., 2005. Development of a set of PCR-based anchor markers encompassing the tomato genome and evaluation of their usefulness for genetics and breeding experiments. Theor. Appl. Genet. 111 291–312. [PubMed]
  • Freeling, M., L. Rapaka, E. Lyons, B. Pedersen and B. Thomas, 2007. G-boxes, bigfoot genes, and environmental response: characterization of intragenomic conserved noncoding sequences in Arabidopsis. Plant Cell 19 1441–1457. [PMC free article] [PubMed]
  • Gale, M., and K. Devos, 1998. Plant comparative genetics after 10 years. Science 282 656–659. [PubMed]
  • Goff, S. A., D. Ricke, T.-H. Lan, G. Presting, R. Wang et al., 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296 92–100. [PubMed]
  • Griffiths-Jones, S., S. Moxon, M. Marshall, A. Khanna, S. R. Eddy et al., 2005. Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33 121–124. [PMC free article] [PubMed]
  • Guigó, R., E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra et al., 2003. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl. Acad. Sci. USA 100 1140–1145. [PMC free article] [PubMed]
  • Guindon, S., and O. Gascuel, 2003. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52 696–704. [PubMed]
  • Haberer, G., S. Young, A. K. Bharti, H. Gundlach, C. Raymond et al., 2005. Structure and architecture of the maize genome. Plant Physiol. 139 1612–1624. [PMC free article] [PubMed]
  • Haberer, G., M. T. Mader, P. Kosarev, M. Spannagl, L. Yang et al., 2006. Large-scale cis-element detection by analysis of correlated expression and sequence conservation between Arabidopsis and Brassica oleracea. Plant Physiol. 142 1589–1602. [PMC free article] [PubMed]
  • Halligan, D., A. Eyre-Walker, P. Andolfatto and P. Keightley, 2004. Patterns of evolutionary constraints in intronic and intergenic DNA of Drosophila. Genome Res. 14 273–279. [PMC free article] [PubMed]
  • Hardison, R. C., K. M. Roskin, S. Yang, M. Diekhans, W. J. Kent et al., 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13 13–26. [PMC free article] [PubMed]
  • Hertel, J., and P. F. Stadler, 2006. Hairpins in a haystack: recognizing microRNA precursors in comparative genomics data. Bioinformatics 22 197–202. [PubMed]
  • Hertel, J., I. L. Hofacker and P. F. Stadler, 2008. SnoReport: computational identification of snoRNAs with unknown targets. Bioinformatics 24(2): 158–164. [PubMed]
  • Higo, K., Y. Ugawa, M. Iwamoto and T. Korenaga, 1999. Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Res. 27 297–300. [PMC free article] [PubMed]
  • Hobolth, A., R. Nielsen, Y. Wang, F. Wu and S. Tanksley, 2006. CpG + CpNpG analysis of protein-coding sequences from tomato. Mol. Biol. Evol. 23 1318. [PubMed]
  • Ilic, K., P. J. SanMiguel and J. L. Bennetzen, 2003. A complex history of rearrangement in an orthologous region of the maize, sorghum, and rice genomes. Proc. Natl. Acad. Sci. USA 100 12265–12270. [PMC free article] [PubMed]
  • Inada, D., A. Bashir, C. Lee, B. Thomas, C. Ko et al., 2003. Conserved noncoding sequences in the grasses. Genome Res. 13 2030–2041. [PMC free article] [PubMed]
  • International Rice Genome Sequencing Project, 2005. The map-based sequence of the rice genome. Nature 436 793–800. [PubMed]
  • Jiang, Z., H. Tang, M. Ventura, M. Cardone, T. Marques-Bonet et al., 2007. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nat. Genet. 39 1361–1368. [PubMed]
  • Kawamoto, H., T. Morita, A. Shimizu, T. Inada and H. Aiba, 2005. Implication of membrane localization of target mRNA in the action of a small RNA: mechanism of post-transcriptional regulation of glucose transporter in Escherichia coli. Genes Dev. 19 328–338. [PMC free article] [PubMed]
  • Kellis, M., N. Patterson, M. Endrizzi, B. Birren and E. S. Lander, 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423 241–254. [PubMed]
  • Kent, W. J., C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle et al., 2002. The human genome browser at UCSC. Genome Res. 12 996–1006. [PMC free article] [PubMed]
  • Kent, W. J., R. Baertsch, A. Hinrichs, W. Miller and D. Haussler, 2003. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100 11484–11489. [PMC free article] [PubMed]
  • Kevei, Z., A. Seres, A. Kereszt, P. Kalo, P. Kiss et al., 2005. Significant microsynteny with new evolutionary highlights is detected between Arabidopsis and legume model plants despite the lack of macrosynteny. Mol. Genet. Genomics 274 644–657. [PubMed]
  • Koch, M. A., B. Haubold and T. Mitchell-Olds, 2000. Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae). Mol. Biol. Evol. 17 1483–1498. [PubMed]
  • Kosiol, C., T. Vinar, R. R. da Fonseca, M. J. Hubisz, C. D. Bustamante et al., 2008. Patterns of positive selection in six mammalian genomes. PLoS Genet. 4(8): e1000144. [PMC free article] [PubMed]
  • Ku, H. M., T. Vision, J. Liu and S. D. Tanksley, 2000. Comparing sequenced segments of the tomato and Arabidopsis genomes: large-scale duplication followed by selective gene loss creates a network of synteny. Proc. Natl. Acad. Sci. USA 97 9121–9126. [PMC free article] [PubMed]
  • Lindblad-Toh, K., C. M. Wade, T. S. Mikkelsen, E. K. Karlsson, D. B. Jaffe et al., 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438 803–819. [PubMed]
  • Lynch, M., and J. S. Conery, 2003. The evolutionary demography of duplicate genes. J. Struct. Funct. Genomics 3 35–44. [PubMed]
  • Ma, J., and J. L. Bennetzen, 2004. Rapid recent growth and divergence of rice nuclear genomes. Proc. Natl. Acad. Sci. USA 101 12404–12410. [PMC free article] [PubMed]
  • Magallon, S. A., and M. J. Sanderson, 2005. Angiosperm divergence times: the effect of genes, codon positions, and time constraints. Evol. Int. J. Org. Evol. 59 1653–1670. [PubMed]
  • Margulies, E. H., G. M. Cooper, G. Asimenos, D. J. Thomas, C. N. Dewey et al., 2007. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17 760–774. [PMC free article] [PubMed]
  • McCubbin, A. G., C. Zuniga and T. Kao, 2000. Construction of a binary bacterial artificial chromosome library of Petunia inflata and the isolation of large genomic fragments linked to the self-incompatibility (S-) locus. Genome 43 820–826. [PubMed]
  • McLoughlin, S., 2001. The breakup history of gondwana and its impact on pre-cenozoic floristic provincialism. Aust. J. Bot. 49 271–300.
  • Messing, J., A. Bharti, W. Karlowski, H. Gundlach, H. Kim et al., 2004. Sequence composition and genome organization of maize. Proc. Natl. Acad. Sci. USA 101 14349–14354. [PMC free article] [PubMed]
  • Miller, W., K. Rosenbloom, R. Hardison, M. Hou, J. Taylor et al., 2007. 28-Way vertebrate alignment and conservation track in the UCSC Genome Browser. Genome Res. 17 1797–1808. [PMC free article] [PubMed]
  • Mouse Genome Sequencing Consortium, 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420 520–562. [PubMed]
  • Mueller, L., T. Solow, N. Taylor, B. Skwarecki, R. Buels et al., 2005. The SOL Genomics Network: a comparative resource for Solanaceae biology and beyond. Plant Physiol. 138 1310–1317. [PMC free article] [PubMed]
  • Murphy, W., D. Larkin, A. Everts-van der Wind, G. Bourque, G. Tesler et al., 2005. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309 613–617. [PubMed]
  • Murphy, W. J., T. H. Pringle, T. A. Crider, M. S. Springer and W. Miller, 2007. Using genomic data to unravel the root of the placental mammal phylogeny. Genome Res. 17 413–421. [PMC free article] [PubMed]
  • Nielsen, R., 2005. Molecular signatures of natural selection. Annu. Rev. Genet. 39 197–218. [PubMed]
  • Notredame, C., D. Higgins and J. Heringa, 2000. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302 205–217. [PubMed]
  • Olmstead, R. G., and J. D. Palmer, 1992. A chloroplast DNA phylogeny of the Solanaceae: subfamilial relationships and character evolution. Ann. MO Bot. Gard. 79 346–360.
  • Olmstead, R. G., and J. D. Palmer, 1997. Implications for the phylogeny, classification, and biogeography of Solanum from cpDNA restriction site variation. Syst. Bot. 22 19–29.
  • Olmstead, R. G., J. A. Sweere, R. E. Spangler, L. Bohs and J. D. Palmer, 1999. Phylogeny and provisional classification of the Solanaceae based on chloroplast DNA, pp. 111–137 in Solanaceae IV, Advances in Biology and Utilization, edited by M. Nee, D. E. Symon, J. P. Jessup and J. G. Hawkes. Royal Botanic Gardens, Kew, UK.
  • Otto, S., and J. Whitton, 2000. Polyploid incidence and evolution. Annu. Rev. Genet. 34 401–437. [PubMed]
  • Petrov, D. A., E. R. Lozovskaya and D. L. Hartl, 1996. High intrinsic rate of DNA loss in Drosophila. Nature 384 346–349. [PubMed]
  • Pollard, K. S., S. R. Salama, N. Lambert, M.-A. Lambot, S. Coppens et al., 2006. An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443 167–172. [PubMed]
  • Quiros, C. F., F. Grellet, J. Sadowski, T. Suzuki, G. Li et al., 2001. Arabidopsis and Brassica comparative genomics: sequence, structure and gene content in the ABI-Rps2-Ck1 chromosomal segment and related regions. Genetics 157 1321–1330. [PMC free article] [PubMed]
  • Rat Genome Sequencing Project Consortium, 2004. Genome sequence of the brown Norway rat yields insights into mammalian evolution. Nature 428 493–521. [PubMed]
  • Rensink, W., Y. Lee, J. Liu, S. Iobst, S. Ouyang et al., 2005. Comparative analyses of six solanaceous transcriptomes reveal a high degree of sequence conservation and species-specific transcripts. BMC Genomics 6 124. [PMC free article] [PubMed]
  • Rhesus Macaque Genome Sequencing and Analysis Consortium, 2007. Evolutionary and biomedical insights from the rhesus macaque genome. Science 13 222–234. [PubMed]
  • Ronning, C., S. Stegalkina, R. Ascenzi, O. Bougri, A. Hart et al., 2003. Comparative analyses of potato expressed sequence tag libraries. Plant Physiol. 131 419–429. [PMC free article] [PubMed]
  • Salzberg, S. L., A. L. Delcher, S. Kasif and O. White, 1998. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 26 544–548. [PMC free article] [PubMed]
  • Schwartz, S., W. J. Kent, A. Smit, Z. Zhang, R. Baertsch et al., 2003. Human-mouse alignments with BLASTZ. Genome Res. 13 103–107. [PMC free article] [PubMed]
  • Semon, M., and K. Wolfe, 2007. Consequences of genome duplication. Curr. Opin. Genet. Dev. 17 505–512. [PubMed]
  • Shabalina, S. A., and A. S. Kondrashov, 1999. Pattern of selective constraint in C. elegans and C. briggsae genomes. Genet. Res. 74 23–30. [PubMed]
  • Shabalina, S. A., A. Y. Ogurtsov, V. A. Kondrashov and A. S. Kondrashov, 2001. Selective constraint in intergenic regions of human and mouse genomes. Trends Genet. 17 373–376. [PubMed]
  • Shabalina, S. A., A. Y. Ogurtsov and N. A. Spiridonov, 2006. A periodic pattern of mRNA secondary structure created by the genetic code. Nucleic Acids Res. 34 2428–2437. [PMC free article] [PubMed]
  • Siepel, A., G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou et al., 2005. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15 1034–1050. [PMC free article] [PubMed]
  • Siepel, A., K. Pollard and D. Haussler, 2006. New methods for detecting lineage-specific selection, pp. 190–205 in Proceedings of the 10th International Conference on Research in Computational Molecular Biology (Lecture Notes in Computer Science, Vol. 3909), edited by A. Apostolico, C. Guerra, S. Istrail, P. Pevzner and M. Waterman. Springer, New York.
  • Siepel, A., M. Diekhans, B. Brejov, L. Langton, M. Stevens et al., 2007. Targeted discovery of novel human exons by comparative genomics. Genome Res. 17 1763–1773. [PMC free article] [PubMed]
  • Smith, N. G. C., M. Brandstrom and H. Ellegren, 2004. Evidence for turnover of functional noncoding DNA in mammalian genome evolution. Genomics 84 806–813. [PubMed]
  • Solovyev, V. V., A. A. Salamov and C. B. Lawrence, 1994. Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22 5156–5163. [PMC free article] [PubMed]
  • Song, J., F. Dong and J. Jiang, 2000. Construction of a bacterial artificial chromosome (BAC) library for potato molecular cytogenetics research. Genome 43 199–204. [PubMed]
  • Song, R., V. Llaca and J. Messing, 2002. Mosaic organization of orthologous sequences in grass genomes. Genome Res. 12 1549–1555. [PMC free article] [PubMed]
  • Sonnhammer, E., and E. Koonin, 2002. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 18 619–620. [PubMed]
  • Stark, A., M. Lin, P. Kheradpour, J. Pedersen, L. Parts et al., 2007. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450 219–232. [PMC free article] [PubMed]
  • Stein, L. D., Z. Bao, D. Blasiar, T. Blumenthal, M. R. Brent et al., 2003. The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol. 1 E45. [PMC free article] [PubMed]
  • Tanksley, S., R. Bernatzky, N. Lapitan and J. Prince, 1988. Conservation of gene repertoire but not gene order in pepper and tomato. Proc. Natl. Acad. Sci. USA 85 6419–6423. [PMC free article] [PubMed]
  • Tavaré, S., 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci. 17 57–86.
  • Thomas, B. C., L. Rapaka, E. Lyons, B. Pedersen and M. Freeling, 2007. Arabidopsis intragenomic conserved noncoding sequence. Proc. Natl. Acad. Sci. USA 104 3348–3353. [PMC free article] [PubMed]
  • Thomas, J. W., J. W. Touchman, R. W. Blakesley, G. G. Bouffard, S. M. Beckstrom-Sternberg et al., 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424 788–793. [PubMed]
  • Tiffin, P., and M. Hahn, 2002. Coding sequence divergence between two closely related plant species: Arabidopsis thaliana and Brassica rapa ssp. pekinensis. J. Mol. Evol. 54 746–753. [PubMed]
  • Tuskan, G. A., S. Difazio, S. Jansson, J. Bohlmann, I. Grigoriev et al., 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313 1596–1604. [PubMed]
  • Van der Hoeven, R., C. Ronning, J. Giovannoni, G. Martin and S. Tanksley, 2002. Deductions about the number, organization, and evolution of genes in the tomato genome based on analysis of a large expressed sequence tag collection and selective genomic sequencing. Plant Cell 14 1441–1456. [PMC free article] [PubMed]
  • Vanderpool, C. K., and S. Gottesman, 2004. Involvement of a novel transcriptional activator and small RNA in post-transcriptional regulation of the glucose phosphoenolpyruvate phosphotransferase system. Mol. Microbiol. 54 1076–1089. [PubMed]
  • Walsh, J. B., 1995. How often do duplicated genes evolve new functions? Genetics 139 421–428. [PMC free article] [PubMed]
  • Wang, Y., X. Tang, Z. Cheng, L. Mueller, J. Giovannoni et al., 2006. Euchromatin and pericentromeric heterochromatin: comparative composition in the tomato genome. Genetics 172 2529–2540. [PMC free article] [PubMed]
  • Washietl, S., I. L. Hofacker, M. Lukasser, A. Huttenhofer and P. F. Stadler, 2005. Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome. Nat. Biotechnol. 23 1383–1390. [PubMed]
  • Wikstrom, N., V. Savolainen and M. W. Chase, 2001. Evolution of the angiosperms: calibrating the family tree. Proc. Biol. Sci. 268 2211–2220. [PMC free article] [PubMed]
  • Wingender, E., P. Dietze, H. Karas and R. Knuppel, 1996. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24 238–241. [PMC free article] [PubMed]
  • Wortman, J. R., B. J. Haas, L. I. Hannick, R. K. J. Smith, R. Maiti et al., 2003. Annotation of the Arabidopsis genome. Plant Physiol. 132 461–468. [PMC free article] [PubMed]
  • Wright, S., C. Yau, M. Looseley and B. Meyers, 2004. Effects of gene expression on molecular evolution in Arabidopsis thaliana and Arabidopsis lyrata. Mol. Biol. Evol. 21 1719–1726. [PubMed]
  • Wu, F., L. A. Mueller, D. Crouzillat, V. Petiard and S. D. Tanksley, 2006. Combining bioinformatics and phylogenetics to identify large sets of single-copy orthologous genes (COSII) for comparative, evolutionary and systematic studies: a test case in the euasterid plant clade. Genetics 174 1407–1420. [PMC free article] [PubMed]
  • Yang, Y. W., K. N. Lai, P. Y. Tai and W. H. Li, 1999. Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J. Mol. Evol. 48 597–604. [PubMed]
  • Yang, Z., 1997. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13 555–556. [PubMed]
  • Yang, Z., and R. Nielsen, 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19 908–917. [PubMed]
  • Yang, Z., and A. D. Yoder, 2003. Comparison of likelihood and Bayesian methods for estimating divergence times using multiple gene loci and calibration points, with application to a radiation of cute-looking mouse lemur species. Syst. Biol. 52 705–716. [PubMed]
  • Yang, Z., R. Nielsen and M. Hasegawa, 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15 1600–1611. [PubMed]
  • Yu, J., S. Hu, J. Wang, G. K.-S. Wong, S. Li et al., 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296 79–92. [PubMed]
  • Yu, J., J. Wang, W. Lin, S. Li, H. Li et al., 2005. The genomes of Oryza sativa: a history of duplications. PLoS Biol. 3 e38. [PMC free article] [PubMed]
  • Zhang, J., R. Nielsen and Z. Yang, 2005. Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22 2472–2479. [PubMed]
  • Zhang, L., T. Vision and B. Gaut, 2002. Patterns of nucleotide substitution among simultaneously duplicated gene pairs in Arabidopsis thaliana. Mol. Biol. Evol. 19 1464–1473. [PubMed]
  • Zhu, H., D.-J. Kim, J.-M. Baek, H.-K. Choi, L. C. Ellis et al., 2003. Syntenic relationships between Medicago truncatula and Arabidopsis reveal extensive divergence of genome organization. Plant Physiol. 131 1018–1026. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...