![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2008, Cold Spring Harbor Laboratory Press Multigenome DNA sequence conservation identifies Hox cis-regulatory elements 1 Division of Biology, California Institute of Technology, Pasadena, California 91125, USA; 2 Howard Hughes Medical Institute, California Institute of Technology, Pasadena, California 91125, USA 3Corresponding authors.E-mail woldb/at/caltech.edu; fax (626) 395-5750.E-mail pws/at/caltech.edu; fax (626) 568-8012. Received August 26, 2008; Accepted September 17, 2008. Abstract To learn how well ungapped sequence comparisons of multiple species can predict cis-regulatory elements in Caenorhabditis elegans, we made such predictions across the large, complex ceh-13/lin-39 locus and tested them transgenically. We also examined how prediction quality varied with different genomes and parameters in our comparisons. Specifically, we sequenced ~0.5% of the C. brenneri and C. sp. 3 PS1010 genomes, and compared five Caenorhabditis genomes (C. elegans, C. briggsae, C. brenneri, C. remanei, and C. sp. 3 PS1010) to find regulatory elements in 22.8 kb of noncoding sequence from the ceh-13/lin-39 Hox subcluster. We developed the MUSSA program to find ungapped DNA sequences with N-way transitive conservation, applied it to the ceh-13/lin-39 locus, and transgenically assayed 21 regions with both high and low degrees of conservation. This identified 10 functional regulatory elements whose activities matched known ceh-13/lin-39 expression, with 100% specificity and a 77% recovery rate. One element was so well conserved that a similar mouse Hox cluster sequence recapitulated the native nematode expression pattern when tested in worms. Our findings suggest that ungapped sequence comparisons can predict regulatory elements genome-wide. Despite knowledge of entire genome sequences, discovering cis-regulatory DNA elements remains surprisingly inefficient. In animal genomes, cis-regulatory elements are located unpredictably around or within the genes they regulate (Woolfe et al. 2005; Davidson 2006; Pennacchio et al. 2006; Engström et al. 2007). These elements, when dissected further, often prove to be composed of individual transcription factor binding sites that are often very loosely defined (Sandelin et al. 2004). Transgenic analysis in vivo is the most definitive way to show that a sequence is regulatory, but it is also the most time consuming and expensive. It is therefore desirable to use other criteria, such as preferential sequence conservation, to identify regions most likely to be functional. To evaluate a strategy for phylogenetic footprinting using four other Caenorhabditis species, we dissected the cis-regulatory structure of a Hox cluster in the nematode Caenorhabditis elegans (Fig. 1A
If two or more species are evolutionarily close enough to show common development and physiology, their genomes are expected to share an underlying gene regulatory network driven by cis-regulatory elements with conserved sequences of several hundred base pairs (Tagle et al. 1988; Davidson 2006; Brown et al. 2007; Li et al. 2007). Within a functional cis-regulatory element, individual transcription-factor binding sites are generally short (~6–20 bp) with statistical preferences, not strict requirements, for specific bases (Sandelin et al. 2004). Statistical over-representation of such motifs has been useful for identifying transcription-factor binding sites common to coregulated genes in C. elegans (Ao et al. 2004; Gaudet et al. 2004; Wenick and Hobert 2004; Pauli et al. 2006; Etchberger et al. 2007; McGhee et al. 2007; Zhao et al. 2007). However, this approach requires a known set of coregulated genes, a limitation that cross-species genomic comparison methods do not have. The simplest genomic comparison method is all-against-all matching of ungapped sequence windows, which is well suited for finding cis-regulatory elements under selective pressure against insertions and deletions (Brown et al. 2002; Cameron et al. 2005). This kind of comparison reveals orientation-independent, one-to-many, and many-to-many relationships, all of which are possible for conserved cis-regulatory sequences, yet invisible in standard global alignments. While ungapped comparisons can highlight regulatory regions, they are not expected to resolve individual transcription-factor binding sites within them. However, different prediction biases from sequence conservation versus statistical over-representation can complement one another (Wang and Stormo 2003; Bigelow et al. 2004; Tompa et al. 2005; Chen et al. 2006). Since purely random pairing of unrelated 100-bp DNA segments typically yields two perfect 6-bp matches (Dickinson 1991), comparing three or more species should identify sequences under selective pressure with greater accuracy than comparing only two (Boffelli et al. 2004; Sinha et al. 2004; Eddy 2005; Stone et al. 2005). This has recently been done for budding yeasts (Cliften et al. 2003; Kellis et al. 2003), Drosophila (Stark et al. 2007), and vertebrates (Krek et al. 2005; Xie et al. 2005, 2007; Pennacchio et al. 2006; McGaughey et al. 2008). Vertebrates have many conserved sequences that may be regulatory, but most have unknown functions (Bejerano et al. 2004; Boffelli et al. 2004; Ovcharenko et al. 2005; Ahituv et al. 2007) that are difficult to test in all cell types throughout the life cycle, especially in mammals. The nematode Caenorhabditis elegans has a compact genome (100 Mb, ~27,000 genes) and body (~1000 somatic cells in adults), which should allow candidate regulatory elements to be tested for function throughout development and across all cell types (Sulston and Horvitz 1977; Kimble and Hirsh 1979; Hillier et al. 2005). Although C. elegans is the most familiar Caenorhabditis species, others are available for multispecies genomic comparisons (Fig. 1B ceh-13 and lin-39 are a linked pair of Hox genes, orthologous to labial/HOXA1 and Sex combs reduced/HOXA5. Hox genes, an ancient class of developmental control genes, pose a special challenge to cis-regulatory analysis because they are not regulated as isolated loci. Instead, they are found throughout bilateria as conserved multigene clusters encoding paralogous transcription factors that are crucial for development, and that are expressed in complex spatiotemporal patterns requiring intricate transcriptional regulation (Garcia-Fernandez 2005; Lemons and McGinnis 2006). Hox genes not only function similarly in disparate animal phyla, but may also be regulated similarly (Malicki et al. 1992; Frasch et al. 1995; Popperl et al. 1995; Haerry and Gehring 1997; Streit et al. 2002; Garcia-Fernandez 2005), although few cis-regulatory elements shared by Hox clusters of different phyla have actually been found (Haerry and Gehring 1997; Streit et al. 2002). Nematodes have only a single set of Hox genes. Several megabases of DNA and numerous non-Hox genes separate the C. elegans Hox cluster into three subclusters of two genes each: ceh-13/lin-39, mab-5/egl-5, and nob-1/php-3 (Supplemental Fig. S1) (Aboobaker and Blaxter 2003). This differs from most vertebrate genomes, which have four or five versions of a single large, unfragmented Hox gene cluster (Lemons and McGinnis 2006). Some Hox genes have been lost in the C. elegans lineage, but all those present have vertebrate and arthropod orthologs (Clark et al. 1993; Maloof and Kenyon 1998; Aboobaker and Blaxter 2003; Stoyanov et al. 2003; Wagmaister et al. 2006). Cis-regulation is almost certainly confined within each C. elegans subcluster: The ceh-13/lin-39 subcluster is thus a natural experiment, in which two genes represent a cluster of vertebrate orthologs (Lemons and McGinnis 2006). The ceh-13/lin-39 subcluster is vital for much anterior and mid-body development in C. elegans, but deciphering its cis-regulation has been difficult and remains incomplete. It is large by C. elegans standards, with almost 20 kb of intergenic DNA encoding only a single microRNA gene. ceh-13 is required for both embryonic and postembryonic development; null ceh-13 mutations are lethal (Brunschwig et al. 1999). In the embryo, ceh-13 is expressed in the A, D, E, and MS lineages and is required for normal gastrulation (Wittmann et al. 1997). Two upstream regulatory sites have been reported to drive expression in the embryo, one of which also acts in the male tail (Streit et al. 2002; Stoyanov et al. 2003). Cis-regulation of post-embryonic ceh-13 expression, which includes the anterior dorsal hypodermis, anterior bodywall muscle, and ventral nerve cord (Brunschwig et al. 1999), is not yet well understood, especially in tissues where it is coexpressed with lin-39. While lin-39 is dispensable for viability, it is required for normal vulval development, migration of the QR and QL neuroblasts, muscle formation, and specification of VC neurons (Burglin and Ruvkun 1993; Clark et al. 1993; Wang et al. 1993; Clandinin et al. 1997; Grant et al. 2000; McKay et al. 2003). A recent study of the lin-39 promoter delimited several elements to ~300 bp by generating many transgenic reporter strains without using comparative genomics information; one of these elements was critical for vulval expression (Wagmaister et al. 2006). Our working hypothesis is that the complex expression of the ceh-13/lin-39 locus arises from the summed actions of independent conserved cis-regulatory elements. We have dissected ceh-13/lin-39 cis-regulation through comparative genomics, and thus defined parameters likely to be useful for genome-wide analyses. This revealed several known and new regulatory elements, including one with functional similarity in mammalian Hox clusters. Results DNA sequencing To enable comparisons to C. elegans, 1.1 Mb of genomic sequences from C. brenneri and C. sp. 3 PS1010 were sequenced and assembled (Supplemental Tables S1, S2). This comprised ~0.5% of each genome, assuming genome sizes roughly equal to C. elegans. The primary DNA sequence data were generally well assembled; the exception was a set of C. brenneri clones covering the mab-5/egl-5 intergenic region, which may have suffered from high polymorphism found in gonochoristic Caenorhabditis species (Graustein et al. 2002). Sequence comparison We used MUSSA (multi-species sequence analysis; http://mussa.caltech.edu) to find preferentially conserved sequences. MUSSA is a N-way sequence comparison algorithm, generalized from Family Relations (Brown et al. 2002), which integrates similarities among three or more genomes (see Methods). It compares, via sliding window, every frame in each participating sequence with every frame in all other sequences, allowing users to choose a window size and threshold of conservation for ungapped sequence matches (here called “MUSSA matches”). MUSSA produces an orientation-independent map of all one-to-one, one-to-many, and many-to-many transitive matches (Fig. 2
A number of parallel lines from visualizing MUSSA matches (at a given threshold of conservation) identified domains of similarity between the sequences, indicating the uniqueness and colinearity of potential regulatory elements (Fig. 2 We initially performed two-way comparisons using a 30-bp window size, which minimized cross-hatched noise and had been useful in comparing mammalian genomes (T. De Buysscher, unpubl.). In principle, the threshold which gives P ≤ 0.05 for spurious matches in a 30-bp window should be 19/30 identities in 1 kb of completely random sequence (Brown 2006). Since nonconserved sequence is not actually random, the real P-value must be larger. For thresholds of ≤21/30, we found that cross-hatched connections marred the readout (Fig. 2B Three-way comparison of ceh-13/lin-39 sequences from C. elegans, C. briggsae, and C. brenneri with 30-bp windows identified several conserved regions (Fig. 2A Cis-regulatory elements operating during development are typically composed of multiple binding sites arrayed over several hundred base pairs (Davidson 2006; Li et al. 2007). We expected that not all of these binding sites would be preserved as ungapped sequence blocks. To ensure that our comparison parameters did not omit functional sequences from transgenic assays, we buffered each MUSSA match with 200 bp of flanking DNA on each side. Aligned features located close to each other were grouped into single regions for testing. In this manner, 11 different regions (N1–N11) were predicted to be functional (Fig. 3A
Four of the 11 conserved regions corresponded to sequences previously shown to have some function. Region N8 corresponds precisely to the microRNA mir-231 and its upstream promoter. mir-231 is expressed from embryonic through adult stages, but its biological role is unknown (Lim et al. 2003). Region N3 drives larval ventral nerve cord expression (pJW8) (Wagmaister et al. 2006); region N9 drives embryonic expression (enh450) (Streit et al. 2002); and a region including element N10 drives larval and male tail expression (271-bp enhancer) (Stoyanov et al. 2003). Because our comparison rediscovered elements of the ceh-13/lin-39 subcluster previously shown to be important, it seemed likely that the newly defined blocks of similarity would also have biological activities. Expression in C. elegans We tested nine of the 11 strongly conserved regions, and all 10 intervening weakly conserved regions, for their ability to positively regulate expression; their repressor activity (if any) was not assayed. We did not retest the previously characterized N8 and N10, but did retest N3 and N9 to show that our assays reproduced published expression patterns in our reporter system (a Δpes-10 basal promoter driving nuclear-localized GFP with an unc-54 3′ untranslated region [UTR]). Background expression from the reporter is described in the Supplemental material, as are experiments showing that different basal promoters gave identical expression patterns in elements that were retested. Most conserved regions drove expression in specific cell types (Table 1). In all cases, the described expression pattern was reproducible in multiple independent lines. Despite some spatial and temporal overlap, the expression patterns for each region were unique.
The intronic element N1 drove expression in vulval muscle, starting during the L4 larval stage and continuing through the adult (Fig. 4A
Potential regulatory sequences were found for both ceh-13 and lin-39. For conserved regions closer to ceh-13 (N9 and N11), observed patterns agreed well with expected ones (Wittmann et al. 1997; Brunschwig et al. 1999; Streit et al. 2002). Expression of lin-39 in the bodywall muscles, intestine, and central body region have all been described and were reproduced, for the most part, by conserved regions closer to lin-39: N1–N4 and N7 (Clark et al. 1993; Wang et al. 1993; Maloof and Kenyon 1998; McKay et al. 2003). Furthermore, expression in the anterior midbody is predicted for both transcription factors, meaning that regions N2–N4 could be acting on both genes. Published patterns for both ceh-13 and lin-39 may be incomplete, which would account for observed activities beyond those expected. Each region drove a different expression pattern. The fusion of a large region (W2) that included both N7 and N9 drove expression in both anterior and posterior bodywall muscle, a simple summation of N7 (strictly posterior) and N9 (strictly anterior) expression patterns (Figs. 3A We then asked what regulatory activities, if any, resided in the less-conserved regions between our conserved elements. Four of the 10 less-conserved regions (I0, I1, I4, and I8) yielded expression apart from the expected background. Region I0 drove expression in the ventral posterior coelomocytes (Fig. 4I Testing for sequence necessity Our DNA regions from the ceh-13/lin-39 Hox subcluster contained not only blocks of ungapped sequence similarity, but also nonconserved sequences in which they were embedded. While these regions clearly drove expression in transgenic worms, our initial survey did not test whether the small conserved matches within them were crucial for regulatory activity. We therefore assayed in vivo constructs derived from some of the most highly conserved regions (N1, N2, N3, and N7; Supplemental Tables S3, S4), in which we mutated the MUSSA match in C. elegans. For N7, mutating the MUSSA match completely eliminated expression in the posterior bodywall muscle, showing the match to be needed for regulation (Fig. 5
Ultraconserved elements Hox clusters are evolutionarily ancient, sharing a common origin for all bilaterians (Garcia-Fernandez 2005; Lemons and McGinnis 2006), meaning that some cis-regulatory elements in C. elegans ceh-13/lin-39 might be conserved in other bilaterian phyla (Haerry and Gehring 1997; Streit et al. 2002). The following Hoxclusters were searched for any possible MUSSA matches to our conserved elements: the single Hox clusters of Drosophila melanogaster, Aedes aegypti (mosquito), Anopheles gambiae (mosquito), Apis mellifera (honey bee), Branchiostoma floridae (lancelet), Capitella sp. I (polychaete worm), Helobdella robusta (leech), Lottia gigantea (snail), Schistosoma mansoni (trematode), Schmidtea mediterranea (flatworm), and Tribolium castaneum (beetle); the four Hox clusters of mouse and human; and the seven Hox clusters of zebrafish. In each of these genomes we found several matches of uncertain significance. We therefore searched orthologous Hox regions for recurrent patterns of MUSSA matches (Fig. 6A
In both mouse and human, N3 and N7-like MUSSA matches were paired with each other in the HOXA cluster near the ceh-13 and lin-39 orthologs, HOXA1 and HOXA5, respectively. Scans of the HOXA clusters in dog, opossum, platypus, and frog also revealed this pairing (Fig. 6A To test whether the interphylum similarities revealed functional sequences, we cloned a 700-bp region of mouse Hox genomic DNA centered on the mouse N3-like MUSSA match and a 650-bp region centered on the N7-like MUSSA match, each containing local sequence conserved among mammals. We assayed both regions in C. elegans transgenes. The mouse N3-like region drove almost the same expression pattern as the C. elegans N3 region (Fig. 6D If N3′s similarities between nematodes and vertebrates result from common descent, N3-like matches should exist in other animal phyla. We found co-occurrence of two top-scoring MEME motifs and a MUSSA match in the nematodes, vertebrates, B. floridae, Capitella sp. I, H. robusta, and S. mansoni (Supplemental Fig. S3B; Supplemental material). MUSSA comparison of N3-like sequences in nematodes, vertebrates, and B. floridae yielded a 70% match, while a comparison of nematodes, vertebrates, S. mansoni, and H. robusta yielded a 65% match (Supplemental Fig. S3B). These matches encompass deuterostomes, ecdysozoa, and lophotrochozoa—all of the major divisions of bilateria. Thus, we interpret the N3 site to be evolutionarily conserved rather than convergent. Threshold revision Having had some success with our initial parameters for ungapped sequence comparison, we then adjusted them empirically and retested them computationally against well-characterized genes in the hope of optimizing our parameters for genome-wide analysis. Initially, nine of the 11 regions (82%) identified by conservation gave expression, while three of the 10 less conserved regions (30%) gave expression; this was promising, but left room for possible improvement. When we tried lower thresholds or smaller windows, MUSSA found matches in some regions that had previously given no hits despite having regulatory activity (and that we had originally classified as false negatives). We therefore optimized the parameter settings and genome combination to achieve the best yield of functional elements while keeping false positives to a minimum (Fig. 2D–F
The intervening regions were often much larger than any conserved region. For instance, region I4 was 4.2 kb; however, the subsection of I4 sufficient to drive expression was 1.6 kb (38% of I4) (Wagmaister et al. 2006). Likewise, region I8 was 2.9 kb, but expression could be recapitulated with only 0.7 kb within it (24% of I8) (Streit et al. 2002). Thus, the density of regulatory regions within nonconserved sequences is probably even lower than our data indicate (Fig. 3B To test whether the revised parameters are useful outside the Hox cluster, we analyzed the previously described C. elegans genes hlh-1, myo-2, myo-3, and unc-54 (Okkema et al. 1993; Krause et al. 1994). These were chosen for analysis because their promoter dissections had been screened for expression across all tissues, unlike most studies that identify positive expression in a specific tissue but did not screen for negative activity across other tissues. Using our strict 15-bp threshold and technique of including 200 bp of flanking DNA, all known regulatory elements of the myosin genes myo-2, myo-3, and unc-54 (Okkema et al. 1993) were identified with no false positives (Supplemental Fig. S7). For the hlh-1 locus, two of four regulatory sites (Krause et al. 1994) were recovered at a lower threshold. Therefore, MUSSA predictions were accurate at some non-Hox loci, but as in the Hox locus itself, some functional elements could not be identified this way. Discussion This study found four known and seven new cis-regulatory elements in the ceh-13/lin-39 Hox subcluster of C. elegans, using ungapped sequence conservation across four genomes and verification by transgenic analyses. Remarkably, one conserved element’s mouse counterpart recapitulated the native nematode expression pattern. The observed expression patterns generally paralleled those found by prior antibody staining and expression from the parental undissected promoters, suggesting that the union of these cis-regulatory elements drives the entire endogenous expression pattern, and that we have identified most cis-regulatory regions of ceh-13/lin-39 (Clark et al. 1993; Wang et al. 1993; Wittmann et al. 1997; Maloof and Kenyon 1998; Brunschwig et al. 1999; Streit et al. 2002; McKay et al. 2003). For ceh-13/lin-39, our first parameters for sequence conservation worked well, even though we later improved them empirically. They identified 11 possible elements, of which nine showed function experimentally, leaving two false positives—a threefold enrichment for functional regulatory elements compared with simple, unselected tiling. With revised parameters, 100% of the computationally identified elements were functional. For these nematode sequences, we found that MUSSA predicted function with highest reliability and resolution when we used windows of 15 bp. Smaller windows gave noisier alignments with poor resolution, while larger windows tended to miss shorter conserved sequences with regulatory activities. These parameters correctly rediscovered regulatory regions in other well-characterized genes, but made some errors, suggesting additional possible refinements as functional data becomes available at other loci. However, we do not expect that this method, used on its own, will discover all elements. We also expect parameters to change when the set of compared genomes is changed, as we have already found. For instance, the conserved regions for vertebrate Hox sequences (e.g., the N3-like mouse region) were much longer than in nematodes, and could be detected at a lower MUSSA threshold with a larger window size. Such differences in sequence conservation might arise from different rates and types of mutations, or from altered selection pressures. Our aim was to efficiently predict new elements with bona fide biological activity, accepting that this runs the risk of missing some regulatory regions. Nevertheless, correctly identifying even two-thirds of all C. elegans regulatory elements with a low false-positive rate, as we did prior to refinement, could significantly advance our knowledge of the worm regulatory genome. Recent uses of sequence constraint in vertebrates have been less sensitive in finding regulatory elements, perhaps because vertebrates undergo qualitatively different regulation (Pennacchio et al. 2006; McGaughey et al. 2008), although there are many differences, both biological and methodological, between their studies and this one. Only a representative subset of regulatory sites are needed to derive refined, genome-wide motifs in C. elegans, as we did with N2-1 (Supplemental material), which can then be statistically correlated with traits of their neighboring genes (Wenick and Hobert 2004; Mortazavi et al. 2006; Etchberger et al. 2007). If a given regulatory element is mutated or fragmented in some species, comparing it with different sets of related species can still allow detection of that element. Such regulatory mutations are known to be responsible for subtle evolutionary changes in the salt resistance and excretory canal phenotypes of C. elegans, which have diverged from the ancestral phenotypes retained in C. briggsae and C. brenneri (Wang and Chamberlin 2004). The most striking difference in conservation we observed was between Elegans group species and the outlying C. sp. 3 PS1010. Four-way comparison of C. elegans, C. briggsae, C. brenneri, and C. remanei predicted the most regulatory elements, many of which could only be detected in C. sp. 3 PS1010 with much lower and noisier thresholds. Although all regions identified with C. sp. 3 PS1010 drove expression, there was no added benefit from this comparison; rather, it increased the false-negative rate. Similarly, neither lin-3 nor lin-11 in C. sp. 3 PS1010 had the organization or the sequence motifs of the genes in the Elegans group species (Supplemental Fig. S8; Supplemental Table S5). Additional Caenorhabditis genomic sequences should clarify which parts of the C. elegans genome encode species- or group-specific traits. The regulatory organization of the ceh-13/lin-39 locus appears to be modular, with each regulatory element functioning independently in transgenes: The expression output of two elements on a single DNA fragment (N7 and N9 on W2) or of four coinjected elements (N1, N2, N3, and N7) matched the sum of their individual activities. Nevertheless, the linear order of conserved elements across the ceh-13/lin-39 locus has been conserved between the different Caenorhabditis species, including the relatively distant C. sp. 3 PS1010, suggesting that element order is under selective pressure. Among the elements, there is also potential for some functional redundancy, as has been noted in mammals (e.g., Ahituv et al. 2007). ceh-13, for example, is expressed in the larval ventral nerve cord (Brunschwig et al. 1999) and three different elements drive expression there. Multiple regulatory elements distributed throughout large introns and flanking sequences control many metazoan genes expressed in complex spatiotemporal patterns (Woolfe et al. 2005; Davidson 2006; Pennacchio et al. 2006) and ceh-13/lin-39 follows this trend. Only two of the nine expressing regions were located within the proximal 2-kb promoter sequences of ceh-13 or lin-39, and four were in lin-39 introns. We did not assay for the effect that these regions had on ceh-13, lin-39, or mir-231 expression. Other examples of distal elements in C. elegans include remote regulation of ceh-10 and osm-9 (Colbert et al. 1997; Wenick and Hobert 2004). Conservation analysis helped define elements without inadvertently splitting them, a hazard in blind deletion analysis. Moreover, it may have freed elements from inhibitory sequences, as we found that some large segments were less active when assayed than their subdomains. The entire second intron of lin-39 yielded no expression in a prior study (Wagmaister et al. 2006), but we identified four different active cis-regulatory elements (N1, N2, I0, and I1) by subdividing the region. One possibility is that poorly conserved DNA separating ceh-13/lin-39 elements harbors hidden regulatory functions that our assay misses, such as repression. The basal promoter construct we used to screen for in vivo enhancer activity is not expected to detect isolated transcriptional silencers or insulators. This could explain moderately conserved but inactive regions, as might enhancers dependent on untested culture conditions or promoter-specific interactions with regulatory elements (Wenick and Hobert 2004; Etchberger et al. 2007). Although large regions can be split into smaller functional components (such as the W2 region dividing into N7, N8, and N9, and the lin-39 intron dividing into N1, N2, I0, and I1), further dissection of functional elements might simply disrupt them, yielding weak and variable expression. This has been observed for ceh-13 male tail expression when multiple sites within N10 were mutated (V. Wegewitz and A. Streit, pers. comm.). Biologically relevant sequence motifs often appear in or near the best-conserved regions, even if the MUSSA matches themselves are not essential for regulatory activity. For instance, two conserved MUSSA matches <200-bp apart identify the element N9; but a known motif that is not part of either conserved window is located next to them, and is necessary for proper regulatory function (Supplemental Fig. S9A). In four of five mutageneses, changing just one conserved feature had little effect, which is consistent with functional redundancy often seen in multi-site regulatory elements. Our assays used injected transgenes, for which multiple copies generally exist of a cloned reporter (Mello and Fire 1995); this might have provided a relaxed context for gene expression, tolerating the loss of “redundant” sites actually required in vivo. A site that subtly controls the quantity or spatiotemporal pattern of gene activity could easily lack an observable impact on GFP expression. Thus, it is important to test not only conserved sequences for regulatory activity, but the sequences near them. The apparent conservation of N3 and N7 regions across phyla suggests that they predate the divergence of bilateria. Although mouse N7 was not active in the cross-phylum assay, the mouse N3-like region was strikingly positive and contains a potentially autoregulatory Hox/Pbx binding site. To test regulatory elements for functional conservation between different animal phyla, Drosophila enhancers and promoters have been compared with those of C. elegans and mammals: This generally involved isolating an enhancer or promoter with a known expression pattern in a donor organism, and testing it transgenically for similar expression in a second, distantly related organism (Malicki et al. 1992; Frasch et al. 1995; Popperl et al. 1995; Haerry and Gehring 1997; Streit et al. 2002; Ruvinsky and Ruvkun 2003). With nematode and mouse N3 regions, we instead tested the donor enhancer for activity equivalent to that already defined for its ortholog in the recipient species. This provides an alternative for comparisons over very long evolutionary distances, across which anatomical similarities may not be obvious. Moreover, additional MEME motifs, one of which may have been independently identified in mammals (as LM115 and LM171 of Xie et al. [2007]) (Supplemental Results), are shared by the vertebrate and nematode sequences. Based on these in vivo data and computational analyses, we consider N3 a pan-phyletic regulatory sequence. Such sequences may be rare, and only present in the most ancient regulatory loci, such as the ParaHox or NK clusters (Garcia-Fernandez 2005). Methods General methods and strains We obtained Caenorhabditis elegans, C. brenneri CB5161, and C. sp. 3 PS1010 from the CGC strain collection and cultured them on OP50 at 20°C, using methods standard for C. elegans (Sulston and Hodgkin 1988). unc-119(ed4) hermaphrodites were microinjected with a mixture of 60 ng/μL unc-119 vector, 12 ng/μL unpurified fusion product, and either 100 ng/μL pBluescript or 100 ng/μL digested genomic DNA to generate transgenic animals (Mello and Fire 1995; Kelly et al. 1997). All noted expression patterns were observed in two or more independent transgenic lines. In nonexpressing lines, at least 16 hermaphrodites from three independent lines (each line driving background GFP to guarantee GFP’s functionality) were observed at each stage (early embryos, late embryos, L1–L4 larvae, young adults, and mature adults) with 100× magnification; males and dauers were observed for some, but not all, reporter lines. DNA preparation DNA was prepared by standard methods (Sulston and Hodgkin 1988). pEpiFos-5 (Epicentre), based on pBeloBAC11 (Birren et al. 1999), was used as the fosmid library vector. Fosmid sequences were shotgun sequenced and assembled into contigs by the Department of Energy’s Joint Genome Institute at Walnut Creek (http://www.jgi.doe.gov/sequencing/protocols). Sequence analysis Sequence contigs from JGI were initially linked by BLASTN (Korf et al. 2003) and then merged with the revseq and megamerger functions of EMBOSS (Olson 2002). Our C. brenneri data had 22 genomic contigs, totaling 680,633 nucleotides (Supplemental Table S1). Our C. sp. 3 PS1010 data had seven genomic contigs, totaling 417,129 nucleotides (Supplemental Table S2). Gene predictions were made with Twinscan 3.5 running in single-species mode with C. elegans parameters (Wei et al. 2005); predicted protein sequences were extracted with BioPerl (Stajich et al. 2002). C. brenneri and C. sp 3 PS1010 protein sequences were tested for orthology against one another and against the protein-coding gene sets of C. elegans, C. briggsae, and C. remanei (from the WS170 release of WormBase) with OrthoMCL 1.3 (Li et al. 2003). Inferred ortholog groups were considered specific (i.e., unique) if they contained only one C. elegans gene, and only one gene from either C. briggsae or C. remanei. Our C. brenneri contigs encode 141 predicted proteins of ≥100 residues in length, of which 88 have unique C. elegans orthologs (Supplemental Table S1). Our C. sp. 3 PS1010 contigs encode 86 predicted ≥100-residue proteins, 68 with C. elegans orthologs (Supplemental Table S2). SVG genomic sequence images were generated by GBrowse for nematodes and vertebrates at the Wormbase (http://www.wormbase.org) and UCSC Genome Browser (http://genome.ucsc.edu) websites. MUSSA (mulitple species sequence analysis) (http://mussa.caltech.edu), a program written in C++ with a Python controlled user interface, was used to identify evolutionarily conserved sequences. MUSSA uses N-way transitivity (all-against-all) so that only windows passing the selected similarity threshold across all species are reported as alignments. No sequences were repeat-masked in the comparisons performed here, though use of MUSSA in other phyla may benefit from masking as a preprocessing step (T. De Buysscher, D. Trout, and B.J. Wold, unpubl.). For regulatory element dissection in the ceh-13/lin-39 cluster, published sequences from C. elegans, C. briggsae, and C. remanei (http://www.wormbase.org) were used with novel sequences from C. brenneri and C. sp. 3 PS1010. The mab-5/egl-5 Hox cluster comparisons used sequences from C. elegans, C. briggsae, and C. remanei. Additional comparisons with non-nematodes used sequences from all of each organism’s available Hox clusters (http://www.ensembl.org; http://genome.ucsc.edu; http://www.genedb.org/genedb/smansoni; http://racerx00.tamu.edu; and http://genome.jgi-psf.org). Known regulatory regions of non-Hox genes were linked from C. elegans to other species using MUSSA. MEME Multiple EM for Motif Elicitation (MEME) v3.5.4 was used to identify nonaligned motifs shared by different animal phyla (http://meme.sdsc.edu/meme) (Bailey and Elkan 1994). MEME motifs from the N3 element were tested for similarities to previously published genomic motifs by examining two 14-nt human sequences with up to two mismatches against JASPAR CNE (Bryne et al. 2007; Xie et al. 2007). Transgene design and construction PCR fusions were generated using standard protocols, essentially as in Hobert (2002). Genomic DNA and the cosmids R13A5 and C07H6 (from A. Fraser and R. Shownkeen at the Sanger Institute) were used as sequence templates. The Fire Lab Vector pPD107.94 was used as the template for the Δpes-10 4X-NLS eGFP LacZ unc-54 sequence (Mello and Fire 1995). The Fire Lab Vector pPD95.75 was used as the template for the “promoterless” eGFP unc-54 sequence (Etchberger and Hobert 2008), used as a control in four constructs to demonstrate identical expression patterns under different basal promoters. Mutation primers were used to mutate target sites in plasmids. The mutated and sequenced enhancers were fused to Fire Lab Vector pPD122.53, where GFP was replaced with YFP, to give a Δpes-10 4X-NLS YFP unc-54. GFP was replaced with CFP for unmutated controls. We mutated conserved sequences by reversal, not reverse complementation; such reversal maintained the base content, but was expected to destroy any sequence-specific binding of transcription factors. Complete methods are described in the Supplemental material.Acknowledgments We dedicate this study to the memory of E.B. Lewis, who pioneered the analysis of Hox clusters at Caltech. We thank C.T. Brown for discussions, N. Mullaney for work on an early version of MUSSA, E. Moon for aid in fosmid library construction, and E. Rubin and his colleagues at the DOE JGI for fosmid sequencing and assembly. We thank L.R. Baugh, C.T. Brown, C. Dalal, J. Green, M. Kato, K. Kiontke, A. Mortazavi, A. Seah, and B. Williams for comments on the manuscript. Some nematode strains used in this work were provided by the Caenorhabditis Genetics Center, which is funded by the NIH National Center for Research Resources (NCRR). Unpublished metazoan genomic sequences were generously provided by the DOE JGI and GeneDB. This work was supported by grants from DOE to B.J.W. and P.W.S., from NASA to B.J.W., from NIH to B.J.W., and from the HHMI, with which P.W.S. is an Investigator. Footnotes [Supplemental material is available online at www.genome.org. The sequence data from this study have been submitted to GenBank (http://www.ncbi.nlm.nih.gov/Genbank/) under accession nos. FJ362353–FJ36238.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.085472.108. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
PLoS Biol. 2005 Jan; 3(1):e7.
[PLoS Biol. 2005]Nature. 2006 Nov 23; 444(7118):499-502.
[Nature. 2006]Genome Res. 2007 Dec; 17(12):1898-908.
[Genome Res. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Curr Biol. 2007 Nov 20; 17(22):1925-37.
[Curr Biol. 2007]J Mol Biol. 1988 Sep 20; 203(2):439-55.
[J Mol Biol. 1988]Science. 2007 Sep 14; 317(5844):1557-60.
[Science. 2007]Genome Biol. 2007; 8(6):R101.
[Genome Biol. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Science. 2004 Sep 17; 305(5691):1743-6.
[Science. 2004]Nat Rev Genet. 2004 Jun; 5(6):456-65.
[Nat Rev Genet. 2004]BMC Bioinformatics. 2004 Sep 9; 5():129.
[BMC Bioinformatics. 2004]PLoS Biol. 2005 Jan; 3(1):e10.
[PLoS Biol. 2005]Annu Rev Genomics Hum Genet. 2005; 6():143-64.
[Annu Rev Genomics Hum Genet. 2005]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Dev Biol. 1977 Mar; 56(1):110-56.
[Dev Biol. 1977]Dev Biol. 1979 Jun; 70(2):396-417.
[Dev Biol. 1979]Genome Res. 2005 Dec; 15(12):1651-60.
[Genome Res. 2005]Mol Phylogenet Evol. 1997 Oct; 8(2):249-59.
[Mol Phylogenet Evol. 1997]J Mol Evol. 2006 Mar; 62(3):281-91.
[J Mol Evol. 2006]Nat Rev Genet. 2005 Dec; 6(12):881-92.
[Nat Rev Genet. 2005]Science. 2006 Sep 29; 313(5795):1918-22.
[Science. 2006]Nature. 1992 Jul 23; 358(6384):345-7.
[Nature. 1992]Development. 1995 Apr; 121(4):957-74.
[Development. 1995]Cell. 1995 Jun 30; 81(7):1031-42.
[Cell. 1995]Curr Opin Genet Dev. 2003 Dec; 13(6):593-8.
[Curr Opin Genet Dev. 2003]Science. 2006 Sep 29; 313(5795):1918-22.
[Science. 2006]Cell. 1993 Jul 16; 74(1):43-55.
[Cell. 1993]Development. 1998 Jan; 125(2):181-90.
[Development. 1998]Dev Biol. 2003 Jul 1; 259(1):137-49.
[Dev Biol. 2003]Development. 1999 Apr; 126(7):1537-46.
[Development. 1999]Development. 1997 Nov; 124(21):4193-200.
[Development. 1997]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Dev Biol. 2003 Jul 1; 259(1):137-49.
[Dev Biol. 2003]Curr Opin Genet Dev. 1993 Aug; 3(4):615-20.
[Curr Opin Genet Dev. 1993]Genetics. 2002 May; 161(1):99-107.
[Genetics. 2002]Dev Biol. 2002 Jun 1; 246(1):86-102.
[Dev Biol. 2002]Proc Natl Acad Sci U S A. 2005 Aug 16; 102(33):11769-74.
[Proc Natl Acad Sci U S A. 2005]Genome Biol. 2007; 8(6):R101.
[Genome Biol. 2007]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Genes Dev. 2003 Apr 15; 17(8):991-1008.
[Genes Dev. 2003]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Dev Biol. 2003 Jul 1; 259(1):137-49.
[Dev Biol. 2003]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Development. 1997 Nov; 124(21):4193-200.
[Development. 1997]Development. 1999 Apr; 126(7):1537-46.
[Development. 1999]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Cell. 1993 Jul 16; 74(1):43-55.
[Cell. 1993]Cell. 1993 Jul 16; 74(1):29-42.
[Cell. 1993]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Nat Rev Genet. 2005 Dec; 6(12):881-92.
[Nat Rev Genet. 2005]Science. 2006 Sep 29; 313(5795):1918-22.
[Science. 2006]Dev Biol. 1997 Jun 1; 186(1):1-15.
[Dev Biol. 1997]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Genetics. 1993 Oct; 135(2):385-404.
[Genetics. 1993]Dev Biol. 1994 Nov; 166(1):133-48.
[Dev Biol. 1994]Cell. 1993 Jul 16; 74(1):43-55.
[Cell. 1993]Cell. 1993 Jul 16; 74(1):29-42.
[Cell. 1993]Development. 1997 Nov; 124(21):4193-200.
[Development. 1997]Development. 1998 Jan; 125(2):181-90.
[Development. 1998]Development. 1999 Apr; 126(7):1537-46.
[Development. 1999]Nature. 2006 Nov 23; 444(7118):499-502.
[Nature. 2006]Genome Res. 2008 Feb; 18(2):252-60.
[Genome Res. 2008]Dev Cell. 2004 Jun; 6(6):757-70.
[Dev Cell. 2004]Genome Res. 2006 Oct; 16(10):1208-21.
[Genome Res. 2006]Genes Dev. 2007 Jul 1; 21(13):1653-74.
[Genes Dev. 2007]Nat Genet. 2004 Mar; 36(3):231-2.
[Nat Genet. 2004]PLoS Biol. 2007 Sep; 5(9):e234.
[PLoS Biol. 2007]Development. 1999 Apr; 126(7):1537-46.
[Development. 1999]PLoS Biol. 2005 Jan; 3(1):e7.
[PLoS Biol. 2005]Nature. 2006 Nov 23; 444(7118):499-502.
[Nature. 2006]J Neurosci. 1997 Nov 1; 17(21):8259-69.
[J Neurosci. 1997]Dev Cell. 2004 Jun; 6(6):757-70.
[Dev Cell. 2004]Dev Biol. 2006 Sep 15; 297(2):550-65.
[Dev Biol. 2006]Dev Cell. 2004 Jun; 6(6):757-70.
[Dev Cell. 2004]Genes Dev. 2007 Jul 1; 21(13):1653-74.
[Genes Dev. 2007]Methods Cell Biol. 1995; 48():451-82.
[Methods Cell Biol. 1995]Nature. 1992 Jul 23; 358(6384):345-7.
[Nature. 1992]Development. 1995 Apr; 121(4):957-74.
[Development. 1995]Cell. 1995 Jun 30; 81(7):1031-42.
[Cell. 1995]Dev Biol. 1997 Jun 1; 186(1):1-15.
[Dev Biol. 1997]Dev Biol. 2002 Feb 15; 242(2):96-108.
[Dev Biol. 2002]Methods Cell Biol. 1995; 48():451-82.
[Methods Cell Biol. 1995]Genetics. 1997 May; 146(1):227-38.
[Genetics. 1997]Brief Bioinform. 2002 Mar; 3(1):87-91.
[Brief Bioinform. 2002]Genome Res. 2002 Oct; 12(10):1611-8.
[Genome Res. 2002]Genome Res. 2003 Sep; 13(9):2178-89.
[Genome Res. 2003]Proc Int Conf Intell Syst Mol Biol. 1994; 2():28-36.
[Proc Int Conf Intell Syst Mol Biol. 1994]Nucleic Acids Res. 2008 Jan; 36(Database issue):D102-6.
[Nucleic Acids Res. 2008]Proc Natl Acad Sci U S A. 2007 Apr 24; 104(17):7145-50.
[Proc Natl Acad Sci U S A. 2007]Biotechniques. 2002 Apr; 32(4):728-30.
[Biotechniques. 2002]Methods Cell Biol. 1995; 48():451-82.
[Methods Cell Biol. 1995]Nat Methods. 2008 Jan; 5(1):3.
[Nat Methods. 2008]