![]() | ![]() |
Formats:
|
||||||||||||||||||||||
Copyright © 2005, Cold Spring Harbor Laboratory Press De novo discovery of a tissue-specific gene regulatory module in a chordate 1 Departments of Pathology and Genetics, Stanford University Medical Center, Stanford, California 94305-5324, USA 2 Departments of Statistics and Biostatistics, Stanford University, Stanford, California 94305, USA 3 Department of Zoology, Graduate School of Science, Kyoto University, Sakyo-ku, Kyoto 606-8502 Japan 4Corresponding author. E-mail arend/at/stanford.edu; fax (650) 725-4905. Received April 22, 2005; Accepted August 8, 2005. This article has been cited by other articles in PMC.Abstract We engage the experimental and computational challenges of de novo regulatory module discovery in a complex and largely unstudied metazoan genome. Our analysis is based on the comprehensive characterization of regulatory elements of 20 muscle genes in the chordate, Ciona savignyi. Three independent types of data we generate contribute to the characterization of a muscle-specific regulatory module: (1) Positive elements (PEs), short sequences sufficient for strong muscle expression that are identified in a high-resolution in vivo analysis; (2) CisModules (CMs), candidate regulatory modules defined by clusters of overrepresented motifs predicted de novo; and (3) Conserved elements (CEs), short noncoding sequences of strong conservation between C. savignyi and C. intestinalis. We estimate the accuracy of the computational predictions by an analysis of the intersection of these data. As final biological validation of the discovered muscle regulatory module, we implement a novel algorithm to search the genome for instances of the module and identify seven novel enhancers. Characterization of the regulatory logic underlying development and differentiation of multicellular animals remains one of the most formidable challenges in contemporary genomics. High-throughput experiments that use expression arrays and related tools can describe global patterns of gene regulation during development and serve as a basis for discovering regulatory hierarchies (see Furlong et al. 2001; Kim et al. 2001; Montalta-He et al. 2002; Gaudet et al. 2004; Schroeder et al. 2004). A complementary approach, high-resolution structure-function studies of individual regulatory regions, can reveal on a gene-by-gene basis exactly which noncoding portions of a locus are sufficient or necessary for proper regulation. Such studies will ultimately be necessary to obtain a full understanding of the genomic control of development, and there is considerable interest in how computational predictions could enhance their efficiency. We therefore set out to assess the effectiveness of computational predictions and to estimate their sensitivity and specificity, by comparing results from computational analyses against the activities of a large number of regulatory constructs assayed in the ascidian chordate, Ciona (Corbo et al. 1997; Johnson et al. 2004). The culmination of these analyses was a regulatory module characterized in sufficient detail that we were able to obtain genome-wide module predictions and experimentally verify a subset of these as enhancers. Ciona is uniquely suited for structure-function and computational analyses of gene regulation. Draft genome sequences for C. savignyi (Vinson et al. 2005) and C. intestinalis (Dehal et al. 2002) as well as a sizeable EST sequence and in situ hybridization databases for C. intestinalis (Satou et al. 2002) are available. A small genome (180 Mb) with a number of expressed genes (15,000) similar to that of Drosophila (Dehal et al. 2002) ensures that the search space for noncoding functional elements is smaller than in vertebrates. Considerable functional conservation despite the large evolutionary distance between the Ciona species allows discovery of functional sequence elements by comparative sequence analyses (Johnson et al. 2004). Finally, electroporation of reporter constructs into developing embryos facilitates efficient in vivo expression analyses (see Corbo et al. 1997; Bertrand et al. 2003; Johnson et al. 2004). We use a combination of experimental and computational approaches to discover and characterize a regulatory module common to 20 Ciona muscle genes. We chose muscle for several reasons. First, the proteins encoded by many muscle-specific genes, most notably those of the muscle fiber strand, are sufficiently conserved that their identification in the Ciona genome by similarity searches is unambiguous. Second, muscle is an easily recognized tissue in the ascidian larva, allowing efficient quantification of expression patterns. Third, the cellular interactions of muscle proteins likely require tight coregulation of their expression, enhancing the likelihood that they are under functionally similar regulatory control that would facilitate identification of a shared regulatory module. Previous genome-wide computational searches for tissue-specific regulatory elements have been carried out in metazoan model organisms with comprehensive genome annotations and substantial prior knowledge of motifs and regulatory regions from decades of experimentation (see Berman et al. 2002, 2004; Gaudet et al. 2004; Wenick and Hobert, 2004). By contrast, in our study we discovered a tissue-specific regulatory module de novo without prior knowledge of the motifs it contains, and structure-function studies carried out independently allowed us to estimate the predictive power of the module. We then searched for instances of the module on a genome-wide scale and validate the predictions in vivo. The success of this study raises the possibility of genome-wide searches for enhancers with more complex expression patterns, as well as computational searches for tissue-specific elements in the human genome. Results Initial characterization of 20 muscle-specific regulatory regions We selected 20 genes for detailed experimental characterization of regulatory sequences. The genes were chosen based on their strong similarity to proteins known to be involved in muscle physiology or structure (Fig. 1a
For each gene, we built an initial reporter construct in which a putative native promoter, a putative start codon, and a varying amount of endogenous protein-coding sequence are fused in-frame to LacZ (see Methods) (Johnson et al. 2004). Expression is therefore driven by endogenous tissue-specific elements and a native promoter. All constructs demonstrate consistent expression in the larval tail muscle, with a median of 74% of the animals staining (Fig. 1a While the in situ hybridizations show that the genes are expressed specifically in the tail muscle, our initial reporter constructs often result in expression in other tissues (Fig. 1a High-resolution in vivo analysis of 20 muscle-specific regulatory regions By using the same in vivo reporter assay, we then identified regions within the initial constructs that are sufficient for muscle expression. In subsequent analyses, these subregions will be compared to computational predictions. Most of the constructs that we tested were truncations and/or deletions of the initial native promoter constructs (Fig. 1a Some of these shortened constructs drive ectopic expression patterns that were not found in the initial native promoter constructs. For example, constructs of four genes (TnT, TPM2, MA-like, and MBP) show moderate (~5%) reproducible expression in the secondary notochord (see Fig. 1e Translation of in vivo results into binary data for subsequent analyses To facilitate subsequent analyses in which we compare the functional data to computational predictions, we devised a method to translate the functional data into a binary data set. The result of this transformation is that every base in the original regulatory regions is either part of a functional element or not. We define a positive element (PE) as the shortest sufficient and non-overlapping sequence that drives strong expression in muscle (Table 1). Only constructs that give a positive result are considered; negative results are not considered because reporter constructs may be nonfunctional for a variety of reasons unrelated to the function of positive regulatory elements (e.g., disruption of the transcription start site or spurious introduction of negative regulatory elements). This does not exclude the possibility that functional sequences reside outside of PEs, nor does it suggest that every base within the PE is necessary for function. Note that the set of PEs (Table 1) greatly overlaps with, but is not identical to, the “shortest constructs” (cf. Fig. 1a To clarify the logic for identification of PEs, we present detailed functional data for three of the 20 genes (Fig. 2
Identification and motif composition of a muscle regulatory module Embedded within the upstream sequences of our muscle genes must be regulatory sequences that contain binding sites (motifs) for tissue-specific transcriptional activator proteins. Clusters of single motifs are often indicative of regulatory sequences (Markstein et al. 2002), as are clusters of several distinct motifs (see Berman et al, 2002). Accordingly, by reasoning that it is likely that more than one motif contributes to muscle-specific activation of target genes, we chose to use CisModule (Zhou and Wong 2004) to identify likely regulatory sequences (see Methods). Cis-Module implements Bayesian analysis to discover clusters of overrepresented motifs in a suite of coregulated sequences. These motif clusters (modules) consist of up to K distinct motifs that occur in spatial proximity and are predicted de novo from the input sequences (we used K = 4; see Methods). Instances of the predicted cis-module (CM) are then annotated in the input sequences. In all subsequent analyses, we will disregard those CMs that are >50% contained within exons, since CisModule predictions within exons may be spurious. All of our initial functional regulatory regions have at least one statistically significant CM; five genes have two highly significant modules, and two genes have three (Table 2; Fig. 3 CisModule outputs position-specific scoring matrices (PSSMs) for four motifs that are not overrepresented in our background sequences (Fig. 4a
To ascertain which of these CM motifs are likely to be functional, as opposed to false positives, we determined their abundance in all predicted CMs versus CMs that overlap PEs at >50% (Fig. 4b Sensitivity and specificity of module predictions As the CMs were predicted independently from the functional analyses, we asked to what extent the predictions are correct by analyzing overlaps between CMs and PEs. The purpose of this analysis is twofold: (1) to validate or refute the module on the basis of the functional data; and (2) if the module is validated, to assess sensitivity and specificity of the CisModule predictions. We again turn to the three representative loci (Fig. 2 Across all 18 PE-containing loci (Fig. 3
In order to further define the predictive success of CMs, we converted the data into base pair counts (Fig. 5b One caveat to these calculations is that we do not have perfect experimental ascertainment of PEs at the base pair level. Some bases in PEs are certainly not functional, and other bases outside of PEs are likely functional. Thus, while our experimentally determined PEs are enriched for truly functional bases, the exact numbers are subject to uncertainty. Nonetheless, the main conclusions are supported by the overlap analyses both at the element level (Fig. 5a Analysis of evolutionarily conserved regions within PEs Every PE contains at least one highly conserved element (CE), defined here as at least 20 bp of at least 75% identity between the two Cionas (Figs. (Figs.2,2 Those CEs that do not reside within PEs may have a variety of functions. Some may be exons missed due to incomplete annotation of the two genomes. Others may have regulatory roles other than activator functions, such as repressors or insulators, which would not be reliably detected by our assays. Repressor functions are particularly attractive candidates, as we did occasionally observe reproducible ectopic expression by constructs carrying internal deletions or truncations (Fig. 1 We converted the CE data into base pair counts to estimate sensitivity and specificity, in analogy to, and for comparison with, the CM predictions. Sensitivity is here the number of PE bases contained within CEs divided by the total number of bases in PEs (Fig. 5d Search of the genome for conserved muscle modules reveals novel muscle enhancers. We next wanted to use the new information of the muscle-specific module to find enhancers on a genome-wide scale. Our previous analyses demonstrated that CEs and CMs have a correspondence with PEs. To estimate the potential predictive power of our computational methods, we calculated the positive predictive value (PPV) (Sokal and Rohlf, 1995), defined as True Positives/All Predictions. The PPV is 67.2% for bases that are both CE and CM; 55% for bases that are CM but not CE; and 29.2% for bases that are CE but not CM. Given this strong PPV of combining CisModule predictions and sequence conservation, we set out to search genome-wide for instances of the regulatory module that overlap with conserved elements. To this end, we devised and implemented a novel algorithm (CisModScan) that identifies candidate clusters of given motifs on a genome-wide scale (see Methods). By using CisModScan, we searched the C. savignyi genome for clusters of Mf1, Mf2, and/or Mf3. We found 1183 predictions that contained at least two of the three motifs. We aligned all of these to their orthologous genomic regions in C. intestinalis. Many of the regulatory module predictions contained more than half of their nucleotides in C. intestinalis predicted exons (664/1183, or 56%). Of the remaining module predictions, 52%, or 269/519, have at least one highly conserved element. Of the predictions with at least one highly conserved element, a median 46% of the bases were contained within conserved elements (http://mendel.stanford.edu/supplementarydata/johnson2005/). We chose 23 module predictions that contained at least one conserved element and were located <2 kb 5′ or 3′ to a predicted first exon. We assayed the function of these sequences with the same heterologous Brachyury promoter that was used for the initial structure-function studies. We found seven novel enhancers, each with distinct expression patterns (Fig. 6
Discussion A number of previous studies have correctly predicted regulatory sites using conservation (Ghanem et al. 2003; Kellis et al. 2003; Johnson et al. 2004), de novo computational motif prediction (GuhaThakurta et al. 2004; Kusakabe et al. 2004), or conservation of predicted motifs (Wenick and Hobert 2004). Other studies have predicted regulatory modules on a genome-wide scale (Berman et al. 2004; Gaudet et al. 2004; Schroeder et al. 2004). However, our work is unique in leveraging a combination of these approaches to characterize a tissue-specific regulatory module de novo in a metazoan, and then use the knowledge of this module to conduct a genome-wide enhancer search and validate a subset of the enhancer predictions in vivo. Our success rate was similar to a genome-wide search for clusters of some of the most intensely studied binding sites in all of developmental biology (Berman et al. 2004). Our search for tissue-specific enhancers is also significantly more powerful than are less directed methods, such as a screen of random DNA fragments. For example, a recent functional screen of 138 random C. intestinalis genomic DNA fragments (average size, 1.7kb; ~240kb total) (Harafuji et al. 2002) yielded only five tissue-specific enhancers. In contrast, we identified seven novel enhancers in ~8 kb of tested DNA. Metazoan gene regulation is inherently complex, since proper expression patterns often depend upon activators and repressors, interactions of tissue-specific elements with basal promoters, and other functional sequences that are often specific to each individual locus. We clearly have not described all functional elements contained within the 20 original genes or within the positive enhancers from the genome-wide search. The complexity of gene regulatory structures is underscored by the variation exhibited by the seven new enhancers and the initial high-resolution screen for PEs: Five enhancers drive expression outside of the embryonic tail muscle, although most of the enhancers express strongly in muscle. In addition, certain constructs from the initial high-resolution screen for PEs exhibit ectopic expression not only in the notochord (a tissue that shares pre- and post-gastrulation cell lineages with muscle) (Nishida, 1987) but also in tissues embryologically unrelated to muscle, such as the central nervous system, ectoderm, or endodermal strand. We conclude that many of the endogenous loci from which we obtained regulatory regions in this study also contain important repressor elements that fine-tune the expression pattern to the appropriate locations in the developing embryo. The success rate for the genome-wide search, at seven positives out of 23 tested, is lower than the predictive value of CMs that contain at least one CE in the original 20 native promoter constructs. One reason for the discrepancy may be that false-positive rates are simply higher for computational searches on a genome-scale, which are inherently more complex than are searches within a moderate number of 5′ regions. Another limitation of the genome-wide scan is the necessity to use a heterologous promoter, which requires the candidate regulatory region to have true enhancer activity in order to give expression: In dissecting the 20 regulatory regions, we often observed elements that were sufficient for expression under the native promoter but were not sufficient under a heterologous promoter (cf. Supplemental Fig. 2b, clones 157, 527, and 528). Given these limitations and complexities, the success rate of the genome-wide search is quite satisfactory, and the discovery of seven novel enhancers with a variety of complex expression patterns underscores that the type of approach we chose will be generally viable. In the future, painstaking in vivo structure-function studies will be crucial to unraveling the complexity of metazoan gene regulation. Within the initial 20 regulatory regions, there remains significant opportunity for discovery of novel repressors and other functional sequences, as well as for more detailed analysis of regulatory element evolution. On a genome-wide scale, we have only scratched the surface of a complex regulatory network, so future work might validate a larger set of predictions from our genome-wide search. Finally, with future refinements and increases in predictive power, our approach to de novo discovery of modules and to the combined computational and experimental validation will be applicable to systems other than muscle and to organisms other than Ciona. Methods Ascidian electroporation, in situ hybridization, and handling Electroporations were conducted as reported previously (Corbo et al. 1997) with a BioRad GenePulser II or a custom electroporator (R. Zeller, pers. comm.) set at 2000 μF and 20Ω. In situ hybridization was carried out according to standard protocols (Satou et al. 2002) with probes derived from the Ghost EST collection (http://ghost.zool.kyoto-u.ac.jp/indexr1.html). Photographs of all in situ hybridizations and electroporations are available at http://mendel.stanford.edu/supplementarydata/johnson2005/. Initial construct generation and mutagenesis We used tailed-end PCR to amplify C. savignyi genomic DNA (http://www.broad.mit.edu/annotation/ciona/) for our initial functional regulatory constructs. We used the djmcs.lacZ plasmid for all native fusion constructs (Johnson et al. 2004), and either pCES (Harafuji et al. 2002) or a Brachyury (Bra) basal promoter plasmid (Bertrand et al. 2003) for the heterologous promoter constructs (the Brachyury plasmid was made available during the course of data collection and, in our hands, has fewer false negatives). Truncations and deletions were carried out as reported previously (Johnson et al. 2004). All clones were verified by sequencing and restriction digest, as reported previously (Johnson et al. 2004). Sequences for all constructs and primers are available at http://mendel.stanford.edu/supplementarydata/johnson2005/. MLAGAN alignments, conserved element detection, and sequence analysis Alignments were constructed as reported previously (Johnson et al. 2004). Orthology of three gene families was not clear due to recent gene duplications and/or large sequence gaps in the C. intestinalis assembly (http://genome.jgi-psf.org/ciona4/ciona4.home.html). In these instances, we constructed all pairwise interspecies alignments and chose the alignment with the highest percentage of bases in highly conserved elements. To find conserved elements, we used custom PERL scripts to scan alignments for 20-bp windows that contain at least 75% identity, and then expanded these windows until identity dropped <75%. Detailed functional and computational annotation of each regulatory region is available at http://mendel.stanford.edu/supplementarydata/johnson2005/. Regulatory module identification The CisModule algorithm (Zhou and Wong 2004) was used to identify candidate modules within the functionally annotated sequences. CisModule identifies a specified number (K) of motifs that are overrepresented in a set of sequences and that occur in clusters of a specified length (l), and outputs module predictions and a PSSM for each motif. Our input sequences included all of the original construct sequences of the 20 genes, plus 5′ regions from 42 genes whose transcripts show muscle-specific expression (http://ghost.zool.kyoto-u.ac.jp/indexr1.html and http://mendel.stanford.edu/supplementarydata/johnson2005/; we also ran CisModule on the 20 genes dissected here, with identical results). For the latter genes, we included 1 kb upstream of the predicted promoter. In some instances we added conserved sequences between 1 kb and 2 kb, since distal conserved sequences may contain regulatory elements. For our analyses, we used the output from a run with 62 genes as the greater number of genes offers greater sequence depth, and therefore higher quality, for the PSSMs. Our experimental results suggested that muscle regulatory elements are ~200 bp, so we ran CisModule with l = 200. We ran CisModule with K = 3, 4, and 5. At K = 5, CisModule returned only four significant motifs, so we used a run with K = 4 for our analyses. As a negative control, we ran CisModule with 59 random intergenic regions from the C. intestinalis assembly (http://mendel.stanford.edu/supplementarydata/johnson2005/). None of the four motifs identified in the muscle-specific genes occurred in the background set. We used Weblogo (http://weblogo.berkeley.edu) for visualization of motifs. We did not include exonic CisModule predictions in any of our analyses. Genome-wide module scan To perform a genome-wide scan for motif clusters, we implemented a novel optimization algorithm (CisModScan, http://www.stanford.edu/group/wonglab/software.html) based on the hierarchical mixture model (HMx model) used in CisModule (Zhou and Wong 2004). A module model is defined by the module length l, the prior probability r of starting a new module, and K distinct motifs with frequencies qk (k = 1,2,...,K). We denote the input sequence by X = x1x2...xL = x[l,L] and the corresponding module locations by Y = y1y2...yL = y[l,L], where L is the full sequence length. Using dynamic programming, we find the optimal Y* that maximize P (X, Y | Ψ, where Ψ denotes all the model parameters. The joint probability of the optimal module locations up to position n(n = 1,2,...,L) is given by [Supplemental Research Data]
Acknowledgments Kerrin Small of the Sidow Laboratory provided a nonredundant list of Ciona savignyi supercontigs. Midori Hosobuchi offered critical comments on the manuscript. Thanks to Patrick Lemaire for kindly providing the Brachyury heterologous promoter construct. The laboratories of Julie Baker, Stuart Kim, and Rick Myers provided helpful insights that guided the progress of this work. Thanks to Vy Le, Bob Zeller, David Kroodsma, and Polly Fordyce for assistance with building our custom electroporator. Notes Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.4062605. Article published online before print in September 2005. Footnotes [Supplemental material is available online at www.genome.org. The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: K. Small and P. Lemaire.] References
WEB SITE REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Science. 2001 Aug 31; 293(5535):1629-33.
[Science. 2001]Science. 2001 Sep 14; 293(5537):2087-92.
[Science. 2001]PLoS Biol. 2004 Nov; 2(11):e352.
[PLoS Biol. 2004]PLoS Biol. 2004 Sep; 2(9):E271.
[PLoS Biol. 2004]Development. 1997 Feb; 124(3):589-602.
[Development. 1997]Genome Res. 2005 Aug; 15(8):1127-35.
[Genome Res. 2005]Science. 2002 Dec 13; 298(5601):2157-67.
[Science. 2002]Genesis. 2002 Aug; 33(4):153-4.
[Genesis. 2002]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Development. 1997 Feb; 124(3):589-602.
[Development. 1997]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]Genome Biol. 2004; 5(9):R61.
[Genome Biol. 2004]PLoS Biol. 2004 Nov; 2(11):e352.
[PLoS Biol. 2004]Dev Cell. 2004 Jun; 6(6):757-70.
[Dev Cell. 2004]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Development. 1997 Feb; 124(3):589-602.
[Development. 1997]Proc Natl Acad Sci U S A. 2003 Sep 30; 100(20):11469-73.
[Proc Natl Acad Sci U S A. 2003]Proc Natl Acad Sci U S A. 2002 May 14; 99(10):6802-5.
[Proc Natl Acad Sci U S A. 2002]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2002 May 14; 99(10):6802-5.
[Proc Natl Acad Sci U S A. 2002]Cell. 2003 Nov 26; 115(5):615-27.
[Cell. 2003]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):763-8.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]Proc Natl Acad Sci U S A. 2004 Aug 17; 101(33):12114-9.
[Proc Natl Acad Sci U S A. 2004]Dev Biol. 2004 Dec 15; 276(2):563-80.
[Dev Biol. 2004]Dev Biol. 2002 Jan 15; 241(2):238-46.
[Dev Biol. 2002]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Genome Res. 2003 Apr; 13(4):533-43.
[Genome Res. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Genome Res. 2004 Dec; 14(12):2457-68.
[Genome Res. 2004]Dev Biol. 2004 Dec 15; 276(2):563-80.
[Dev Biol. 2004]Dev Biol. 1987 Jun; 121(2):526-41.
[Dev Biol. 1987]Development. 1997 Feb; 124(3):589-602.
[Development. 1997]Genesis. 2002 Aug; 33(4):153-4.
[Genesis. 2002]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2002 May 14; 99(10):6802-5.
[Proc Natl Acad Sci U S A. 2002]Cell. 2003 Nov 26; 115(5):615-27.
[Cell. 2003]Genome Res. 2004 Dec; 14(12):2448-56.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2004 Aug 17; 101(33):12114-9.
[Proc Natl Acad Sci U S A. 2004]Proc Natl Acad Sci U S A. 2004 Aug 17; 101(33):12114-9.
[Proc Natl Acad Sci U S A. 2004]Genome Res. 2003 Apr; 13(4):721-31.
[Genome Res. 2003]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Cell. 2003 Nov 26; 115(5):615-27.
[Cell. 2003]