![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2007, Cold Spring Harbor Laboratory Press Genomic regulatory blocks underlie extensive microsynteny conservation in insects 1 Computational Biology Unit, Bergen Center for Computational Science, University of Bergen, Bergen 5008, Norway; 2 Sars Centre for Marine Molecular Biology, University of Bergen, Bergen 5008, Norway; 3 Program for Genomics and Bioinformatics, Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm 17177, Sweden; 4 Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, British Columbia V5Z 4H4, Canada; 5 Genetics Graduate Program, University of British Columbia, Vancouver, British Columbia V6T 1Z3, Canada 6Corresponding author.E-mail boris.lenhard/at/bccs.uib.no; fax 47-55584295. Received May 3, 2007; Accepted July 12, 2007. Freely available online through the Genome Research Open Access option. This article has been cited by other articles in PMC.Abstract Insect genomes contain larger blocks of conserved gene order (microsynteny) than would be expected under a random breakage model of chromosome evolution. We present evidence that microsynteny has been retained to keep large arrays of highly conserved noncoding elements (HCNEs) intact. These arrays span key developmental regulatory genes, forming genomic regulatory blocks (GRBs). We recently described GRBs in vertebrates, where most HCNEs function as enhancers and HCNE arrays specify complex expression programs of their target genes. Here we present a comparison of five Drosophila genomes showing that HCNE density peaks centrally in large synteny blocks containing multiple genes. Besides developmental regulators that are likely targets of HCNE enhancers, HCNE arrays often span unrelated neighboring genes. We describe differences in core promoters between the target genes and the unrelated genes that offer an explanation for the differences in their responsiveness to enhancers. We show examples of a striking correspondence between boundaries of synteny blocks, HCNE arrays, and Polycomb binding regions, confirming that the synteny blocks correspond to regulatory domains. Although few noncoding elements are highly conserved between Drosophila and the malaria mosquito Anopheles gambiae, we find that A. gambiae regions orthologous to Drosophila GRBs contain an equivalent distribution of noncoding elements highly conserved in the yellow fever mosquito Aëdes aegypti and coincide with regions of ancient microsynteny between Drosophila and mosquitoes. The structural and functional equivalence between insect and vertebrate GRBs marks them as an ancient feature of metazoan genomes and as a key to future studies of development and gene regulation. Long-range cis-regulation in vertebrates has recently been the focus of much attention, driven by the genome-wide discovery of highly conserved noncoding elements (HCNEs) found to span the loci of developmental regulatory genes. After a series of observations of high levels of conservation of individual developmental enhancers, whole-genome comparisons revealed an abundance of HCNEs that tend to cluster along chromosomes. The clusters most often coincide with genes encoding developmental and differentiation-related transcription factors. Many HCNEs have been characterized as long-range enhancers, first in studies of individual genes (Gottgens et al. 2000; Sumiyama and Ruddle 2003; Kimura-Yoshida et al. 2004; Milewski et al. 2004), followed by systematic studies in zebrafish, Xenopus, and mouse (de la Calle-Mustienes et al. 2005; Shin et al. 2005; Woolfe et al. 2005; Pennacchio et al. 2006). Genome-wide analyses of HCNE sequences have detected several overrepresented motifs that are believed to be associated with context-specific enhancer activity (Bailey et al. 2006; Pennacchio et al. 2007). The emerging model is that an array of HCNEs defines a region of regulatory inputs of its target gene(s), and that the full complement of those inputs results in the actual expression pattern of the gene (Kimura-Yoshida et al. 2004; de la Calle-Mustienes et al. 2005; Woolfe et al. 2005; Pennacchio et al. 2006). It is plausible to speculate that the genes with the most complex spatiotemporal expression should have more complex regulatory inputs. This is in full agreement with the finding that the targets of the most elaborate arrays of HCNEs are genes encoding developmental regulators and genes for proteins that regulate axonal guidance and related processes in the central nervous system (Lindblad-Toh et al. 2005). Many HCNE arrays span large gene-free regions—so-called “gene deserts”—around their target genes (Sandelin et al. 2004). However, very often the regions spanned by HCNEs contain genes whose biological functions and expression patterns are unrelated to those of the presumptive target genes. These unrelated genes, which we refer to as “bystander genes,” are independent of the regulatory input of HCNE arrays, but the pressure to maintain HCNE arrays have kept bystander and target genes together for hundreds of millions of years (Kikuta et al. 2007). We termed the HCNE-spanned regions containing such genes “genomic regulatory blocks” (GRBs) and found GRBs to correspond to the longest regions of conserved gene order across vertebrate genomes. In this paper, we use the term “microsynteny conservation” to denote the preservation of close proximity among genes through evolution, and we refer to chromosomal regions that have been largely maintained in evolution as “synteny blocks” (Zdobnov et al. 2002; Pevzner and Tesler 2003). The fruit fly Drosophila melanogaster (Dmel) has been used for a century as a model organism for studies of genetics, animal development, behavior, and many other aspects of biology. It is remarkable that most developmental regulatory genes in the fly have conserved orthologs in vertebrates, often with analogous functions (Carroll 2005), and that many of these genes are associated with HCNEs in both flies and vertebrates (Glazov et al. 2005; Vavouri et al. 2007). Although insect HCNEs have not been studied as extensively as vertebrate HCNEs, the trends described are similar, strongly suggesting that most HCNEs function as developmental regulatory elements in vertebrates and insects alike. In both vertebrate and insect genomes, most bases that are conserved above neutral evolution rates appear to be noncoding (Siepel et al. 2005). More than 20,000 intronic and intergenic elements are perfectly conserved over at least 50 bp between Dmel and the closely related D. pseudoobscura (Dpse), and most abundant in the vicinity of developmental transcription factor genes (Glazov et al. 2005). A recent search for HCNEs conserved between Dmel and the more distantly related D. virilis (Dvir) revealed several elements that coincide with characterized developmental enhancers (Papatsenko et al. 2006). Regions of conserved microsynteny have been found between Dmel and the malaria mosquito Anopheles gambiae (Agam) although these organisms diverged ~250 million years ago (Zdobnov et al. 2002). A recent comparison of 12 insect genomes demonstrated microsynteny conservation among more distantly related insects (Zdobnov and Bork 2007). This comparison also showed that the distribution of insect synteny block lengths is incompatible with a model where genes have been randomly shuffled in evolution, and would be better explained by the existence of rearrangement hotspots—regions that have been shuffled more than others in evolution. The same trend has been observed in comparisons of mammalian genomes (Kent et al. 2003; Pevzner and Tesler 2003; Murphy et al. 2005). In vertebrates, conserved microsynteny can at least in part be explained by the occurrence of GRBs (Kikuta et al. 2007). In this study, we present evidence for the existence of GRBs in insects and their functional equivalence to those in vertebrates. We have identified 6779 HCNEs shared among five different Drosophila species, demonstrating that fly genomes contain an extensive core repertoire of HCNEs. We show that an equivalent organization can be observed in orthologous mosquito loci through comparisons of the genome sequences of Anopheles gambiae and Aëdes aegypti, and that the maintenance of HCNE clusters is likely to underlie preservation of microsynteny between flies and mosquitoes. The regions of HCNE arrays and microsynteny conservation also contain unrelated genes, probably in a similar way to bystander genes in vertebrate GRBs (Kikuta et al. 2007). We provide genome-wide evidence that these genes generally differ from target genes in their type of core promoter, which might for the first time explain on a genome-wide level why bystander genes do not specifically respond to long-range regulation in the region. Finally, we report a striking correspondence between Polycomb binding regions and several Drosophila GRBs, and discuss the occurrence of GRBs as an ancient and fundamental feature of metazoan genomes. Results We identified HCNEs in pairwise alignments between the euchromatic genome sequences of D. melanogaster (Dmel) and four other Drosophila species—D. ananassae (Dana), D. pseudoobscura (Dpse), D. virilis (Dvir), and D. mojavensis (Dmoj)—selected based on the state of their genome assemblies, availability of whole-genome sequence alignments to Dmel, and phylogenetic relationships (Supplemental Fig. S1). We required HCNEs to be conserved at 98% identity over at least 50 bp in all four pairwise comparisons. To focus on elements that are most likely to function in regulation of transcription, we discarded elements that partially or entirely overlapped exons (Bejerano et al. 2004; Glazov et al. 2005; Woolfe et al. 2005; Bailey et al. 2006). There were 6779 HCNEs, with a median size of 59 bp and a maximum of 157 bp. Consistent with earlier observations for flies (Glazov et al. 2005), nematodes (Vavouri et al. 2007), and vertebrates (Bejerano et al. 2004; Sandelin et al. 2004), we found regions of high HCNE density to be strongly enriched for genes encoding developmental transcriptional regulators (Supplemental Table S1). Highly conserved noncoding elements are enriched in large synteny blocks To study the distribution of HCNEs with respect to regions of microsynteny, we identified synteny blocks conserved among all five fly genomes as described in Methods. None of the four species that we compared to Dmel has a finished genome assembly. Nevertheless, our results indicate that reliable synteny blocks can be constructed because most of the sequence is in very large scaffolds. Although the synteny blocks included few scaffolds, they spanned 76% of the Dmel euchromatic sequence (Supplemental Table S2). We distinguish between the span of a synteny block, which we define as the entire genomic region between the extreme borders of the block, and its coverage, meaning the reciprocally aligned, syntenic bases in the block. Of the HCNEs, 94% were entirely spanned by synteny blocks, and 86% had at least 98% of their sequence covered by synteny blocks. We wished to compare the coverage of HCNE sequence by synteny blocks to the coverage of coding sequence (CDS) while controlling for the fact that the latter is less conserved overall. We therefore identified the bases in the Dmel sequence that were aligned in a reciprocal-best manner in all four pairwise genome comparisons (reciprocally best aligned [RA] sequence), and measured the fraction of them that was covered by synteny blocks. Remarkably, 90% of RA-HCNE sequence was covered by synteny blocks, compared to only 75% of RA-CDS. RA-HCNE sequence was enriched in large synteny blocks compared to RA-CDS (Fig. 1A
HCNE arrays are centrally positioned in large synteny blocks that span multiple genes We identified 164 peaks of HCNE density on Dmel chromosomes 2, 3, and X by first using a Gaussian kernel to compute local HCNE density at positions spaced 1 kb throughout the euchromatic sequence, and then locating peaks in the resulting density distribution. Many peaks of HCNE density are contained within single synteny blocks and are centrally positioned within those blocks (Fig. 1B,C HCNE-associated genes are in large blocks of conserved microsynteny between fly and mosquito The ct locus (Fig. 2A
To investigate whether maintained fly-mosquito microsynteny at the ct locus could be explained by a selective pressure to keep the HCNE-cluster intact, we searched for HCNEs conserved between Agam and the yellow fever mosquito Aëdes aegypti (Aeg) at the ct locus in mosquitoes. Indeed, there is a distinct island of mosquito-specific HCNEs confined to the region of the fly-mosquito synteny block (Fig. 2A To quantitatively assess whether genes regulated by HCNE arrays are more likely to be in large regions of microsynteny between Dmel and Agam, we constructed synteny blocks between the two genomes, using a more relaxed approach than among the Drosophila because of the large evolutionary distance between flies and mosquitoes (see Methods). We then measured the span of Dmel–Agam synteny blocks around Dmel genes from several categories, including genes in HCNE-dense regions and genes annotated with Gene Ontology (GO) biological process terms that have been found to be associated with genes spanned by HCNE arrays (GO terms “multicellular organismal development” and “regulation of transcription, DNA-dependent;” see Supplemental Table S1 and Glazov et al. 2005). There was a tendency for genes in the HCNE-related categories to be within more extensive blocks of synteny than other types of genes (Fig. 2C HCNE-associated genes have specific types of core promoters Data from this and earlier work suggests a model where insect and vertebrate HCNE arrays represent clusters of enhancers that specify expression programs for only a small subset of the genes that they span. How enhancer activity is specifically directed toward certain genes at HCNE-spanned loci is unknown. It has been demonstrated that enhancers can selectively target certain promoters (Li and Noll 1994; Merli et al. 1996) and that this selectivity may be facilitated by the occurrence of different core promoter types (Ohtsuki et al. 1998; Butler and Kadonaga 2001). A recent investigation of core promoters in Dmel classified them into five major types based on motif-content: TATA box followed by initiator (TATA/Inr), initiator followed by downstream promoter element (Inr/DPE), Motif 6 followed by Motif 1 (Motif 1/6), DNA replication element (DRE), and promoters containing only initiator, but none of the other elements (Inr only) (Ohler 2006). Based on these observations, the author designed a program (McPromoter) that predicts core promoters in the Dmel genome with high accuracy and classifies them as one of the five types. Hypothesizing that enhancers in HCNE arrays may target specific genes within “striking distance” on the basis of their core promoter architecture, we used the genome-wide McPromoter predictions to investigate core promoter properties of likely target genes. Of 81 developmental transcriptional regulators located in HCNE-dense regions, 56 have a promoter prediction close to one or more annotated transcription start sites. Of these 56 genes, 53 (95%) are associated with a prediction of a type containing an Inr-motif (Inr only, Inr/DPE, or TATA/Inr; see Table 1). For comparison, only 39% all 5824 genes assigned a promoter prediction have a prediction with an Inr-motif. The enrichment is strongest for genes with Inr-only core promoters (P = 0.005, compared to Inr/DPE enrichment, by Fisher’s exact test). For examples of genes with different core promoter types, see Figures 2
To further explore the association between core promoter types and gene functions, we performed a systematic search for enrichment of different GO annotations within each of the five core promoter classes (Fig. 3A
To explore gene expression correlations among genes with different core promoter types, we used a published tiling array data set consisting of gene expression measurements across the Dmel genome at 12 time points during the 24 h of embryonic development (Manak et al. 2006). Consistent with a housekeeping nature of genes with DRE or Motif 1/6 core promoters, we found that randomly selected gene pairs from these sets often have highly correlated expression profiles, unlike gene pairs from the other sets (Fig. 3B HCNE arrays mark regulatory domains maintained in evolution While the data presented here suggest that the need to maintain HCNE clusters is a major reason for microsynteny conservation in insects, other reasons for microsynteny conservation exist. A genome-wide comparison of Dmel–Dpse synteny blocks to changes in gene expression throughout the Dmel life cycle suggested that microsynteny is preserved at some loci in order to maintain coregulation of neighboring genes (Stolc et al. 2004; see also erratum at http://bussemaker.bio.columbia.edu/papers/Science2004/). Figure 4 Further evidence for the existence of large regulatory domains in Drosophila genomes comes from genome-wide mapping of Polycomb binding sites in embryonic cell lines, where Polycomb was found to bind large regions, preferentially around developmental regulators (Schwartz et al. 2006; Tolhuis et al. 2006). Similar findings have been reported for human embryonic stem cells, where the Polycomb repressive complex 2 subunit SUZ12 shows a strong tendency to bind across developmental transcription factor genes and around HCNEs (Lee et al. 2006). We inspected the Dmel Polycomb binding regions determined by Tolhuis et al. (2006) and noted an association with HCNEs, as expected. Tolhuis and colleagues interrogated ~30% of the Dmel genome and found that 10% of the interrogated sequence corresponds to large Polycomb binding regions (Pc domains). HCNE sequence is more than twofold enriched in these Pc domains: 114 kb of the sequence interrogated by Tolhuis and colleagues corresponds to HCNEs, and 23% of this HCNE sequence is within Pc domains. The association of HCNEs with Pc domains is significant (P < 10−5; Wilcoxon test) when one compares the density of HCNEs in Pc domains to the density of HCNEs in regions randomly sampled from the part of the genome interrogated by Tolhuis and colleagues and with similar size distribution as the Pc domains. Interestingly, we also found a very good agreement between the boundaries of synteny blocks, HCNE clusters and Pc domains at a number of loci, including the three shown in Figures 2B Discussion Experimental evidence for long-range regulation and GRBs in Drosophila Genomic regulatory blocks (GRBs) are regions containing long-range regulatory elements that have been interlocked in cis with their target genes as well as unrelated genes (Kikuta et al. 2007). We show here that this concept also applies to insect genomes. In the zebrafish genome, GRBs were discovered through enhancer detection events where the reporter insertion was close to or in a bystander gene, yet recapitulated the expression pattern of the target gene further away (Kikuta et al. 2007). Since enhancer detection has been performed extensively in Drosophila, we searched for examples of such insertions near bystander genes in the literature. Such insertions can be used to support the notion that regulatory elements form GRBs and thereby conserve microsynteny. The most striking example we found is the E32 enhancer detection line, which represents an insertion in the 5′ untranslated region of out at first (Merli et al. 1996). The insertion replicates part of the expression pattern of decapentaplegic (dpp), a developmental regulatory gene located 33 kb away. The region between dpp and the insertion contains a gene desert with HCNEs (Fig. 5 Regulatory HCNE arrays are a fundamental feature of metazoan genomes Most target genes in Drosophila GRBs appear to be developmental regulatory genes that have well-conserved vertebrate orthologs spanned by equivalent arrays of HCNEs (Sandelin et al. 2004). In addition to noncoding conservation and the types of genes they contain, other parallels between GRBs in insects and vertebrates are evident. They often harbor relatively long regions devoid of genes (gene deserts; Ovcharenko et al. 2005) and are characterized by microsynteny conserved deep in evolution (Kikuta et al. 2007; this work). Our demonstration of similarly organized HCNE arrays at orthologous Drosophila and Anopheles loci (where gene order has been partially preserved) reveals that microsynteny conservation, while constrained by regulatory elements, can outlive the sequence conservation of those elements. The match between synteny blocks, HCNE arrays, and experimentally determined Polycomb binding regions in Drosophila is striking and supports the notion that these features are signatures of GRBs. In vertebrates, Polycomb group proteins are also preferentially found at the loci of developmental regulatory genes (Boyer et al. 2006; Lee et al. 2006), were shown to bind to evolutionarily conserved CpG islands that overlap large portions of developmental regulatory genes (Tanay et al. 2007), and directly control CpG methylation (Vire et al. 2006). Even though insects do not have genome methylation or CpG islands, one can speculate that Polycomb binding regions in Drosophila are functionally equivalent to conserved CpG islands in mammals. At present, it is unknown whether those regions in insects have any specific sequence properties analogous to CpG islands. Together with a recent demonstration of the presence of HCNE clusters in nematode genomes of the genus Caenorhabditis (Vavouri et al. 2007), our findings indicate that arrays of HCNEs are central to developmental regulation of most, if not all, Metazoa. The association of HCNEs with orthologous genes among nematodes, insects, and vertebrates (Kikuta et al. 2007; Vavouri et al. 2007) suggests that long-range regulation and clusters/arrays of HCNEs are an ancient property of metazoan genomes. The role of HCNEs in constraining microsynteny has not yet been explored beyond vertebrates and insects, however. Responsiveness of genes to long-range enhancers The apparent unresponsiveness of bystander genes to long-range enhancers in GRBs remains mysterious. Distance does not seem to be crucial for enhancer action (Nobrega et al. 2003; Ellingsen et al. 2005). In the study mentioned above (Merli et al. 1996), the Drosophila gene out at first does not normally react to dpp enhancers but did so after exchanging its promoter with a dpp promoter. Thus, one explanation for enhancer specificity could be differential responsiveness of core promoters to enhancers (Smale 2001). In mammals, different types of core promoters have been clearly shown to be related to different modes of regulation (Carninci et al. 2006). In Drosophila, a recent study classified many known promoter regions into a number of different subtypes according to the principal motif (or combinations thereof) they contain (Ohler 2006). In this work we have shown that this classification discriminates between developmental genes (Inr with or without DPE), housekeeping genes (DRE or Motif 1/6), and tissue-specific genes (TATA). Based on these results, we speculate that it is the Inr-type of promoters without TATA boxes that are most likely to respond to long-range regulation. Indeed, inspection of dozens of Drosophila GRBs strongly supports the hypothesis that nonresponsive bystander genes, with expression patterns unrelated to the target gene in the same region, have core promoters of the DRE or Motif 1/6 types. In this way, Ohler’s classification of Drosophila core promoters is more powerful than that for vertebrate promoters made by Carninci et al. (2006); in vertebrates, we still do not know the fundamental difference between core promoters for housekeeping and developmental regulatory genes, which both seem to have CpG island core promoters, most without TATA boxes and with “broad”-type transcription start regions. While borders of GRBs can be identified as synteny block boundaries by comparative genomics, it is still unclear how the cellular machinery recognizes those borders. Some regulatory domains are known to be delimited by insulator elements, which bind proteins that block the reach of enhancers or inhibit the spread of repressed chromatin (Valenzuela and Kamakaka 2006). Recent studies have revealed an abundance of putative insulator elements bound by the enhancer-blocking protein CTCF in mammalian genomes, and predicted a similar number of binding sites in Tetraodon (Kim et al. 2007; Xie et al. 2007). Human CTCF is functionally conserved in Drosophila, where several other enhancer-blocking proteins also are known (Moon et al. 2005). It will be interesting to see whether insulator elements are present at the borders of vertebrate and Drosophila GRBs. Conclusions The evidence presented in this paper establishes GRBs as a fundamental property of metazoan genomes. The long distances of regulatory elements from their developmental regulatory target genes will have to be taken into account in future studies of these genes and their regulatory networks. Additionally, these findings provide guidelines for designing enhancer trap experiments and their interpretation, including an informed choice of core promoter type for enhancer trap constructs. Methods Sequences and annotations We used the following genome assemblies: Dmel release 4 (Berkeley Drosophila Genome Project); Dpse release 1.03 (Baylor HGSC); Dana, Dvir, and Dmoj Aug. 2005 (Agencourt); Agam MOZ2 (The International Anopheles Genome Project); Aaeg AaegL1 (The Broad Institute and TIGR), and A. mellifera Amel_2.0 (Baylor HGSC). We obtained Aaeg sequences from Ensembl (Hubbard et al. 2007; http://www.ensembl.org), and the other genome sequences, pairwise chained BLASTZ alignments between them, and annotations from the UCSC Genome Browser Database (Kuhn et al. 2007; http://genome.ucsc.edu). We used FlyBase v. 4.3 gene and CDS annotations (Crosby et al. 2007; http://flybase.org) and Dmel GO annotations (rev. 1.93) from http://www.geneontology.org. HCNE detection We identified elements highly conserved among flies by scanning pairwise BLASTZ net whole-genome alignments (Kent et al. 2003) between Dmel and each of the other four Drosophila species for regions with at least 98% identity over 50 alignment columns. Highly conserved elements were merged if they overlapped on the Dmel assembly. We discarded elements whose Dmel coordinates overlapped with any exon in FlyBase 4.3 genes, RefSeq genes, Dmel cDNA sequences from GenBank, or GENSCAN predictions. Remaining elements from each pairwise comparison were intersected based on their Dmel coordinates, to obtain elements conserved among all five species. Such elements spanning at least 50 bp of Dmel sequence were considered fly HCNEs. To detect mosquito HCNEs at selected Agam loci, we identified homologous Aaeg contigs by inspecting translated BLAT alignments in Ensembl v. 42–43 (Hubbard et al. 2007). We aligned Agam and Aaeg sequences with Shuffle-LAGAN v. 2.0 (Brudno et al. 2003) with default settings and used the resulting alignments to detect HCNEs as described for flies above, but using a lower identity threshold (80%) and removing elements that overlapped exons by comparing with the following UCSC Genome Browser database annotations on the Agam assembly: Ensembl genes, Agam cDNAs from GenBank, aligned Dmel proteins and GENSCAN predictions. To assess conservation of Drosophila HCNEs in Agam, we used a BLASTZ net alignment from Dmel to Agam. Computation of feature densities and density peak detection For images of loci, we computed HCNE densities by a sliding-window approach (Fig. 2 Identification of synteny blocks and RA sequence among flies To identify synteny blocks, we made use of the utilities and C functions in the UCSC Genome Browser source package (http://genome.ucsc.edu/FAQ/FAQlicense). Starting from pairwise chained BLASTZ alignments (chains) between the Dmel genome and each of the four other genomes, we constructed pairwise net alignments (nets) by running the program chainNet with option –minSpace = 1. chainNet filters a set of chains to retain only the best alignment for each position in one of the genomes (Kent et al. 2003). The chainNet algorithm tends to prioritize large chains and therefore its output is suitable for identifying synteny blocks. For each of the four pairwise genome comparisons, we constructed two sets of nets (one from the perspective of each genome), and used them to filter the chains into a set of reciprocal-best chains (rb-chains) that only contain alignment columns included in the nets for both genomes. To find the bases in the Dmel sequence that were aligned in a reciprocal-best manner in all four parwise genome comparisons (RA sequence), we identified the Dmel bases that were in ungapped blocks (i.e., were aligned to some base) in all four sets of rb-chains. We constructed pairwise synteny blocks from rb-chains in three steps: (1) Rb-chains were split at gaps that spanned nets if, within the gap, nets for either genome contained at least 10 kb in ungapped blocks. We used nets to split rb-chains because they include alignments that are not reciprocal-best, thus allowing us to capture synteny breaks caused, for example, by species-specific duplications. Only rb-chains that contained ≥10 kb in ungapped blocks after this step were retained. (2) We classified regions spanned by multiple (nested) rb-chains as being outside synteny blocks, and truncated nested rb-chains accordingly. Again, rb-chains containing <10 kb in ungapped blocks were discarded. (3) To avoid artificial synteny breaks due to failure to link scaffolds together in any of the non-Dmel assemblies, we joined rb-chains that were nearest neighbors along the same Dmel chromosome arm, but on different scaffolds in the non-Dmel assembly, unless the gap between the rb-chains in either genome contained nets with at least 10 kb of sequence in ungapped blocks (i.e., the same criterion as used to split chains in step 2 above). The set of rb-chains after this third step constituted our pairwise synteny blocks. Although joining of chains may overestimate synteny in pairwise comparisons, any such effects should be minimal after pairwise synteny blocks are intersected into five-way synteny blocks. We created five-way synteny blocks by intersecting the pairwise synteny blocks based on their coordinates on the Dmel assembly: Any two Dmel bases were assigned to the same five-way synteny block if, and only if, they were part of the same synteny block in each of the pairwise comparisons. We discarded five-way synteny blocks that did not contain at least 10 kb in ungapped alignments across all pairwise synteny blocks. Analysis of Dmel–Agam synteny To identify Dmel–Agam synteny blocks, we first computed reciprocal-best BLASTZ net alignments between Dmel and Agam as described for fly comparisons above. We then constructed a graph where two alignments (nodes) were connected if separated by ≤100 kb in both genomes (not considering strand, to allow local inversions within synteny blocks). We considered each connected component in the graph to be one synteny block. The threshold of 100 kb is arbitrary; we tested several values in the range 0–300 kb with similar results. Considering all protein-coding FlyBase genes, we assigned a gene to a synteny block if that gene had a transcript with at least 25% of its CDS aligned to the syntenic Agam locus. Genes that belonged to multiple blocks according to this rule were excluded. Core promoter analysis We assigned a McPromoter prediction (Ohler 2006) to a FlyBase transcript if it was within 250 bp upstream of the annotated start site of the transcript or within the noncoding part of its first exon. In rare cases where multiple promoter predictions satisfied these criteria, the prediction closest to the annotated start site was chosen. For illustrated loci, core promoter assignments to genes were reviewed and changed if available transcript data motivated modifications to FlyBase gene models. Expression analysis To assign expression values to genes, we processed FlyBase gene models as follows. Because the expression signals from the tiling array study (Manak et al. 2006) are not strand-specific, we masked parts of exons that overlapped exons on the other genomic strand. We disregarded any gene that had more than half of its total exon sequence masked. For each remaining gene i, we computed its maximum transfrag coverage cmaxi as maxj(cij), where cij is the number of unmasked exon bases covered by transfrags for gene i at time point j. Any gene i with cmaxi ≥70% of its unmasked exon sequence was considered expressed (a similar criterion was used in the original analysis of the data; Manak et al. 2006); other genes were assigned an expression value of 0 for all time points. If two expressed genes (annotated on the same strand) shared unmasked exon sequence, only the gene with highest cmax was considered further, because we were not interested in comparing expression profiles between genes that share the same transcriptional unit. Each retained gene was then, for each time point, assigned an expression value equal to the median signal over its unmasked exon sequence. Only genes that showed at least a twofold difference in expression values between some time points were used in comparisons of expression profiles. Acknowledgments This work was supported by the Functional Genomics Programme (FUGE) of the Research Council of Norway and a core grant from the Sars Centre. S.H.S. is supported by the National Sciences and Research Council of Canada (NSERC), and is a MSFHR senior trainee. We thank David Fredman, Fernando Casares, and Wyeth Wasserman for useful discussions, and Agencourt for permission to use the Dana, Dvir, and Dmoj genome assemblies. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6669607 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Nat Biotechnol. 2000 Feb; 18(2):181-6.
[Nat Biotechnol. 2000]Proc Natl Acad Sci U S A. 2003 Apr 1; 100(7):4030-4.
[Proc Natl Acad Sci U S A. 2003]Development. 2004 Jan; 131(1):57-71.
[Development. 2004]Development. 2004 Feb; 131(4):829-37.
[Development. 2004]Genome Res. 2005 Aug; 15(8):1061-72.
[Genome Res. 2005]Development. 2004 Jan; 131(1):57-71.
[Development. 2004]Genome Res. 2005 Aug; 15(8):1061-72.
[Genome Res. 2005]PLoS Biol. 2005 Jan; 3(1):e7.
[PLoS Biol. 2005]Nature. 2006 Nov 23; 444(7118):499-502.
[Nature. 2006]Nature. 2005 Dec 8; 438(7069):803-19.
[Nature. 2005]BMC Genomics. 2004 Dec 21; 5(1):99.
[BMC Genomics. 2004]Genome Res. 2007 May; 17(5):545-55.
[Genome Res. 2007]Science. 2002 Oct 4; 298(5591):149-59.
[Science. 2002]Genome Res. 2003 Jan; 13(1):37-45.
[Genome Res. 2003]Genome Res. 2005 Jun; 15(6):800-8.
[Genome Res. 2005]Genome Biol. 2007; 8(2):R15.
[Genome Biol. 2007]Genome Res. 2005 Aug; 15(8):1034-50.
[Genome Res. 2005]Genomics. 2006 Oct; 88(4):431-42.
[Genomics. 2006]Science. 2002 Oct 4; 298(5591):149-59.
[Science. 2002]Trends Genet. 2007 Jan; 23(1):16-20.
[Trends Genet. 2007]Proc Natl Acad Sci U S A. 2003 Sep 30; 100(20):11484-9.
[Proc Natl Acad Sci U S A. 2003]Genome Res. 2003 Jan; 13(1):37-45.
[Genome Res. 2003]Science. 2005 Jul 22; 309(5734):613-7.
[Science. 2005]Genome Res. 2007 May; 17(5):545-55.
[Genome Res. 2007]Science. 2004 May 28; 304(5675):1321-5.
[Science. 2004]Genome Res. 2005 Jun; 15(6):800-8.
[Genome Res. 2005]PLoS Biol. 2005 Jan; 3(1):e7.
[PLoS Biol. 2005]Exp Cell Res. 2006 Oct 1; 312(16):3108-19.
[Exp Cell Res. 2006]Genome Biol. 2007; 8(2):R15.
[Genome Biol. 2007]Gene. 2001 May 30; 270(1-2):1-15.
[Gene. 2001]Genome Res. 2002 Jun; 12(6):996-1006.
[Genome Res. 2002]Dev Biol. 2005 Oct 15; 286(2):647-58.
[Dev Biol. 2005]Mech Dev. 2003 Oct; 120(10):1193-207.
[Mech Dev. 2003]Neuron. 1997 Mar; 18(3):397-409.
[Neuron. 1997]Genome Res. 2005 Jun; 15(6):800-8.
[Genome Res. 2005]Genome Res. 2005 Jun; 15(6):800-8.
[Genome Res. 2005]EMBO J. 1994 Jan 15; 13(2):400-6.
[EMBO J. 1994]Genes Dev. 1996 May 15; 10(10):1260-70.
[Genes Dev. 1996]Genes Dev. 1998 Feb 15; 12(4):547-56.
[Genes Dev. 1998]Genes Dev. 2001 Oct 1; 15(19):2515-9.
[Genes Dev. 2001]Nucleic Acids Res. 2006; 34(20):5943-50.
[Nucleic Acids Res. 2006]Nat Genet. 2006 Jun; 38(6):626-35.
[Nat Genet. 2006]Nat Genet. 2006 Oct; 38(10):1151-8.
[Nat Genet. 2006]Science. 2004 Oct 22; 306(5696):655-60.
[Science. 2004]Development. 2002 Aug; 129(15):3585-96.
[Development. 2002]Development. 2004 Feb; 131(4):767-74.
[Development. 2004]Proc Natl Acad Sci U S A. 2005 Feb 22; 102(8):2820-5.
[Proc Natl Acad Sci U S A. 2005]Dev Biol. 2005 Feb 15; 278(2):459-72.
[Dev Biol. 2005]Mech Dev. 2005 Sep; 122(9):1056-69.
[Mech Dev. 2005]Nat Genet. 2006 Jun; 38(6):700-5.
[Nat Genet. 2006]Nat Genet. 2006 Jun; 38(6):694-9.
[Nat Genet. 2006]Cell. 2006 Apr 21; 125(2):301-13.
[Cell. 2006]Genome Res. 2007 May; 17(5):545-55.
[Genome Res. 2007]Genes Dev. 1996 May 15; 10(10):1260-70.
[Genes Dev. 1996]Genesis. 2002 Sep-Oct; 34(1-2):58-61.
[Genesis. 2002]Nat Genet. 2006 Jun; 38(6):694-9.
[Nat Genet. 2006]Genes Dev. 1990 Jul; 4(7):1114-27.
[Genes Dev. 1990]Genes Dev. 1996 May 15; 10(10):1260-70.
[Genes Dev. 1996]BMC Genomics. 2004 Dec 21; 5(1):99.
[BMC Genomics. 2004]Genome Res. 2005 Jan; 15(1):137-45.
[Genome Res. 2005]Genome Res. 2007 May; 17(5):545-55.
[Genome Res. 2007]Nature. 2006 May 18; 441(7091):349-53.
[Nature. 2006]Cell. 2006 Apr 21; 125(2):301-13.
[Cell. 2006]Proc Natl Acad Sci U S A. 2007 Mar 27; 104(13):5521-6.
[Proc Natl Acad Sci U S A. 2007]Nature. 2006 Feb 16; 439(7078):871-4.
[Nature. 2006]Genome Biol. 2007; 8(2):R15.
[Genome Biol. 2007]Genome Res. 2007 May; 17(5):545-55.
[Genome Res. 2007]Science. 2003 Oct 17; 302(5644):413.
[Science. 2003]Development. 2005 Sep; 132(17):3799-811.
[Development. 2005]Genes Dev. 1996 May 15; 10(10):1260-70.
[Genes Dev. 1996]Genes Dev. 2001 Oct 1; 15(19):2503-8.
[Genes Dev. 2001]Nat Genet. 2006 Jun; 38(6):626-35.
[Nat Genet. 2006]Annu Rev Genet. 2006; 40():107-38.
[Annu Rev Genet. 2006]Cell. 2007 Mar 23; 128(6):1231-45.
[Cell. 2007]Proc Natl Acad Sci U S A. 2007 Apr 24; 104(17):7145-50.
[Proc Natl Acad Sci U S A. 2007]EMBO Rep. 2005 Feb; 6(2):165-70.
[EMBO Rep. 2005]Nucleic Acids Res. 2007 Jan; 35(Database issue):D610-7.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007 Jan; 35(Database issue):D668-73.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007 Jan; 35(Database issue):D486-91.
[Nucleic Acids Res. 2007]Proc Natl Acad Sci U S A. 2003 Sep 30; 100(20):11484-9.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2007 Jan; 35(Database issue):D610-7.
[Nucleic Acids Res. 2007]Bioinformatics. 2003; 19 Suppl 1():i54-62.
[Bioinformatics. 2003]Proc Natl Acad Sci U S A. 2003 Sep 30; 100(20):11484-9.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2006; 34(20):5943-50.
[Nucleic Acids Res. 2006]Nat Genet. 2006 Oct; 38(10):1151-8.
[Nat Genet. 2006]