• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Feb 2009; 19(2): 255–265.
PMCID: PMC2652207

Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE


Finding and characterizing mRNAs, their transcription start sites (TSS), and their associated promoters is a major focus in post-genome biology. Mammalian cells have at least 5–10 magnitudes more TSS than previously believed, and deeper sequencing is necessary to detect all active promoters in a given tissue. Here, we present a new method for high-throughput sequencing of 5′ cDNA tags—DeepCAGE: merging the Cap Analysis of Gene Expression method with ultra-high-throughput sequence technology. We apply DeepCAGE to characterize 1.4 million sequenced TSS from mouse hippocampus and reveal a wealth of novel core promoters that are preferentially used in hippocampus: This is the most comprehensive promoter data set for any tissue to date. Using these data, we present evidence indicating a key role for the Arnt2 transcription factor in hippocampus gene regulation. DeepCAGE can also detect promoters used only in a small subset of cells within the complex tissue.

Transcription initiation is one of the most fundamental cellular processes. The identification of transcription start sites (TSS) leads to the detection of the associated core promoters. Historically, precise definition of TSS has been laborious and addressed one gene at a time. Therefore, few genes have had their start sites mapped in detail. We, and others, have presented techniques that can identify TSS on a genome-wide scale—typically, by generating full-length cDNAs and then sequencing short tags at their 5′ ends (Ng et al. 2005; Kodzius et al. 2006). The largest study to date using such methods (Carninci et al. 2006) was carried out by taking advantage of the cap analysis of gene expression (CAGE) technology (Kodzius et al. 2006). In the FANTOM3 project, tag libraries were sequenced from 22 tissues with an average ~48,500 tags per library in mouse. This study gave new insights into how transcription initiation works, as reviewed (Muller et al. 2007; Sandelin et al. 2007), and suggested that mammalian cells have many more core promoters than previously appreciated. In a previous study (Gustincich et al. 2006), we have also shown that mouse brain regions have a higher number of active TSS than other tissues, presumably leading to a higher diversity of distinct transcripts in these tissues. Since the average mammalian cell is estimated to express at least 350,000 mRNAs (Jackson et al. 2000) and the brain is a highly complex tissue with many distinct regions, which in themselves have high cellular heterogeneity, it is evident that the promoters sampled so far are just skimming the surface of the brain transcriptional complexity.

Among the different regions of the brain, the hippocampal formation (hippocampus, dentate gyrus, and subiculum) has been the subject of intense studies due to its essential role in the formation of new episodic and long-term memories (Bird and Burgess 2008). In particular, the hippocampus has been a major experimental system to unveil the role of synaptic plasticity. The hippocampus has also been implicated in a number of neurological and psychiatric disorders including epilepsy and Alzheimer's disease. Additionally, the subgranular zone of the dentate gyrus is one of the sites of adult neurogenesis (Zhao et al. 2008).

The study of the hippocampal formation has relied on its distinctive and readily identifiable structure at both gross and histological levels (Paxinos 2004). The hippocampus is divided into fields CA1–CA3 comprising the pyramidal cell layer, where the pyramidal cells are present, and a heterogeneous group of diverse GABAergic interneurons (Parra et al. 1998; Maccaferri and Lacaille 2003). The dentate gyrus comprises the molecular, principal, and polymorphic layers where the granule cells are the principal cells and the pyramidal basket cells are the most prominent class of interneurons. In a few cases, the use of cell-type-specific knockout mice for ligands or receptors of neuroactive molecules has made it possible to integrate physiological, anatomical, and molecular data from synapses to understand animal behavior (Tsien et al. 1996; Nakazawa et al. 2004). However, our understanding of the physiological organization of the hippocampus has been hampered by difficulties in describing the complete repertoire of neuronal cell types and their properties (Parra et al. 1998).

Thus, the identification of specific promoters that initiate transcription and drive gene expression in hippocampus, or even in specific subsets of cells within the tissue, will be important for developing new lines of mice with cell- or tissue-specific gene isoforms labeled and/or knocked out, or by the use of neuronal cell ablation (Watanabe et al. 1998; Tonegawa et al. 2003). The identification of novel promoters, particularly those that drive expression in a small number of neurons, may also lead to the description of rare neuronal types that have not been previously characterized.

To this end, we present DeepCAGE, a method combining CAGE technology with a high-throughput tag sequencer, the GS20 sequencer (454 Life Sciences [Roche]). We use DeepCAGE to create a comprehensive resource of hippocampal TSS and core promoters. Since with this method we can reach an unprecedented sampling depth (2 million tags, of which 1.4 million can be unambiguously mapped to the genome), we explored this data set to find promoters preferentially used in the hippocampus, their effects on the proteome, correlation with spatial expression data, and to understand the transcriptional regulatory program in the hippocampus.


DeepCAGE sequencing

We adapted the CAGE method (Shiraki et al. 2003; Kodzius et al. 2006) to the 454 Life Sciences (Roche) GS20 sequencer as described in Methods (Fig. 1). Briefly, total RNA from C57B6/J pooled mice hippocampi was purified and used as a template in the first-strand cDNA reaction primed by random primers to capture both the poly(A)+ and poly(A) RNA species. To extend cDNA synthesis through GC-rich regions in the 5′ UTR, we carried out the reverse transcription reaction at high temperature in the presence of trehalose and sorbitol (Carninci et al. 2002). cDNAs reaching the cap site were then selected by cap-trapping. They were then ligated to a linker having a recognition site for the class-IIs restriction endonuclease MmeI just next to the start of the cDNAs corresponding to the 5′-end of the original RNAs. This linker was used to prime second-strand cDNA synthesis. Subsequently, MmeI digestion cleaved 20 ~ 21 bp within the double-stranded cDNA, releasing CAGE tags. After ligation of a second linker to the 3′ end opened by MmeI digestion, CAGE tags were PCR amplified, purified, and further amplified before restriction and concatenation for direct sequencing (see Methods). The DeepCAGE technology does not require cloning in bacteria.

Figure 1.
Preparation of DeepCAGE cDNA libraries. cDNA is produced with reverse transcriptase using random priming to maximize chances to reach the cap sites and to include non-polyadenylated RNAs. The cap site is biotinylated, followed by cleavage of single-strand ...

The key step for direct sequencing on the 454 device is the introduction of specific primer sites at the ends of the concatamers by mixing the CAGE tags at a ratio of 20:1 with a mixture of the 454 linkers “A” and “B.” Since these linkers (Margulies et al. 2005) can ligate DNA only on one side, they terminate the concatenation reaction and provide ends suitable for sequencing when A and B appear on the opposite sides of the concatamer, regardless of the orientation of the CAGE tags and insert size, which was optimized to be ~500 bp.

After a first test run, we produced two large-scale reactions, achieving in total ~2 × 106 CAGE tags. After sequencing, tags were mapped to the mouse genome (mm8 assembly) using an algorithm based on our previous studies (see Methods); in total, 1.4 × 106 tags map with high stringency (see Supplemental Table S1). The same mapping protocol was applied to all other CAGE tag libraries that are part of the data set used in this study. All CAGE tags have been submitted to the DNA Data Bank of Japan (DDBJ) under accession numbers AGAAA0000001-AGAAA0552486.

Data sets and resource preparation

Similar to most other high-throughput genomic technologies (The ENCODE Consortium 2007), CAGE data are most useful together with other data sets. In this study, we compare the hippocampus CAGE data to seven other CAGE data sets—each corresponding to a different tissue—with varying sequencing depth, including three brain tissues: visual cortex, somatosensory cortex, and cerebellum (see Supplemental Table S1). We also use the fantom3 cDNA set for associating promoters with genes and gene annotation (Carninci et al. 2005). As a resource for the community, the CAGE data sets (including “tissue-specific” promoter sets discussed below) prepared here are freely available as data tracks and sequence files at http://people.binf.ku.dk/albin/supplementary_data/hcamp/, from where they can be directly uploaded to the UCSC browser for visualization, or downloaded for analysis by power users. Additional statistics such as the fraction of tags mapping to known 5′-ends are shown in Supplemental Figure S1.

Exploration of tissue preferences for core promoters

As shown previously, for analysis of core promoters, it is helpful to group CAGE tags that map close to each other on the genome. Using the method of Carninci et al. (2006), tags from any tissue were grouped into a tag cluster if their genome mapping coordinates overlap on the same strand.

To explore the overall tissue preference of the CAGE clusters, we selected all CAGE tag clusters that have more than 30 tags per million (TPM), when counting all tissues. For clarity, TPM normalization is commonly used in tag-based studies and can be described as normalizing all tag counts so that the total count of mapped tags within a library equals 106 tags. The reason for this conservative cutoff is that a certain number of tags are needed to assess tissue distributions. For simplicity, we refer to these clusters as core promoters in this study, in the same way as in Carninci et al. (2006) and Sandelin et al. (2007).

This analysis identified 18,948 core promoters. To explore both what fraction of promoters are expressed primarily in one or a subset of tissues and what tissues have similar promoter usage, we hierarchically clustered these core promoters in terms of their expression and visualized the results as a heatmap (Fig. 2A; Eisen et al. 1998). We observe that:

  1. The brain tissues cluster together in terms of promoter usage; in particular, visual and somatosensory have the most similar usage, with few CAGE tag clusters being preferentially used in only one of these tissues.
  2. The cortex tissues have substantial “smearing.” Many promoters are used, but they are also shared between at least two tissues. This is a property seen also for macrophage and lung tissues, which might be due to the large number of macrophage cells present in lung tissue.
  3. Conversely, there is a large set of promoters that are used mostly in the hippocampus: This promoter set has very little smearing, a feature shared with less complex tissues such as liver. The cerebellum is somewhere in between the hippocampus and the cortex tissues in terms of smearing and specificity.

Figure 2.
Exploration and validation of identified core promoters. (A) Exploration of tissue usage in all core promoters having more than 30 tags per million using hierarchical clustering, with CAGE tag expression data from the actual core promoters. Preferential ...

Note that this analysis only captures general properties of the sets—there are always individual exceptions to any “rules” implied, but the goal is to see the overall tendencies. The results from this data exploration suggest that both the hippocampus and cerebellum have more core promoters that are biased toward the tissue in question than other sampled brain tissues. We explore these in the next section.

Definition of preferentially expressed promoters

We then identified the core promoters that are significantly biased toward individual tissues. Historically, in molecular biology literature, promoters with this property are often called tissue specific: We will avoid this term here since it implies exclusive expression in a given tissue, which is impossible to ascertain since we cannot sample all tissues with infinite depth. We can only assess the properties of the tissues sampled, and the goal is to find tag clusters that are preferentially used in one of the sampled set of tissues compared to the rest of tissues. We refer to such clusters as preferentially expressed promoters (PEPs), used in the following context: A liver PEP is a promoter that is preferentially used in liver tissue. We term a core promoter preferentially expressed for a certain tissue if (1) the number of tags from this tissue is greater than the sum of all other tags from other tissues, normalized for library size; and (2) this overrepresentation is statistically significant (see Methods).

Using this method, out of the 18,948 core promoters assessed, 6536 (34%) are preferentially used in a single tissue. Hippocampus, cerebellum, and liver have the most PEPs, while visual and somatosensory cortex have the fewest (Fig. 2B). Interestingly, only 8% of the hippocampus PEPs analyzed have only hippocampus tags, although this fraction is dependent on the expression cutoff (30 TPMs) used in the selection of tag clusters to start with: Smaller cutoffs give larger number promoters only detected in hippocampus (Supplemental Fig. S2).

As noted previously (Gustincich et al. 2006), promoters used preferentially in brain generally have multiple TSS and higher CpG content compared to promoters used preferentially in other tissues, which often have a single peak distribution of TSS, governed by a TATA-box (Sandelin et al. 2007). We found that hippocampus PEPs share the properties of the other brain PEPs in this regard—broader, CpG-rich promoters with fewer TATA patterns (Supplemental Figs. S3–S5). We also note that for promoters that are used strongly in many brain tissues, the hippocampus tag usage at the nucleotide level generally correlates well with tag distributions from the other brain tissues (Supplemental Figs. S6 and S7). There are exceptions to this; promoters where the tag distribution shape differs between tissues have been explored previously by Kawaji et al. (2006).

We then assessed where the PEPs from the various tissues were located in terms of overlap with known genes (see Methods). Figure 2B shows that the hippocampus, apart from having the largest amount of PEPs overall, also has the greatest number of PEPs located in intronic and intergenic space. This indicates that there are many strong promoters preferentially expressed in the hippocampus that have no known corresponding gene. This observation is not simply a sample size effect, as the cerebellum has almost as many PEPs with just ~18% of the sequencing depth compared to the hippocampus.

RACE validates distal upstream promoters

While PEPs falling within genes can be considered candidate alternative promoters for the same gene, PEPs in intergenic space are more likely to be the promoters of novel transcripts. We selected 10 intergenic hippocampus PEPs for RACE validation. Importantly, the selection was not in any way based on additional information that would be a validation in itself, such as EST sequences. Out of the 10 cases, eight had a PCR product, and the sequenced product validated a hippocampus PEPs in six of these cases. Of the failed cases, one had supporting evidence from other sources (overlap of 5′-ends of spliced ESTs) (data not shown). The outcome is comparable to that of RACE validation of intergenic transcribed regions from tiling array data in the ENCODE project (50%–70% success rate) (The ENCODE Project Consortium 2007); however, this should not be viewed as a true sensitivity measure of DeepCAGE, for three principal reasons: (1) Even in perfect circumstances, RACE does not have perfect sensitivity; (2) as we are focusing on novel core promoters, we do not know the exon structure of the downstream product, which makes primer design nontrivial; and (3) many of the promoters have high GC content, which makes amplification challenging. Nevertheless, these results show that promoters inferred by DeepCAGE can be detected by other methods, as already shown previously in extensive validation experiments of the original FANTOM3 CAGE study (see Supplemental Material of Carninci et al. 2006).

In the annotation process, we noticed a considerable number of cases in which intergenic hippocampus PEPs were located relatively close to the 5′-end of a known gene. These cases are likely novel alternative upstream promoters. An example is shown in Figure 2C: CAGE and RACE data show a novel hippocampus PEP upstream of the mouse Bai3 (brain-specific angiogenesis inhibitor 3) gene. An extreme case is shown in Figure 2D: CAGE identifies a hippocampus PEP that is upstream of the Arpc5 (actin-related protein 2/3 complex, subunit 5) gene, but on the other strand, forming a bidirectional promoter. RACE validation as well as human orthologous transcripts and EST evidence show that the novel promoter is likely a distal upstream alternative promoter of the Rgl1 (ral guanine nucleotide dissociation stimulator-like 1) gene, whose RefSeq-annotated start site (which is also a hippocampus PEP) (data not shown) is a remarkable ~141 kb downstream from the novel promoter. Therefore, while we focused on intergenic promoters to find novel transcripts, we often identified novel promoters that provide new ways to regulate the transcription of known genes.

Brain tissues use different alternative promoters within the same gene

As shown in Figure 2B, the majority of hippocampus PEPs are located inside genes, overlapping annotated exons. It is likely that many of these are alternative promoters for the same gene, since many of them are supported by full-length cDNAs (see examples in Figs. 3 and and4).4). Alternative promoters are interesting for three reasons: (1) They allow a gene to have multiple, distinct, regulatory inputs; (2) alternative promoter locations can affect the protein content of the gene product similarly to alternative splicing; and (3) it is important for molecular approaches in neurobiology to selectively knock down gene isoforms that are preferentially used in a given tissue.

Figure 3.
Alternative promoters preferentially expressed in different brain tissues. (A) The Venn diagram shows the number of genes having at least one preferentially expressed promoter (PEP) from any of the four sampled brain tissues, or any combination PEPs of ...
Figure 4.
Examples of changes of domain content for genes by use of hippocampus PEPs. Hippocampus preferentially expressed promoter (PEP) locations are shown as red triangles. Locations of predicted protein domains are shown as colored blocks (note that domains ...

We first identified all genes containing one or more PEPs from hippocampus, somatosensory cortex, visual cortex, and cerebellum inside exons. Then, we counted the number of genes with multiple distinct PEPs from the different tissues (Fig. 3A). The Dlgap1 gene (guanylate kinase-associated protein [GKAP] or SAPAP [synapse-associated protein 90-postsynaptic density-95-associated protein]) (Fig. 3B) is exceptional since it has four core promoters that are preferentially used in hippocampus, somatosensory cortex, visual cortex, and cerebellum, respectively. All of these PEPs overlap corresponding 5′-ends from full-length cDNAs (Fig. 3B); in this case, the CAGE verifies these 5′-ends and assigns tissue expression constraints. Dlgap1 is a scaffolding postsynaptic density protein at excitatory synapsis that contains 14-amino-acid repeats at the N terminus involved in protein–protein interactions and that are affected by different promoter usage (Kim et al. 1997; Romorini et al. 2004); the CAGE data indicate that all the PEPs are upstream of these repeats except for the cerebellum, indicating that cerebellum transcripts do not include the repeats. Thus, the selection of alternative promoters has in this case a clear functional consequence.

We then sought to systematically identify potential changes in protein domain composition caused by usage of hippocampus PEPs. Using cDNA data, we predicted protein domains to genomic positions and determined in how many cases a hippocampus PEP falls within a gene but downstream from a protein domain within the same gene, which then would give a protein product that is lacking the domain in question. Using conservative criteria (see Methods), we found 50 such genes (see Supplemental Material). Three examples (Pclo, Bai1, and Myo10), showing dramatic protein domain content diversity, are shown in Figure 4.

Transcription factor binding sites analysis on specific core promoters

An advantage with the CAGE is that tags give high-resolution mappings of active TSS, which can be used to pinpoint core promoters for computational sequence analyses (Wasserman and Sandelin 2004). We first analyzed the −1000 to +200 region surrounding PEPs from the tissues in Supplemental Table S1 for significantly overrepresented motif matches from the JASPAR database (Vlieghe et al. 2006). Our results are largely consistent with previous studies of promoters used primarily in single tissues—for instance, homeobox motifs are overrepresented in embryo PEPS, ETS motifs in macrophage PEPs, and so on (Supplemental Tables S3 and S4). Since CAGE data may also be interpreted as promoter usage (the number of tags mapped to a loci), we investigated what transcription factor genes are strongly expressed in hippocampus, and whether their predicted transcription factor binding sites have a clear preference to the promoters that are preferentially used in the same tissue.

Figure 5A plots the fraction of hippocampus tags in transcription factor genes compared to other tissues versus the overall hippocampus expression of the same genes (CAGE TPMs) (see Methods). Only a handful of transcription factor genes stand out as very highly expressed in hippocampus, including Arnt2, Sp3, and Aes, and only some of these have a clear preference for hippocampus. We compared these highly expressed transcription factors with mouse in situ hybridization experiments from the Allen Brain Atlas (Lein et al. 2007, see image 5B-I). Overall, there is a high correspondence between both preferential expression in hippocampus and overall strength of expression between CAGE and the in situ hybridization data. The general pattern is that in situ data show high correlation with the CAGE data if the number of CAGE tags is high, while the gene is sometimes not visibly expressed in situ if the CAGE tag count is low; this is likely because a certain number of transcripts are necessary to get a visible signal in in situ hybridization, whereas CAGE technology may sample smaller numbers of transcripts in a cell.

Figure 5.
Transcription factor genes with preferential expression in hippocampus. (A) The relation between expression strength (number of hippocampus CAGE tags/million) vs. the “tissue specificity” (fraction of hippocampus tags vs. all brain tags, ...

For most of the highly expressed transcription factors, we have no corresponding computational model for how they bind DNA. However, since many transcription factors from the same structural class bind similar target sequences (Sandelin and Wasserman 2004), observed overrepresentation of hits with a given model might be due to binding sites from a different factor with similar binding preferences. As an example, the predicted sites for the well-studied bHLH-PAS Arnt gene are overrepresented in hippocampus PEPs (Supplemental Table S3), but the Arnt gene is lowly expressed in the whole brain as measured by both CAGE and in situ data (data not shown). Interestingly, its paralog Arnt2 is primarily expressed in brain. According to the CAGE data, Arnt2 is highly, and preferentially, expressed in hippocampus (Fig. 5A). Furthermore, in situ images of Arnt2 confirm a distinctive expression of Arnt2 in the C1 region of the hippocampus (Fig. 5B). This leads to the hypothesis that the Arnt predicted sites are, in fact, sites for Arnt2, which would make Arnt2 a major factor in hippocampus transcription regulation.

Promoters used in restricted cell types in hippocampus

We have focused above on relatively strong promoters having more than 30 TPMs in order to study the distribution of tags from different tissues in a statistically valid way. The significance of transcripts that are present in a tissue with low frequency has been met with suspicion, and the observations were often labeled either as methodological or transcriptional noise. Although both of these are still a possibility, we have explored the expression properties of known genes having substantially less than 30 TPM by analyzing their spatial expression patters using in situ hybridization data. In Figure 6, we compare the number of tags hitting a known gene and the corresponding in situ images. Note that in almost all these cases, the tags hit the annotated 5′-end of the gene (Supplemental Fig. S8).

Figure 6.
CAGE identifies promoter activity from small subpopulations of hippocampal cells. Examples of correspondence between CAGE tags and signal detected by in situ hybridization, ordered from relatively high expression (from the top left quadrant), expressed ...

From Figure 6, we note that rare tags frequently identify genes whose expression is restricted only to a reduced and well-defined cell population within the tissue, which may have specific physiological roles. Within these cells, the gene is highly expressed, but since the CAGE experiment is performed on all the cells in the tissue, the total signal is averaged out. Development of next-generation technologies to prepare CAGE libraries from individual neuronal populations will further address this variability. Nevertheless, the ability of DeepCAGE to detect transcripts used in only a handful of cells in a complex tissue shows the utility of the method.


We are on the verge of a new era, in which sequencing technology can be used to infer biological function on a comprehensive scale, and the power of sequencing centers will be available to normal laboratories. Here, we have modified the CAGE protocol for the 454 Life Science instrument and demonstrate the usefulness of deep sequencing to discover new promoters in complex tissues.

We have identified a large number of core promoters that are preferentially used within hippocampus. Our results indicate that of the tissues we tested, the hippocampus has the largest number of such promoters, closely followed by cerebellum. These results may be due to two different factors: cell type diversity of the tissue and sequencing depth. The cerebellum is one of the least complex among brain tissues, while the hippocampus and the cortex tissues are very complex, with a plethora of different cell types. This is also shown in Figure 6, where it is evident that small, distinct cell populations within the hippocampus express a given gene, while most other cells do not. The methods we use to measure transcription cannot quantitatively measure the diversity of cells within a tissue and consequently neither the transcription dynamics within single cells. This strongly motivates further developments to assess the expression rates of genes within smaller cell populations. Such approaches in combination with the in situ data from the Allen Brain Atlas may result in a new molecular taxonomy of the different types of hippocampal cells providing a framework for a complete description of the components of hippocampal cell's network (Gray et al. 2004; Ma 2006; Sugino et al. 2006).

The subset of novel hippocampus promoters that are not overlapping known transcripts could indicate noncoding RNAs that have as yet not been sampled in mouse; Mercer et al. (2008) showed the existence of other long noncoding RNAs that are expressed primarily in brain. On the other hand, we also find that many of the “novel” promoters, in fact, are new upstream promoters for known genes.

However, most of these novel promoters fall within known genes, and we find that many genes have different core promoters that are used preferentially by different brain tissues, which may give partially different RNAs and protein products. In extension, identification of cell-type-specific alternative promoters for genes encoding for proteins responsible for neuronal and synaptic activity (channels, receptors, etc.) may provide increased specificity for drug treatments for epilepsy and other hippocampal-related neuropsychiatric disorders. Although this approach may seem far from our current technology, the use of antigene RNAs (agRNA) or peptide nucleic acids (agPNA) as well as of Locked Nucleic Acids (LNA) that target specific promoters has been proposed and demonstrated in vitro (Janowski et al. 2005a,b). This is particularly relevant since new promising strategies for delivery of nucleic-acid-based modifiers of gene expression into the brain have been recently proven (Kumar et al. 2007). To this end, the data set we present here, enabled by DeepCAGE, is to date the most comprehensive brain-centric promoter-exploration resource.


Preparation and sequencing of CAGE libraries

The preparation of the CAGE library is adapted from Shiraki et al. (2003) and Kodzius et al. (2006), to work with the 454 Life Sciences sequencer. The schema is represented in Figure 1. A detailed protocol of the CAGE library preparation, starting from trizol-extracted RNAs, is available in the Supplemental Material.

Once the CAGE library is prepared, we test various ratios of beads to CAGE library ratios, using usually an excess of DNA over beads (1:4 to 1:16 ratio beads:DNA) in the 454 GS20 protocol. During the calibration of the instrument, small-scale runs (1/8 of small kit runs) are used to calibrate the best DNA/beads ratio, followed by one or more runs of 454 large-scale sequencing kits (further details at http://www.454.com/).

In silico mapping of CAGE tags

Sequenced tags CAGE tags were mapped to mouse chromosomes and the mitochondrial genome (Genome build: mm8) using the BLAST/Vmatch alignment programs, and the longest full-matched (meaning no mismatches in the middle) positions were selected. These tags were referred to as “single-mapped” tags. Tags that map to multiple locations on the genome (with the same length) were called “multi-mapped,” and tags that did not map (mapped <18 bp long) were called “unmapped.” These multi-mapped and unmapped tags were passed to the rescue stage to increase the number of “single-mapped” tags (see Supplemental Material), since many promoters share identical subsequences (Faulkner et al. 2008). Rescued tags were incorporated into the single-mapped tag collection, and other tags were discarded. In the rest of the analysis, we use only the single-mapped tags; note that the same mapping procedure was applied to all CAGE libraries in the study.

Mouse hippocampus RNA preparation and 5′-RACE PCR validation of target intergenic core promoters

Adult C57/Bl wild-type mice (n = 5) were sacrificed by CO2 inhalation, and hippocampal regions were rapidly dissected and snap-frozen in liquid nitrogen. Total RNA was extracted with TRIzol reagent (Invitrogen) following the manufacturer's protocol; the RNA sample was treated with DNase (Ambion), aliquoted in RNase free LoBind tubes (Eppendorf), and stored at −80°C.

RACE-ready cDNA was obtained with the Generacer kit (Invitrogen) following the manufacturer's protocol with no modifications starting from 5 μg of hippocampus total RNA. 5′-RACE was carried out using Platinum Taq DNA polymerase High Fidelity (Invitrogen); each PCR product was cloned in a TOPO TA vector (Invitrogen) and transformed in OneShot Top10 chemically competent Escherichia coli cells. Five colonies from each plate were selected for growth, DNA extraction (DNA Mini Kit; QIAGEN), and sequencing.

Oligonucleotide primers for the validation of intergenic core promoters were hand-designed according to guidelines from the Generacer kit manual and checked with PerlPrimer for possible primer-dimer formation. Primers used to validate the core promoters are shown in Supplemental Table S1.

Generation of tag clusters

A tag cluster (TC) was defined as the maximum set of tags where all 5′-ends are <20 bp from the closest neighbor, and on the same strand. We chose 20 bp because it is approximately the length of a CAGE tag, and thus we know with certainty that the transcript starting at the tag's 5′-end at least spans this region. This is the same definition as used in Carninci et al. (2006).

Exploration of tissue preferences of tag clusters

We first normalized the expression of each TC to TPMs for each tissue:

equation image

where rc,j is the number of non-normalized tags in cluster c for tissue i, ri is the total number of non-normalized tags for tissue i, and nc is the total TPM for the cluster. Only TCs with nc >30 TPM were considered. We then normalized the total expression of each such tag cluster to sum to 1:

equation image

where tc,i is the normalized contribution of tissue i to cluster c. We then hierarchically clustered the set of promoters in terms of expression in each tissue (the tc,i values) using Euclidian distance measure and complete linkage as the clustering method (the defaults of the dist() and hclust() functions in R, respectively). The reordering of columns (tissues) and rows (tag clusters) was visualized using the heatmap.2() function in the gplots R package.

Generation of preferentially expressed promoters (PEPs)

To call a TC c preferentially expressed in a tissue i, we considered:

  1. tc,i values as defined above. We required one such value to be >0.5, since with this cutoff, a TC will be preferentially expressed only in one tissue.
  2. The assessment whether this over- or underrepresentation was significant, that is, unlikely to have arisen from random sampling from the underlying tags. This can be expressed as a binomial overrepresentation test. We required that the TC in question presented a P-value <0.05 in a one-tailed binomial overrepresentation test (R function: binom.test).
  3. We only assessed core promoters nc > 30 TPMs. The tag number constraint is not strictly necessary; we introduced the additional constraint to reduce the number of statistical tests (as tests with few tags will always be insignificant) and to focus on strong promoters.

Mapping PEPs to genes and introns

PEPs were considered to belong to a gene if they had at least one tag on the same strand within the boundary of its transcript (using the RIKEN cDNA database) including a 50-bp slack at the 5′-end of the gene. If a PEP had no such overlap, it was considered intergenic. PEPs belonging to genes were further divided into exonic if the PEP overlaps with an exon, or otherwise as intronic.

Domain annotation and PEPs

Domains were annotated using RIKEN cDNA annotation (corresponding to Interpro domain locations). To determine whether transcription initiation at the hippocampus PEPs changed the domain product, we used the gene mappings from above. Then we checked whether any domain in a gene containing a hippocampus PEP was upstream of this PEP and downstream from the annotated transcription start site. Usage of this PEP would result in the domain being lost and consequently in a different protein product.

TFBS overrepresentation analysis

We searched all sequence sets with the JASPAR matrices (Vlieghe et al. 2006) using the ASAP tool (Marstrand et al. 2008) with the following setting: uniform background model, a pseudo-count of 1, and threshold value of 0.7 relative to the matrix-specific scoring range. For all matrices, we calculated a P-value representing the overrepresentation using the binomial test as described in van Helden et al. (1998). For the tables, the P-value threshold is <0.01. As a background set, we used all core promoters with more than 30 TPM.

In situ comparison

For comparing expression and tissue preference of transcription factors between CAGE and in situ experiments available from the Allen Brain Atlas (Lein et al. 2007), we calculated the hippocampus strength versus tissue preference T for each transcription factor gene (using the RIKEN TF database [Kanamori et al. 2004]):

equation image

where we sum over all TCs c that are within the boundary of the gene of interest, and all the brain tissues i. nc,i is the TPM count for respective TC and tissue. This was then visually compared with corresponding in situ images, downloaded from http://www.brain-map.org/.


P.C. and Y.H. are supported by the National Project on Protein Structural and Functional Analysis from MEXT and the National Project on Genome Network Analysis and the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government. P.C. is also supported by a grant by the EU 6th Framework Program (NFG project). Authors affiliated with the Bioinformatics Centre are supported by a grant from the Novo Nordisk Foundation. The European Research Council has provided financial support to A.S. under the EU 7th Framework Programme (FP7/2007-2013)/ERC grant agreement 204135. S.G. is supported by a career development grant from “The Giovanni Armenise-Harvard Foundation.” We thank Susan Sunkin for help with in situ images and scientific discussion, Kazuho Ikeo and Toshitsugi Okayama for developing and sharing the CAGE mapping procedures before publication, and Akira Hasegawa for technical support.


[Supplemental material is available online at www.genome.org. CAGE tag sequences have been submitted to the DNA Data Bank of Japan (http://www.ddbj.nig.ac.jp/) under accession nos. AGAAA0000001-AGAAA0552486. Processed CAGE data sets are freely available at http://people.binf.ku.dk/albin/supplementary_data/hcamp/.]

Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.084541.108.


  • Bird C.M., Burgess N. The hippocampus and memory: Insights from spatial processing. Nat. Rev. Neurosci. 2008;9:182–194. [PubMed]
  • Carninci P., Shiraki T., Mizuno Y., Muramatsu M., Hayashizaki Y. Extra-long first-strand cDNA synthesis. Biotechniques. 2002;32:984–985. [PubMed]
  • Carninci P., Kasukawa T., Katayama S., Gough J., Frith M.C., Maeda N., Oyama R., Ravasi T., Lenhard B., Wells C., et al. The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. [PubMed]
  • Carninci P., Sandelin A., Lenhard B., Katayama S., Shimokawa K., Ponjavic J., Semple C.A., Taylor M.S., Engstrom P.G., Frith M.C., et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 2006;38:626–635. [PubMed]
  • Eisen M.B., Spellman P.T., Brown P.O., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 1998;95:14863–14868. [PMC free article] [PubMed]
  • The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. [PMC free article] [PubMed]
  • Faulkner G.J., Forrest A.R., Chalk A.M., Schroder K., Hayashizaki Y., Carninci P., Hume D.A., Grimmond S.M. A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics. 2008;91:281–288. [PubMed]
  • Gray P.A., Fu H., Luo P., Zhao Q., Yu J., Ferrari A., Tenzen T., Yuk D.I., Tsung E.F., Cai Z., et al. Mouse brain organization revealed through direct genome-scale TF expression analysis. Science. 2004;306:2255–2257. [PubMed]
  • Gustincich S., Sandelin A., Plessy C., Katayama S., Simone R., Lazarevic D., Hayashizaki Y., Carninci P. The complexity of the mammalian transcriptome. J. Physiol. 2006;575:321–332. [PMC free article] [PubMed]
  • Jackson D.A., Pombo A., Iborra F. The balance sheet for transcription: An analysis of nuclear RNA metabolism in mammalian cells. FASEB J. 2000;14:242–254. [PubMed]
  • Janowski B.A., Huffman K.E., Schwartz J.C., Ram R., Hardy D., Shames D.S., Minna J.D., Corey D.R. Inhibiting gene expression at transcription start sites in chromosomal DNA with antigene RNAs. Nat. Chem. Biol. 2005a;1:216–222. [PubMed]
  • Janowski B.A., Kaihatsu K., Huffman K.E., Schwartz J.C., Ram R., Hardy D., Mendelson C.R., Corey D.R. Inhibiting transcription of chromosomal DNA with antigene peptide nucleic acids. Nat. Chem. Biol. 2005b;1:210–215. [PubMed]
  • Kanamori M., Konno H., Osato N., Kawai J., Hayashizaki Y., Suzuki H. A genome-wide and nonredundant mouse transcription factor database. Biochem. Biophys. Res. Commun. 2004;322:787–793. [PubMed]
  • Kaur B., Brat D.J., Devi N.S., Van Meir E.G. Vasculostatin, a proteolytic fragment of brain angiogenesis inhibitor 1, is an antiangiogenic and antitumorigenic factor. Oncogene. 2005;24:3632–3642. [PubMed]
  • Kawaji H., Frith M.C., Katayama S., Sandelin A., Kai C., Kawai J., Carninci P., Hayashizaki Y. Dynamic usage of transcription start sites within core promoters. Genome Biol. 2006;7:R118. doi: 10.1186/gb-2006-7-12-r118. [PMC free article] [PubMed] [Cross Ref]
  • Kim E., Naisbitt S., Hsueh Y.P., Rao A., Rothschild A., Craig A.M., Sheng M. GKAP, a novel synaptic protein that interacts with the guanylate kinase-like domain of the PSD-95/SAP90 family of channel clustering molecules. J. Cell Biol. 1997;136:669–678. [PMC free article] [PubMed]
  • Kodzius R., Kojima M., Nishiyori H., Nakamura M., Fukuda S., Tagami M., Sasaki D., Imamura K., Kai C., Harbers M., et al. CAGE: Cap analysis of gene expression. Nat. Methods. 2006;3:211–222. [PubMed]
  • Kumar P., Wu H., McBride J.L., Jung K.E., Kim M.H., Davidson B.L., Lee S.K., Shankar P., Manjunath N. Transvascular delivery of small interfering RNA to the central nervous system. Nature. 2007;448:39–43. [PubMed]
  • Lein E.S., Hawrylycz M.J., Ao N., Ayres M., Bensinger A., Bernard A., Boe A.F., Boguski M.S., Brockway K.S., Byrnes E.J., et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature. 2007;445:168–176. [PubMed]
  • Ma Q. Transcriptional regulation of neuronal phenotype in mammals. J. Physiol. 2006;575:379–387. [PMC free article] [PubMed]
  • Maccaferri G., Lacaille J.C. Interneuron diversity series: Hippocampal interneuron classifications–making things as simple as possible, not simpler. Trends Neurosci. 2003;26:564–571. [PubMed]
  • Margulies M., Egholm M., Altman W.E., Attiya S., Bader J.S., Bemben L.A., Berka J., Braverman M.S., Chen Y.J., Chen Z., et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. [PMC free article] [PubMed]
  • Marstrand T.T., Frellsen J., Moltke I., Thiim M., Valen E., Retelska D., Krogh A. ASAP: A framework for overrepresentation statistics for transcription factor binding sites. PLoS One. 2008;3:e1623. doi: 10.1371/journal.pone.0001623. [PMC free article] [PubMed] [Cross Ref]
  • Mercer T.R., Dinger M.E., Sunkin S.M., Mehler M.F., Mattick J.S. Specific expression of long noncoding RNAs in the mouse brain. Proc. Natl. Acad. Sci. 2008;105:716–721. [PMC free article] [PubMed]
  • Muller F., Demeny M.A., Tora L. New problems in RNA polymerase II transcription initiation: Matching the diversity of core promoters with a variety of promoter recognition factors. J. Biol. Chem. 2007;282:14685–14689. [PubMed]
  • Nakazawa K., McHugh T.J., Wilson M.A., Tonegawa S. NMDA receptors, place cells and hippocampal spatial memory. Nat. Rev. Neurosci. 2004;5:361–372. [PubMed]
  • Ng P., Wei C.L., Sung W.K., Chiu K.P., Lipovich L., Ang C.C., Gupta S., Shahab A., Ridwan A., Wong C.H., et al. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods. 2005;2:105–111. [PubMed]
  • Parra P., Gulyas A.I., Miles R. How many subtypes of inhibitory cells in the hippocampus? Neuron. 1998;20:983–993. [PubMed]
  • Paxinos G. The rat nervous system. Elsevier Academic Press; San Diego, CA: 2004.
  • Romorini S., Piccoli G., Jiang M., Grossano P., Tonna N., Passafaro M., Zhang M., Sala C. A functional role of postsynaptic density-95-guanylate kinase-associated protein complex in regulating Shank assembly and stability to synapses. J. Neurosci. 2004;24:9391–9404. [PubMed]
  • Sandelin A., Wasserman W.W. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J. Mol. Biol. 2004;338:207–215. [PubMed]
  • Sandelin A., Carninci P., Lenhard B., Ponjavic J., Hayashizaki Y., Hume D.A. Mammalian RNA polymerase II core promoters: Insights from genome-wide studies. Nat. Rev. Genet. 2007;8:424–436. [PubMed]
  • Shepherd G.M. The synaptic organization of the brain. Oxford University Press; Oxford, UK: 2003.
  • Shiraki T., Kondo S., Katayama S., Waki K., Kasukawa T., Kawaji H., Kodzius R., Watahiki A., Nakamura M., Arakawa T., et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. 2003;100:15776–15781. [PMC free article] [PubMed]
  • Sousa A.D., Berg J.S., Robertson B.W., Meeker R.B., Cheney R.E. Myo10 in brain: Developmental regulation, identification of a headless isoform and dynamics in neurons. J. Cell Sci. 2006;119:184–194. [PubMed]
  • Sugino K., Hempel C.M., Miller M.N., Hattox A.M., Shapiro P., Wu C., Huang Z.J., Nelson S.B. Molecular taxonomy of major neuronal classes in the adult mouse forebrain. Nat. Neurosci. 2006;9:99–107. [PubMed]
  • Tonegawa S., Nakazawa K., Wilson M.A. Genetic neuroscience of mammalian learning and memory. Philos. Trans. R. Soc. Lond. B Biol. Sci. 2003;358:787–795. [PMC free article] [PubMed]
  • Tsien J.Z., Huerta P.T., Tonegawa S. The essential role of hippocampal CA1 NMDA receptor-dependent synaptic plasticity in spatial memory. Cell. 1996;87:1327–1338. [PubMed]
  • van Helden J., Andre B., Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 1998;281:827–842. [PubMed]
  • Vlieghe D., Sandelin A., De Bleser P.J., Vleminckx K., Wasserman W.W., van Roy F., Lenhard B. A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 2006;34:D95–D97. [PMC free article] [PubMed]
  • Wasserman W.W., Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat. Rev. Genet. 2004;5:276–287. [PubMed]
  • Watanabe D., Inokawa H., Hashimoto K., Suzuki N., Kano M., Shigemoto R., Hirano T., Toyama K., Kaneko S., Yokoi M., et al. Ablation of cerebellar Golgi cells disrupts synaptic integration involving GABA inhibition and NMDA receptor activation in motor coordination. Cell. 1998;95:17–27. [PubMed]
  • Zhao C., Deng W., Gage F.H. Mechanisms and functional implications of adult neurogenesis. Cell. 2008;132:645–660. [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...