• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of jphysiolThe Journal of Physiology SiteMembershipSubmissionJ Physiol
J Physiol. Sep 1, 2006; 575(Pt 2): 321–332.
Published online Jul 20, 2006. doi:  10.1113/jphysiol.2006.115568
PMCID: PMC1819450

The complexity of the mammalian transcriptome

Abstract

A comprehensive understanding of protein and regulatory networks is strictly dependent on the complete description of the transcriptome of cells. After the determination of the genome sequence of several mammalian species, gene identification is based on in silico predictions followed by evidence of transcription. Conservative estimates suggest that there are about 20 000 protein-encoding genes in the mammalian genome. In the last few years the combination of full-length cDNA cloning, cap-analysis gene expression (CAGE) tag sequencing and tiling arrays experiments have unveiled unexpected additional complexities in the transcriptome. Here we describe the current view of the mammalian transcriptome focusing on transcripts diversity, the growing non-coding RNA world, the organization of transcriptional units in the genome and promoter structures. In-depth analysis of the brain transcriptome has been challenging due to the cellular complexity of this organ. Here we present a computational analysis of CAGE data from different regions of the central nervous system, suggesting distinctive mechanisms of brain-specific transcription.

The classic image of a mammalian transcriptome is composed of a large assembly of spliced mRNAs, each structured with a capped 5′ end, a 5′ untranslated region (5′ UTR), a coding sequence (CDS), a 3′ untranslated region (3′ UTR) and a polyA tail. This view is currently challenged. The stunning conservation of the number of genes found in as diverse organisms as Caenorhabditis elegans, Mus musculus and Homo sapiens has suggested that a fundamental increase in complexity is due to additional mechanisms that cannot be explained by the limited number of mammalian protein-encoding genes (Lander et al. 2001; Waterston et al. 2002; Mattick, 2003).

Complexity can be created at multiple levels. For instance, the presence of alternative splice forms may create functional diversity in protein isoforms presenting different domain combinations.

At the same time, transcriptional diversity can be augmented by use of alternative gene starts and thereby potentially different regulatory mechanisms for different forms of the gene. A large diversity of 5′and 3′ ends may also represent the structural basis for the regulation of mRNA turnover, translation and subcellular localization.

Recent data that stem from technological advances in transcriptome analysis prove that mammalian cells have additional layers of complexity that cannot simply be explained by splicing (Carninci et al. 2005; Katayama et al. 2005).

First and foremost, a large portion of transcripts do not encode for proteins, particularly in the organisms we consider the most complex (mammalian and primates) since the ratio of non-protein-encoding RNA/protein-encoding RNA progressively increases from bacteria to primates. Even when applying the most conservative bioinformatic definition of non-coding RNA (ncRNA), thousands of spliced transcripts do not contain any significant CDS. Historically, the presence of non-coding transcripts was interpreted as cloning artifacts or truncated molecules. In recent years, a multitude of evidence to the contrary has been presented. Aside from in-depth validations of specific cases, we now can identify the same non-coding RNAs in multiple cDNA libraries and the promoters of such transcripts are generally more conserved over evolution than the promoters of protein-encoding RNAs. These new transcripts join the class of well-established members of the ncRNA world like rRNAs and tRNAs together with other newcomers such as small nucleolar RNAs (snoRNAs) and microRNAs.

Interestingly, the transcriptome seems to double its size when considering the poly A fraction of RNA (e.g. the set of transcripts not having a polyA tail). Furthermore, many transcripts seem to be localized only in the nucleus (Cheng et al. 2005). The biological role of these RNAs is still unclear, although some of them may be involved in the rearrangements of chromatin status.

Last but not least, it has been observed that a marked increase in transcriptional complexity is correlated to the organization of the transcriptional units in the genome. There are genomic regions that are highly enriched for transcripts (transcriptional forests, TF), which may be subjected to shared epigenetics regulatory control over larger regions; conversely, regions exist that are devoid of transcripts (transcriptional desert, TD). Most importantly, the reciprocal organization of genes seems to suggest the existence of clusters of sense–antisense pairs whose regulation seems to be coordinated (Katayama et al. 2005).

Here, we review our current understanding of the transcriptome structure. In this context, the study of gene expression in the nervous system is particularly challenging. The extreme cellular heterogeneity has significantly impaired expression profile experiments. Furthermore, the complexity of the neural networks in the central nervous system is grossly underestimated (Masland & Raviola, 2000). The systematic analysis of neuronal diversity with techniques such as the Golgi method has proved the existence of a large variety of morphological types of neurons: as many as 100 different types of interneurons may be present in each layer of the neocortex. Initial works on gene expression profiles of identified neuronal cell types have been used to direct electrophysiological experiments or classify neuronal types (Gustincich et al. 2004; Sugino et al. 2006). However, this approach is still in its infancy.

As a first analysis of CAGE data in brain libraries, we present here the CAGE tags distribution in representative areas of the central nervous system.

Innovative approaches to describe the mammalian transcriptome

The completion of the sequencing of many genomes has provided the most fundamental information needed for the analysis of the transcriptome and corresponding regulatory sequences. Gene discovery has historically relied on the combination of in silico predictions together with experimental evidence of transcription. Conservative estimates suggest that there are about 20 000 protein-encoding genes in the mammalian genome (Lander et al. 2001; Waterston et al. 2002).

In the last years different technological approaches have been used to determine the regions of the genome that are transcribed and thus unveiling unexpected additional complexities. The combination of full length cDNA cloning (i), CAGE (cap-analysis gene expression) and other libraries of short sequence tags (ii) and tiling arrays (iii) have modified our current view of the transcriptome. We here present a brief review on each of these efforts and results.

(i)Full length cDNA cloning

The RIKEN Mouse Gene Encyclopedia project is the largest and most comprehensive effort to isolate and sequence novel full-length mouse cDNAs (Okazaki et al. 2002; Carninci et al. 2003, 2005; Hayashiki and Carninci 2006; Gustincich et al. 2003). The combination of full-length cDNA enrichment, subtraction, sequencing and bioinformatics analysis culminated in the FANTOM3 cDNA clone collection composed of about 103 000 entirely sequenced cDNA clones, synthesized from most mouse tissues in a total of 237 libraries. These data have been then integrated with all the available mouse mRNA data from the public sequence database (56 000). During this work, new definitions aimed to describe the transcriptome were introduced. These are purely computational and unequivocal, which is essential for the study of the entities of the transcriptome.

A transcriptional unit (TU) identifies a group of mRNAs whose exons overlap with at least one nucleotide in the genome and have the same orientation. Similarly, a transcriptional framework (TK) is composed of mRNAs that share common expressed regions, splicing events, transcription start sites (TSS) or Transcription termination sites (TTS). Thus, TKs are contained within TU, and mRNAs within a TK are expected to be functionally related.

By the analysis of the cDNA clones and tags (such as CAGE tags) among the RIKEN libraries, at least 181 000 independent transcripts have been identified in the transcriptome. Although this estimate is conservative because multiple evidence for both TSS and TTS were required, the number of distinct transcripts is at least one order of magnitude larger than the estimated 22 000 mammalian protein-encoding genes.

(ii)CAGE

Although TSSs can be inferred from the 5′ end of full-length cDNAs and 5′-ESTs (expressed sequence tags), the depth of coverage was limited and a more systematic approach was needed. Tag-based libraries allowed cloning and high-throughput sequencing of multiple concatenated short fragments derived from starts and/or ends of cDNAs. These libraries provided higher resolution of the borders of transcripts, including a description of TSSs and TTSs. Using the CAGE technique, a systematic approach to 5′ end analysis was undertaken (Carninci et al. 2006).

In this technique, after selection of a capped, full-length cDNA, a fragment of 20–21 bp at the 5′ end was isolated by using the class IIS restriction enzyme MmeI. The enrichment over non-capped molecules has been calculated to be about 330-fold. These fragments were then concatenated by ligation and cloned into a plasmid vector for large scale sequencing. Up to 15 of the 20 bp long CAGE tags/clone were ligated and up to 1 million tags were cost-effectively sequenced in a single experiment (Shiraki et al. 2003; Harbers & Carninci, 2005; Kodzius et al. 2006). CAGE tag sequences were then mapped with BLAST against the genome. About 60% of tags could be mapped confidently to unique positions. Tags with ambiguous mappings often corresponded to transcribed repeats. TSSs were thus identified by mapping CAGE tags onto unique genomic regions.

As when assessing cDNA mappings, new terminology had to be introduced to analyse CAGE-defined TSSs. We define a CAGE-tag starting site (CTSS) as a genomic position which serves as the 5′ edge for one or several CAGE tags. Tag clusters (TCs) are then produced by grouping overlapping tags on the same strand. TCs are defined by a start and end position, a count and a distribution of these tags. Figure 1A shows an example of TC distribution for calcium–calmodulin-dependent protein kinase II, delta (Camk2d), a gene expressed in visual cortex.

Figure 1
Example of integration CA GE and cDNA information

A total of 729 504 mouse CTSS and 593 290 TCs have been identified by sequencing 145 CAGE libraries. From independent libraries 159 075 mouse TCs were defined by two or more CAGE tags, and at least 236 000 of them were defined when one includes RIKEN ESTs and full-length cDNAs. Several lines of evidence proved that CAGE identifies genuine transcription start sites, including: statistical analysis of reproducibility, experimental validation by primer extension, correlation with published sites of TATA-binding protein-associated binding protein 1 identified by chromatin immunoprecipitation (ChIP) as well as conservation of start site architecture between orthologous mouse and human genes. Importantly, the CAGE TSS data also constitute a quantitative profiling of relative promoter use across many tissues and cell types, which in the near future will allow linking gene expression with controlling promoter elements, thus leading to the deciphering of transcriptional networks (Nilsson et al. 2006).

(iii)Tiling arrays

The development of whole genome tiling arrays (Kapranov et al. 2002, 2005; Kampa et al. 2004; Bertone et al. 2004; Johnson et al. 2005; Cheng et al. 2005) has provided a new perspective on the number and extent of transcripts. This technique is designed to assay transcription at regular intervals of the genome.

A tiling chip experiment examines whether RNA is transcribed as detected using oligonucleotide probes spaced on average every 5–35 bp of the non-repeated part of the genome sequence. Unlike standard microarrays, this analysis is independent of previous annotations, allowing the identification of novel transcribed regions. Tiling arrays can also determine the internal structure of transcripts and so define the presence of introns and exons. The array-detected regions of transcription are called transcribed fragments (TARs) or transfrags.

Tiling array analysis of human chromosomes 21 and 22 have suggested that the number of detectable transcribed exons that were expressed in at least 1 out of the 11 cell lines tested was 10-fold larger than the number of the currently annotated exons. 26.5% of the tiling probes were positive in at least one cell line although only 2.6% of the probes were located within well-annotated protein-encoding genes. Furthermore, most of the novel transcripts were detected in only 1 out of 11 cell lines, suggesting that transcription is frequently cell- or condition-restricted (Kampa et al. 2004).

As tiling arrays are less accurate at detecting precise transcript edges (as probes will only partially overlap edges), an analysis using both tag-based sequencing approaches like CAGE and tiling arrays may be beneficial. In a pilot study, the comparison of tiling array data with CAGE tags libraries from human HepG2 cells have proved that: (i) 74% and 92% of the HepG2 CAGE tags map within 100 and 1000 bp, respectively, from transfrags, suggesting a remarkable correspondence between the two different approaches; (ii) many of these transcripts were present at low levels because 41 974 of the 65 501 TSSs that mapped within 1 kb upstream of known mRNAs or 5′-ESTs were present as single tags.

By measuring the length of genomic sequences contained within full length cDNA clones boundaries and between 5′ TSS and 3′ TTS identified by sequence tags, it has been estimated that at least 62.5% of the mouse genome is transcribed, representing 18 461 TFs. A part of these TFs has been analysed as interacting ‘chains’ of transcripts, which show conservation between mouse and human and are candidates for sharing common transcriptional control mechanisms (Engstrom et al. 2006).

Therefore cDNA cloning, tag sequencing and tiling data agree that a substantially larger extent of the genome is transcribed than commonly thought. Furthermore, they suggest that the majority of transcriptional events in a cell are rare and that transcription is frequently cell- or condition-restricted. This also means that although the majority of the genome can be transcribed, no single cell will transcribe more than a fraction of the total transcriptome.

Transcripts diversity and alternative splicing

In the mouse, numerous (78 393) different splicing variants have been identified, such that 65% of TUs contain multiple splice forms (Zavolan et al. 2003; Carninci et al. 2005). On average, a defined 3′ end has 1.32 start sites, while on average 1.83 3′ ends were identified for each 5′ end. Five-prime end tagging analysis suggested the existence in the mouse of 236 000 different TSSs and 153 000 TTSs. Considering that only 44 000 transcriptional units were detected within the above mouse transcriptome analysis, many TUs have large TSS-covered regions and TTS variability (Carninci et al. 2005). Accordingly, a comprehensive long-SAGE analysis of the mouse identified 3.3 alternative 3′ ends per locus (Siddiqui et al. 2005).

Alternative TSSs are important for driving the expression of different mRNAs isoforms (Carninci et al. 2006) in different contexts. By clustering tags-derived TSSs into 70 supergroups based on their CAGE-determined level of expression, different TSSs belonging to the same TUs almost invariantly fall into different expression supergroups, suggesting that they are subject to different control logic. Furthermore, 58% of protein-encoding transcriptional units had two or more alternative promoters based on the presence of non-overlapping tag clusters.

Since 27 motif families with six or more nucleotides were statistically overrepresented within 120 bp of the polyadenylation site of individual transcripts, variation in the 3′ end may suggest differential control of RNA localization and stability (Carninci et al. 2006).

Furthermore, there is a total of 32 129 protein-encoding TKs on the genome, of which 19 197 have only a single protein splice form and 2525 have an alternative non-coding splice variant. Of the 12 932 TKs that show variation in splicing, 8365 had a different SCOP (structural classification of proteins) domain composition depending on splice form (Carninci et al. 2005).

Protein isoforms may also be targeted to multiple subcellular locations. The differential use of endoplasmic reticulum signal peptides and transmembrane domains is common: 573 TU present variations in their use of signal peptides, 1527 in their use of transmembrane domains and 615 generated protein isoforms from distinct membrane organization classes (Davis et al. 2006).

Thus, splicing controls domain content and subcellular localization.

ncRNA gene discovery

ncRNA in the FANTOM3 collection

There are relatively few entirely novel protein-encoding genes that have been discovered either by cDNA cloning or validation of tiling array signals. In the mouse, 34 030 FANTOM3 cDNAs lack any protein-encoding sequence and are annotated as non-protein coding RNA. Additional support from CAGE tags or EST comes for 3652 ncRNAs. The ncRNA world contains the majority of the novel TUs (7183) and so far the number of non-coding (23 218) outnumbers the number of protein-encoding TUs (20 929) (Carninci et al. 2005). About 63% of the mouse ncRNAs are spliced and their expression has been proven (Carninci et al. 2005) which suggests they are not derived from genomic contamination during cloning procedures (Ravasi et al. 2006).

Analysis of 2680 full-length non-coding RNAs shows that, although cross species conservation of the transcribed region is weak (Pang et al. 2006), the conservation of their promoters is much higher and extends further than in coding mRNAs (5 kb versus 500 bp) (Carninci et al. 2005).

Interestingly, many ncRNAs appear to start from initiation sites in 3′ untranslated regions of protein-encoding loci (Babak et al. 2005; Pang et al. 2006).

Non-polyadenylated and nuclear RNA

Using high-density tiling arrays for 10 human chromosomes, Cheng and colleagues extended their analysis to cell compartments, using cytoplasmic polyA+ and polyA fractions of both cytoplasmic and nuclear RNA from HepG2 cells, revealing a new transcriptional world (Cheng et al. 2005). Of all transcribed sequences they found that 43.7% were non-polyadenylated while 36.9% were bimorphic (present both as poly and non-poly adenylated forms). These transcripts would have been missed by procedures that use polyA for transcript purification. Furthermore, 41.7% of all of the RNA transcripts were confined to the nucleus.

The total amount of detectable RNAs, including novel nuclear and cytoplasmic transcripts, exceeded by one order of magnitude the whole fraction of protein-encoding annotated exons. Interestingly, about 25% of the nuclear restricted polyA fraction was defined intergenic with respect to known protein-encoding genes, while a large part was enriched in intronic sequences (57%).

The presence of intronic sequences can be expected as a by-product of splicing. However, intronic sequences may have additional functions, such as splicing regulation, and this could explain the slow degradation and the selective export of some introns in the cytoplasmic fraction (Mattick, 2003; Cheng et al. 2005). Furthermore, there are some well known ncRNAs that never leave the nucleus, such as Xist and Air (Mattick, 2004, 2005).

A review dedicated to the role of ncRNA in the brain is included in this issue (Mehler & Mattick, 2006).

Genomic organization of transcriptional units

Transcription start sites within transcriptional units

The analysis of CAGE tags distribution within the full-length cDNA sequences unveiled unexpected patterns: 13 767 of the 20 639 mouse protein-encoding transcriptional units (67%) were supported by tags within 20 nucleotide from the reported 5′ end of a full-length cDNA (Carninci et al. 2005, 2006).

However, there was a significant increase in CAGE tags incidence in the 3′ UTR of some protein-encoding transcripts. These TSSs had a distinct GGG sequence motif coinciding with the TSS. There was a strikingly conserved region located at position + 40 to + 90 relative to the TSS. Interestingly, downstream genes on the opposite strand were located much closer than expected when the 3′ UTR transcription was detected in tail-to-tail mapping pairs of transcripts. In such cases, transcripts initiated in the 3′ UTR might regulate downstream genes using a sense–antisense (S–AS) mechanism or might protect the transcript from being regulated through a S–AS mechanism by RNAs coming from the gene on the opposite strand.

A third set of 34 229 CAGE tags mapped into genomic regions that contain exons on the same strand. These TSSs may generate transcripts that either are truncated or have no protein product. This type of TSS was enriched in genes transcribed by promoters having TATA-boxes and sharp starting sites in the major 5′ promoter (i.e. the 5′ end of the known gene). Conversely, they were less frequent in genes that contain a major promoter with broadly distributed TSSs and CpG islands (see below for a description of different promoter types; CpG islands are regions enriched for the CG dinucleotides). Although the function of these TSSs is unclear, this difference may reflect different types of chromatin regulation during the transcription from these two types of promoter.

Global sense–antisense (S–AS) transcription

Detection of potential S–AS pairs has increased the extent of the transcriptome. While 15–20% of protein-encoding genes have AS transcription (Okazaki et al. 2002; Kiyosawa et al. 2003; Yelin et al. 2003; Werner & Berdal, 2005), SAGE indicates that at least 50% of all transcripts have a corresponding antisense transcript in the brain alone (Siddiqui et al. 2005) and CAGE data shows that up to 72% of the TUs exhibit S–AS transcription. However, the antisense ncRNAs expression level may be still underestimated. Although most mouse CAGE libraries were oligo-dT primed, a small set of random-primed CAGE libraries indicate even larger antisense transcriptional levels (Katayama et al. 2005), in agreement with observations that antisense transcripts are poorly polyadenylated (Kiyosawa et al. 2005).

S–AS pairs are often differentially expressed across tissues and conditions, particularly overrepresented in imprinted genes (>85% of imprinted TUs) and show preference for types of genes as grouped by GO terms (i.e. extracellular proteins show reduced S–AS transcription), suggesting that they are specifically regulated and unlikely to represent transcriptional noise (Katayama et al. 2005).

Although various studies (Lehner et al. 2002; Yelin et al. 2003; Chen et al. 2004, 2005) suggest that the most common S–AS transcripts overlap as tail-to-tail (or convergent), CAGE tags suggest that head-to-head overlap (or divergent, including promoter regions) is more frequent than converging antisense (Katayama et al. 2005).

S–AS both negatively and positively regulate each other expression in 30% of the cases where experimental validation by perturbation has been carried out. Existence of both negative and positive correlations of S–AS levels suggests that there are mechanisms that are different from the siRNA-like model of action.

5′-Ends and promoter structures

Using CAGE data, two main types of promoters were identified.

(i) Single dominant peak class (SP) promoters, where the majority of tags are concentrated to no more than four consecutive start positions, giving a single dominant TSS. This class of promoters is generally associated with TATA-boxes, and it corresponds to the text-book view of promoters. Importantly, only a minor fraction (< 25%) of promoters belong to this class. These promoters are mostly enriched in tissue-specific transcripts (Carninci et al. 2006). Figure 2A shows an example of SP promoter of a TC associated with expression in the cerebellum (calbindin; Calb1).

Figure 2Figure 2
Differential promoter usage in brain tissues

(ii) General broad distribution (BR) promoters have broad distribution of TSSs generally spread over 100 nt (Carninci et al. 2006). They are strongly associated with CpG islands and are GC rich. In 90% of cases, TATA-independent transcription occurred within a CpG island. They show an overrepresentation of Sp1 sites. Figure 2B shows an example of BR promoter of a TC associated with expression in cerebellum (Pleckstrin homology domain-containing, family B (evectins) member 2; Plekhb2).

Other promoter architectures occur frequently, although they can be described as subtypes of the broad class or hybrids between broad and sharp class (Carninci et al. 2006).

Interestingly, the TATA-containing promoters had a lower nucleotide substitution rate than the other promoter types when comparing mouse and human promoters. This is probably due to specific sequence constraints in these promoters, in particular the spacing between the TATA box and the initiator dinucleotides, that was found to be a pyrimidine–purine (Py–Pu;) dinucleotide (Carninci et al. 2006).

On the contrary, CpG-enriched promoters generally have more substitutions. Since CpG promoters have many TSS and thus many initiator Py–Pu; dinucleotides, a single mutation would have a smaller effect; thus, the mutation of these sequences may have acted as a fine-tuning of promoter activity during evolution.

Unexpectedly, the CAGE data revealed that in some promoters, the transcription start sites in mouse and human are not at locations that are orthologous at sequence level (Frith et al. 2006), in other words, evolutionary turnover of transcription start sites exists. This finding indicates that there is an unexpectedly large amount of plasticity in core promoters, as drastic changes of TSS locations can be tolerated while retaining function even in critical genes.

The CAGE TSS data constitute a quantitative profiling of relative promoter use across many tissues and cell types. 159 075 TC were hierarchically clustered according to their normalized expression values in parts per million. Ubiquitous transcripts were associated with a broad TSS and CpG islands, whereas tightly regulated transcripts were associated with sharper distinct TSS and TATA-box promoters (Carninci et al. 2006).

CAGE data analysis in the brain

It is important to realize that at present, the currently available transcriptome data (CAGE, cDNA, tiling arrays) have not been analysed in-depth, and many discoveries are probably hidden within the deluge of data. Here, we present an analysis of the transcriptome in brain which exemplifies the extent of novel discoveries that can be made with publicly available CAGE data.

Among the 145 mouse CAGE libraries, several are derived from specific brain regions. Here we analyse TC distributions in libraries from visual cortex, somatosensory cortex, cerebellum and liver as a control (Fig. 1B).

These datasets are comparable in terms of sequencing depth without prior normalizations, since there are about 200 000 tags for each tissue that are grouped into an average of almost 50 000 TC, mapping to 17 000 TUs. Visual cortex and somatosensory cortex were chosen because of the high complexity of their neuronal populations. It has been estimated that these cortical regions are comprised of as many as hundreds of different cell types. We chose cerebellum for representing a less complex area of the brain to allow for detection of TUs that are expressed at relatively low levels. Liver was chosen as a control due to the relatively low complexity in terms of cell types and the divergent function of liver and brain cells.

Transcripts diversity and alternative splicing

Considering visual cortex as a reference, we determined how many TUs are expressed in both visual cortex and each other tissue. These are defined as the number of TUs having at least one TC from visual cortex and at least one TC from the other tissue. Of the 48 386 TUs expressed in visual cortex, 8814 are co-expressed in somatosensory cortex, 8736 in cerebellum and 6749 in liver. Looking only at the 104 073 TCs having at least 10 tags, we assessed the number of tissue-specific TCs. When considering highly reliable tissue-specific TCs (P > 0.0001, one-sided Fisher's exact test), we found 2009 brain-specific TCs (1.9%) and 1330 liver-specific TCs (1.2%). This is higher than expected by random (10.4 cases = 0.0001 × 104 073), and shows a larger transcriptional complexity of brain tissues even for highly expressed genes.

Furthermore, using the same thresholds as above, we demonstrated that a large portion of TCs is also specific for different brain regions. While liver has 1330 specifically expressed TCs, cerebellum has 1061, visual cortex 558 and somatosensory cortex 449. Therefore cerebellum expression is almost as different from visual cortex and somatosensory cortex as from liver. Overall, the number of brain-specific TUs is not as high as expected, although much higher than the number expected by random. This might be because many TUs in the brain are very lowly expressed or present only in subsets of cell populations and thus excluded from the analysis.

We then investigated whether TUs might have one TC up-regulated in one tissue, and another one down-regulated in another tissue, using the digital expression approach (P > 0.01) (Audic & Claverie, 1997). First, we choose TUs with at least two highly used promoters with robust expression in two CAGE libraries. Their numbers were, respectively, 374 (visual cortex versus somatosensory cortex), 446 (visual cortex versus cerebellum) and 249 (visual cortex versus liver). Then, among these we identified promoters that are differentially used in different regions of the brain: 8 (4 TUs) for visual cortex versus somatosensory cortex cases, 105 for visual cortex versus cerebellum and 111 for visual cortex versus liver, representing 1%, 11% and 20%, respectively, of analysed TUs.

In approximately half of the cases, this represents variations at the level of the first exon. Other examples show new features discovered by the FANTOM3 consortium, such as transcription initiation in the 3′ UTR or in internal exons. Some false positives arose from the computational definition of TUs, which may bridge two genes sharing a few base pairs, or borrow promoters from intronic repeats. Lastly, differential expression was detected in a very active area that defies our concept of TUs. The development of highly sequenced CAGE library (in the order of a million of tags) will give the statistical power necessary to clarify the expression dynamics of these exotic TCs. In Fig. 2C and D the vesicular glutamate transporter 1 represents an example of a differential promoter usage in visual cortex (Fig. 2C) and cerebellum (Fig. 2D). The list of the differentially used TCs is available as Supplementary material, which provides links to visualize them in the FANTOM3 genomic viewer.

Global sense–antisense (S–AS) transcription

As an example of S–AS pairs that map to the same genomic locus and are present in brain-related libraries, we identified Gnb5, the beta subunit of the heterotrimeric guanine nucleotide-binding proteins (Fig. 3). Alternative spliced variants encoding different isoforms exist and this variation is conserved between human and mouse. In FANTOM3 mouse CAGE samples, the promoter of the short isoforms is dominant in several brain samples.

Figure 3
Examples of S–AS expression in the nervous system

Antisense transcripts of mouse Gnb5 were isolated from RIKEN full-length cDNA libraries made from olfactory epithelia RNAs. The longest ORF of this antisense transcript is 77 amino acids and this transcript was classified as ncRNA. This AS tends to be co-expressed in several brain samples. Most importantly, the expression levels between S and AS are independently regulated.

Transcription start sites within transcriptional units

As described above, not all CAGE tags map to known 5′ transcript edges. We analysed the distribution of all CAGE tags within the TU boundaries for the different tissues with special emphasis on the S–AS content. Figure 4 shows the distribution of CAGE tags into first exons (Fig. 4A), inner exons (Fig. 4B) and last exons (Fig. 4C) for S and AS in the different regions. Although no large differences are observed, it may be of note that cerebellum has the highest level of TSS in the last exon. This is also true comparing cerebellum to all other tissues analysed as observed in Carninci et al. (2006).

Figure 4
In depth analysis of TSS properties in brain tissues compared to liver

5′ Ends and promoter structures

We then compared, for each tissue, the transcription start site distribution of the tissue-specific TCs (P < 0.01, one-sided Fisher's exact test) with the remaining TCs expressed in the same tissue. We also identified those promoters specific for the pooled brain tissues. We then analysed the occurrence of TATA boxes and CpG islands around the promoters (Fig. 4DF).

Interestingly, we found a pronounced difference between liver-specific promoters, pooled-brain-specific and regional brain-specific promoters.

Data on liver-specific promoters confirmed the typical structure as presented in Carninci et al. (2006). These promoters have a larger fraction of TATA-box containing promoters and correspondingly a lower CpG content. Furthermore, twice as many liver-specific promoters are of the SP type compared with non-specific liver and brain-specific TCs. Brain specific TCs are surprisingly enriched in CpG islands (see Fig. 4E), when compared with other tissue-specific transcripts. Due to the nature of the broad, CpG-enriched promoters, it is likely that brain-specific transcription is regulated at epigenetic level much more frequently than tissue-specific expression in other organs.

We also examined the conservation of promoters for each tissue and analysed net-alignments compared with those in the human genome (−300 to +50 of representative positions). Although we expected less conservation in brain-specific promoters, since they contain CpG islands that are fast evolving, brain-specific promoters had significantly higher sequence identity over the −300 to +50 promoter region compared with liver-specific promoters (p (liver versus cerebellum) = 0.0098; p (liver versus somatosensory cortex) = 1.32 × 10−7; p (liver versus visual cortex) = 2.41 × 10−8; Wilcoxon one-sided test), showing one additional exception: these promoters are not under positive evolutionary selection. It is unclear why the brain has different RNA expression regulatory mechanisms. We can speculate that the broad, CpG promoters may be advantageous for brain expression for their property of fine-tuning transcription regulation (mutation of TATA-box promoters lead to on–off situation), and that this type of regulation is beneficial to maintain low expression levels of brain-specific genes.

Conclusions

We show here that there are basic differences in the mechanisms that neurons use to control transcription. This highlights that there are key basic regulatory differences in different tissues and that lessons on transcription regulation learned by studying cell culture may not be transferred directly to brain cells. We can foresee that, with further development of the novel methodologies, more peculiarities of basic transcriptional control mechanisms in the brain will appear.

Regardless of tissue, novel technologies are now revealing an unprecedented number of transcripts that have escaped our observation for years. Experimental and statistical analysis shows that these RNA transcripts are real and functional. We believe this is still the tip of the iceberg. There are many more dimensions of the transcriptome that await proper exploration, such as (i) the complete characterization of the nuclear and polyA fraction of the RNA population, (ii) the extensive repertoire of short RNA, including but not limited to miRNAs, and (iii) other RNA fractions which contain poorly characterized transcripts.

In the future, this analysis will be useful for the identification of the transcriptome of individual neurons in different physiological conditions. Further developments of the ‘one thousand dollar genome’ will provide a component of the technological platform to afford sequencing of many more genomes and correlated transcriptomes from the same individuals.

Acknowledgments

We are grateful to the other members of the RIKEN GERG group for preparation of primary data. We thank Mrs Harumi Uruma for secretarial assistance. This work was supported by a grant from the 6th EU framework for Neuro-Functional Genomics (NFG) to P.C. and S.G. and from the Presidential Research Grant for Intersystem Collaboration of RIKEN to P.C. S.G. was also supported by The Giovanni Armenize-Harvard Foundation.

Supplemental material

The online version of this paper can be accessed at:

10.1113/jphysiol.2006.115568

http://jp.physoc.org/cgi/content/full/jphysiol.2006.115568/DC1/1 http://jp.physoc.org/cgi/content/full/jphysiol.2006.115568/DC1/2 and contains supplemental material.

TUs containing at least two differentially expressed TCs with opposite behaviours. Comparisons were made between visual cortex and cerebellum (Supplementary Table 1), or visual cortex and liver (Supplementary Table 2), using the digital expression method from Audic & Claverie (1997).

This material can also be found as part of the full-text HTML version available from http://www.blackwell-synergy.com

Supplemental data:
Supplemental data:
Supplemental data:

References

  • Audic S, Claverie JM. The significance of digital gene expression profiles. Genome Res. 1997;7:986–995. [PubMed]
  • Babak T, Blencowe BJ, Hughes TR. A systematic search for new mammalian noncoding RNAs indicates little conserved intergenic transcription. BMC Genomics. 2005;6:104. [PMC free article] [PubMed]
  • Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, et al. Global identification of human transcribed sequences with genome tiling arrays. Science. 2004;306:2242–2246. [PubMed]
  • Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, et al. FANTOM Consortium; RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group) The transcriptional landscape of the mammalian genome. Science. 2005;309:1559–1563. [PubMed]
  • Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. 2006;38:626–635. [PubMed]
  • Carninci P, Waki K, Shiraki T, Konno H, Shibata K, Itoh M, et al. Targeting a complex transcriptome: the construction of the mouse full-length cDNA encyclopedia. Genome Res. 2003;13:1273–1289. [PMC free article] [PubMed]
  • Chen J, Sun M, Hurst LD, Carmichael GG, Rowley JD. Genome-wide analysis of coordinate expression and evolution of human cis-encoded sense-antisense transcripts. Trends Genet. 2005;21:326–329. [PubMed]
  • Chen J, Sun M, Kent WJ, Huang X, Xie H, Wang W, et al. Over 20% of human transcripts might form sense-antisense pairs. Nucl Acids Res. 2004;32:4812–4820. [PMC free article] [PubMed]
  • Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, et al. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science. 2005;308:1149–1154. [PubMed]
  • Davis MJ, Hanson KA, Clark F, Fink JL, Zhang F, Kasukawa T, et al. Differential use of signal peptides and membrane domains is a common occurrence in the protein output of transcriptional units. Plos Genet. 2006;2:e46. [PMC free article] [PubMed]
  • Engstrom PG, Suzuki H, Ninomiya N, Akalin A, Sessa L, Lavorgna G, et al. Complex loci in human and mouse genomes. Plos Genet. 2006;2:e47. [PMC free article] [PubMed]
  • Frith MC, Ponjavic J, Fredman D, Kai C, Kawai J, Carninci P, et al. Evolutionary turnover of mammalian transcription start sites. Genome Res. 2006;16:713–722. [PMC free article] [PubMed]
  • Gustincich S, Arakawa Y, Batalov S, Beisel KW, Bono H, Carninci P, et al. Analysis of the mouse transcriptome for genes involved in the function of the central nervous system. Genome Res. 2003;13:1395–1401. [PMC free article] [PubMed]
  • Gustincich S, Contini M, Gariboldi M, Puopolo M, Kadota K, Bono H, et al. Gene discovery in genetically labeled single neurons of the retina. Proc Natl Acad Sci U S A. 2004;101:5069–5074. [PMC free article] [PubMed]
  • Harbers M, Carninci P. Tag-based approaches for transcriptome research and genome annotation. Nat Meth. 2005;2:495–502. [PubMed]
  • Hayashizaki Y, Carninci P. Genome network and FANTOM3: assessing the complexity of the transcriptome. Plos Genet. 2006;2:e63. [PMC free article] [PubMed]
  • Johnson JM, Edwards S, Shoemaker D, Schadt EE. Dark matter in the genome: evidence of widespread transcription detected by microarray tiling experiments. Trends Genet. 2005;21:93–102. [PubMed]
  • Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, Cawley S, et al. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 2004;14:331–342. [PMC free article] [PubMed]
  • Kapranov P, Cawley SE, Drenkow J, Bekiranov S, Strausberg RL, Fodor SP, Gingeras TR. Large-scale transcriptional activity in chromosomes 21 and 22. Science. 2002;296:916–919. [PubMed]
  • Kapranov P, Drenkow J, Cheng J, Long J, Helt G, Dike S, Gingeras TR. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 2005;15:987–997. [PMC free article] [PubMed]
  • Katayama S, Tomaru Y, Kasukawa T, Waki K, Nakanishi M, Nakamura M, et al. RIKEN Genome Exploration Research Group; Genome Science Group (Genome Network Project Core Group); FANTOM Consortium. Antisense transcription in the mammalian transcriptome. Science. 2005;309:1564–1566. [PubMed]
  • Kim TH, Barrera LO, Zheng M, Qu C, Singer MA, Richmond TA, et al. A high-resolution map of active promoters in the human genome. Nature. 2005;436:876–880. [PMC free article] [PubMed]
  • Kiyosawa H, Mise N, Iwase S, Hayashizaki Y, Abe K. Disclosing hidden transcripts: mouse natural sense-antisense transcripts tend to be poly(A) negative and nuclear localized. Genome Res. 2005;15:463–474. [PMC free article] [PubMed]
  • Kiyosawa H, Yamanaka I, Osato N, Kondo S, Hayashizaki Y. RIKEN GER Group, GSL and Members. Antisense transcripts with FANTOM2 clone set and their implications for gene regulation. Genome Res. 2003;13:1324–1334. [PMC free article] [PubMed]
  • Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, et al. CAGE: cap analysis of gene expression. Nat Meth. 2006;3:211–222. [PubMed]
  • Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed]
  • Lehner B, Williams G, Campbell RD, Sanderson CM. Antisense transcripts in the human genome. Trends Genet. 2002;18:63–65. [PubMed]
  • Masland RH, Raviola E. Confronting complexity: strategies for understanding the microcircuitry of the retina. Annu Rev Neurosci. 2000;23:249–284. [PubMed]
  • Mattick JS. Challenging the dogma: the hidden layer of non-protein-coding RNAs in complex organisms. Bioessays. 2003;25:930–939. [PubMed]
  • Mattick JS. RNA regulation: a new genetics? Nat Rev Genet. 2004;5:316–323. [PubMed]
  • Mattick JS. The functional genomics of noncoding RNA. Science. 2005;309:1527–1528. [PubMed]
  • Mehler MF, Mattick JS. Non-coding RNA in the nervous system. J Physiol. 2006 [PMC free article] [PubMed]
  • Nilsson R, Bajic VB, Suzuki H, di Bernardo D, Bjorkegren J, Katayama S, et al. Transcriptional network dynamics in macrophage activation. Genomics. 2006;88:133–142. [PubMed]
  • Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, et al. FANTOM Consortium; RIKEN Genome Exploration Research Group Phase I & II Team. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature. 2002;420:563–573. [PubMed]
  • Pang KC, Frith MC, Mattick JS. Rapid evolution of noncoding RNAs: lack of conservation does not mean lack of function. Trends Genet. 2006;22:1–5. [PubMed]
  • Ravasi T, Suzuki H, Pang KC, Katayama S, Furuno M, Okunishi R, et al. Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res. 2006;16:11–19. [PMC free article] [PubMed]
  • Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A. 2003;100:15776–15781. [PMC free article] [PubMed]
  • Siddiqui AS, Khattra J, Delaney AD, Zhao Y, Astell C, Asano J, et al. A mouse atlas of gene expression: large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. Proc Natl Acad Sci U S A. 2005;102:18485–18490. [PMC free article] [PubMed]
  • Sugino K, Hempel CM, Miller MN, Hattox AM, Shapiro P, Wu C, et al. Molecular taxonomy of major neuronal classes in the adult mouse forebrain. Nat Neurosci. 2006;9:99–107. [PubMed]
  • Tropea D, Kreiman G, Lyckman A, Mukherjee S, Yu H, Horng S, Sur M. Gene expression changes and molecular pathways mediating activity-dependent plasticity in visual cortex. Nat Neurosci. 2006;9:660–668. [PubMed]
  • Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, et al. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. [PubMed]
  • Werner A, Berdal A. Natural antisense transcripts: sound or silence? Physiol Genomics. 2005;23:125–131. [PubMed]
  • Yelin R, Dahary D, Sorek R, Levanon EY, Goldstein O, Shoshan A, et al. Widespread occurrence of antisense transcription in the human genome. Nat Biotechnol. 2003;21:379–386. [PubMed]
  • Zavolan M, Kondo S, Schonbach C, Adachi J, Hume DA, Hayashizaki Y, Gaasterland T. RIKEN GER Group; GSL Members. Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res. 2003;13:1290–1300. [PMC free article] [PubMed]

Articles from The Journal of Physiology are provided here courtesy of The Physiological Society
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...