Logo of plntphysLink to Publisher's site
Plant Physiol. 2009 Mar; 149(3): 1316–1324.
PMCID: PMC2649393

Highly Diversified Molecular Evolution of Downstream Transcription Start Sites in Rice and Arabidopsis1,[W][OA]


Alternative usage of transcription start sites (TSSs) is one of the key mechanisms to generate gene variation in eukaryotes. Here, we show diversified molecular evolution of TSSs in remotely related flowering plants, rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana), by comprehensive analyses of large collections of full-length cDNAs and genome sequences. We determined 45,917 representative TSSs within 23,445 loci of rice and 35,313 TSSs within 16,964 loci of Arabidopsis, about two TSSs per locus in either species. The nucleotide features around TSSs displayed distinct patterns when the most upstream TSSs were compared with downstream TSSs. We found that CG-skew and AT-skew were clearly different between upstream and downstream TSSs, and that this difference was commonly observed in rice and Arabidopsis. Relative entropy analysis revealed that the most upstream TSSs had retained canonical cis elements, whereas downstream TSSs showed atypical nucleotide features. Expression patterns were distinguishable between upstream and downstream TSSs. These results indicate that plant TSSs were generally diversified in downstream regions, resulting in the development of new gene expression patterns. Furthermore, our comparative analysis of TSS variation between the species showed a positive correlation between TSS number and gene conservation. Rice and Arabidopsis might have evolved novel TSSs in an independent manner, which led to diversification of these two species.

While a complete genome sequence enables us to estimate the total number of genes in an organism, transcriptional activities of genes can be verified by either tiling array analyses or mapping ESTs and cDNAs onto a genome (Suzuki et al., 2001a; Yamada et al., 2003; Halasz et al., 2006; Li et al., 2007). The complicated gene structure of eukaryotes, which includes alternative transcripts, hampers precise computational predictions of exon-intron boundaries. Discoveries of alternative variants of genes have, therefore, been accomplished by the experimental verification of transcripts (Kim et al., 2005; Blencowe, 2006; Chen et al., 2007). Findings from a wide variety of alternative transcripts in higher eukaryotes led to the concept that the number of transcript variants, rather than the total number of genes, would better reflect the biological complexity of an organism (Brett et al., 2002). Therefore, to understand the relationship between genes and their functions, it is necessary to study transcript variation. Alternative transcripts are mainly generated by two mechanisms: alternative splicing and alternative usage of transcription start sites (TSSs). Both mechanisms are known to play important roles in tissue-specific gene expression and functional variation, which have significant impact on biological processes (Landry et al., 2003; Iida and Go, 2006). Recent large-scale sequencing projects have produced a considerable number of 5′-end sequences of full-length cDNAs (FLcDNAs) from rice (Oryza sativa subsp. japonica; Satoh et al., 2007) and from Arabidopsis (Arabidopsis thaliana; Seki et al., 2002; Alexandrov et al., 2006). Therefore, in this paper, we attempt to elucidate the biological significance and evolution of alternative TSSs in plants.

Past studies of TSS variation have focused mainly on mammals and fungi. For example, in human, millions of 5′-end sequences of FLcDNAs were used to determine 269,774 TSSs, from which 30,964 TSS clusters of 14,628 genes were obtained (Kimura et al., 2006). It was shown that alternative promoters and the resultant alternative usage of first exons had created a large number of transcript variants (Kimura et al., 2006). In yeast, 5′-end sequences of FLcDNAs were mapped on the genome sequence, and numerous TSSs were also clearly determined. Over 90% of the analyzed yeast loci had more than two transcript variants derived from different TSSs (Miura et al., 2006). These results indicated that TSS variation could be observed widely in animals and fungi. However, a comparison of promoters between human and mouse revealed that of 5,463 genes, which contained putative alternative promoters, only 807 were evolutionarily conserved (Tsuritani et al., 2007). In addition, another study of Cap Analysis of Gene Expression data sets found that TSSs of the orthologous genes did not always reside at the equivalent locations in the human and mouse genomes (Frith et al., 2006). These observations have suggested flexibility and rapid turnover of TSSs during evolution. Despite the large amount of TSS information in animals and fungi, there is a paucity of TSS studies in plants. Therefore, analysis of TSSs in the plant lineage could add to knowledge about the evolution of TSSs in eukaryotes.

Recent progress of genome and transcriptome sequencing in rice and Arabidopsis gives us an opportunity to investigate TSS variation in higher plants. RIKEN and Ceres have released over 200,000 5′-end sequences obtained from Arabidopsis FLcDNA clones (Seki et al., 2002; Alexandrov et al., 2006). In addition, more than 580,000 5′- or 3′-end sequences of rice FLcDNAs have become available (Satoh et al., 2007). This wealth of sequence information allows us to conduct identification and comparative analyses of TSSs in more than 10,000 loci of these plants. Previous studies have indicated increased CG-skew around TSSs in both plants and fungi but not in animals (Fujimori et al., 2005) and that the skewed prevalence of adenine at TSSs and of cytosine at −1 bp of TSSs were common characteristics of rice and Arabidopsis (Alexandrov et al., 2006). Yamamoto et al. (2007a, 2007b) conducted a comprehensive analysis of promoter regions to detect frequently observed octamers derived from TATA box, Y patch, and CpG and reported that different octamers could be used for different gene expression mechanisms.

In this study, we identified TSSs comprehensively by mapping transcripts to the genomes of rice and Arabidopsis and compared nucleotide signals around the TSSs. To our knowledge, this is the first attempt of a large-scale TSS comparison between higher plants based on FLcDNA sequences. We also conducted a comparative analysis of TSS variation and gene conservation to elucidate how TSSs and genes have evolved during plant evolution.


Identification and Clustering of TSSs

We mapped rice FLcDNAs and their 5′-end sequences onto the rice genome to identify TSSs by a previously described method (The Rice Annotation Project, 2007). As shown in Table I, 32,644 FLcDNAs and 173,881 5′-end sequences of rice, which were successfully mapped onto the rice genome sequence (International Rice Genome Sequencing Project build 4.0), were used for further analyses. Likewise, for Arabidopsis, 21,859 FLcDNAs and 255,302 5′-end sequences were mapped onto the Arabidopsis genome sequence. As a result, we identified 87,782 and 93,726 nonredundant TSSs in rice and Arabidopsis, respectively. After we removed possible read-through transcripts and TSSs in nonannotated genomic positions (for details, see “Materials and Methods”), 80,743 rice transcripts corresponded to 23,445 loci and 91,919 Arabidopsis transcripts corresponded to 16,964 loci. The average numbers of transcripts per locus were 3.44 in rice and 5.42 in Arabidopsis.

Table I.
Number of transcript sequences mapped to the genomes

In addition to alternative TSSs, it is known that a TSS can fluctuate in animal and fungus genomes for biological reasons (Carninci et al., 2006; Kimura et al., 2006; Miura et al., 2006). A TSS may not be determined on a single site of a genome sequence; hence, the TSS number inferred from transcript mapping could be overestimated if all fluctuating TSSs are counted. Taking this possibility into account, we decided to define regions of TSSs by clustering closely located TSSs. When we calculated the distance between TSSs of the same locus, 79.8% and 86.9% of all distances between TSS pairs were less than 100 bp in rice and Arabidopsis, respectively (Supplemental Fig. S1). Since these small fluctuations can be caused either by biological processes or by experimental errors, we first checked the accuracy of 5′-end sequencing. A large number of 5′-end sequences that were obtained from the same clones were determined by two independent experiments: complete sequencing of the FLcDNA and partial 5′-end sequencing. If there are experimental errors, the TSS positions should vary even though they are derived from the same clone. We evaluated the TSS positions in 15,381 FLcDNAs of rice and in 19,865 FLcDNAs of Arabidopsis. Our results showed that approximately 10% to 20% of TSSs of the same clones were not identical (Table II; Supplemental Fig. S2), and 4.8% and 4.5% of TSS pairs were more than 5 bp apart in rice and Arabidopsis, respectively. Note that the distance should be 0, because the sequences were determined using the same clone. Therefore, not only biological fluctuations but also these experimental errors should be diminished by clustering closely mapped TSSs.

Table II.
Comparison of aligned 5′-positions between FLcDNAs and 5′-end sequences obtained from the same clones

To date, several criteria have been used for TSS clustering. For example, to analyze Cap Analysis of Gene Expression data, tags of 20- or 21-bp overlaps were clustered (Carninci et al., 2006), while Kimura et al. (2006) adopted a 500-bp interval between distinct TSSs, and, in yeast, a fixed 100-bp interval was used for clustering (Miura et al., 2006). In this study, we first estimated appropriate interval sizes (for details, see “Materials and Methods”), and they were determined to be 21 bp for rice and 27 bp for Arabidopsis. The TSSs were clustered by the single-linkage clustering method with these thresholds. As a result, the maximum cluster sizes were 133 and 193 bp in length, and the average sizes were 4.2 and 8.7 bp in rice and Arabidopsis, respectively. If we excluded single-member clusters, the average sizes were 14.0 bp in rice and 22.3 bp in Arabidopsis. Finally, we could define 45,917 TSS clusters within 23,445 loci in rice and 35,313 TSS clusters within 16,964 loci in Arabidopsis. On average, each locus had 1.96 and 2.08 TSS clusters in rice and Arabidopsis, respectively. The medians of the distances between TSSs were 149 bp in rice and 184 bp in Arabidopsis. The distribution of the average TSS distances in a locus showed similar patterns in the two species (Supplemental Fig. S3).

Although hundreds of thousands of sequences were evaluated, the number of transcripts might not be saturated and there might be TSS variants missing in the cDNA libraries used. In fact, 50% and 58% of the clusters were determined by only one transcript in rice and Arabidopsis, respectively. This result suggests that more TSSs might be detected if we further collect cDNA clones. Thus, our estimates of the numbers of TSS clusters should be taken as lower limits, and there is likely to be more variation of TSSs than observed in this study.

Prominent Nucleotide Features around TSSs

As described in “Materials and Methods,” we defined a representative TSS in each TSS cluster for further analyses. We refer to these representatives simply as TSSs, unless otherwise noted. When there are two or more TSSs in a locus, cis elements of a downstream TSS may overlap with a transcribed region of an upstream TSS. In fact, 18.9% of downstream TSSs of rice were located within the protein-coding regions of transcripts initiated from upstream TSSs and resulted in truncated open reading frames (ORFs). The nucleotide compositions around the downstream TSSs might be distinct from those of the upstream TSSs, because the functional constraints of the transcribed regions of the upstream TSSs should create nucleotide biases. To assess this possibility, we separated the TSS data sets into the most upstream TSSs and the remaining downstream TSSs and analyzed their nucleotide features, such as CG-skew and AT-skew. When there was one TSS in a locus, it was included in the upstream TSS data set.

First, we observed a strong peak of CG-skew in the upstream TSS data sets, whereas the downstream TSSs represented considerably reduced CG-skew in the two plants (Fig. 1, A and B). In downstream TSSs, the peaks were weakened and slightly shifted to the upstream. Second, we investigated AT-skew around TSSs. Our data clearly showed that the AT-skew was significantly biased around the TSSs, with similar tendency in rice and Arabidopsis (Fig. 1, C and D). In addition, the overall distributions of the AT-skew were quite different between the upstream and downstream TSSs, in a similar manner in both species. It seems that AT-skew is a much clearer indicator of the nucleotide signals around TSSs than CG-skew. Third, in both upstream and downstream TSSs, similar patterns of GC contents were observed (Fig. 2). Clear peaks and drastic changes around TSSs were seen in all cases. However, TATA-like signals around −35 to −25 bp upstream from the TSSs were significantly diminished in downstream TSSs. Last, the relative entropy display of the nucleotide compositions clearly showed strong signals of TATA-like motifs around the −35 to −25 bp regions of upstream TSSs, but these signals disappeared in downstream TSSs (Fig. 3). This analysis was conducted for the case in which the loci of a single TSS were excluded, and we obtained essentially the same results (Supplemental Fig. S4). We found skewed appearance of C at −1 bp and of A/G at TSSs, which were also weaker in downstream TSSs. In previous reports, the TATA box of rice and Arabidopsis was frequently detected in conjunction with the Y patch motif, which is a stretch of C/Ts and is located −100 to −1 bp upstream from the TSS (Yamamoto et al., 2007a, 2007b). In our relative entropy analysis, signals of C/Ts were detected between TSSs and TATA-like motifs in the −25 to −1 bp upstream regions of rice but not of Arabidopsis. Similarly, in the downstream TSSs, the signals corresponding to the Y patch motif were clear only in rice.

Figure 1.
CG-skew around upstream and downstream TSSs of rice (A) and Arabidopsis (B) and AT-skew around upstream and downstream TSSs of rice (C) and Arabidopsis (D). The vertical axis indicates the skew values, and the horizontal axis indicates relative positions ...
Figure 2.
GC contents around TSSs of rice (A) and Arabidopsis (B). The vertical axis indicates the GC contents, and the horizontal axis indicates relative positions from the TSS. The thick line denotes the GC content of the most upstream TSSs, and the thin line ...
Figure 3.
Relative entropy of nucleotides around rice upstream TSSs (A), rice downstream TSSs (B), Arabidopsis upstream TSSs (C), and Arabidopsis downstream TSSs (D). The vertical axis indicates the relative entropy of individual nucleotide sites, and the horizontal ...

These comparisons of the nucleotide signals around TSSs between the upstream and downstream data sets suggested that in rice and Arabidopsis downstream transcription might be differently regulated from upstream transcription, which had canonical cis elements such as the TATA box. Previous studies have defined TSSs as being distinct if they were separated by over 500 bp, so that the downstream transcription signals do not heavily overlap with the upstream transcripts (Kimura et al., 2006; Tsuritani et al., 2007). To examine the possibility that the weakened downstream signals were caused by overlapping upstream transcripts, we reanalyzed nucleotide signals of downstream TSSs that were located more than 500 bp away from any upstream TSSs. We observed decreased nucleotide signals at the same level as those of all downstream TSSs (data not shown). In addition, this tendency was not different in protein-coding regions, untranslated regions, and introns (Supplemental Fig. S5). These results suggest that the weakened nucleotide signal might be due to different transcriptional signals rather than to overlapping transcription with the upstream TSSs.

Relationship between TSS Diversification and Gene Expression Patterns

It is expected that use of alternative TSSs is related to gene expression patterns in differentiated tissues or in response to specific conditions (Macknight et al., 2002; Lee et al., 2006; Szecsi et al., 2006). We examined whether TSS diversification is correlated with patterns of alternative gene expression, using information from the rice transcript library. To exclude ambiguity and experimental errors, we focused on loci where TSSs were determined by more than five transcripts. As a result, 1,012 (48.5%) of 2,088 loci that had exactly two TSSs consisted of cDNAs that were obtained from different libraries. Hence, as expected, differential TSSs should result in variations of gene expression patterns. For example, there are two TSSs in a locus named Os08g0199300 and annotated as “similar to YyaF/YCHF TRANSFAC/OBG family small GTPase plus RNA binding domain TGS” (Fig. 4). The downstream TSS of this locus started in the third exon of the upstream transcript. Thus, the downstream ORF was predicted to be a truncated form. While the upstream transcripts were collected from various libraries, such as callus, flower, and shoot, the downstream transcripts were found only in one library, designated as “1 week after flowering ear” (Table III). As another example of a locus, Os01g0303200, in which a hypothetical protein was predicted, had two TSSs derived from different expression patterns (Supplemental Fig. S6). Transcripts from the downstream TSS that showed no coding potential had been derived only from the library of “Leaf (9 leaf stage),” and no transcripts of the upstream TSS had been derived from this library (Supplemental Table S1). These results support the idea that TSS diversification contributed to the variations of gene expression patterns. The Arabidopsis ortholog, AT1G35220, had similar but slightly different TSSs (see “Discussion”).

Figure 4.
Examples of alternative TSSs: Os08g0199300 annotated as “similar to YyaF/YCHF TRANSFAC/OBG family small GTPase plus RNA binding domain TGS.” Boxes connected by lines denote exons. The dotted line corresponds to an unsequenced cDNA region. ...
Table III.
Library information and the number of cDNA clones obtained from the rice locus, Os08g0199300

TSS Diversity Correlated with Protein Sequence Evolution

To elucidate the evolutionary significance of TSS diversification, we used two approaches to analyze the relationship between the numbers of TSSs per locus and the evolutionary conservation of protein sequences. We determined orthologs between rice and Arabidopsis by reciprocal best hits of BLASTP searches and calculated protein identity between the orthologs. We found a positive correlation between the number of rice TSSs and protein identity (Fig. 5), and a similar tendency was observed in Arabidopsis TSSs (Supplemental Fig. S7). These results suggest that a locus encoding evolutionarily conserved proteins had acquired more TSSs than one encoding diverged proteins.

Figure 5.
Relationship between TSS number and amino acid identity of orthologs between rice and Arabidopsis. The TSS number is that of the rice loci.

Next, we searched the UniProtKB database for homologous sequences of the rice and Arabidopsis proteins and classified them into four groups by their level of conservation. The ratio of conserved protein groups increased as the number of TSSs per locus grew (Supplemental Fig. S8). However, if the cDNA collection was insufficient, the number of TSSs of poorly expressed genes might be underestimated. To exclude the possibility of insufficient sampling of cDNAs, we used TSSs that were supported by five or more transcripts and confirmed that the same tendency was observed (data not shown). Therefore, highly variable TSSs seemed to be prevalent in conserved protein-coding genes of either rice or Arabidopsis.


To cluster TSSs that fluctuated for biological or experimental reasons, we used a threshold interval of 21 bp for rice and 27 bp for Arabidopsis. Since the resultant average sizes of the TSS clusters were much smaller, 4.2 bp in rice and 8.7 bp in Arabidopsis, than those initial intervals, it seems that fluctuating TSSs were clustered effectively and that there was little excessive clustering. The longer average size of Arabidopsis TSS clusters may be due to experimental errors, as we observed more discrepancies in Arabidopsis sequences obtained from the same clone compared with those of rice (Table II).

As each locus had on average two TSS clusters in either species, there should have been significant contribution of this TSS variation to these species. Indeed, TSS variants of several genes are known to be responsible for different expression patterns (Landry et al., 2003; Iida and Go, 2006). In this study, our large-scale analysis revealed that TSSs had been obtained from different libraries in about half of the loci that had two TSSs (Table III; Supplemental Table S1). This observation is consistent with our finding that the nucleotide signatures are distinguishable between the upstream and downstream TSSs, as canonical signals, such as the TATA box motif, were clearly depicted in the upstream TSSs but were considerably diminished in the downstream TSSs. These results suggest that transcription from the upstream TSSs is, at least in part, under a common regulatory mechanism, while the downstream TSSs are generally regulated by specialized systems (Fig. 6), which should lead to highly differentiated expression patterns.

Figure 6.
A model of TSS diversification in the course of flowering plant evolution.

In addition to CG-skew, which characterizes plant and yeast TSSs (Fujimori et al., 2005), AT-skew was found to be another strong indicator of TSSs. It is of particular interest that the distributions of AT-skew were nearly identical between rice and Arabidopsis. The sharp contrast of the AT-skew patterns between the upstream and downstream TSSs also supports the aforementioned idea that TSS variations are related to expression differences. A possible application of this clear AT-skew is that, since the AT-skew has been conserved between these remotely related plant species, one may consider a generalized method by which TSSs can be predicted from newly sequenced genomic DNA of plants. Because plants and fungi share common nucleotide features around TSSs (Fujimori et al., 2005), the animal machinery might have evolved independently.

A reason for the weak signals of downstream TSSs appears to be overlap with upstream protein-coding regions. Since protein-coding regions are under functional constraints, the nucleotide compositions and genomic positions of cis elements will be affected. For example, the TATA box frequently contains TAA, which is a stop codon and may prematurely terminate translation. The medians of the distances between TSSs were relatively small, 149 and 184 bp in rice and Arabidopsis, respectively, so that it was possible that the signals overlapping the upstream protein-coding region remained generally weak. However, even though we used only downstream TSSs separated from upstream TSSs by more than 500 bp, the signals were almost identical to those of all downstream TSSs (data not shown). Therefore, we concluded that the distinct signals of the downstream TSSs were not necessarily due to upstream coding regions but that they are intrinsic to the nature of the downstream TSSs. We should note that the downstream TSSs might produce a truncated protein whose function is deteriorated or lost. Thus, regulation by alternative TSS usage may be achieved in a loss-of-function manner, which is suggested to be of evolutionary importance (Oda et al., 2002; Tanaka et al., 2005). On the contrary, if a new TSS is generated in the upstream region, it would affect the downstream canonical transcriptional signals. Therefore, upstream TSS generation might have been suppressed during evolution (Fig. 6). Our hypothesis is that plants have generally retained an upstream “genuine” TSS with the TATA box and created downstream diversity. This seems to be in contrast with the observation that, in humans, the TATA box was used for tissue-specific expression while ubiquitously expressed genes are dependent on CpG islands (Suzuki et al., 2001b; Carninci et al., 2006), suggesting that plants and animals independently evolved their basic transcription regulation machineries.

It is intuitively plausible that, if protein sequences are highly diverged because of relaxed functional constraint, regulation of their expression becomes concordantly variable. However, our analyses revealed that the number of TSSs increased proportionally to the sequence conservation in both rice and Arabidopsis. Although it was expected that the gene function affected the number of TSSs, our functional categorization of the proteins by the Gene Ontology hierarchy showed no significant correlation between gene function and the number of TSSs (Supplemental Fig. S9). Since highly conserved proteins generally play essential roles, are used in a variety of tissues, and are regulated by complex processes, elaborate transcriptional regulation to control several TSSs might be required. Intriguingly, the TSSs that we identified were not necessarily conserved between rice and Arabidopsis. As shown in Figure 4, B and C, both rice and Arabidopsis used the same upstream TSSs in this locus, whereas the downstream TSSs obviously differed between these orthologs. Likewise, in humans and mice, merely one-fourth of the promoter regions between orthologs were conserved (Wasserman et al., 2000; Tsuritani et al., 2007). Therefore, TSS variation seems to be unstable in the course of evolution, and this variation should contribute to biodiversity among a wide range of species.


We determined TSSs in rice and Arabidopsis by large-scale computation and found that both species have, on average, two or more TSSs per locus. The nucleotide signals around TSSs were similar in these two plants, while they were quite different between the upstream and downstream TSSs. A positive correlation between TSS numbers and gene conservation was also observed. This study provides an insight for diversified transcriptional variation that is likely to have contributed to the evolution of plant species.


Genome and cDNA Sequences

We used FLcDNAs and their 5′-end sequences for TSS determination (Supplemental Table S2). The FLcDNAs and 5′-end sequences of rice (Oryza sativa; Kikuchi et al., 2003; Satoh et al., 2007) and the FLcDNAs and 5′-end sequences of Arabidopsis (Arabidopsis thaliana; Seki et al., 2002; Alexandrov et al., 2006) were retrieved from the GenBank/EMBL/DDBJ DNA databases. In addition, the Arabidopsis FLcDNAs sequenced by RIKEN were downloaded from the RIKEN Arabidopsis Genome Encyclopedia (http://rarge.gsc.riken.jp/archives/rafl/sequence/; Sakurai et al., 2005). The library information of the rice FLcDNA clones, which was derived from 41 different libraries including unknown resources, was provided by Dr. S. Kikuchi (personal communication). For the rice genome sequence, the International Rice Genome Sequencing Project genome sequence build 4 was used (http://rgp.dna.affrc.go.jp/IRGSP/download.html). The Arabidopsis genome sequence was downloaded from the National Center for Biotechnology Information's FTP site (ftp://ftp.ncbi.nih.gov/genomes/) as of August 13, 2004. ORFs and annotation data of rice were downloaded from the RAP-DB (http://rapdb.dna.affrc.go.jp/; The Rice Annotation Project, 2008). ORFs of Arabidopsis were retrieved from The Arabidopsis Information Resource (TAIR) 7 annotation data (http://www.arabidopsis.org/) as of June 19, 2007 (Rhee et al., 2003).

cDNA Mapping to Genome Sequences

Positions of transcripts on the genome sequences were determined by methods described previously (The Rice Annotation Project, 2007). We used 5′-end positions aligned by the est2genome program with the following options: gap open penalty, 8; mismatch penalty, 6 (Rice et al., 2000). Since the cDNA data sets included redundant sequences, which were determined as a full-length sequence and as a 5′-EST of the same clone, we used only the full-length cDNAs. We noticed that approximately 5% of the mapped transcripts contained an unaligned 5′-region of 7 bp or more, which were possibly derived from remaining vector sequences. These unaligned regions were discarded from our analyses. We found that 764 RAP loci included nonoverlapping transcripts, which might be due to transcriptional read-through. These read-through candidates were not used in this study, because read-through transcripts lead to overestimation of alternative TSSs. Because 1,807 5′-end sequences of Arabidopsis did not correspond to any TAIR protein-coding regions, they were eliminated from our data sets.

Clustering of 5′-End Positions

We clustered 5′-end positions that fluctuated for biological or for experimental reasons. To determine an appropriate threshold for the distance between 5′-end positions to be clustered, the relationship between the distance and the total number of clusters was examined (Supplemental Fig. S10). The cluster number decreased gradually and monotonically as the distance increased. We adopted the threshold distance at which the rate of decrease in the total number of clusters was less than 1%: 21 bp for rice and 27 bp for Arabidopsis. Juxtaposed 5′-end positions within the threshold distance were clustered by the single-linkage clustering method. In each cluster, a single representative TSS was selected in the following order: (1) supported by a full-length sequence, (2) supported by the most clones, and (3) the most upstream 5′-TSS.

Calculation of CG-Skew and AT-Skew

We extracted genomic sequences that spanned the −250 to +350 bp region around each TSS. When an ambiguous nucleotide denoted by N existed in a sequence file, the sequence was eliminated from our analysis. CG-skew values [= (C − G)/(C + G)] were computed in a sliding window of 100 bp with 1-bp steps, where C stands for the total number of cytosines in the window and G stands for the total number of guanines. The position of a window in Figure 1 is represented by the 51st nucleotide of the window. Likewise, AT-skew values were calculated for adenines (A) and thymines (T).

Calculation of the Relative Entropy at a Nucleotide Site

We represented nucleotide biases by relative entropy, modifying a previously reported method (Schneider and Stephens, 1990; Crooks et al., 2004). The relative entropy (R) at a particular nucleotide position is:

equation M1

where pn is the observed frequency of nucleotide n (A, T, G, or C) at the position and pg is the genomic frequency of n. Previous studies have assumed random occurrence of the nucleotide in a background distribution, so that pg was 0.250 for any n, but the GC contents of rice and Arabidopsis were 43.6% and 36.0%, respectively, which clearly deviated from 50%. Therefore, for example, pg for the adenine of rice was set to 0.282, assuming that A and T, or G and C, distributed equally in either DNA strand. The height (Hn) of each nucleotide n at a particular position in Figure 1 was determined by multiplying the relative entropy by the frequency of that nucleotide (Schneider and Stephens, 1990), as follows:

equation M2

Sequence Analysis of Orthologs

The rice protein set we used was compared with the Arabidopsis protein set of TAIR. Homologs and orthologs were determined, as described elsewhere (The Rice Annotation Project, 2007). Homologous sequences of other organisms were identified by BLASTP searches against UniProtKB (release 10.2) downloaded as of April 9, 2007 (The UniProt Consortium, 2007). We adopted less than 10−4 of the E value as a threshold. On the basis of the taxonomic groups to which the organisms of the homologs belonged, we categorized the rice and Arabidopsis proteins into (1) Oryzeae/Brassicaceae, (2) Liliopsidae/Eudicotyledons, (3) Viridiplantae, and (4) nonplant organisms (including fungi, animals, and prokaryotes).

Supplemental Data

The following materials are available in the online version of this article.

  • Supplemental Figure S1. Distribution of distances between intralocus TSSs.
  • Supplemental Figure S2. Distribution of the distances of aligned 5′-positions between FLcDNAs and 5′-ESTs obtained from the same clones.
  • Supplemental Figure S3. Distribution of average TSS distances of a locus in rice and Arabidopsis.
  • Supplemental Figure S4. Relative entropy of nucleotides around the loci of a single TSS and loci with multiple TSSs.
  • Supplemental Figure S5. Relative entropy of nucleotides around downstream TSSs in rice.
  • Supplemental Figure S6. Examples of alternative TSSs.
  • Supplemental Figure S7. Relationship between the TSS number and amino acid identity in Arabidopsis.
  • Supplemental Figure S8. Evolutionary conservation and TSS occurrence frequency per locus.
  • Supplemental Figure S9. Relationship between TSS numbers in a locus and Gene Ontology categories.
  • Supplemental Figure S10. Definition of threshold distance.
  • Supplemental Table S1. Library information of the number of cDNA clones from Os01g0303200.
  • Supplemental Table S2. Number of transcript sequences used in this study.
  • Supplemental Information S1. Evaluation of TSSs of two FLcDNA sets obtained from independent cloning methods.

Supplementary Material

[Supplemental Data]


We thank H. Numa and H. Sakai for their suggestions; S. Kikuchi, M. Seki, and T. Sakurai for providing information about FLcDNA clones; the Rice Annotation Project members for rice genome annotation data; and Y.Y. Yamamoto for helpful discussions.


1This work was supported by the Ministry of Agriculture, Forestry, and Fisheries of Japan (Integrated Research Project for Plant, Insect, and Animal Using Genome Technology grant no. GD–1002 to T.T., T.I., and K.O.K. and Genomics for Agricultural Innovation grant no. GIR–1001 to T.T. and T.I.).

The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Takeshi Itoh (pj.og.crffa@hotiat).

[W]The online version of this article contains Web-only data.

[OA]Open Access articles can be viewed online without a subscription.



  • Alexandrov NN, Troukhan ME, Brover VV, Tatarinova T, Flavell RB, Feldmann KA (2006) Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol 60 69–85 [PubMed]
  • Blencowe BJ (2006) Alternative splicing: new insights from global analyses. Cell 126 37–47 [PubMed]
  • Brett D, Pospisil H, Valcarcel J, Reich J, Bork P (2002) Alternative splicing and genome complexity. Nat Genet 30 29–30 [PubMed]
  • Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, et al (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 38 626–635 [PubMed]
  • Chen FC, Wang SS, Chaw SM, Huang YT, Chuang TJ (2007) Plant Gene and Alternatively Spliced Variant Annotator: a plant genome annotation pipeline for rice gene and alternatively spliced variant identification with cross-species expressed sequence tag conservation from seven plant species. Plant Physiol 143 1086–1095 [PMC free article] [PubMed]
  • Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: a sequence logo generator. Genome Res 14 1188–1190 [PMC free article] [PubMed]
  • Frith MC, Ponjavic J, Fredman D, Kai C, Kawai J, Carninci P, Hayashizaki Y, Sandelin A (2006) Evolutionary turnover of mammalian transcription start sites. Genome Res 16 713–722 [PMC free article] [PubMed]
  • Fujimori S, Washio T, Tomita M (2005) GC-compositional strand bias around transcription start sites in plants and fungi. BMC Genomics 6 26. [PMC free article] [PubMed]
  • Halasz G, van Batenburg MF, Perusse J, Hua S, Lu XJ, White KP, Bussemaker HJ (2006) Detecting transcriptionally active regions using genomic tiling arrays. Genome Biol 7 R59. [PMC free article] [PubMed]
  • Iida K, Go M (2006) Survey of conserved alternative splicing events of mRNAs encoding SR proteins in land plants. Mol Biol Evol 23 1085–1094 [PubMed]
  • Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H, et al (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301 376–379 [PubMed]
  • Kim N, Shin S, Lee S (2005) ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res 15 566–576 [PMC free article] [PubMed]
  • Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, Yamamoto J, Sekine M, Tsuritani K, Wakaguri H, et al (2006) Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res 16 55–65 [PMC free article] [PubMed]
  • Landry JR, Mager DL, Wilhelm BT (2003) Complex controls: the role of alternative promoters in mammalian genomes. Trends Genet 19 640–648 [PubMed]
  • Lee JR, Jang HH, Park JH, Jung JH, Lee SS, Park SK, Chi YH, Moon JC, Lee YM, Kim SY, et al (2006) Cloning of two splice variants of the rice PTS1 receptor, OsPex5pL and OsPex5pS, and their functional characterization using pex5-deficient yeast and Arabidopsis. Plant J 47 457–466 [PubMed]
  • Li L, Wang X, Sasidharan R, Stolc V, Deng W, He H, Korbel J, Chen X, Tongprasit W, Ronald P, et al (2007) Global identification and characterization of transcriptionally active regions in the rice genome. PLoS One 2 e294. [PMC free article] [PubMed]
  • Macknight R, Duroux M, Laurie R, Dijkwel P, Simpson G, Dean C (2002) Functional significance of the alternative transcript processing of the Arabidopsis floral promoter FCA. Plant Cell 14 877–888 [PMC free article] [PubMed]
  • Miura F, Kawaguchi N, Sese J, Toyoda A, Hattori M, Morishita S, Ito T (2006) A large-scale full-length cDNA analysis to explore the budding yeast transcriptome. Proc Natl Acad Sci USA 103 17846–17851 [PMC free article] [PubMed]
  • Oda M, Satta Y, Takenaka O, Takahata N (2002) Loss of urate oxidase activity in hominoids and its evolutionary implications. Mol Biol Evol 19 640–653 [PubMed]
  • Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 31 224–228 [PMC free article] [PubMed]
  • Rice P, Longden I, Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16 276–277 [PubMed]
  • Sakurai T, Satou M, Akiyama K, Iida K, Seki M, Kuromori T, Ito T, Konagaya A, Toyoda T, Shinozaki K (2005) RARGE: a large-scale database of RIKEN Arabidopsis resources ranging from transcriptome to phenome. Nucleic Acids Res 33 D647–D650 [PMC free article] [PubMed]
  • Satoh K, Doi K, Nagata T, Kishimoto N, Suzuki K, Otomo Y, Kawai J, Nakamura M, Hirozane-Kishikawa T, Kanagawa S, et al (2007) Gene organization in rice revealed by full-length cDNA mapping and gene expression analysis through microarray. PLoS One 2 e1235. [PMC free article] [PubMed]
  • Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18 6097–6100 [PMC free article] [PubMed]
  • Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y, et al (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science 296 141–145 [PubMed]
  • Suzuki Y, Taira H, Tsunoda T, Mizushima-Sugano J, Sese J, Hata H, Ota T, Isogai T, Tanaka T, Morishita S, et al (2001. a) Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep 2 388–393 [PMC free article] [PubMed]
  • Suzuki Y, Tsunoda T, Sese J, Taira H, Mizushima-Sugano J, Hata H, Ota T, Isogai T, Tanaka T, Nakamura Y, et al (2001. b) Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res 11 677–684 [PMC free article] [PubMed]
  • Szecsi J, Joly C, Bordji K, Varaud E, Cock JM, Dumas C, Bendahmane M (2006) BIGPETALp, a bHLH transcription factor is involved in the control of Arabidopsis petal size. EMBO J 25 3912–3920 [PMC free article] [PubMed]
  • Tanaka T, Tateno Y, Gojobori T (2005) Evolution of vitamin B6 (pyridoxine) metabolism by gain and loss of genes. Mol Biol Evol 22 243–250 [PubMed]
  • The Rice Annotation Project (2007) Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana. Genome Res 17 175–183 [PMC free article] [PubMed]
  • The Rice Annotation Project (2008) The Rice Annotation Project Database (RAP-DB): 2008 update. Nucleic Acids Res 36 D1028–D1033 [PMC free article] [PubMed]
  • The UniProt Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35 D193–D197 [PMC free article] [PubMed]
  • Tsuritani K, Irie T, Yamashita R, Sakakibara Y, Wakaguri H, Kanai A, Mizushima-Sugano J, Sugano S, Nakai K, Suzuki Y (2007) Distinct class of putative “non-conserved” promoters in humans: comparative studies of alternative promoters of human and mouse genes. Genome Res 17 1005–1014 [PMC free article] [PubMed]
  • Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE (2000) Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26 225–228 [PubMed]
  • Yamada K, Lim J, Dale JM, Chen H, Shinn P, Palm CJ, Southwick AM, Wu HC, Kim C, Nguyen M, et al (2003) Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 302 842–846 [PubMed]
  • Yamamoto YY, Ichida H, Abe T, Suzuki Y, Sugano S, Obokata J (2007. a) Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res 35 6219–6226 [PMC free article] [PubMed]
  • Yamamoto YY, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, Abe T (2007. b) Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics 8 67. [PMC free article] [PubMed]

Articles from Plant Physiology are provided here courtesy of American Society of Plant Biologists
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem chemical substance records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records.
  • Taxonomy
    Taxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...