Logo of bioinfoLink to Publisher's site
Bioinformatics. Jan 15, 2013; 29(2): 273–274.
Published online Nov 21, 2012. doi:  10.1093/bioinformatics/bts678
PMCID: PMC3546801

Intron-centric estimation of alternative splicing from RNA-seq data

Dmitri D. Pervouchine, 1 , 2 , 3 ,* David G. Knowles, 1 , 2 and Roderic Guigó 1 , 2

Abstract

Motivation: Novel technologies brought in unprecedented amounts of high-throughput sequencing data along with great challenges in their analysis and interpretation. The percent-spliced-in (PSI, An external file that holds a picture, illustration, etc.
Object name is bts678i1.jpg) metric estimates the incidence of single-exon–skipping events and can be computed directly by counting reads that align to known or predicted splice junctions. However, the majority of human splicing events are more complex than single-exon skipping.

Results: In this short report, we present a framework that generalizes the An external file that holds a picture, illustration, etc.
Object name is bts678i2.jpg metric to arbitrary classes of splicing events. We change the view from exon centric to intron centric and split the value of An external file that holds a picture, illustration, etc.
Object name is bts678i3.jpg into two indices, An external file that holds a picture, illustration, etc.
Object name is bts678i4.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i5.jpg, measuring the rate of splicing at the 5′ and 3′ end of the intron, respectively. The advantage of having two separate indices is that they deconvolute two distinct elementary acts of the splicing reaction. The completeness of splicing index is decomposed in a similar way. This framework is implemented as bam2ssj, a BAM-file–processing pipeline for strand-specific counting of reads that align to splice junctions or overlap with splice sites. It can be used as a consistent protocol for quantifying splice junctions from RNA-seq data because no such standard procedure currently exists.

Availability: The CAn external file that holds a picture, illustration, etc.
Object name is bts678i6.jpg code of bam2ssj is open source and is available at https://github.com/pervouchine/bam2ssj

Contact: ue.grc@pd

1 INTRODUCTION

One major challenge in the analysis of high-throughput RNA sequencing data is to disentangle relative abundances of alternatively spliced transcripts. Many existing quantification methods do so by using considerations of likelihood, parsimony and optimality to obtain a consolidated view of cDNA fragments that map to a given transcriptional unit (Katz et al., 2010; Montgomery et al., 2010; Trapnell et al., 2012). The advantage of such integrative approaches is that they provide robust estimators for transcript abundance by reducing sampling errors, as they effectively consider samples of larger size. In contrast, because all the reads from the same transcriptional unit are combined into one master model, there is no guarantee that the inclusion or exclusion of a specific exon is estimated independently of co-occurring splicing events (Katz et al., 2010; Pan et al., 2008).

The quantification of alternatively spliced isoforms based on the An external file that holds a picture, illustration, etc.
Object name is bts678i7.jpg metric captures more accurately the local information related to splicing of each particular exon (Katz et al., 2010). We follow Kakaradov et al. (2012) in considering only the reads that align to splice junctions (Fig. 1) and ignoring the reads that align to exon bodies (position-specific read counts are not considered). An external file that holds a picture, illustration, etc.
Object name is bts678i11.jpg is defined as

equation image
(1)

where the factor of two in the denominator accounts for the fact that there are twice as many mappable positions for reads supporting exon inclusion as exon exclusion. Equation (1) defines an unbiased estimator for the fraction of mRNAs that represent the inclusion isoform under the assumption that splice-junction reads are distributed evenly. An external file that holds a picture, illustration, etc.
Object name is bts678i12.jpg can also be derived from the expression values of whole isoforms, for instance, as the abundance of the inclusion isoform as the fraction of the total abundance. However, the non-uniform read coverage not only between but also within transcripts makes such estimates generally detrimental (Kakaradov et al., 2012).

Fig. 1.
The percent-spliced-in (PSI, An external file that holds a picture, illustration, etc.
Object name is bts678i8.jpg) metric is defined as the number of reads supporting exon inclusion (An external file that holds a picture, illustration, etc.
Object name is bts678i9.jpg) as the fraction of the combined number of reads supporting inclusion and exclusion (An external file that holds a picture, illustration, etc.
Object name is bts678i10.jpg). The exon of interest is shown in gray. Only reads that span to the ...

The An external file that holds a picture, illustration, etc.
Object name is bts678i13.jpg metric can be generalized beyond the class of single-exon–skipping events by counting inclusion and exclusion reads regardless of exon adjacency (Fig. 1, dashed arcs). Although this definition helps to reduce the undercoverage bias by taking into account splice junctions that are not present in the reference annotation, it often assigns misleading values to An external file that holds a picture, illustration, etc.
Object name is bts678i14.jpg metric, for instance, in the case of multiple-exon skipping, where the amount of support for exon exclusion does not reflect the true splicing rate of each individual intron.

2 APPROACH

In this work, we change the view from exon centric to intron centric. Each intron is defined uniquely by the combination of its 5′-splice site (An external file that holds a picture, illustration, etc.
Object name is bts678i15.jpg, donor) and 3′-splice site (An external file that holds a picture, illustration, etc.
Object name is bts678i16.jpg, acceptor). Denote by An external file that holds a picture, illustration, etc.
Object name is bts678i17.jpg the number of reads aligning to the splice junction spanning from An external file that holds a picture, illustration, etc.
Object name is bts678i18.jpg to An external file that holds a picture, illustration, etc.
Object name is bts678i19.jpg (Fig. 2) and define

equation image
(2)

where An external file that holds a picture, illustration, etc.
Object name is bts678i30.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i31.jpg run over all donor and acceptor sites, respectively, within the given genomic annotation set. Because An external file that holds a picture, illustration, etc.
Object name is bts678i32.jpg could be An external file that holds a picture, illustration, etc.
Object name is bts678i33.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i34.jpg could be An external file that holds a picture, illustration, etc.
Object name is bts678i35.jpg, both An external file that holds a picture, illustration, etc.
Object name is bts678i36.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i37.jpg are real numbers from An external file that holds a picture, illustration, etc.
Object name is bts678i38.jpg to An external file that holds a picture, illustration, etc.
Object name is bts678i39.jpg. The value of An external file that holds a picture, illustration, etc.
Object name is bts678i40.jpg can be regarded as an estimator for the conditional probability of splicing from An external file that holds a picture, illustration, etc.
Object name is bts678i41.jpg to An external file that holds a picture, illustration, etc.
Object name is bts678i42.jpg, i.e. the fraction of transcripts in which the intron An external file that holds a picture, illustration, etc.
Object name is bts678i43.jpg to An external file that holds a picture, illustration, etc.
Object name is bts678i44.jpg is spliced, relative to the number of transcripts in which An external file that holds a picture, illustration, etc.
Object name is bts678i45.jpg is used as a splice site. Similarly, An external file that holds a picture, illustration, etc.
Object name is bts678i46.jpg is the relative frequency of An external file that holds a picture, illustration, etc.
Object name is bts678i47.jpg-to-An external file that holds a picture, illustration, etc.
Object name is bts678i48.jpg splicing with respect to the splicing events in which An external file that holds a picture, illustration, etc.
Object name is bts678i49.jpg is used.

Fig. 2.
Left: the 5′-splicing index, An external file that holds a picture, illustration, etc.
Object name is bts678i20.jpg, is the number of reads supporting the splicing event from An external file that holds a picture, illustration, etc.
Object name is bts678i21.jpg to An external file that holds a picture, illustration, etc.
Object name is bts678i22.jpg relative to the combined number of reads supporting splicing from An external file that holds a picture, illustration, etc.
Object name is bts678i23.jpg to any acceptor site An external file that holds a picture, illustration, etc.
Object name is bts678i24.jpg. Right: the 3′-splicing index, An external file that holds a picture, illustration, etc.
Object name is bts678i25.jpg, is the number of reads ...

In the particular case of single-exon skipping (Fig. 1), the values of An external file that holds a picture, illustration, etc.
Object name is bts678i50.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i51.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i52.jpg are related as follows. Denote the upstream and downstream introns of the highlighted exon by An external file that holds a picture, illustration, etc.
Object name is bts678i53.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i54.jpg, respectively. Let An external file that holds a picture, illustration, etc.
Object name is bts678i55.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i56.jpg. Then, An external file that holds a picture, illustration, etc.
Object name is bts678i57.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i58.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i59.jpg, where An external file that holds a picture, illustration, etc.
Object name is bts678i60.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i61.jpg. Assuming uniform read coverage across the gene (An external file that holds a picture, illustration, etc.
Object name is bts678i62.jpg), we get An external file that holds a picture, illustration, etc.
Object name is bts678i63.jpg and, therefore,

equation image
(3)

That is, in the particular case of single-exon skipping, the value of An external file that holds a picture, illustration, etc.
Object name is bts678i64.jpg is equal to the average of An external file that holds a picture, illustration, etc.
Object name is bts678i65.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i66.jpg given that the read coverage is reasonably uniform. If a and An external file that holds a picture, illustration, etc.
Object name is bts678i68.jpg differ significantly, the contribution of An external file that holds a picture, illustration, etc.
Object name is bts678i69.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i69a.jpg to An external file that holds a picture, illustration, etc.
Object name is bts678i70.jpg is given by the weight factors An external file that holds a picture, illustration, etc.
Object name is bts678i71.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i72.jpg.

Similarly, the completeness of splicing index (Tilgner et al., 2012) is split into two indices, An external file that holds a picture, illustration, etc.
Object name is bts678i73.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i74.jpg, where

equation image
(4)

and An external file that holds a picture, illustration, etc.
Object name is bts678i75.jpg denotes the number of genomic reads (reads mapped uniquely to the genomic sequence) overlapping the splice site An external file that holds a picture, illustration, etc.
Object name is bts678i76.jpg. Note that An external file that holds a picture, illustration, etc.
Object name is bts678i77.jpg depends only on An external file that holds a picture, illustration, etc.
Object name is bts678i78.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i79.jpg depends only on An external file that holds a picture, illustration, etc.
Object name is bts678i80.jpg. The values of An external file that holds a picture, illustration, etc.
Object name is bts678i81.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i82.jpg are unbiased estimators for the absolute frequency of splice site usage, i.e. the proportion of transcripts in which An external file that holds a picture, illustration, etc.
Object name is bts678i83.jpg (or An external file that holds a picture, illustration, etc.
Object name is bts678i84.jpg) is used as a splice site, among all transcripts containing the splice site An external file that holds a picture, illustration, etc.
Object name is bts678i85.jpg (or An external file that holds a picture, illustration, etc.
Object name is bts678i86.jpg).

3 METHODS

To compute An external file that holds a picture, illustration, etc.
Object name is bts678i87.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i88.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i89.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i90.jpg for a given donor–acceptor pair, one needs to know five integers, An external file that holds a picture, illustration, etc.
Object name is bts678i91.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i92.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i93.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i94.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i95.jpg, of which only the first one depends on both An external file that holds a picture, illustration, etc.
Object name is bts678i96.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i97.jpg, while the rest have a single argument. We developed bam2ssj, a pipeline for counting these five integers directly from BAM input. bam2ssj is implemented in CAn external file that holds a picture, illustration, etc.
Object name is bts678i98.jpg and depends on SAMtools (Li et al., 2009). The input consists of (i) a sorted BAM file containing reads that align uniquely to the genome or to splice junctions and (ii) a sorted GTF file containing the coordinates of exon boundaries. Each time the CIGAR string (Li et al., 2009) contains An external file that holds a picture, illustration, etc.
Object name is bts678i99.jpgMAn external file that holds a picture, illustration, etc.
Object name is bts678i100.jpgNAn external file that holds a picture, illustration, etc.
Object name is bts678i101.jpgM, An external file that holds a picture, illustration, etc.
Object name is bts678i102.jpg, the counter corresponding to the splice junction defined by An external file that holds a picture, illustration, etc.
Object name is bts678i103.jpgN is incremented. One mapped read may span several splice junctions and increment several counters. If the CIGAR string does not contain the An external file that holds a picture, illustration, etc.
Object name is bts678i104.jpgMAn external file that holds a picture, illustration, etc.
Object name is bts678i105.jpgNAn external file that holds a picture, illustration, etc.
Object name is bts678i106.jpgM pattern, the read is classified as genomic and increments An external file that holds a picture, illustration, etc.
Object name is bts678i107.jpg for every splice site An external file that holds a picture, illustration, etc.
Object name is bts678i108.jpg it overlaps. Position-specific counts (Kakaradov et al., 2012) are implemented as a stand-alone utility that is not included in the current distribution. Importantly, bam2ssj counts reads that align to splice junctions in a strand-specific way, i.e. An external file that holds a picture, illustration, etc.
Object name is bts678i109.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i110.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i111.jpg, An external file that holds a picture, illustration, etc.
Object name is bts678i112.jpg and An external file that holds a picture, illustration, etc.
Object name is bts678i113.jpg are reported for the correct (annotated) and incorrect (opposite to annotated) strand. We leave further processing of these counts by Equations (2)–(4) to the user.

4 RESULTS AND DISCUSSION

We validated bam2ssj by counting reads aligning to splice junctions in the whole-cell polyadenylated fraction of Cold Spring Harbor Long RNA-seq data (http://genome.ucsc.edu/ENCODE/). In total, 8 558 231 343 mapped reads were analyzed in 404 min (An external file that holds a picture, illustration, etc.
Object name is bts678i114.jpg350 000 reads/sec). 1 184 553 724 reads align to splice junctions, of which An external file that holds a picture, illustration, etc.
Object name is bts678i115.jpg1% align to the opposite strand. 1 699 718 327 reads overlap annotated splice junctions, of which An external file that holds a picture, illustration, etc.
Object name is bts678i116.jpg5% map to the opposite strand. The values of An external file that holds a picture, illustration, etc.
Object name is bts678i117.jpg coincide with those reported by ENCODE in 98.9% of cases (1 163 251 008 reads); all discrepancies were due to the ambiguity of CIGAR translation in the mapper’s output. Because RNA-seq data are increasingly processed into the compact BAM form, we propose that bam2ssj be used as a standard operating procedure for counting splice junction reads.

Funding: Grants BIO2011-26205 and CSD2007-00050 Consolider, Ministerio de Educación y Ciencia (Spain).

Conflict of Interest: none declared.

REFERENCES

  • Kakaradov B, et al. Challenges in estimating percent inclusion of alternatively spliced junctions from RNA-seq data. BMC Bioinformatics. 2012;13(Suppl. 6):S11. [PMC free article] [PubMed]
  • Katz Y, et al. Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods. 2010;7:1009–1015. [PMC free article] [PubMed]
  • Li H, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. [PMC free article] [PubMed]
  • Montgomery S, et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature. 2010;464:773–777. [PMC free article] [PubMed]
  • Pan Q, et al. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 2008;40:1413–1415. [PubMed]
  • Tilgner H, et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res. 2012;22:1616–1625. [PMC free article] [PubMed]
  • Trapnell C, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 2012;7:562–578. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...