![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2009 The Author(s) A hierarchical Bayesian model for comparing transcriptomes at the individual transcript isoform level 1Howard Hughes Medical Institute, University of California, Los Angeles, Los Angeles, CA 90095 and 2Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA *To whom correspondence should be addressed. Tel: Phone: +1 213 740 2143; Fax: +1 213 740 8631; Email: liang.chen/at/usc.edu Received December 15, 2008; Revised April 13, 2009; Accepted April 14, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract The complexity of mammalian transcriptomes is compounded by alternative splicing which allows one gene to produce multiple transcript isoforms. However, transcriptome comparison has been limited to differential analysis at the gene level instead of the individual transcript isoform level. High-throughput sequencing technologies and high-resolution tiling arrays provide an unprecedented opportunity to compare transcriptomes at the level of individual splice variants. However, sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. Here we propose a hierarchical Bayesian model, BASIS (Bayesian Analysis of Splicing IsoformS), to infer the differential expression level of each transcript isoform in response to two conditions. A latent variable was introduced to perform direct statistical selection of differentially expressed isoforms. Model parameters were inferred based on an ergodic Markov chain generated by our Gibbs sampler. BASIS has the ability to borrow information across different probes (or positions) from the same genes and different genes. BASIS can handle the heteroskedasticity of probe intensity or sequence read coverage. We applied BASIS to a human tiling-array data set and a mouse RNA-seq data set. Some of the predictions were validated by quantitative real-time RT–PCR experiments. INTRODUCTION It has been estimated that more than 90% of human genes are alternatively spliced (1,2). Multiple transcript isoforms produced from a single gene can lead to protein isoforms with distinct functions (3). Alternative splicing (AS) is widely involved in different physiological and pathological processes. Different tissues exhibit different AS patterns, and malfunctions in AS regulatory factors result in various developmental defects (4–8). Abnormal mRNA splicing contributes to many human diseases (9–11). Identifying differentially expressed distinct transcript isoforms is crucial to understanding transcriptional and post-transcriptional regulation of various processes. Thus, there is an urgent need to study the differences between transcriptomes at the individual transcript isoform level. Although full-length cDNA sequencing is a potential approach, it is expensive and labor-intensive, making transcriptome comparison an elusive goal. High-resolution tiling arrays and high-throughput sequencing technologies (e.g. RNA-seq) provide an unprecedented opportunity to compare transcriptomes at the individual splice variant level. Probes of tiling array are fixed, whereas RNA-seq provides a collection of randomly distributed reads. In the two platforms, each transcript is represented by many probes or covered by a large number of sequence reads. However, a short probe (~25-mer for Affymetrix chips or ~60-mer for NimbleGen chips) matches only a small portion of the transcript sequence. Thus, the probe intensity may not represent the expression level of a single transcript, but rather a family of splice variants. RNA-seq faces the same challenges due to short sequence reads (~35-mer for Illumina Solexa and Applied Biosystems SOLiD, ~200-mer for Roche 454 Life Sciences). Although junction reads are useful for identifying AS events, the low coverage hampers their statistical power, which is more obvious for low-abundance transcripts. In addition, some junction reads are not specific to one transcript isoform, but to a group of transcript isoforms. Novel data analysis methods are needed to fully utilize these high-throughput techniques for inferring transcriptome differences at the individual transcript isoform level, as AS is one of the major means of expanding genome information. Transcriptome comparison at the individual transcript isoform level must jointly consider probes or sequence reads belonging to the same gene, because many genes have multiple alternatively spliced regions and many transcript isoforms do not contain any sequence positions or exon–exon junctions which exclusively appear in these isoforms. Instead, the uniqueness of these transcript isoforms is reflected by the uniqueness of exon combinations. If a nucleotide position (or exon junction) exclusively appear in one isoform but not in others, we call this position an isoform-specific position (or isoform-specific exon junction). Through the analysis (see details in Supplementary Data S1), we found that about 42% of human transcripts exhibit no isoform-specific positions, and about 57% have ≤ 50 base pair (bp) of isoform-specific positions. Approximately 66% of human multi-exon transcripts have no isoform-specific exon junctions. Among mouse transcripts, roughly 39% have no isoform-specific sequence positions, and about 57% have ≤ 50 bp of isoform-specific positions. Approximately 70% of the mouse multi-exon transcripts exhibit no isoform-specific exon junctions. The distribution of the number of isoform-specific positions and isoform-specific exon junctions in a transcript is shown in Supplementary Figure S1. Another complication confronted during the analysis of differential isoform expression is that contiguous splicing choices cannot be directly obtained from either the high-resolution microarray data or the high-throughput sequencing data. Here we introduce an approach to comparing transcriptomes at the individual transcript isoform level. This was achieved by developing a hierarchical Bayesian model based on transcript splicing patterns assembled from public databases and high-resolution tiling-array or high-throughput sequencing data (specifically RNA-seq). We call this model BASIS (Bayesian Analysis of Splicing IsoformS). BASIS has the ability to borrow information across different probes (or positions) from the same and different genes when making statistical inferences. Differentially expressed transcript isoforms can be directly inferred from the model by introducing a latent variable and accounting for the heteroskedasticity of probe intensity or sequence read coverage. The usefulness of BASIS is illustrated by its application to a human tiling-array data set to compare HeLa and HepG2 cell lines (12) and to a mouse RNA-seq data set to compare brain, liver and muscle tissues (13). MATERIALS AND METHODS Hierarchical Bayesian model (BASIS) For each probe i that appears in at least one transcript isoform of gene g, consider the linear model:
A hierarchical Bayesian model is constructed as:
The Gibbs sampler was used to generate a Markov chain and the posterior probabilities of Δβ, δ and γ were estimated from the chain. The variance parameter δm[0] was initialized to be the mean of intensity sum (y1+y2) for probes or positions in bin m. γ[0] was initialized as (1, …, 1)T. The Gibbs sampler at the k-th iteration proceeds as follows:
For the choice of hyperparameters τ and ψ, we adopt a semi-automatic approach proposed by George et al. (14). In this approach, τgj and ψgj were selected by considering the prior odds of excluding an isoform from the model and a t-statistic threshold of including an isoform in the model. is the ratio of the heights of N(0,τgj) and N(0, ψgj) at 0. Therefore can be interpreted as the prior odds that transcript isoform j is declared as a non-differentially expressed transcript when Δβgj is very close to zero. We also consider the marginal densities and , where is the least squares estimator and σΔ βgj2 is the variance of . The intersection point of these two marginal densities is denoted as tgj σ Δβgj so that the density of N(0, σΔ βgj2 + ψgj) will be larger than the density of N(0, σ Δ βgj2 + τgj) if and only if . Therefore tgj can be interpreted as a t-statistic threshold of whether transcript isoform j should be declared as a differentially expressed transcript. Through simple calculation, it can be shown that tgj is a function of and . Specifically, we chose where is the standard error of the least squares estimator . This setting was suggested by George et al. (14). It indicates that the prior odds that transcript isoform j is declared as a non-differentially expressed transcript when Δβgj is very close to zero is 100, and the t-statistic threshold tgj for the marginal density of is about 2.17. Hyperparameter v = 0 (and any λ) and p = 0.5 were used to represent ignorance as suggested (14–16).To study the robustness of BASIS to initial values and bin size, in the real data analysis, four Markov chains were generated according to four different settings: (I) Hyperparameters were chosen as described above. We divided the probes (or positions) into 100 bins and δm was initialized as the mean of intensity sum (y1+y2) for probes or positions in bin m; (II) the same as (I) except that we used 20 bins; (III) the same as (I) except that we used 500 bins; (IV) the same as (I) except that we used 100 as the initial value for each δm. A total of 10 000 burn-in iterations followed by 40 000 iterations were generated to estimate the posterior probabilities. To identify differentially expressed transcript isoforms, we used the median model decision rule (17) that includes variables with posterior probability Pr (γ = 1|data) larger than 0.5. Thus, transcript isoforms with posterior mean of γ larger than 0.5 were declared as differentially expressed. If the posterior mean of Δβgj for the differentially expressed transcript is positive, the isoform was declared to be up-regulated in HeLa for the comparison between HeLa and HepG2, or to be up-regulated in brain for the comparison between brain and liver or the comparison between brain and muscle. Otherwise, the isoform was declared to be down-regulated in HeLa or brain. The lists of genes and their isoforms analyzed can be found in Supplementary Tables S1–3. The differentially expressed isoforms can be found in Supplementary Tables S4–6. BASIS can be downloaded at http://www-rcf.usc.edu/~liangche/software.html. Simulations A total of 100 genes were simulated. Nine of them were simulated to have five transcript isoforms and some transcript isoforms were simulated to be differentially expressed. The other 91 genes were created by randomly drawn from the real data and simulated to have no differentially expressed isoforms. The probe arrangements of the five isoforms for the nine differentially expressed genes were simulated as:
Probes 1–50 appear in isoforms 2–5; probes 51–100 appear in isoforms 1 and 3–5; and so on. The matrix E = {eij} was used as matrix Xg (g = 1, …, 9) in BASIS for the nine genes. ΔYgi was simulated as where Δϵgi follows a normal distribution with mean 0 and variance δm which is determined by the bin number for the probe. The choices Δβg and δm are discussed as follows.
For the other 91 genes, we randomly selected the X matrix from the human data. They were simulated to have no differentially expressed isoforms (i.e. ΔYgi was simulated as Δϵgi because Δ βg = 0 for g = 10, …, 100). In total, there were 28 132 probes and 368 transcript isoforms. These probes were randomly assigned to 100 bins. For the m-th bin, the variance δm was simulated as m. About 1000 simulations were performed. For each simulation, we used a burn-in of 1000 iterations, followed by 4000 iterations. Hyperparameters were chosen as described before. Different thresholds for the posterior mean of γ were used to declare transcript isoforms as differentially expressed. The power and the false-positive rates were calculated as average values from those 1000 simulations. Instead of using the purely simulated matrix E [shown in (*)] for the nine differentially expressed genes, we also randomly selected a probe arrangement matrix from genes with five isoforms in the human data and used the matrix as E to simulate ΔY's for the differentially expressed genes as described. The other 91 genes without any differentially expressed isoforms were the same as described above. We performed 1000 simulations to calculate the average power when the average false positive rate is 0.005. Then, we repeated these procedures 100 times. Thus we tested 100 different matrix E's for the nine genes with differentially expressed isoforms. Tiling array data preprocessing Non-redundant transcript isoform information of human genes was downloaded from the AS and Transcript Diversity database (18) (http://www.ebi.ac.uk/astd/, release 1.1, names begin with ‘TRAN’) and the Ensembl Genome Browser (http://www.ensembl.org/index.html, release 50, names begin with ‘ENST’). Expression levels of these transcripts were from the whole-genome tiling arrays in which the human genome is split into 91 chips at 5-bp resolution (as measured from the central positions of adjacent 25-mer oligonucleotides) (12). We considered the expression data for HeLa and HepG2 cell lines in the cytosol. For each cell line, there were about three replicates. Those RNAs were polyadenylated and longer than 200 nt. The probe coordinates were from the NCBI version 35. The UCSC liftover tool (http://genome.ucsc.edu/) was used to convert the coordinates between version 35 and version 36. The probes mapped to intergenic regions were used as background probes to train the sequence-specific model that considers the composition of the nucleotides at each position of a 25-mer probe (19,20). Thus:
Probes with intensity level larger than 4 in at least one cell line were counted as qualified probes. We removed transcripts without any qualified probe. In other words, if a column of X is equal to vector 0, the corresponding transcript isoform was considered not expressed and removed thereby. Recall that X is the probe arrangement matrix and each row represents a qualified probe and each column represents an isoform. Among the 35 351 annotated genes (141 295 transcripts), 29 085 genes have at least two qualified probes. Among the 29 085 genes, 3197 do not have enough qualified probes to distinguish different transcript isoforms. The case of not having enough qualified probes results in identical columns in matrix X (e.g. Xj = Xk where Xj is the j-th column and Xk is the k-th column). For example, considering gene A and gene B each of which have three isoforms, their exon arrangements are:
RNA-seq data preprocessing Non-redundant transcript isoform information of mouse genes was downloaded from the ASTD and Ensembl databases. Expression levels of these transcripts were from the high-throughput RNA-seq data for adult mouse brain, liver and muscle (13). For each tissue, there were two replicates. Uniquely mapped sequence reads from two replicates were pooled together and mapped to genes. The number of reads mapped to a position was treated as the read coverage over that position. The read coverage was multiplied by a constant to make the total number of reads equivalent for the three tissues (28 million). We compared brain with liver and brain with muscle. Positions with read coverage larger than 4 in at least one tissue were counted as qualified positions. For the comparison between brain and liver, among the 28 129 annotated genes (110 857 transcripts), 49% of them have less than two qualified positions. About 10% of them do not have enough qualified positions to distinguish different transcript isoforms and this cannot be explained by the lack of expression for these isoforms. In addition, 1% of them are un-identifiable. For the comparison between brain and muscle, 49% of them have less than two qualified positions. About 11% of them were removed because some isoforms cannot be distinguished and this cannot be explained by the fact that these isoforms are not expressed. Another 1% of them are un-identifiable. The details of the prescreening procedures can be found in Figure 2 RNA preparation and qRT–PCR Adult C57BL mouse brain, liver and muscle tissues were dissected and quickly submerged in Trizol (Invitrogen, CA) followed by immediate tissue homogenization. Total RNA samples were prepared according to manufacturer's protocol (Invitrogen, CA). Cytosolic RNA of HeLa and HepG2 cells were generous gifts from Gingeras's group, original authors of the tilling array data (12). RNA were treated with RQ1 RNase-free DNase I (Roche Applied Science) at 1 U/μg RNA and reverse transcription was done as described previously (21). Real-time RT–PCR was performed as previously described (22) using SYBR Green Supermix on a Bio-Rad iQ5 thermocycler for 40 cycles at 60°C annealing temperature. Primers are listed in Supplementary Table S7. Each primer pair amplifies only one amplicon and the identity of RT–PCR product was confirmed by direct sequencing. Relative mRNA levels between brain and liver (brain/liver ratio) or between brain and muscle (brain/muscle ratio) were first normalized by geometric averaging of multiple internal control genes (including Gapdh, Sdha and mRps18a) (23) and then quantified using ΔΔCt method. Relative mRNA levels between HeLa and HepG2 cells were first normalized by geometric average of three internal control genes (HPRT1, RPLP0 and SDHA) and then quantified using ΔΔCt method. RESULTS In BASIS, for gene g, the probe intensity (or read coverage over each position) is modeled as the sum of the intensity of transcript isoforms containing this probe (or position):
Heteroskedasticity of probe intensity and sequence read coverage Microarray noise has been shown to be scale dependent (24). Similarly for RNA-seq data, the noise associated with read coverage over each position is proportional to the mean. Figure 3 ). In addition, the log ratio of probe intensity under two conditions cannot be modeled as the sum of isoform differences at the log scale (i.e. ). To handle the heteroskedasticity and concomitantly maintain the valid linear isoform combination assumption, we divided all of the probes (or positions) across the whole genome into bins according to their intensity values (or read coverage) under two conditions. Probes (or positions) with similar intensity values have similar variances. And different variance parameters were specified for different bins. Thus, Δ \εgi ~ N(0, δm) if probe i (or position i) falls into bin m. Large number of probes (or positions) in each bin provided a more stable variance estimate for Δϵgi than that estimated from very few experimental replicates (e.g. two or three replicates) of single probe. We therefore borrowed strength across probes from different genes. Figure 3
Power analysis We studied the statistical power of BASIS, particularly when the isoform information was incomplete. A total of 100 genes were simulated each time. Nine genes were simulated to have five potential transcript isoforms, and some transcript isoforms were differentially expressed. The other 91 genes were simulated to have no differentially expressed transcript isoforms. Different scenarios in terms of Δβgj and the completeness of the transcript isoform information were examined. The details of the simulation settings can be found in ‘Materials and Methods’ section. Table 1 compares the power of BASIS and the least squares fit when the total false-positive rate was controlled at 0.005. A particular probe arrangement matrix E [shown in (*)] was used for this study. The power of BASIS is 0.76 when the false positive rate is 0.005. It demonstrates that BASIS can correctly identify most of the differentially expressed isoforms (13.68 out of 18) and also correctly declare non-differentially expressed isoforms (348.25 out of 350). Note that we have additional 91 genes with no differentially expressed isoforms and there are a total of 350 non-differentially expressed isoforms. BASIS has a much larger statistical power than the least squares fit (0.76 versus 0.31). This is due to the fact that errors for probes of the same genes are heteroskedastic, and BASIS takes this into account. We also separately calculated the power and the false-positive rate for genes 1–9. For example, gene 1 has five transcripts isoforms, two of which are differentially expressed. Thus, the total number of positive instances is 2 and the total number of negative instances is 3 when we calculate the power and false-positive rate for gene 1. The settings for genes 2 and 3 are similar to those for gene 1, except for the differential signals. When the differential signal increases (Δβgj from 1.8 to 2.4), the performance of the model improves (power from 0.74 to 0.96). When the information for one differentially expressed isoform is lacking (i.e. there is no annotation about the transcript isoforms, but it exists in cells and is differentially expressed), the inferences for other isoforms are still reliable (genes 4–6). The worst situation is when the differential signal for the missing isoform is very high (Δβgj = 2.4 for isoform 4 of gene 9) and this isoform demonstrates a high correlation with other known isoforms (the correlation between isoforms 4 and 3 is about 0.63). The false-positive rate can be as high as 0.4. The results demonstrate that when the AS information is incomplete and the missing transcript isoform that has not been annotated is actually differentially expressed, the model still performs well if the missing transcript isoform has a reasonably low correlation with other known transcripts or the differential signal is low.
Besides the purely simulated probe arrangement matrix E [shown in (*)] for genes with differentially expressed isoforms, we also tested another 100 different probe arrangement matrix E's randomly drawn from the real data (genes in the human data and with five isoforms). For each matrix E, the same simulation settings as mentioned in ‘Materials and Methods’ section were preformed: nine genes with differentially expressed isoform were simulated and there were another 91 non-differentially expressed genes. The overall power of BASIS and the least squares fit for the 100 genes were calculated based on 1000 simulations for each E. As shown in Figure 4
HeLa and HepG2 tiling-array data analysis The array data was obtained from Kapranov et al. (12), who profiled the cytosolic polyadenylated [poly(A)+] RNAs in HeLa and HepG2 cell lines using whole-genome 5-bp resolution tiling arrays. The known or predicted human transcript isoform splicing patterns were obtained from the ASTD and Ensembl databases. After the preprocessing to remove unexpressed genes etc., a total of 110 528 transcripts (26 808 genes) were considered in BASIS. Overall, 11 854 transcripts were differentially expressed between HeLa and HepG2 cells. About 8851 transcripts were up-regulated in HeLa cells, and the remaining 3003 transcripts were up-regulated in HepG2 cells. These differentially expressed transcripts belong to 9191 genes, indicating that some genes have more than one differentially expressed transcript isoform. Specifically, 1892 genes have more than one differentially expressed transcript isoform. More interestingly, 789 exhibited at least one up-regulated isoform and at least one down-regulated isoform in HeLa compared to HepG2 cells. These have been summarized in the workflow Figure 2 The convergence of the chain was evaluated by tracing the variance estimate δm for each bin m. All of the variance estimate δm passed the Geweke's diagnostic, the Raftery and Lewis's diagnostic, and the Heidelberger and Welch's convergence diagnostic implemented in the R package ‘coda’ (25). Using the posterior mean of Δβ, we calculated the residual of each probe (Δϵ). Residuals falling in the same bin should be approximately normally distributed, with a mean of 0 and variance equal to the estimated variance δm for this bin. The residual Q–Q plots (Supplementary Figure S2) show that the distributions of residuals were similar to those expected. Mouse brain, liver and muscle RNA-seq data analysis The RNA-seq data was obtained from Mortazavi et al. (13), who used Solexa high-throughput sequencing to quantify the poly(A)+ RNA in adult mouse brain, liver and muscle. The sequence read coverage at nucleotide resolution was normalized across different tissues such that the total number of reads was equivalent. Similarly, the known or predicted mouse transcript isoform splicing patterns were obtained from the ASTD and Ensembl databases. For the comparison between brain and liver, 35 715 transcripts were differentially expressed. About 21 188 transcripts were up-regulated in brain, and the others were up-regulated in liver. These transcripts correspond to 10 771 genes. About 7699 genes have more than one differentially expressed transcript isoform. Among them, 5711 exhibited at least one up-regulated isoform and at least one down-regulated isoform in brain compared to liver. For the comparison between brain and muscle, 34 126 transcripts belonging to 10 554 genes were differentially expressed. About 19 851 of the transcripts were up-regulated in brain and the others were up-regulated in muscle. Among these differentially expressed genes, 7392 have more than one differentially expressed transcript isoforms and 5498 of them exhibited at least one up-regulated isoform and at least one down-regulated isoform in brain compared to muscle. The above results have also been summarized in the workflow Figure 2 The convergence of the chain was further evaluated by tracing the variance estimate δm for each bin. All of the variance estimate δm passed the Geweke's diagnostic, the Raftery and Lewis's diagnostic, and the Heidelberger and Welch's convergence diagnostic. The residual Q-Q plots (Supplementary Figures S3 and S4) show that the residuals in the same bin were approximately normally distributed, with a mean of 0 and variance equal to the variance estimated for this bin. Using the junction reads as an independent data resource, we evaluated the performance of BASIS. We mapped the splice-spanning reads to the transcript isoforms considered in BASIS. About 3732 transcript isoforms have at least one sequence read over their isoform-specific splice junctions in brain and (or) liver. For the comparison between brain and muscle, the number is 3679. Isoform-specific splice junction means that no other transcript isoforms contain the same junction. Such transcripts were designated as ‘present’ in tissues. This is a stringent criterion, as many of the truly present transcripts may not have any isoform-specific splice junctions (Supplementary Figure S1) or may not have any reads over their isoform-specific splice junctions owing to the low abundance. We declared the transcripts with junction read difference larger than four as differentially expressed. As shown in Figure 5
Robustness of BASIS to bin size and initial value specifications The hyperparameters of BASIS were chosen by a semi-automatic approach or chosen as non-informative values to represent ignorance as described in Materials and Methods section. We also studied the robustness of BASIS for different bin sizes and initial values. Four Markov chains were generated according to different bin sizes or different initial values (see details in Materials and Methods section). For the tiling-array data, among the declared differentially expressed transcripts, 95% of them can be detected by all of the chains. For the RNA-seq data, 91% and 88% of them can be detected by all of the chains for the brain-liver comparison and the brain-muscle comparison respectively. Specifically, for different bin sizes, the overlap among the three scenarios (20, 100, 500 bins) is about 89–95%. The results are not exactly the same because of the different strength borrowed from probes (or positions) due to different bin sizes. For different initial values, the overlap among results is about 98–99%. The above results suggest the robustness of the inference results for different bin sizes and initial values. Experimental validation To further examine the prediction power of BASIS, we subsequently performed real time RT–PCR experiments to assay transcript isoforms’ relative expression levels between adult mouse brain and liver, between adult mouse brain and muscle, and between HeLa and HepG2 cells. We were particularly interested in genes whose isoforms show distinct differential expression patterns between the two conditions. For example, one transcript isoform is up-regulated in brain than in liver, whereas anther transcript isoform of the same gene is down-regulated or is not differentially expressed. For each tested transcript isoform, we designed one of the two PCR primers from the isoform-specific exonic region or exon junction that exclusively represents the isoform. For the RNA-seq data, we randomly tested the relative expression levels of 14 transcript isoforms between mouse brain and liver (Figure 6
For the tilling array data, we randomly tested 12 transcript isoforms in HeLa and HepG2 cells (Figure 6 The RT–PCR products of the tested transcripts in different tissues/cells were examined by agarose gel electrophoresis (Figure 7
DISCUSSION Because AS dramatically increases the complexity of eukaryotic transcriptomes, two transcriptomes can be precisely compared only through the expression level of each isoform, but not individual probes or exons, to more accurately deduce gene expression regulation. In this article, we proposed a hierarchical Bayesian model (BASIS) to identify splicing isoforms that are differentially expressed between two conditions. BASIS integrates known splicing information to fully utilize high-density tiling-array or high-throughput RNA-Seq data. BASIS jointly considers all probes (or positions) targeting the same gene to infer the differential expression level. Sequence read coverage or probe intensity at each position may represent a family of splice variants instead of one single isoform. As shown in Supplementary Figure S1, many transcript isoforms do not contain any isoform-specific sequence positions or isoform-specific exon–exon junctions. Individual probe intensities or sequence reads may not provide direct evidence to distinguish differentially expressed transcript isoforms. BASIS tackles this problem by allocating the intensity of each probe (or sequence read coverage) to multiple transcript isoforms and integrating multiple probe intensities (or sequence read coverage values) for the same gene. Another advantage of jointly considering probes is that a superior signal-to-noise estimate can be achieved by utilizing information from every probe (or sequence read). If the expression intensity is compared probe by probe between two conditions, the high noise level of an individual probe would make the comparison less reliable. However, if we consider the joint behavior of all probes targeting the same gene, the results become much more reliable. In addition, inferences at the transcript isoform level instead of the probe level deliver a more biologically interpretable result. Second, BASIS accounts for the heteroskedasticity of probe intensity or sequence read coverage and has much higher statistical power than the least squares fit. We gathered together all of the probes (or read coverage over positions) from different genes and divided them into 100 bins. Probes (or positions) within the same bin share the same variance. Therefore, strength could be borrowed across genes in estimating the variance in probe intensity (or the variance in read coverage). This is particularly crucial when there are only a few replicates for each tiling-array or RNA-seq experiment. The approach to binning probes to calculate stable estimates of variances has also been used by Johnson et al. (19). In addition, BASIS can be extended to handle the ‘large p and small n’ issue. When the number of potential transcript isoforms is larger than the number of data points available, BASIS maintains the flexibility in statistical inference, whereas the traditional least squares fit requires the number of potential transcript isoforms to be smaller than the number of probes (or positions). Empirical and hierarchical Bayesian approaches have been applied to gene-level microarray analyses in which each gene is represented by one probe and information across different genes are borrowed from each other (26,27). Third, the latent variable γ was introduced into BASIS in order to perform variable selection. In many biological conditions, only a portion of the transcript isoforms is expressed. The latent variable can directly identify the transcript isoforms of interest and leads to an interpretable model. Simulation studies show that BASIS has a about 2-fold increase in power compared to the least squares fit (Table 1 and Figure 4 The predicted tissue-specific transcript isoforms have functional significance. For example, Slc25a25 is a type of calcium binding mitochondrial ATP-Mg/Pi transporter (29,30). Interestingly, its rat ortholog was found to be expressed much more highly in liver than in brain (31), whereas its human homolog was shown to be expressed much more highly in brain than in liver (29). Such a discrepancy may not be due to species variation. It is likely due to the tissue-specific expression of alternatively spliced variants, as Mashima et al. (31) used a probe specific to the liver isoform (TRAN00000157033), consistent with our real time RT–PCR results (Figure 6 BASIS focuses on the direct detection of differentially expressed transcript isoforms between two conditions. Several groups have developed algorithms to estimate transcript abundance, but not difference. The difference in isoform abundances can be conveniently modeled as a normal distributed variable. This is based on the fact that the difference between two normal distributions remains normal (for tiling-array data) and the difference between two Poisson distributions is approximately normal (for sequencing data). The Q–Q plots in Supplementary Figures S2–S4 confirm that the normal distribution is a valid assumption. Shai et al. (33) developed the GenASAP algorithm to infer the expression levels of transcript isoforms including or excluding a cassette exon. This was designed specifically for a custom microarray in which an exon-skipping event are represented by three exon body probes and three junction probes (33). If a gene has more than one alternative exon and more than two transcript isoforms consequently, GenASAP cannot distinguish isoforms which all include the tested cassette exon, neither can it further distinguish isoforms which all exclude the tested cassette exon. On the contrary, BASIS can deal with genes with more than two transcript isoforms. Shai et al. (33) used a truncated normal distribution (β ≥ 0) to satisfy the non-negative constraint on isoform abundance and maximized the lower bound of the log likelihood instead of the log likelihood itself during their variational EM learning because the exact posterior cannot be computed. Such normal distribution approximation may be inappropriate for RNA-seq data. BASIS focuses on the difference of isoform abundances and the normal approximation for the difference is valid for both the tiling-array data and the RNA-seq data. In GenASAP, the calculation based on the lower bound of the log likelihood may introduce bias in the estimation of isoform abundance. In addition, there was no direct statistical inference for the differential expression patterns and genes are tested separately. However, BASIS performs direct inference on the differentially expressed isoforms and it borrows information from different genes. Anton et al. (34) proposed the SPACE algorithm to predict the structures and the abundances of transcript isoforms from microarray data (34). A ‘non-negative matrix factorization’ method was applied to handle the non-negative constraints. The numerical approximation involving non-negative constraints is a computation-intensive task, especially when thousands of genes are considered in a single study. In addition, SPACE has no direct statistical inference for the comparison of two transcriptomes. Anton et al. (34) provided the MATLAB code for SPACE. We therefore used SPACE to predict transcript isoform abundances and carried out differential studies by comparing the isoform abundances between two conditions. We performed simulation studies to compare BASIS and SPACE. BASIS has a much higher statistical power than SPACE given the same false positive rate (e.g. 0.87 versus 0.04 when false positive rate is 0.06. See details in Supplementary Data S1 and Supplementary Table S8). The low power of SPACE may be due to the fact that SPACE assumes that the gene structure is un-known and only two experiments (or conditions) were considered. Anton et al. (34) reported that the estimation of isoform structure and abundance depends on the number of experiments (34). When there are only a few experiments, the estimation error tends to be high. On the contrary, BASIS utilizes the known isoform structure and borrows information across different genes. It works well even there are only two experiments (or conditions). BASIS focuses on the direct inference of differentially expressed transcript isoforms. The Markov chains generated by the Gibbs sampler converged very quickly and, in theory, the empirical distributions of the hidden variables based on those homogeneous ergodic Markov chains will converge to the actual posterior probabilities (14). BASIS can handle both microarray data and RNA-seq data. For the RNA-seq data, we used the uniquely mapped reads for each gene and ignored the multireads that can be mapped to multiple positions in the mouse genome. Inclusion and proportionate allocation of multireads have been reported to impact RNA quantification (13). In the present study, we focused on the differential expression patterns of transcript isoforms. Either exclusion or inclusion of the multireads under both conditions has only a small effect on the final results. An isoform-specific exonic region is needed to accurately assay the expression level of a transcript isoform by real time RT–PCR. Because many transcripts are unique in their exon combinations rather than in isoform-specific exon positions (Supplementary Figure S1), the number of transcripts one can directly test is significantly reduced. Novel experimental techniques are needed in the future to solve this problem. However, through simulation studies, we found that the power of BASIS is not related to the percentage of isoform-specific positions (P-value for the correlation = 0.29). Therefore, the real time RT–PCR validation results on transcripts with isoform-specific positions can still be treated as a fair evaluation of BASIS. We also noted that about 1% genes in our data are un-identifiable because the columns of X are perfectly collinear (i.e. Xj is a linear combination of the other columns). For those genes, additional information from other types of experiments is required to infer the differentially expressed isoforms. FUNDING National Institutes of Health [P50 HG 002790]; the American Federation for Aging Research Grant. Funding for open access charge: University of Southern California. Conflict of interest statement. None declared. Supplementary Data are available at NAR Online. [Supplementary Data]
ACKNOWLEDGEMENTS We thank Michael Waterman (USC) for critical reading of this manuscript and Hongyu Zhao (Yale) for the scientific critiques during the manuscript preparation. We would also like to acknowledge Thomas Gingeras’ group (Affymetrix) for providing the cytosolic RNA of HeLa and HepG2 cell line. We would like to specially thank Doug Black (UCLA) for his scientific advice on the study and generous support for the validation experiments. REFERENCES 1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. [PubMed] 2. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 2008;40:1413–1415. [PubMed] 3. Black DL. Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell. 2000;103:367–370. [PubMed] 4. Ding JH, Xu X, Yang D, Chu PH, Dalton ND, Ye Z, Yeakley JM, Cheng H, Xiao RP, Ross J, et al. Dilated cardiomyopathy caused by tissue-specific ablation of SC35 in the heart. EMBO J. 2004;23:885–896. [PubMed] 5. Jumaa H, Wei G, Nielsen PJ. Blastocyst formation is blocked in mouse embryos lacking the splicing factor SRp20. Curr. Biol. 1999;9:899–902. [PubMed] 6. Xu X, Yang D, Ding JH, Wang W, Chu PH, Dalton ND, Wang HY, Bermingham JR, Jr, Ye Z, Liu F, et al. ASF/SF2-regulated CaMKIIdelta alternative splicing temporally reprograms excitation-contraction coupling in cardiac muscle. Cell. 2005;120:59–72. [PubMed] 7. Jensen KB, Dredge BK, Stefani G, Zhong R, Buckanovich RJ, Okano HJ, Yang YY, Darnell RB. Nova-1 regulates neuron-specific alternative splicing and is essential for neuronal viability. Neuron. 2000;25:359–371. [PubMed] 8. Kanadia RN, Johnstone KA, Mankodi A, Lungu C, Thornton CA, Esson D, Timmers AM, Hauswirth WW, Swanson MS. A muscleblind knockout model for myotonic dystrophy. Science. 2003;302:1978–1980. [PubMed] 9. Faustino NA, Cooper TA. Pre-mRNA splicing and human disease. Genes Dev. 2003;17:419–437. [PubMed] 10. Garcia-Blanco MA, Baraniak AP, Lasda EL. Alternative splicing in disease and therapy. Nat. Biotechnol. 2004;22:535–546. [PubMed] 11. Blencowe BJ. Exonic splicing enhancers: mechanism of action, diversity and role in human genetic diseases. Trends Biochem. Sci. 2000;25:106–110. [PubMed] 12. Kapranov P, Cheng J, Dike S, Nix DA, Duttagupta R, Willingham AT, Stadler PF, Hertel J, Hackermuller J, Hofacker IL, et al. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science. 2007;316:1484–1488. [PubMed] 13. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008;5:621–628. [PubMed] 14. George EI, Mcculloch RE. Variable selection via Gibbs sampling. J. Am. Stat. Assoc. 1993;88:881–889. 15. Chipman H, George EI, Mcculloch RE. The practical implementation of Bayesian model selection. IMS Lect. Notes Monogr. Ser. 2001;38:67–131. 16. George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat. Sinica. 1997;7:339–373. 17. Barbieri MM, Berger JO. Optimal predictive model selection. Ann. Stat. 2004;32:870–897. 18. Stamm S, Riethoven JJ, Le Texier V, Gopalakrishnan C, Kumanduri V, Tang Y, Barbosa-Morais NL, Thanaraj TA. ASD: a bioinformatics resource on alternative splicing. Nucleic Acids Res. 2006;34:D46–D55. [PubMed] 19. Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS. Model-based analysis of tiling-arrays for ChIP-chip. Proc. Natl Acad. Sci. USA. 2006;103:12457–12462. [PubMed] 20. Kapur K, Xing Y, Ouyang Z, Wong WH. Exon arrays provide accurate assessments of gene expression. Genome Biol. 2007;8:R82. [PubMed] 21. Chen L, Zheng S. Identify alternative splicing events based on position-specific evolutionary conservation. PLoS ONE. 2008;3:e2806. [PubMed] 22. Boutz PL, Chawla G, Stoilov P, Black DL. MicroRNAs regulate the expression of the alternative splicing factor nPTB during muscle development. Genes Dev. 2007;21:71–84. [PubMed] 23. Vandesompele J, De Preter K, Pattyn F, Poppe B, Van Roy N, De Paepe A, Speleman F. Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes. Genome Biol. 2002;3:RESEARCH0034. [PubMed] 24. Rocke DM, Durbin B. A model for measurement error for gene expression arrays. J. Comput. Biol. 2001;8:557–569. [PubMed] 25. Plummer M, Best N, Cowles K, Vines K. CODA: convergence diagnosis and output analysis for MCMC. R News. 2006;6:7–11. 26. Lonnstedt I, Speed T. Replicated microarray data. Stat. Sinica. 2002;12:31–46. 27. Nott DJ, Yu ZM, Chan E, Cotsapas C, Cowley MJ, Pulvers J, Williams R, Little P. Hierarchical Bayes variable selection and microarray experiments. J. Multivariate Anal. 2007;98:852–872. 28. Xing Y, Yu T, Wu YN, Roy M, Kim J, Lee C. An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic Acids Res. 2006;34:3150–3160. [PubMed] 29. Fiermonte G, De Leonardis F, Todisco S, Palmieri L, Lasorsa FM, Palmieri F. Identification of the human mitochondrial ATP-Mg/Pi transporter. Bba-Bioenergetics. 2004;1658:191. 30. del Arco A, Satrustegui J. Identification of a novel human subfamily of mitochondrial carriers with calcium-binding domains. J. Biol. Chem. 2004;279:24701–24713. [PubMed] 31. Mashima H, Ueda N, Ohno H, Suzuki J, Omata M. A novel mitochondrial Ca2+-dependent solute carrier in the liver identified by mRNA differential display. Gastroenterology. 2003;124:A127. 32. Mariottini P, Shah ZH, Toivonen JM, Bagni C, Spelbrink JN, Amaldi F, Jacobs HT. Expression of the gene for mitoribosomal protein S12 is controlled in human cells at the levels of transcription, RNA splicing, and translation. J. Biol. Chem. 1999;274:31853–31862. [PubMed] 33. Shai O, Morris QD, Blencowe BJ, Frey BJ. Inferring global levels of alternative splicing isoforms using a generative model of microarray data. Bioinformatics. 2006;22:606–613. [PubMed] 34. Anton MA, Gorostiaga D, Guruceaga E, Segura V, Carmona-Saez P, Pascual-Montano A, Pio R, Montuenga LM, Rubio A. SPACE: an algorithm to predict and quantify alternatively spliced isoforms using microarrays. Genome Biol. 2008;9:R46. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||
Nature. 2008 Nov 27; 456(7221):470-6.
[Nature. 2008]Nat Genet. 2008 Dec; 40(12):1413-5.
[Nat Genet. 2008]Cell. 2000 Oct 27; 103(3):367-70.
[Cell. 2000]EMBO J. 2004 Feb 25; 23(4):885-96.
[EMBO J. 2004]Curr Biol. 1999 Aug 26; 9(16):899-902.
[Curr Biol. 1999]Science. 2007 Jun 8; 316(5830):1484-8.
[Science. 2007]Nat Methods. 2008 Jul; 5(7):621-8.
[Nat Methods. 2008]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D46-55.
[Nucleic Acids Res. 2006]Science. 2007 Jun 8; 316(5830):1484-8.
[Science. 2007]Proc Natl Acad Sci U S A. 2006 Aug 15; 103(33):12457-62.
[Proc Natl Acad Sci U S A. 2006]Genome Biol. 2007; 8(5):R82.
[Genome Biol. 2007]Nat Methods. 2008 Jul; 5(7):621-8.
[Nat Methods. 2008]Science. 2007 Jun 8; 316(5830):1484-8.
[Science. 2007]PLoS One. 2008 Jul 30; 3(7):e2806.
[PLoS One. 2008]Genes Dev. 2007 Jan 1; 21(1):71-84.
[Genes Dev. 2007]Genome Biol. 2002 Jun 18; 3(7):RESEARCH0034.
[Genome Biol. 2002]J Comput Biol. 2001; 8(6):557-69.
[J Comput Biol. 2001]Science. 2007 Jun 8; 316(5830):1484-8.
[Science. 2007]Nat Methods. 2008 Jul; 5(7):621-8.
[Nat Methods. 2008]Proc Natl Acad Sci U S A. 2006 Aug 15; 103(33):12457-62.
[Proc Natl Acad Sci U S A. 2006]Nucleic Acids Res. 2006; 34(10):3150-60.
[Nucleic Acids Res. 2006]J Biol Chem. 2004 Jun 4; 279(23):24701-13.
[J Biol Chem. 2004]J Biol Chem. 1999 Nov 5; 274(45):31853-62.
[J Biol Chem. 1999]Bioinformatics. 2006 Mar 1; 22(5):606-13.
[Bioinformatics. 2006]Genome Biol. 2008; 9(2):R46.
[Genome Biol. 2008]Nat Methods. 2008 Jul; 5(7):621-8.
[Nat Methods. 2008]