• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of cancerinformAuthor InfoTable of ContentsEditorial Board
Cancer Inform. 2011; 10: 205–215.
Published online Aug 1, 2011. doi:  10.4137/CIN.S7473
PMCID: PMC3153162

On Differential Gene Expression Using RNA-Seq Data

Abstract

Motivation:

RNA-Seq is a novel technology that provides read counts of RNA fragments in each gene, including the mapped positions of each read within each gene. Besides many other applications it can be used to detect differentially expressed genes. Most published methods collapse the position-level read data into a single gene-specific expression measurement. Statistical inference proceeds by modeling these gene-level expression measurements.

Results:

We present a Bayesian method of calling differential expression (BM-DE) that directly models the position-level read counts. We demonstrate the potential advantage of the BM-DE method compared to existing approaches that rely on gene-level aggregate data. An important additional feature of the proposed approach is that BM-DE can be used to analyze RNA-Seq data from experiments without biological replicates. This becomes possible since the approach works with multiple position-level read counts for each gene. We demonstrate the importance of modeling for position-level read counts with a yeast data set and a simulation study.

Availability:

A public domain R package is available from http://odin.mdacc.tmc.edu/~ylji/BMDE/.

Keywords: clustering, false discovery rate, mixture models, next-generation sequencing

1. Introduction

1.1. RNA-Seq experiments

RNA-Seq is a high-throughput sequencing technology that has recently emerged as a popular methodology to measure gene expression with high accuracy. It generates millions of short reads of mRNA or cDNA. The short reads are mapped to the genome, resulting in a sequence of read counts at millions of genomic positions.1,2 RNA-Seq exhibits a high level of reproducibility,1 and mitigates many limitations of microarrays.3 Consequently, RNA-Seq enables researchers to investigate more complex aspects of the trancriptome, such as allele-specific expression and the discovery of novel promoters and isoforms,4 and to develop new approaches to old but fundamental biological questions. An example of the latter is the identification of differentially expressed genes between two conditions.

RNA-Seq experiments produce data on millions of short reads. The data report the base sequence of the reads and the positions on the genome to which the reads are mapped. Most current methods collapse the position-level read counts into a single gene-level summary, such as the number of reads that map per kilobase of exon model per million mapped reads (RPKM), or simply the sum of all the read counts across positions within each gene. Simple hypothesis testing based on the gene-level summaries implements inference on differential gene expression under two biological conditions. We take a different approach. We start by modeling the position-level read counts within a gene. This enables us to account for position outliers among position-level counts. We demonstrate that failure to identify and downweight outliers can bias gene-level summaries. Our hierarchical modeling approach then proceeds with borrowing information across genes for the inference of differential expression. Furthermore, we specifically model systemic biases such as total RNA amount in each experiment to achieve better accuracy in calling differentially expressed genes.

Ji and Liu5 illustrated how inference with a Bayesian hierarchical model can improve statistical inference for high-throughput experiments. They also highlighted that borrowing information across loci through a hierarchical model can improve statistical inference even in the case without biological replicates. We follow this advise and propose a Bayesian hierarchical model to effectively utilizes position-level information and accounts for the variabilities in all the position-level read counts mapped to each gene. We demonstrate the superior performance of the BM-DE method, even when the RNA-seq data are generated from experiments without biological replicates. Due to the still elevated cost of RNA-Seq many studies are carried out without replicates. In such experiments, only one biological sample is prepared per condition for a single run of RNA sequencing. We show that the BM-DE method reduces the false positive findings. Note that this does not imply that the BM-DE method can account for the biological variation in such experiments. This is impossible without replicates. We recommend to use sound and efficient experimental designs6 with biological replicates for RNA-Seq experiments. For existing data, some without replicates, the proposed BM-DE approach can be used to increase the precision of calling differentially expressed genes.

1.2. Inference for RNA-Seq data

RNA-Seq data are usually normalized across libraries to adjust for different total read counts by lanes or by samples. In early work, researchers simply used cumulative counts, summing up read counts across positions, followed by minor normalization to account for gene length and the total number of reads.7 Recently, more sophisticated normalization methods were proposed. For example, see Robinson and Oshlack8 and Balwierz et al.9

With the single expression summary per gene per condition, most statistical modeling and inference for differential expression has been based on classical hypothesis testing, such as Fisher’s exact test, likelihood ratio tests, or t-tests. For example, Marioni et al10 modeled read counts with a Poisson distribution, and used a likelihood ratio test to identify differentially expressed genes. Similar to Marioni et al,10 Wang et al1 used a Poisson distribution to test differential expression for experiments without biological replicates. Robinson and Smyth11 developed a negative binomial model to account for the variation across replicate samples. They estimated a common dispersion using all tags, and shrinks dispersions of tags toward the estimated common dispersion similar to empirical Bayes approach. edgeR12 implemented the model for application for RNA-Seq data. Bullard et al13 compared the performance of various hypothesis tests, and found poor performance of the t-test, in particular for genes with low counts. They also studied biases introduced by gene-length and the normalization procedure. They observed that the t-test tends to yield significant test statistics more frequently for longer genes. This is due to the dependence of the estimated standard error on the mean read counts.

Oshlack and Wakefield14 further investigated the transcript length bias in RNA-Seq data for differential expression. They illustrated that the standard approaches that use aggregate read counts for each gene in differential expression are subject to significant bias, and that a simple adjustment, dividing by the transcript length, does not entirely remove this bias. Young et al15 accounted for the transcript length bias in RAN-Seq data, and developed a statistical model for gene ontology analysis.

Bayesian approaches for differential expression in RNA-Seq data have been developed by many researchers, such as Anders and Huber,16 Hoen et al,3 Taub17 and Wu et al.18 Wu et al18 took an empirical Bayes approach to detect differential expression for RNA-Seq data when biological replicates are not available. They developed a hierarchical model with aggregate counts at gene level to estimate log fold change in gene expression, and mitigated the limitation of experiments without replicate by borrowing strength across all genes.

Differently from the previous approaches, the methods proposed in Jian and Wong,19 Salzman et al,20 and Li et al21 used models to estimate gene expression at the isoform level. Oshlack et al4 provided a broad review on current research in preprocessing RNA-Seq data and identifying differentially expressed genes.

In this paper, we propose a novel method for the inference on differential gene expression with three distinct features:

  • We explicitly model the read count at each genomic position within a gene. The proposed model can reduce the false positive rate by accounting for the dispersion in the position-specific counts. As another desirable consequence of position-level modeling the length bias disappears. We show significant improvements over existing models that only use gene-level summaries.
  • The proposed method does not require prior normalization of the mapped read counts. Instead we simultaneously carry out the normalization and the inference on differential expression.
  • We borrow strength across genes in a hierarchical model. Thus, the detection of differentially expressed genes is informed by the expression measurements in the entire data set.

A related important feature is that borrowing strength across genes in the hierarchical model allows meaningful model-based inference without replicates, if desired.

Section 2 describes the proposed Bayesian model. Section 3 reports the data analysis for the yeast data. Section 4 describes a small simulation study. The last section concludes with a final discussion. The manuscript and R programs with a simple example are available at http://odin.mdacc.tmc.edu/~ylji.

2. Probability Model

RNA-Seq data contains millions of read counts, with each read mapped to a genomic position within a gene. Such count data can be easily assembled from the standard output of upstream read alignment, eg, using SOAP or BOWTIE.22 We consider counts, nij and mij, of mapped reads starting at position j of gene i under two different experimental conditions, 0 and 1, respectively. Here i = 1, …, I and j = 1, …, Ji. Let Nij = nij + mij denote the total count over the two conditions at position j of gene i. For ad-hoc inference about differential expression we may consider the empirical fraction, rij = nij /Nij as the position-level ratio or ri = Σj nijj Nij as the gene-level ratio. The proposed model-based inference improves on these empirical estimates by modeling the position-level read counts.

To start, we characterize sampling variation as binomial sampling. Conditional on the total count Nij, we assume nij ~ Bin(Nij, pij), independently across positions j. Therefore, pij represents the true proportion of the read count under condition 0 relative to the total read count under both conditions at location j of gene i. One could use rij as an empirical estimate of pij. For example, a value of rij = 0.5 implies that the observed numbers of reads mapped into position j of gene i are the same across the two conditions. Typically, most rij’s cluster around a particular value representing a relative expression level of gene i. Often the data includes some outliers closer to 0 or 1, due to random noise. One of our modeling aims is to downweigh these outliers in quantifying the gene expression.

To this end, we introduce a mechanism to down-weigh outlying pij in the inference for differential expression. We achieve this by introducing a latent indicator wij for each position, with wij = 0 representing an outlier at position j. We assume that pij follows a mixture of beta distributions Ji et al.23

pij|wij,αi,βiindep.{Be(αi,βi)ifwij=1,Be(1/2,(1/2)ifwij=0,

where Be(a, b) represents a beta distribution with mean a = (a + b): When wij = 0 the j-th position is an outlier, and the expected ratio is given a Be(1/2, 1/2) prior which assigns most probability mass close to 0 or 1. We assume wij follows a Bernoulli distribution with with probability πiw ie, wij ~ Ber (πtw), in which πiw represents a gene-specific proportion of outliers. The parameters (αi, βi) characterize the expression of gene i, excluding the outliers. This formal accounting for outliers in the mixture robustifies inference in critical ways. Later, in the application to a yeast RNA-Seq data set, we will show that failure to downweigh such outliers could even flip the reported inference on differential expression for some genes (Fig. 6).

Figure 6.
Same as Figure 5 for three genes for which inference based on ri and [p with hat]i disagree. The genes are marked as rectangles in Figure 4a. Many of the positions are imputed to be possible outliers, and thus downweighted in the inference.

We reparameterize αi and βi for easier interpretation and computation. We follow Robert and Rousseau,24 and let ηi = log(αi + βi) and ξi = log(αi/βi). Note that ξi is the logit of the mean αi/(αi + βi) of the beta distribution. In the (ξi, ηi) parametrization an unusually large or small value of ξi indicates differential expression, whereas ηi allows for varying levels of heterogeneity across genes. This interpretation leaves ξi as the main parameter of interest. Figure 3(b) shows the posterior means of all ξi for a yeast RNA-Seq data set (see Section 3). While the cloud in the middle represents the majority of nondifferentially expressed genes, the genes with values ξi outside the cloud are those with differential expression. We use a mixture of normal distributions for ξi to formalize the notion of differential expression. That is,

ξi|ξ¯,sξ2iidπ0λN(ξ¯,sξ2)+π1λN(ξ¯δ1,sξ2)+π1λN(ξ¯,+δ1,sξ2).
(1)
Figure 3.
Posterior probability of differential expression, [p with hat]i = Pr(λi ≠ 0 | data) (panel a) and the posterior mean of relative gene expression over the two conditions, [Xi w/ hat]i = E (ξi | data) (panel b).

We introduce a latent trinary indicator λi [set membership] {0, −1, 1} to represent normal, under-, and over-expression, and rewrite the mixture model (1) as a hierarchical model

ξi|λi,ξ¯iidN(ξ¯+λiδλi,sξ2),Pr(λi=l)=πlλ,l=1,0,1

We complete the model with priors for πw=(π1w,,π1w), πλ=(π1λ,π0λ,π1λ), δ−1, δ1 and sξ2. We use a beta distribution πiwBe(aw,bw), independently across i, a Dirichlet prior πλ ~ Dir(a−1, a0, a1), and a gamma prior sξ2(as,bs). Finally, we use independent gamma priors δlGa(alδ,blδ),l=1,1, and π([Xi w/ macron]) [proportional, variant] 1.

The hyperprior distribution on [Xi w/ macron], allows for imbalance between the overall counts under the two conditions.

In contrast to fixing [Xi w/ macron], for example, at [Xi w/ macron] = 0.5, the hierarchical extension with the hyperprior allows for a systematic bias (such as different sequencing depth) across the two conditions. Using possibly different δ−1 and δ1 allows for varying deviation from the mean [Xi w/ macron] for of over- versus under-expressed genes. For simplicity, we fix ηi in the analysis for the yeast data. If a prior on ηi were desired, one could easily extend the model accordingly, using, for example the prior model from Robert and Rousseau.24 The model is summarized in Figure 1.

Figure 1.
Hierarchical model for RNA-Seq data.

3. Yeast Data Analysis

3.1. Data

We illustrate the proposed approach with an RNA-Seq data set from Ingolia et al.25 Specifically, mRNA were extracted from yeast, Saccharomyces cerevisiae strain BY4741, in rich growth medium (YEPD medium) and poor growth medium (amino acid starvation). The goal of the experiment was to identify genes that are differentially expressed between these two biologic conditions. The sequences of short reads were produced using an Illumina Genome Analyzer II. The short reads were mapped using the SOAP method Li et al.26 The data set consists of counts under two different conditions for 1,285 genes.

We considered I = 1,089 genes having Ji ≥ 5 positions for analysis and discarded the remaining 196 for lack of information. The read counts of those 1,089 genes, under the two growth conditions, j=1Jinij and j=1Jimij range from 1 to 9,334 and from 0 to 14,150, respectively. Figure 2 shows histograms of Ji (panel a) and j=1JiNij (panel b) on a logarithm scale (with base 10). Overall, genes have many positions with non-zero counts, and reads per position are small.

Figure 2.
Histogram of the number of non-zero count positions (Ji, i = 1, …, I) (panel a) and total counts over the two conditions, j=1Ji Nij (panel b), i = 1, …, I, on the logarithm scale with base 10.

3.2. Markov chain Monte Carlo simulations

We estimated and fixed ηi as follows. First, we find [alpha]i and [beta]i such that and [alpha]i/([alpha]i + [beta]i) = ri and [alpha]i[beta]i/([alpha]i + [beta]i)2/([alpha]i + [beta]i + 1) = var(rij), the sample variance of the rij. We fix ηi = log ([alpha]i + [beta]i). We expect that about 5% of all genes are differentially expressed and that about 5% of all positions are outliers. We therefore set (aw, bw) = (19, 1), (a−1, a0, a1) = (1, 38, 1), a1δ,b1δ=(5,0.11), a1δ,b1δ=(5,0.12), and (as, bs) = (3, 0.09). We implemented posterior inference using Markov chain Monte Carlo (MCMC) posterior simulations for the proposed model. The implementation is a standard Gibbs sampling algorithm using Metropolis-Hastings transition probabilities with random walk proposals when the complete conditional posterior distribution is not available for efficient random variate generation. We ran the MCMC simulation by iterating over all complete conditionals for 4,500 iterations, discarding the first 500 iterations as burn-in.

3.3. Results

Figure 3(a) plots the posterior probabilities of differential expression, [p with hat]i = Pr(λI ≠ 0 | data). Some genes report very large posterior probabilities [p with hat]i. Figure 3(b) plots the posterior means [Xi w/ hat]i = E(ξi|data). The three dashed horizontal lines mark the posterior means of ([Xi w/ macron] + δ1), [Xi w/ macron], and ([Xi w/ macron] − δ1), respectively. The genes close to or outside the boundary of the lower and upper dashed lines are reported as differentially expressed.

Figure 4a plots the marginal posterior probabilities [p with hat]i against the empirical estimate ri of relative expression. The plot illustrates that [p with hat]i agrees with the ad-hoc estimates ri for most genes. But there are some genes where [p with hat]i disagrees with (we would argue, improves upon) ad-hoc inference with ri. In the next two figures we explore possible reasons for this. Figures 5 and and66 present summaries for some selected genes to illustrate agreement and disagreement of ri and [p with hat]i. In both figures, the plots in the first column show Nij (circle) and nij (cross) along positions. The second column plots rij along positions. The dashed line indicates the posterior mean [Xi w/ hat]i, and the dotted line shows the empirical estimate ri. The line for [Xi w/ hat]i is plotted at logit −1[Xi w/ hat]i to map to the unit scale. The third column plots the posterior probability ŵij = Pr(wij = 1 | data) along positions.

Figure 4.
Posterior probabilities [p with hat]i = Pr(λi ≠ 0 | data) plotted against ri (panel a). The triangles and squares indicate genes for which posterior inference agrees (triangles) and disagrees (squares) with the inference based on ...
Figure 5.
Inference summaries for three genes for which inferences based on ri and [p with hat]i agree. The three genes are marked as a triangle in Figure 4a. The first column shows nij (crosses) and Nij (circles). The second column plots rij. The dotted ...

Comparison of the two figures explains the observed discrepancies in ri and [p with hat]i. The large ri in Figure 6 are due to outliers in rij, including some positions with small total read counts Nij. In contrast, under the posterior inference, many of the ŵij are imputed with relatively smaller values, leading to a downweighting of the corresponding rij in the inference for the gene-specific indicators λi for differential expression, and thus for [p with hat]i. Except for these few outliers, most rij’s are aligned around a value close to 0.5, indicating nondifferential expression. In other words, while ri is very sensitive to outliers, the model-based estimate down-weights outliers, as desired.

The computation of posterior probabilities [p with hat]i = Pr(λi ≠ 0 | data) is only half the desired inference. We still need to decide which genes should be reported as differentially expressed. We use a decision rule based on flagging genes with [p with hat]i > κ for some threshold κ. We fix the threshold κ by setting a bound on the false discovery rate (FDR).27 Figure 4b summarizes the FDR implied by decision rules of reporting the genes with highest probability of differential expression. For FDR¯ ≤ 0.10 the rule reports 46 differentially expressed genes. The rule corresponds to a threshold κ = 0.618.

4. Simulation

We carry out a simulation study to further examine the proposed model. The study investigates the performance of our method in the case where genes have many positions with nonzero counts. In the study, we assume small within-gene variabilities in the read counts and large across-gene variabilities. We achieve this by centering ηi around a small value and allowing a relative large variance for ξi in our model.

Since the primary goal is inference on ξi, we fix ηi at their simulation truth. We place priors on the remaining parameters, (ξ¯,sξ2,πw,πλ,δ1,δ1) as described in Section 2.

We compare model-based estimates with the simulation truth, and compare the inference under the proposed model to that under two methods: (1) the Analysis of Sequence Counts (ASC) proposed by Wu et al18 and (2) the MA-plot-based method with random sampling model (DEGseq) proposed in Wang et al.1

In the ASC, Wu et al model the aggregate read count for each gene under each condition as a binomial random variate, given the total read count summing over all the genes at each condition. The expected proportions in the binomial are compared between the two conditions for each gene. They use δ to denote the difference between the logarithms of the proportions and λ as the sum of the two log proportions. They propose unimodal prior distributions for δ and λ and compute the posterior probability P(|δi| > Δ0|data), where δi is log fold change in gene expression of gene i, and Δ0 is a pre-defined threshold for biological significance. In DEGseq, Wang et al. define Mi = log2(C0i) − log2(C1i) and Ai = (log2(C0i) + log2(C1i))/2 where C0i=j=1Ji nij and C1i=j=1Ji mij. They assume that given Ai = a, Mi approximately follows a normal distribution with mean and variance,

E(Mi|Ai=a)=log2(C0.)log2(C1.),Var(Mi|Ai=a)=4(122a/(C0.C1.)(log2e)2/{(C0.+C1.)22a/(C0.C1.)},

where Cκ.=i=1ICκi for κ = 0, 1. Inference on differential gene expression is then formalized with a z-test. For this simulation study, a normalization for DEGseq and ASC is not necessary for this study since [Xi w/ macron] is set at 0.

We simulate a sample of I = 1,200 genes. For half of the genes we assumed Ji = 300 recorded positions per gene, and for the other half we use Ji = 100. We let λi = −1 or 1 for 150 genes and λi = 0 for the remaining 450 genes. Given λi, we generate ηiN(η¯,sη2) and ξiN(ξ¯+λiδλi,sξ2), with η¯=2,sη2 = 0.252, [Xi w/ macron] = 0, sξ2 = 0.1, and δ−1 = δ1 = 1. We let wij = 0 or 1 independently with probabilities 0.05 and 0.95, respectively. Conditional on wij = 0 or 1, we respectively generate pij from either a Be(αI, βi) or Be(1/2, 1/2) prior, where αi = exp(ηi) exp(ξi)/(1 + exp(ξi)) and βi = exp(ηi) = (1 + exp(ξi)). Finally, we generate Nij ~ Ga(1.5, 1/1.5) (rounded up to the nearest integer), and nij ~ Bin(Nij, pij), independently. We then proceed to estimate ξi and P(λI ≠ 0 | data) conditional on Nij and nij under the proposed model.

The receiver operating characteristic (ROC) curve is commonly used to select an optimal method for classification problems. We assume a decision rule that reports genes with posterior probabilities, pi ≠ 0 | data) and P(|δi| > δ0| data) (in the cases of the proposed approach and ASC) or P-value (in the case of DEGseq) beyond a threshold where we set δ0 = 1.8. The ROC curve plots true positive rate against the false positive rate as a parametric curve indexed by the threshold. Figure 7 shows the three ROC curves. The ROC curve for the proposed method compares favorably against the alternatives. It demonstrates the limitations of ASC. We believe that this is due to the strong assumptions on the shape of the priors of δ and λ. The simulation truth is that the mean expression of the genes is generated from a mixture of three distributions, which does not agree with the unimodal assumptions of the ASC model.

Figure 7.
ROC curves for identification of differential gene expression under the proposed method (black solid line), the DEGseq (red dotted line) proposed by Wang et al28 and the ASC (blue dashed line) proposed by Wu et al18 in the simulation study.

Regarding the performance of DEGseq, we note that longer genes tend to have larger aggregate counts across positions. Therefore, DEGseq is more likely to declare long genes with small effects as differentially expressed genes since its estimated standard deviation inherently depends on the mean counts. Specifically, we observe that DEGseq tends to produce smaller p-values for nondifferentially expressed genes with Ji = 300 than those with Ji = 100 due to the gene-length bias (see Fig. 8a). On the other hand, the proposed model accounts for position-specific variability while more information on relative gene expression gets accumulated as the number of positions within a gene increases (see Fig. 8b). Therefore, the proposed method tends to produce smaller posterior probability of differential expression for non-differentially expressed genes with Ji = 300 than those with Ji = 100. This, coupled with vague position-specific information leads to superior performance of the proposed method for longer genes. This conveys significant implication on statistical inference of differential expression using RNA-Seq data. Since RNA-Seq experiments produce many non-zero count positions within a gene, and many reads per position, the RNA-Seq data enables us to model variability among expression levels on positions within the same gene, and the incorporation of it into a model improves the resulting inference.

Figure 8.
Boxplots of P value under DEGseq (panel a) and [p with hat]i under the proposed method (panel b) by the number of positions within a gene, Ji = 100 or 300.

We note that if both Nij and Ji are small, modeling the position-level read counts does not significantly improve inference. Also, if there is little variation across position-level counts, then the loss of information under aggregation remains negligible. We found that for cases where short reads are mapped to small number of positions, DEGseq performs well (results not shown). However, such situations are untypical for large-scale RNA-Seq experiments with usually very noisy data.

5. Discussion

We proposed a Bayesian model-based approach for inference with RNA-Seq data. We introduced a hierarchical structure to model the position-level count data. We demonstrate through a simulation study and the analysis of a yeast experiment that the model effectively downweights outlying observations at the position level and obtains more robust estimates of gene expression.

The model provides a promising framework for further development of statistical models for RNA-Seq data. One possible extension is to relax the parametric assumption for ξi. By removing the restriction to a specific parametric family of distributions, one could further robustify inference about gene expression levels. Another important extension is to incorporate dependence across genes. In the current model we assumed that ξi are independently and identically distributed. One may achieve more precise estimates and formal inference about dependence structure by generalizing the model to allow for dependence of ξi across genes. One could build on available prior information to construct informative priors for dependence at the level of the indicators λi. For model with indicators at the gene level similar to λi used in our model this is carried out in.29 The binary nature of λi greatly simplifies general modeling of dependence structure. For a recent discussion of models for dependent gene expression see, for example, Stingo et al30 or Jones et al,31 and references therein. Both references use a model-based Bayesian approach as in this paper.

Finally, while the model was specifically developed for experiments comparing two conditions without biologic replicates, simple modification would allow the use for experiments with replicates or experiments with multiple conditions. The proposed model can be extended for experiments with replicates by replacing the binomial sampling model for nij by a model for counts across replicates. For experiments with multiple conditions, one may consider a multinomial likelihood with a Dirichlet prior.

Acknowledgments

Yuan Ji and Peter Müller’s research is partially supported by NIH R01 CA132897. Shoudan Liang’s research is supported by NIH K25 CA123344.

Footnotes

Disclosure

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

References

1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009b;10:57–63. [PMC free article] [PubMed]
2. Li J, Jiang H, Wong WH. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biology. 2010b;11 [PMC free article] [PubMed]
3. Hoen PAC, Ariyurek Y, Thygesen HH, et al. Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Research. 2008;36:e141. [PMC free article] [PubMed]
4. Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biology. 2010;11:12. [PMC free article] [PubMed]
5. Ji H, Liu XS. Analyzing’ omics data using hierarchical models. Nature Biotechnology. 2010 [PMC free article] [PubMed]
6. Auer PL, Doerge RW. Statistical design and analysis of RNA sequencing data. The Genetics Society of America. 2010;185:405–16. [PMC free article] [PubMed]
7. Mortazavi A, Williams BA, McCue K, Schaefier L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5:621–8. [PubMed]
8. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010;11:3. [PMC free article] [PubMed]
9. Balwierz PJ, Carninci P, Daub CO, et al. Methods for analyzing deep sequencing expression data: constructing the human and mouse promoterome with deepCAGE data. Genome Biology. 2009;10:7. [PMC free article] [PubMed]
10. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research. 2008;18:1509–17. [PMC free article] [PubMed]
11. Robinson MD, Smyth GK. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics. 2007;23(21):2881–7. [PubMed]
12. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139–40. [PMC free article] [PubMed]
13. Bullard JH, Purdom EHKD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11 [PMC free article] [PubMed]
14. Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biology Direct. 2010;4 [PMC free article] [PubMed]
15. Young MD, Wakefieldand MJ, Smyth GK, Oshlack A. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology. 2010;11:2. [PMC free article] [PubMed]
16. Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11:10. [PMC free article] [PubMed]
17. Taub MA. Analysis of high-throughput biological data: some statistical problems in RNA-seq and mouse genotyping. 2009. Ph.D. thesis. Department of Statistics, UC Berkeley.
18. Wu Z, Jenkins BD, Rynearson TA, et al. Empirical bayes analysis of sequencing-based transcriptional profiling without replicates. BMC Bioinformatics. 2010;11 [PMC free article] [PubMed]
19. Jian H, Wong WH. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics. 2009;25:1026–32. [PMC free article] [PubMed]
20. Salzman J, Jiang H, Wong WH. Division of Statistics. Stanford University; 2010. Statistical modeling of RNA-Seq data. Tech. rep.
21. Li B, Ruotti V, Stewart R, Thomson J, Dewey C. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010a;26:493–500. [PMC free article] [PubMed]
22. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-eficient alignment of short DNA sequences to the human genome. Genome Biology. 2009;10:R25. [PMC free article] [PubMed]
23. Ji Y, Wu C, Liu P, Coombes K. Applications of beta-mixture models in bioinformatics. Bioinformatics. 2005;21(9):2118–22. [PubMed]
24. Robert CP, Rousseau J. A Mixture Approach to Bayesian Goodness of Fit. Les cahiers du CEREMADE. 2004 (2002–9).
25. Ingolia N, Ghaemmaghami S, Newman J, Weissman J. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009;324(5924):218–23. [PMC free article] [PubMed]
26. Li R, Li Y, Kristiansen K, Wang J. SOAP: short oligonucleotide alignment program. Bioinformatics. 2008 [PubMed]
27. Newton MA, Noueiry A, Sarkar D, Ahlquist P. detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics. 2004;5:155–76. [PubMed]
28. Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2009a;26(1):136–8. [PubMed]
29. Telesca D, Müller P, Parmigiani G, Freedman RS. Harvard University; 2010. Modeling dependent gene expression. Tech. rep.
30. Stingo F, Chen Y, Tadesse M, Vannucci M. Incorporating biological information into linear models: a bayesian approach to the selection of pathways and genes. Annals of Applied Statistics. 2011 [PMC free article] [PubMed]
31. Jones B, Carvalho C, Dobra A, Hans C, Carter C, West M. Experiments in stochastic computation for high-dimensional graphical models. Statistical Science. 2004;20:388–400.

Articles from Cancer Informatics are provided here courtesy of Libertas Academica
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...