- Journal List
- Cancer Inform
- v.10; 2011
- PMC3153162

# On Differential Gene Expression Using RNA-Seq Data

^{1}Department of Biostatistics, UT M.D. Anderson Cancer Center Houston, Texas, USA

^{2}Department of Bioinformatics and Computational Biology, UT M.D. Anderson Cancer Center Houston, Texas, USA

^{3}Department of Mathematics, UT Austin, Austin, Texas, USA

## Abstract

### Motivation:

RNA-Seq is a novel technology that provides read counts of RNA fragments in each gene, including the mapped positions of each read within each gene. Besides many other applications it can be used to detect differentially expressed genes. Most published methods collapse the position-level read data into a single gene-specific expression measurement. Statistical inference proceeds by modeling these gene-level expression measurements.

### Results:

We present a Bayesian method of calling differential expression (BM-DE) that directly models the position-level read counts. We demonstrate the potential advantage of the BM-DE method compared to existing approaches that rely on gene-level aggregate data. An important additional feature of the proposed approach is that BM-DE can be used to analyze RNA-Seq data from experiments without biological replicates. This becomes possible since the approach works with multiple position-level read counts for each gene. We demonstrate the importance of modeling for position-level read counts with a yeast data set and a simulation study.

### Availability:

A public domain R package is available from http://odin.mdacc.tmc.edu/~ylji/BMDE/.

**Keywords:**clustering, false discovery rate, mixture models, next-generation sequencing

## 1. Introduction

### 1.1. RNA-Seq experiments

RNA-Seq is a high-throughput sequencing technology that has recently emerged as a popular methodology to measure gene expression with high accuracy. It generates millions of short reads of mRNA or cDNA. The short reads are mapped to the genome, resulting in a sequence of read counts at millions of genomic positions.^{1}^{,}^{2} RNA-Seq exhibits a high level of reproducibility,^{1} and mitigates many limitations of microarrays.^{3} Consequently, RNA-Seq enables researchers to investigate more complex aspects of the trancriptome, such as allele-specific expression and the discovery of novel promoters and isoforms,^{4} and to develop new approaches to old but fundamental biological questions. An example of the latter is the identification of differentially expressed genes between two conditions.

RNA-Seq experiments produce data on millions of short reads. The data report the base sequence of the reads and the positions on the genome to which the reads are mapped. Most current methods collapse the position-level read counts into a single gene-level summary, such as the number of reads that map per kilobase of exon model per million mapped reads (RPKM), or simply the sum of all the read counts across positions within each gene. Simple hypothesis testing based on the gene-level summaries implements inference on differential gene expression under two biological conditions. We take a different approach. We start by modeling the position-level read counts within a gene. This enables us to account for position outliers among position-level counts. We demonstrate that failure to identify and downweight outliers can bias gene-level summaries. Our hierarchical modeling approach then proceeds with borrowing information across genes for the inference of differential expression. Furthermore, we specifically model systemic biases such as total RNA amount in each experiment to achieve better accuracy in calling differentially expressed genes.

Ji and Liu^{5} illustrated how inference with a Bayesian hierarchical model can improve statistical inference for high-throughput experiments. They also highlighted that borrowing information across loci through a hierarchical model can improve statistical inference even in the case without biological replicates. We follow this advise and propose a Bayesian hierarchical model to effectively utilizes position-level information and accounts for the variabilities in all the position-level read counts mapped to each gene. We demonstrate the superior performance of the BM-DE method, even when the RNA-seq data are generated from experiments without biological replicates. Due to the still elevated cost of RNA-Seq many studies are carried out without replicates. In such experiments, only one biological sample is prepared per condition for a single run of RNA sequencing. We show that the BM-DE method reduces the false positive findings. Note that this does not imply that the BM-DE method can account for the biological variation in such experiments. This is impossible without replicates. We recommend to use sound and efficient experimental designs^{6} with biological replicates for RNA-Seq experiments. For existing data, some without replicates, the proposed BM-DE approach can be used to increase the precision of calling differentially expressed genes.

### 1.2. Inference for RNA-Seq data

RNA-Seq data are usually normalized across libraries to adjust for different total read counts by lanes or by samples. In early work, researchers simply used cumulative counts, summing up read counts across positions, followed by minor normalization to account for gene length and the total number of reads.^{7} Recently, more sophisticated normalization methods were proposed. For example, see Robinson and Oshlack^{8} and Balwierz et al.^{9}

With the single expression summary per gene per condition, most statistical modeling and inference for differential expression has been based on classical hypothesis testing, such as Fisher’s exact test, likelihood ratio tests, or t-tests. For example, Marioni et al^{10} modeled read counts with a Poisson distribution, and used a likelihood ratio test to identify differentially expressed genes. Similar to Marioni et al,^{10} Wang et al^{1} used a Poisson distribution to test differential expression for experiments without biological replicates. Robinson and Smyth^{11} developed a negative binomial model to account for the variation across replicate samples. They estimated a common dispersion using all tags, and shrinks dispersions of tags toward the estimated common dispersion similar to empirical Bayes approach. edgeR^{12} implemented the model for application for RNA-Seq data. Bullard et al^{13} compared the performance of various hypothesis tests, and found poor performance of the *t*-test, in particular for genes with low counts. They also studied biases introduced by gene-length and the normalization procedure. They observed that the *t*-test tends to yield significant test statistics more frequently for longer genes. This is due to the dependence of the estimated standard error on the mean read counts.

Oshlack and Wakefield^{14} further investigated the transcript length bias in RNA-Seq data for differential expression. They illustrated that the standard approaches that use aggregate read counts for each gene in differential expression are subject to significant bias, and that a simple adjustment, dividing by the transcript length, does not entirely remove this bias. Young et al^{15} accounted for the transcript length bias in RAN-Seq data, and developed a statistical model for gene ontology analysis.

Bayesian approaches for differential expression in RNA-Seq data have been developed by many researchers, such as Anders and Huber,^{16} Hoen et al,^{3} Taub^{17} and Wu et al.^{18} Wu et al^{18} took an empirical Bayes approach to detect differential expression for RNA-Seq data when biological replicates are not available. They developed a hierarchical model with aggregate counts at gene level to estimate log fold change in gene expression, and mitigated the limitation of experiments without replicate by borrowing strength across all genes.

Differently from the previous approaches, the methods proposed in Jian and Wong,^{19} Salzman et al,^{20} and Li et al^{21} used models to estimate gene expression at the isoform level. Oshlack et al^{4} provided a broad review on current research in preprocessing RNA-Seq data and identifying differentially expressed genes.

In this paper, we propose a novel method for the inference on differential gene expression with three distinct features:

- We explicitly model the read count at each genomic position within a gene. The proposed model can reduce the false positive rate by accounting for the dispersion in the position-specific counts. As another desirable consequence of position-level modeling the length bias disappears. We show significant improvements over existing models that only use gene-level summaries.
- The proposed method does not require prior normalization of the mapped read counts. Instead we simultaneously carry out the normalization and the inference on differential expression.
- We borrow strength across genes in a hierarchical model. Thus, the detection of differentially expressed genes is informed by the expression measurements in the entire data set.

A related important feature is that borrowing strength across genes in the hierarchical model allows meaningful model-based inference without replicates, if desired.

Section 2 describes the proposed Bayesian model. Section 3 reports the data analysis for the yeast data. Section 4 describes a small simulation study. The last section concludes with a final discussion. The manuscript and R programs with a simple example are available at http://odin.mdacc.tmc.edu/~ylji.

## 2. Probability Model

RNA-Seq data contains millions of read counts, with each read mapped to a genomic position within a gene. Such count data can be easily assembled from the standard output of upstream read alignment, eg, using SOAP or BOWTIE.^{22} We consider counts, *n _{ij}* and

*m*, of mapped reads starting at position

_{ij}*j*of gene

*i*under two different experimental conditions, 0 and 1, respectively. Here

*i*= 1, …,

*I*and

*j*= 1, …,

*J*. Let

_{i}*N*=

_{ij}*n*+

_{ij}*m*denote the total count over the two conditions at position

_{ij}*j*of gene

*i*. For ad-hoc inference about differential expression we may consider the empirical fraction,

*r*=

_{ij}*n*/

_{ij}*N*as the position-level ratio or

_{ij}*r*= Σ

_{i}

_{j}*n*/Σ

_{ij}

_{j}*N*as the gene-level ratio. The proposed model-based inference improves on these empirical estimates by modeling the position-level read counts.

_{ij}To start, we characterize sampling variation as binomial sampling. Conditional on the total count *N _{ij}*, we assume

*n*∼ Bin(

_{ij}*N*,

_{ij}*p*), independently across positions

_{ij}*j*. Therefore,

*p*represents the true proportion of the read count under condition 0 relative to the total read count under both conditions at location

_{ij}*j*of gene

*i*. One could use

*r*as an empirical estimate of

_{ij}*p*. For example, a value of

_{ij}*r*= 0.5 implies that the observed numbers of reads mapped into position

_{ij}*j*of gene

*i*are the same across the two conditions. Typically, most

*r*’s cluster around a particular value representing a relative expression level of gene

_{ij}*i*. Often the data includes some outliers closer to 0 or 1, due to random noise. One of our modeling aims is to downweigh these outliers in quantifying the gene expression.

To this end, we introduce a mechanism to down-weigh outlying *p _{ij}* in the inference for differential expression. We achieve this by introducing a latent indicator

*w*for each position, with

_{ij}*w*= 0 representing an outlier at position

_{ij}*j*. We assume that

*p*follows a mixture of beta distributions Ji et al.

_{ij}^{23}

where Be(*a, b*) represents a beta distribution with mean *a* = (*a* + *b*): When *w _{ij}* = 0 the

*j*-th position is an outlier, and the expected ratio is given a Be(1/2, 1/2) prior which assigns most probability mass close to 0 or 1. We assume

*w*follows a Bernoulli distribution with with probability ${\pi}_{i}^{w}$ ie,

_{ij}*w*∼ Ber $\left({\pi}_{t}^{w}\right)$, in which ${\pi}_{i}^{w}$ represents a gene-specific proportion of outliers. The parameters (

_{ij}*α*,

_{i}*β*) characterize the expression of gene

_{i}*i*, excluding the outliers. This formal accounting for outliers in the mixture robustifies inference in critical ways. Later, in the application to a yeast RNA-Seq data set, we will show that failure to downweigh such outliers could even flip the reported inference on differential expression for some genes (Fig. 6).

We reparameterize *α _{i}* and

*β*for easier interpretation and computation. We follow Robert and Rousseau,

_{i}^{24}and let

*η*= log(

_{i}*α*+

_{i}*β*) and

_{i}*ξ*= log(

_{i}*α*/

_{i}*β*). Note that

_{i}*ξ*is the logit of the mean

_{i}*α*/(

_{i}*α*+

_{i}*β*) of the beta distribution. In the (

_{i}*ξ*,

_{i}*η*) parametrization an unusually large or small value of

_{i}*ξ*indicates differential expression, whereas

_{i}*η*allows for varying levels of heterogeneity across genes. This interpretation leaves

_{i}*ξ*as the main parameter of interest. Figure 3(b) shows the posterior means of all

_{i}*ξ*for a yeast RNA-Seq data set (see Section 3). While the cloud in the middle represents the majority of nondifferentially expressed genes, the genes with values

_{i}*ξ*outside the cloud are those with differential expression. We use a mixture of normal distributions for

_{i}*ξ*to formalize the notion of differential expression. That is,

_{i}*$\widehat{p}$*= Pr(λ

_{i}_{i}≠ 0 | data) (panel a) and the posterior mean of relative gene expression over the two conditions,

*$\widehat{\xi}$*=

_{i}*E*(

*ξ*| data) (panel b).

_{i}We introduce a latent trinary indicator *λ _{i}* ∈ {0, −1, 1} to represent normal, under-, and over-expression, and rewrite the mixture model (1) as a hierarchical model

We complete the model with priors for
${\pi}^{w}=\left({\pi}_{1}^{w},\dots ,{\pi}_{1}^{w}\right)$,
${\pi}^{\lambda}=\left({\pi}_{-1}^{\lambda},{\pi}_{0}^{\lambda},{\pi}_{1}^{\lambda}\right)$, δ_{−1}, δ_{1} and
${s}_{\xi}^{2}$. We use a beta distribution
${\pi}_{i}^{w}\sim \text{Be}\left({a}_{w},{b}_{w}\right)$, independently across *i*, a Dirichlet prior π^{λ} ∼ Dir(*a*_{−1,} *a*_{0,} *a*_{1}), and a gamma prior
${s}_{\xi}^{-2}\sim \left({a}_{s},{b}_{s}\right)$. Finally, we use independent gamma priors
${\delta}_{l}\sim Ga\left({a}_{l}^{\delta},{b}_{l}^{\delta}\right),l=-1,\hspace{0.17em}\hspace{0.17em}1,$ and π($\stackrel{\u0304}{\xi}$) ∝ 1.

The hyperprior distribution on $\stackrel{\u0304}{\xi}$, allows for imbalance between the overall counts under the two conditions.

In contrast to fixing $\stackrel{\u0304}{\xi}$, for example, at $\stackrel{\u0304}{\xi}$ = 0.5, the hierarchical extension with the hyperprior allows for a systematic bias (such as different sequencing depth) across the two conditions. Using possibly different *δ*_{−1} and *δ*_{1} allows for varying deviation from the mean $\stackrel{\u0304}{\xi}$ for of over- versus under-expressed genes. For simplicity, we fix *η _{i}* in the analysis for the yeast data. If a prior on

*η*were desired, one could easily extend the model accordingly, using, for example the prior model from Robert and Rousseau.

_{i}^{24}The model is summarized in Figure 1.

## 3. Yeast Data Analysis

### 3.1. Data

We illustrate the proposed approach with an RNA-Seq data set from Ingolia et al.^{25} Specifically, mRNA were extracted from yeast, Saccharomyces cerevisiae strain BY4741, in rich growth medium (YEPD medium) and poor growth medium (amino acid starvation). The goal of the experiment was to identify genes that are differentially expressed between these two biologic conditions. The sequences of short reads were produced using an Illumina Genome Analyzer II. The short reads were mapped using the SOAP method Li et al.^{26} The data set consists of counts under two different conditions for 1,285 genes.

We considered *I* = 1,089 genes having *J _{i}* ≥ 5 positions for analysis and discarded the remaining 196 for lack of information. The read counts of those 1,089 genes, under the two growth conditions,
${\sum}_{j=1}^{{J}_{i}}\cdot {n}_{ij}$ and
${\sum}_{j=1}^{{J}_{i}}\cdot {m}_{ij}$ range from 1 to 9,334 and from 0 to 14,150, respectively. Figure 2 shows histograms of

*J*(panel a) and ${\sum}_{j=1}^{{J}_{i}}\cdot {N}_{ij}$ (panel b) on a logarithm scale (with base 10). Overall, genes have many positions with non-zero counts, and reads per position are small.

_{i}### 3.2. Markov chain Monte Carlo simulations

We estimated and fixed *η _{i}* as follows. First, we find

*$\widehat{\alpha}$*and

_{i}*$\widehat{\beta}$*such that and

_{i}*$\widehat{\alpha}$*/(

_{i}*$\widehat{\alpha}$*+

_{i}*$\widehat{\beta}$*) =

_{i}*r*and

_{i}*$\widehat{\alpha}$*

_{i}*$\widehat{\beta}$*/(

_{i}*$\widehat{\alpha}$*+

_{i}*$\widehat{\beta}$*)

_{i}^{2}/(

*$\widehat{\alpha}$*+

_{i}*$\widehat{\beta}$*+ 1) = var(

_{i}*r*), the sample variance of the

_{ij}*r*. We fix

_{ij}*η*= log (

_{i}*$\widehat{\alpha}$*+

_{i}*$\widehat{\beta}$*). We expect that about 5% of all genes are differentially expressed and that about 5% of all positions are outliers. We therefore set (

_{i}*a*,

_{w}*b*) = (19, 1), (

_{w}*a*

_{−1},

*a*

_{0},

*a*

_{1}) = (1, 38, 1), ${a}_{-1}^{\delta},{b}_{-1}^{\delta}=\left(5,0.11\right)$, ${a}_{1}^{\delta},{b}_{1}^{\delta}=\left(5,0.12\right)$, and (

*a*,

_{s}*b*) = (3, 0.09). We implemented posterior inference using Markov chain Monte Carlo (MCMC) posterior simulations for the proposed model. The implementation is a standard Gibbs sampling algorithm using Metropolis-Hastings transition probabilities with random walk proposals when the complete conditional posterior distribution is not available for efficient random variate generation. We ran the MCMC simulation by iterating over all complete conditionals for 4,500 iterations, discarding the first 500 iterations as burn-in.

_{s}### 3.3. Results

Figure 3(a) plots the posterior probabilities of differential expression, *$\widehat{p}$ _{i}* = Pr(λ

_{I}≠ 0 | data). Some genes report very large posterior probabilities

*$\widehat{p}$*. Figure 3(b) plots the posterior means

_{i}*$\widehat{\xi}$*= E(

_{i}*ξ*|data). The three dashed horizontal lines mark the posterior means of (

_{i}*$\stackrel{\u0304}{\xi}$ + δ*

_{1})

*, $\stackrel{\u0304}{\xi}$,*and (

*$\stackrel{\u0304}{\xi}$ − δ*

_{−}_{1}), respectively. The genes close to or outside the boundary of the lower and upper dashed lines are reported as differentially expressed.

Figure 4a plots the marginal posterior probabilities *$\widehat{p}$ _{i}* against the empirical estimate

*r*of relative expression. The plot illustrates that

_{i}*$\widehat{p}$*agrees with the ad-hoc estimates

_{i}*r*for most genes. But there are some genes where

_{i}*$\widehat{p}$*disagrees with (we would argue, improves upon) ad-hoc inference with

_{i}*r*. In the next two figures we explore possible reasons for this. Figures 5 and and66 present summaries for some selected genes to illustrate agreement and disagreement of

_{i}*r*and

_{i}*$\widehat{p}$*. In both figures, the plots in the first column show

_{i}*N*(circle) and

_{ij}*n*(cross) along positions. The second column plots

_{ij}*r*along positions. The dashed line indicates the posterior mean

_{ij}*$\widehat{\xi}$*, and the dotted line shows the empirical estimate

_{i}*r*. The line for

_{i}*$\widehat{\xi}$*is plotted at logit

_{i}^{−1}

*$\widehat{\xi}$*to map to the unit scale. The third column plots the posterior probability

_{i}*ŵ*= Pr(

_{ij}*w*= 1 | data) along positions.

_{ij}*$\widehat{p}$*= Pr(λ

_{i}*≠ 0 | data) plotted against*

_{i}*r*(panel a). The triangles and squares indicate genes for which posterior inference agrees (triangles) and disagrees (squares) with the inference based on

_{i}**...**

Comparison of the two figures explains the observed discrepancies in *r _{i}* and

*$\widehat{p}$*. The large

_{i}*r*in Figure 6 are due to outliers in

_{i}*r*, including some positions with small total read counts

_{ij}*N*. In contrast, under the posterior inference, many of the

_{ij}*ŵ*are imputed with relatively smaller values, leading to a downweighting of the corresponding

_{ij}*r*in the inference for the gene-specific indicators λ

_{ij}*for differential expression, and thus for*

_{i}*$\widehat{p}$*. Except for these few outliers, most

_{i}*r*’s are aligned around a value close to 0.5, indicating nondifferential expression. In other words, while

_{ij}*r*is very sensitive to outliers, the model-based estimate down-weights outliers, as desired.

_{i}The computation of posterior probabilities *$\widehat{p}$ _{i}* = Pr(λ

*≠ 0 | data) is only half the desired inference. We still need to decide which genes should be reported as differentially expressed. We use a decision rule based on flagging genes with*

_{i}*$\widehat{p}$*> κ for some threshold κ. We fix the threshold κ by setting a bound on the false discovery rate (FDR).

_{i}^{27}Figure 4b summarizes the FDR implied by decision rules of reporting the genes with highest probability of differential expression. For $\overline{FDR}$ ≤ 0.10 the rule reports 46 differentially expressed genes. The rule corresponds to a threshold κ = 0.618.

## 4. Simulation

We carry out a simulation study to further examine the proposed model. The study investigates the performance of our method in the case where genes have many positions with nonzero counts. In the study, we assume small within-gene variabilities in the read counts and large across-gene variabilities. We achieve this by centering *η _{i}* around a small value and allowing a relative large variance for ξ

*in our model.*

_{i}Since the primary goal is inference on ξ* _{i}*, we fix

*η*at their simulation truth. We place priors on the remaining parameters, $\left(\overline{\xi},\hspace{0.17em}\hspace{0.17em}{s}_{\xi}^{2},\hspace{0.17em}\hspace{0.17em}{\pi}^{w},\hspace{0.17em}\hspace{0.17em}{\pi}^{\lambda},\hspace{0.17em}\hspace{0.17em}{\delta}_{-1},\hspace{0.17em}\hspace{0.17em}{\delta}_{1}\right)$ as described in Section 2.

_{i}We compare model-based estimates with the simulation truth, and compare the inference under the proposed model to that under two methods: (1) the Analysis of Sequence Counts (ASC) proposed by Wu et al^{18} and (2) the MA-plot-based method with random sampling model (DEGseq) proposed in Wang et al.^{1}

In the ASC, Wu et al model the aggregate read count for each gene under each condition as a binomial random variate, given the total read count summing over all the genes at each condition. The expected proportions in the binomial are compared between the two conditions for each gene. They use *δ* to denote the difference between the logarithms of the proportions and λ as the sum of the two log proportions. They propose unimodal prior distributions for *δ* and λ and compute the posterior probability P(|*δ _{i}*| > Δ

_{0}|data), where

*δ*is log fold change in gene expression of gene

_{i}*i*, and Δ

_{0}is a pre-defined threshold for biological significance. In DEGseq, Wang et al. define

*M*= log

_{i}_{2}(

*C*

_{0i}) − log

_{2}(

*C*

_{1}

*) and*

_{i}*A*= (log

_{i}_{2}(

*C*

_{0}

*) + log*

_{i}_{2}(

*C*

_{1}

*))/2 where ${C}_{0i}={\sum}_{j=1}^{{J}_{i}}$*

_{i}*n*and ${C}_{1i}={\sum}_{j=1}^{{J}_{i}}$

_{ij}*m*. They assume that given

_{ij}*A*=

_{i}*a*,

*M*approximately follows a normal distribution with mean and variance,

_{i}where
${C}_{\kappa .}={\sum}_{i=1}^{I}{C}_{\kappa i}$ for *κ* = 0, 1. Inference on differential gene expression is then formalized with a z-test. For this simulation study, a normalization for DEGseq and ASC is not necessary for this study since *$\stackrel{\u0304}{\xi}$* is set at 0.

We simulate a sample of *I* = 1,200 genes. For half of the genes we assumed *J _{i}* = 300 recorded positions per gene, and for the other half we use

*J*= 100. We let λ

_{i}*= −1 or 1 for 150 genes and λ*

_{i}*= 0 for the remaining 450 genes. Given λ*

_{i}*, we generate ${\eta}_{i}\sim N\left(\overline{\eta},{s}_{\eta}^{2}\right)$ and ${\xi}_{i}\sim N\left(\overline{\xi}+{\lambda}_{i}{\delta}_{\lambda i},{s}_{\xi}^{2}\right)$, with $\overline{\eta}=2,\hspace{0.17em}\hspace{0.17em}{s}_{\eta}^{2}$ = 0.25*

_{i}^{2},

*$\stackrel{\u0304}{\xi}$*= 0, ${s}_{\xi}^{2}$ = 0.1, and

*δ*

_{−1}=

*δ*

_{1}= 1. We let

*w*= 0 or 1 independently with probabilities 0.05 and 0.95, respectively. Conditional on

_{ij}*w*= 0 or 1, we respectively generate

_{ij}*p*from either a Be(α

_{ij}*,*

_{I}*β*) or Be(1/2, 1/2) prior, where α

_{i}*= exp(*

_{i}*η*) exp(

_{i}*ξ*)/(1 + exp(

_{i}*ξ*)) and

_{i}*β*= exp(

_{i}*η*) = (1 + exp(

_{i}*ξ*)). Finally, we generate

_{i}*N*∼ Ga(1.5, 1/1.5) (rounded up to the nearest integer), and

_{ij}*n*∼ Bin(

_{ij}*N*,

_{ij}*p*), independently. We then proceed to estimate

_{ij}*ξ*and P(λ

_{i}*≠ 0 | data) conditional on*

_{I}*N*and

_{ij}*n*under the proposed model.

_{ij}The receiver operating characteristic (ROC) curve is commonly used to select an optimal method for classification problems. We assume a decision rule that reports genes with posterior probabilities, *p*(λ* _{i}* ≠ 0 | data) and P(|

*δ*| >

_{i}*δ*

_{0}| data) (in the cases of the proposed approach and ASC) or

*P*-value (in the case of DEGseq) beyond a threshold where we set

*δ*

_{0}= 1.8. The ROC curve plots true positive rate against the false positive rate as a parametric curve indexed by the threshold. Figure 7 shows the three ROC curves. The ROC curve for the proposed method compares favorably against the alternatives. It demonstrates the limitations of ASC. We believe that this is due to the strong assumptions on the shape of the priors of

*δ*and λ. The simulation truth is that the mean expression of the genes is generated from a mixture of three distributions, which does not agree with the unimodal assumptions of the ASC model.

^{28}and the ASC (blue dashed line) proposed by Wu et al

^{18}in the simulation study.

Regarding the performance of DEGseq, we note that longer genes tend to have larger aggregate counts across positions. Therefore, DEGseq is more likely to declare long genes with small effects as differentially expressed genes since its estimated standard deviation inherently depends on the mean counts. Specifically, we observe that DEGseq tends to produce smaller *p*-values for nondifferentially expressed genes with *J _{i}* = 300 than those with

*J*= 100 due to the gene-length bias (see Fig. 8a). On the other hand, the proposed model accounts for position-specific variability while more information on relative gene expression gets accumulated as the number of positions within a gene increases (see Fig. 8b). Therefore, the proposed method tends to produce smaller posterior probability of differential expression for non-differentially expressed genes with

_{i}*J*= 300 than those with

_{i}*J*= 100. This, coupled with vague position-specific information leads to superior performance of the proposed method for longer genes. This conveys significant implication on statistical inference of differential expression using RNA-Seq data. Since RNA-Seq experiments produce many non-zero count positions within a gene, and many reads per position, the RNA-Seq data enables us to model variability among expression levels on positions within the same gene, and the incorporation of it into a model improves the resulting inference.

_{i}*P*value under DEGseq (panel a) and

*$\widehat{p}$*under the proposed method (panel b) by the number of positions within a gene,

_{i}*J*= 100 or 300.

_{i}We note that if both *N _{ij}* and

*J*are small, modeling the position-level read counts does not significantly improve inference. Also, if there is little variation across position-level counts, then the loss of information under aggregation remains negligible. We found that for cases where short reads are mapped to small number of positions, DEGseq performs well (results not shown). However, such situations are untypical for large-scale RNA-Seq experiments with usually very noisy data.

_{i}## 5. Discussion

We proposed a Bayesian model-based approach for inference with RNA-Seq data. We introduced a hierarchical structure to model the position-level count data. We demonstrate through a simulation study and the analysis of a yeast experiment that the model effectively downweights outlying observations at the position level and obtains more robust estimates of gene expression.

The model provides a promising framework for further development of statistical models for RNA-Seq data. One possible extension is to relax the parametric assumption for *ξ _{i}*. By removing the restriction to a specific parametric family of distributions, one could further robustify inference about gene expression levels. Another important extension is to incorporate dependence across genes. In the current model we assumed that

*ξ*are independently and identically distributed. One may achieve more precise estimates and formal inference about dependence structure by generalizing the model to allow for dependence of

_{i}*ξ*across genes. One could build on available prior information to construct informative priors for dependence at the level of the indicators λ

_{i}*. For model with indicators at the gene level similar to λ*

_{i}*used in our model this is carried out in.*

_{i}^{29}The binary nature of λ

*greatly simplifies general modeling of dependence structure. For a recent discussion of models for dependent gene expression see, for example, Stingo et al*

_{i}^{30}or Jones et al,

^{31}and references therein. Both references use a model-based Bayesian approach as in this paper.

Finally, while the model was specifically developed for experiments comparing two conditions without biologic replicates, simple modification would allow the use for experiments with replicates or experiments with multiple conditions. The proposed model can be extended for experiments with replicates by replacing the binomial sampling model for *n _{ij}* by a model for counts across replicates. For experiments with multiple conditions, one may consider a multinomial likelihood with a Dirichlet prior.

## Acknowledgments

Yuan Ji and Peter Müller’s research is partially supported by NIH R01 CA132897. Shoudan Liang’s research is supported by NIH K25 CA123344.

## Footnotes

**Disclosure**

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

## References

*Department of Statistics*, UC Berkeley.

**Libertas Academica**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (1.2M) |
- Citation

- BADGE: a novel Bayesian model for accurate abundance quantification and differential analysis of RNA-Seq data.[BMC Bioinformatics. 2014]
*Gu J, Wang X, Halakivi-Clarke L, Clarke R, Xuan J.**BMC Bioinformatics. 2014; 15 Suppl 9:S6. Epub 2014 Sep 10.* - NPEBseq: nonparametric empirical bayesian-based procedure for differential expression analysis of RNA-seq data.[BMC Bioinformatics. 2013]
*Bi Y, Davuluri RV.**BMC Bioinformatics. 2013 Aug 27; 14:262. Epub 2013 Aug 27.* - Differential gene expression analysis using coexpression and RNA-Seq data.[Bioinformatics. 2013]
*Yang EW, Girke T, Jiang T.**Bioinformatics. 2013 Sep 1; 29(17):2153-61. Epub 2013 Jun 21.* - Comparative studies of differential gene calling using RNA-Seq data.[BMC Bioinformatics. 2013]
*Zheng X, Moriyama EN.**BMC Bioinformatics. 2013; 14 Suppl 13:S7. Epub 2013 Oct 1.* - A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data.[Am J Bot. 2012]
*Kvam VM, Liu P, Si Y.**Am J Bot. 2012 Feb; 99(2):248-56. Epub 2012 Jan 20.*

- The analytical landscape of static and temporal dynamics in transcriptome data[Frontiers in Genetics. ]
*Oh S, Song S, Dasgupta N, Grabowski G.**Frontiers in Genetics. 535* - Genome-Wide Characterization of Transcriptional Patterns in High and Low Antibody Responders to Rubella Vaccination[PLoS ONE. ]
*Haralambieva IH, Oberg AL, Ovsyannikova IG, Kennedy RB, Grill DE, Middha S, Bot BM, Wang VW, Smith DI, Jacobson RM, Poland GA.**PLoS ONE. 8(5)e62149* - Time Series Expression Analyses Using RNA-seq: A Statistical Approach[BioMed Research International. 2013]
*Oh S, Song S, Grabowski G, Zhao H, Noonan JP.**BioMed Research International. 2013; 2013203681* - β-empirical Bayes inference and model diagnosis of microarray data[BMC Bioinformatics. ]
*Hossain Mollah MM, Haque Mollah MN, Kishino H.**BMC Bioinformatics. 13135* - Accuracy of RNA-Seq and its dependence on sequencing depth[BMC Bioinformatics. ]
*Cai G, Li H, Lu Y, Huang X, Lee J, Müller P, Ji Y, Liang S.**BMC Bioinformatics. 13(Suppl 13)S5*

- PubMedPubMedPubMed citations for these articles

- On Differential Gene Expression Using RNA-Seq DataOn Differential Gene Expression Using RNA-Seq DataCancer Informatics. 2011; 10()205

Your browsing activity is empty.

Activity recording is turned off.

See more...