DifferentialRegulation: a Bayesian hierarchical approach to identify differentially regulated genes

Summary Motivation: Although transcriptomics data is typically used to analyse mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on individual samples, and rarely allow comparisons between groups of samples (e.g., healthy vs. diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, i.e., reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. Results: Here, we present DifferentialRegulation, a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, versus state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. Availability and implementation: DifferentialRegulation is distributed as a Bioconductor R package.

In fact, existing single-cell analysis tools either assign all ambiguous reads as spliced, for example, velocyto (La Manno et al., 2018), or acknowledge the difficulty and quantify the result of the ambiguous reads as a stand-alone count matrix (i.e., a), and leave the task of solving the ambiguity to the users, such as alevin (Srivastava et al., 2019) and alevin-fry (He et al., 2022).The difficulty of ambiguous read allocation comes from both biological and computational aspects.First, s and u mRNA share large genomic regions (e.g., exons), while scRNA-seq reads are characterized by short fragments, which are often compatible with both splice variants of genes.Second, most ambiguous reads can be allocated to spliced and unspliced genes equally well, with no decisive evidence for disambiguating those reads.Third, while read coverage is approximately uniform in bulk RNA-seq reads (i.e., the location of reads in a transcript can be assumed to be roughly random, hence uniform, across the transcript), this is not the case for single-cell protocols.At present, we still lack accurate methods that model the underlying mechanism of ambiguous reads, and address the respective uncertainty in a probabilistic way.

MCMC scheme
We alternately sample parameters and latent states from their conditional distributions, following a Metropoliswithin-Gibbs sampling scheme.In particular, we sample: • hyper-parameters δ|π from a Metropolis sampler, with an adaptive random walk proposal; • parameters ρ|X from Dirichlet Gibbs samplers; • hierarchical parameters π|X, δ from Dirichlet Gibbs samplers; • latent states X|Z, π from multinomial Gibbs samplers.
Below, we describe the prior specifications, and the details of each sampling scheme.

δ|π Metropolis sampling
For convenience, the δ parameters are sampled after applying a logarithmic transformation.The posterior distribution of log(δ) is defined as: p(log(δ)|π) ∝ p(π|log(δ))p(log(δ)), where p(log(δ)) is the informative prior for log(δ) (defined below).Instead, p(π|log(δ)), in the bulk case, is the product of the T beta densities defined in (2.3) (main paper), while, in the single-cell model, is the product of the G Dirichlet densities defined in (2.6) (main paper).
We use an empirical Bayes approach to formulate an informative prior for the hyper-parameters.In particular, we randomly select 1,000 genes/transcripts, and, for each one of them, we use DRIMSeq (Nowicka and Robinson, 2016) to obtain an initial estimate of the following parameters: log(δ + ), log(δ S ), and (in the single-cell model) log(δ U ).For each parameter, we compute the mean and standard-deviation of its 1,000 estimated values (1 per selected gene): µ + and σ + for log(δ + ), µ S and σ S for log(δ S ), and (in the single-cell model) µ U and σ U for log(δ U ).
These estimates are used to formulate the following informative prior distributions for the hyper-parameters: Hyper-parameters are sampled, in the logarithmic scale, from a Metropolis sampler; in particular, we sample parameters (log(δ S ), log(δ U ), log(δ A )), (or simply (log(δ S ), log(δ U )) in the bulk model), and, to obtain their prior distributions, apply the change of variable via the Jacobian transformation (Murphy, 2012).
Values are proposed based on an adaptive random walk (Haario et al., 2001).In the initial 200 iterations, we propose values from a normal distribution independently for each hyper-parameter.After 200 iterations, a covariance matrix of the posterior chains is computed, and hyper-parameters are proposed from a multivariate normal distribution, with this covariance matrix.The covariance matrix is then re-updated at 300, 400, 500 iterations, and at the burn-in (which is set to be at least 500 iterations).

ρ|X Gibbs sampling
Below, we express densities in terms of Y , which is directly derived from X: in the bulk model, for each transcript t, and sample i, Y Similarly, in the single-cell model, for gene g and sample i, we have that Y We assume a weakly informative Dirichlet prior distribution for ρ, ρ i ∼ Dir (1, . . ., 1) , for i = 1, . . ., N. (1) The posterior distribution of ρ i can be obtained as: is the density of the multinomial distribution, reported in the main paper in formula (2.1) for the bulk model, and (2.4) for the single-cell model.Given the conjugacy of the Dirichlet and multinomial distributions, the posterior distribution is again Dirichlet distributed, and, in the bulk model, it is equal to: In the single-cell model, G (the number of genes) replaces T (the number of transcripts).

π|X, δ Gibbs sampling
In the bulk model, for each sample i and transcript t, the posterior distribution of π Si can be obtained as: Si δ (t) , where p X is the density of the Binomial distribution in (2.2) (main paper), and p π (t) Si δ (t) is the conjugate Beta hierarchical prior distribution in (2.3) (main paper).These choices lead to the following Beta posterior distribution for π , for t = 1, . . .T, i = 1, . . .N. (3) Similarly, in the single-cell model, for each sample i and gene g, the posterior distribution of π (g) i can be calculated as: p π g) , where p X is the density of the multinomial distribution in (2.5) (main paper), and p π g) is the conjugate Dirichlet hierarchical prior distribution in (2.6) (main paper).These results lead to the following Dirichlet posterior distribution for π Ai + δ (g) A , for g = 1, . . .G, i = 1, . . ., N. (4)

X|Z, π Gibbs sampling
Assume that a total of J equivalence classes (EC) are available.Each EC is defined by two elements, which for the j-th EC are W j and V j : the former contains the list of gene(s)/transcript(s) compatible with the j-th equivalence class, while the latter indicates the splice status of the corresponding gene(s)/transcript(s) (i.e., s or u for bulk data, and s, u or a for single-cell data).Each EC is also associated to a count, z j for the j-th class, indicating the number of (bulk or single-cell) RNA-seq reads compatible with it.
For each EC, we aim to allocate its counts to the gene(s)/transcript(s), and respective splice status, compatible with it.
For simplicity, below we consider the bulk model; the single-cell model is a natural extension of it.We define U ij as the contribution (i.e., counts) of the j-th EC to the spliced and unspliced versions of transcript t in sample i.
Assume that the j-th EC is compatible with n j transcripts.The counts of the j-th EC from the i-th sample are allocated according to the following multinomial distribution: where , W j = w 1 , . . ., w n j , and For every sample, i = 1, . . ., N , once the counts of all J ECs have been allocated, we can recover the overall s and u abundance of each transcript, t = 1, . . ., T , by adding elements across ECs: U ij ; then, the overall transcript-level counts can be retrieved as U i .In order to accelerate the runtime of our method, latent states are only sampled every 10 iterations (i.e., undersampling scheme).

Wald-type test for the single-cell model
Given two groups of samples, A and B, we aim to test if parameter π S , π U , and π A change between groups.We consider the difference of these parameters across groups as: ω = (ω S , ω U ), with ω S = A π S − B π S , and Under the null hypothesis, π S and π U (and consequently π A ) do not change across groups; therefore we test the following system of hypotheses: H 0 : ω S = ω U = 0; H 1 : otherwise.
We approximate the posterior distribution of ω with a bivariate normal (Gelman et al., 1995): ω ∼N (ω, Σω ), with parameters ω and Σω inferred from the posterior chains.Given this normal approximation, we use a bivariate Wald test (Li et al., 1991)

Simulation details
In the main paper (Section 2.5), we illustrate how, in both bulk and single-cell simulations, starting from a real dataset as anchor data, we obtain the counts for each transcript (bulk data) or gene (single-cell data).Then, we artificially introduce a differential effect in a randomly selected subset of the genes/transcripts.Finally, starting from these counts, in order to introduce mapping uncertainty, we use read-level simulators (RSEM (Li and Dewey, 2011) for bulk data and minnow (Sarkar et al., 2019) for single-cell data) to simulate read-level data.Below, we explain how the differential effects are introduced.
To generate differentially regulated (DR) genes/transcripts, we first define a list of genes/transcripts that are eligible for being chosen as DR.In particular, a gene/transcript is eligible when, if it was selected as DR, it would lead to a difference, between groups A and B, in the (group-level) relative abundance of unspliced reads, πU , of at least 0.2 (i.e., B πU − A πU ≥ 0.2).Since DR is introduced by inverting spliced and unspliced reads in one group, this condition can be easily checked, in the data where DR has not been introduced, as: This condition prevents us from simulating negligible differential effects (i.e., B πU − A πU < 0.2), which would be uninteresting.Then, from the list of eligible genes/transcripts, we randomly select 2,000 transcripts (bulk simulation), or 20% of the genes (single-cell simulation).For each one of the selected genes/transcripts, we randomly select the group (A or B) where we will alter count data; this ensures that, on average, half of the modifications will affect group A, and half group B. Finally, for each gene, in each sample belonging to the selected group, we invert their spliced (s) and unspliced (u) counts.This introduces a difference in the relative abundance of s and u reads between groups.
In our DGE simulation, additionally to DR genes/transcripts (see above for DR simulation design), we also introduce a nuisance DGE effect.To this aim, we randomly sample 2,000 (bulk simulation) or 20% of the genes (single-cell data).For each gene/transcript, we randomly sample a fold change F C between conditions, and a group to alter: the gene counts in the selected group are then multiplied by F C (and rounded to the closest integer).As for the DR effect, randomly sampling the group ensures that, on average, changes will equally affect the two groups.In the simulations with average fold change 3, 6 and 9, F C is sampled as F C = 2 + Z, F C = 5 + Z and F C = 8 + Z, respectively, with Z ∼ exp(µ = 1), where exp(µ) is the exponential random variable with mean µ.This generates random fold changes with variance 1, and mean 3, 6 or 9.
In our DAS simulation (bulk data only), after simulating DR transcripts (details above), we generate a nuisance DAS effect.To do that, we first randomly sample 2,000 genes; in order to introduce DAS, for each selected gene, we randomly sample a group: in the selected group, we invert the abundance (i.e., counts) of the two most abundant transcripts in the gene.This is coherent with the design of previous simulation studies involving DAS (Tiberi and Robinson, 2020;Soneson et al., 2016).Note that DAS is only introduced in bulk simulations, because its simulation requires transcript-level resolution, which is not available in scRNA-seq protocols.
In the batch simulation study, we generate two DR effects: one between groups (of interest), and one between batches (nuisance).First, we specify a design for batches: see Supplementary Tables 1 and 2 for the designs of the bulk and single-cell data, respectively.Then, we generate a DR effect between groups (see details above), and a second DR effect between batches.Clearly, the two DR effects are based on distinct sets of randomly selected genes/transcripts.In each simulation, we simulate the same number of DR genes/transcripts between groups (difference of interest) and between batches (nuisance effect): 2,000 transcripts for the bulk data, and 20% of the genes for the single-cell data.
In the dropout simulation study (single-cell data only), we generate three simulations where, for each gene in every cell type, we only allow for 1, 5 and 10% of the cells to have non-zero abundance.In particular, for each gene and cell type, we add the abundance of s and u reads across all cells of the specific cell type; then, we randomly allocate these s (and u) overall counts across a randomly selected fraction (1, 5 or 10%) of genes.
In this way the overall s and u abundances, of each gene in every cell type, are unchanged.Finally, starting from this dataset, we introduce a DR effect between groups, as in the main DR simulation (details above), by inverting the abundance of s and u reads in 20% of the genes (selected at random).
The null simulation study is trivially obtained by simulating no differential effects of any kind between groups.
Finally, note that, although BRIE2 can account for the sample-to-sample variability, we found that including sample information led to a major decrease in power (Supplementary Figure 15); therefore, in all results, we considered the results obtained by fitting BRIE2 with group as the only covariate in the model.
2 Supplementary Tables Supplementary Table 6: Bulk real data analysis.Number of interesting genes present among the top 1,000 results returned by every method."DiffReg" refers to DifferentialRegulation; "mouse pancreas development" and "ZNF800" indicate the 110 genes associated to the term "mouse pancreas development", and the 3 genes associated to "Znf800", respectively; "overall" gathers all 112 genes from the previous two lists.

1 : 2 :Supplementary Figure 3 :Supplementary Figure 4 :
Bulk simulation stud with DR + DGE.Top row: ROC curves; i.e., false positive rate (FPR) vs. true positive rate (TPR).Bottom row: false positive (FP) results among top detections (topN).All simulations have differentially regulated and differentially expressed genes, with average fold-change (FC) of 3 (left panel), 6 (middle panel) or 9 (right panel).Results from the bulk simulations, with and without batch effects.Top row: ROC curves; i.e., false positive rate (FPR) vs. true positive rate (TPR).Bottom row: false positive (FP) results among top detections (topN).Left panel (DR): simulation with differential regulation only; right panel (DR + batch): simulation with differential regulation and batch effect.ROC curves for the main bulk simulations, stratified by overall transcript abundance: false positive rate (FPR) vs. true positive rate (TPR).Top row: low abundance (bottom 33% of abundant transcripts); middle row: medium abundance (middle 33% of abundance); bottom row: high abundance (top 33% of abundant transcripts).Left column: DR only simulation; middle column: DR and DGE simulation (average fold-change of 3); right column: DR and DAS simulation.False positive (FP) results among top detections (topN) for the main bulk simulations, stratified by overall transcript abundance: false positive rate (FPR) vs. true positive rate (TPR).Top row: low abundance (bottom 33% of abundant transcripts); middle row: medium abundance (middle 33% of abundance); bottom row: high abundance (top 33% of abundant transcripts).Left column: DR only simulation; middle column: DR and DGE simulation (average fold-change of 3); right column: DR and DAS simulation.

5 : 8 :
Histogram of p, the probability that group B is up-regulated (while 1 − p is the probability that B is downregulated), in the null simulation studies.Left: bulk simulation; right: single-cell simulation.Figure 6: Single-cell simulation study with DR + DGE.Top row: ROC curves; i.e., false positive rate (FPR) vs. true positive rate (TPR).Bottom row: false positive (FP) results among top detections (topN).All simulations have differentially regulated and differentially expressed genes, with average fold-change (FC) of 3 (left panel), 6 (middle panel) or 9 (right panel).Single-cell simulation study with various fractions of zero-abundant cells.Top row: ROC curves; i.e., false positive rate (FPR) vs. true positive rate (TPR).Bottom row: false positive (FP) results among top detections (topN).All simulations have differentially regulated genes only (no DGE or DAS), with 90 (left column), 95 (middle column) and 99% (right column) of zero cells (i.e., gene expression equal to 0).The average gene abundance is the same across simulations.Results from the single-cell simulations, with and without batch effects.Top row: ROC curves; i.e., false positive rate (FPR) vs. true positive rate (TPR).Bottom row: false positive (FP) results among top detections (topN).Left panel (DR): simulation with differential regulation only; right panel (DR + batch): simulation with differential regulation and batch effect.Supplementary Figure 9: ROC curves for the main scRNA-seq simulations, stratified by overall transcript abundance: false positive rate (FPR) vs. true positive rate (TPR).Top row: low abundance (bottom 33% of abundant transcripts); middle row: medium abundance (middle 33% of abundance); bottom row: high abundance (top 33% of abundant transcripts).Left column: DR only simulation; right column: DR and DGE simulation (average fold-change of 3).
Supplementary Figure11: Traceplot for the posterior distribution of πU (i.e., the key parameter of inference) for the 20 most significant transcripts of the bulk real data analysis (results 1-10).The dashed vertical grey line dentotes the burn-in (excluded from the analyses).

Table 1 :
Design of the simulation study with batch effects; bulk data.

Table 2 :
Design of the simulation study with batch effects; single-cell data.
Supplementary Table3: Percentage of false positive results in the null simulation study (bulk data); i.e., raw p-values below thresholds 0.1, 0.05, and 0.01.For DifferentialRegulation, we considered a false positive detection when p, the probability of up-regulation, was close to 0 or 1: p < 0.05 or p > 0.95 (for threshold 0.1), p < 0.025 or p > 0.975 (for threshold 0.05), and p < 0.005 or p > 0.995 (for threshold 0.01).

Table 4 :
Percentage of false positive results in the null simulation study (single-cell data); i.e., raw p-values below thresholds 0.1, 0.05, and 0.01.For DifferentialRegulation, we considered a false positive detection when p, the probability of up-regulation, was close to 0 or 1: p < 0.05 or p > 0.95 (for threshold 0.1), p < 0.025 or p > 0.975 (for threshold 0.05), and p < 0.005 or p > 0.995 (for threshold 0.01).

Table 5 :
Number of cells available, across all six brain organoids, for each cell type, in the real data analysis study.