NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data

Abstract Summary We present NewWave, a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA sequencing data. To achieve scalability, NewWave uses mini-batch optimization and can work with out-of-memory data, enabling users to analyze datasets with millions of cells. Availability and implementation NewWave is implemented as an open-source R package available through the Bioconductor project at https://bioconductor.org/packages/NewWave/ Supplementary information Supplementary data are available at Bioinformatics online.


Model specification
NewWave assumes a negative binomial distribution for the expression data. Given n samples and J genes across all samples and denoting by Y i j the count of gene j in cell i, the likelihood function for the observed count y i j is f NB (y i j ; µ i j , θ j ) = Γ(y i j + θ j ) Γ(y i j + 1)Γ(θ j ) θ j θ j + µ i j θ j µ i j µ i j + θ j y i j (1) in this parametrization the variance is We specify the following regression model: where X is a matrix with dimension n × M where M is the number of cell-level covariates. It typically contains a set of dummy variables that specify the batch of the samples and by default includes an intercept. V is a matrix with dimension J × L where L is the number of gene-level covariates. It typically only contains an intercept, used as library size normalization. W is a matrix with dimension n × K, where K is the number of latent factors, and it contains the low-rank representation of the cells. The parameters α, β, and γ are the regression coefficients for W, X, and W, respectively.
We note that (3) is a special case of the model of Risso et al. (2018) and we refer the reader to that paper for additional details.

Parameter estimation
The log-likelihood function to be maximized is The estimation of the parameters is done with a penalized approach for β, α, W, γ and not penalized for θ: where Pen(·) is a regularization term to reduce overfitting and improve numerical stability (see Risso et al., 2018, for details). Our iterative estimation procedure follows closely that of Risso et al. (2018), considering the special case of no zero inflation.
Special attention must be paid to the estimation of the dispersion parameters θ j . These can be estimated in a gene-wise fashion, or assuming a common dispersion parameter θ j = θ for all the genes.
This choice yields two different implementations. When the user wants to estimate a common dispersion parameter, our implementation takes advantage of the mini-batch strategy outlined below. However, the computation cannot leverage parallel computations, since only one value needs to be estimated. On the other hand, when the user requires gene-wise dispersion parameters, the computations can proceed in parallel for each gene, but the mini-batch strategy is not effective, as it slows down computations.

Mini-batches
Even when the mini batch approach is chosen, the first iteration will be done using all the observation; this method shows better performance in terms of time needed to converge. If the common dispersion approach is chosen, starting from the second iteration, only a random subset b of size m of observations is used at each iteration to estimate the parameter. Note that the parameter estimate is assumed to describe the variation across all observations, even those that do not belong to b.
At each iteration, the estimated parameter is optimal for b, but there is no guarantee of its optimality on the full data. Hence, the value of the parameter estimate is updated only if it leads to an increase in the log-likelihood function in the full data.
The optimization of the cell-and gene-specific parameters is performed with a mini-batch approach after a first iteration that uses all data. When parallel computing is used, each child process uses only m/c observations where m is the number of observations in the mini-batch and c is the number of child processes.
2 Details on the benchmark 2.1 Compared strategies Figure 1B shows a benchmark of the different estimation strategies implemented in the NewWave package, compared to the original ZINB-WaVE implementation, both with and without zero inflation.
The estimation strategies in NewWave are • Default (Common dispersion, no mini-batches). A single dispersion parameter, common to all genes, is estimated. All cells and genes are used for the estimation of gene-and cell-specific parameters.
• Common dispersion + mini-batch. A single dispersion parameter, common to all genes, is estimated. Mini-batches of 10% of the number of cells and 100 genes are used for the estimation of gene-and cell-specific parameters, respectively.
• Gene-wise dispersion + mini-batch. Gene-wise dispersion parameters are estimated. Mini-batches of 10% of the number of cells and 100 genes are used for the estimation of gene-and cell-specific parameters, respectively.
Note that since our preprocessing includes a preliminary step that selects the top 1,000 most variable genes, the genes in the minibatches are a random sample of these. Figure 1D shows a benchmark of different specifications of the NewWave model: • hdf5 10cores : NewWave with minibatch and genewise dispersion on a out-of-memory dataset (HDF5 format) using 10 cores.
• matrix 10cores : NewWave with minibatch and genewise dispersion on a dense in-memory matrix using 10 cores.
• PCA 10cores : Exact PCA on a out-of-memory dataset (HDF5 format) using 10 cores. We used exact PCA rather than randomized PCA as our preliminary results suggested that the former was faster.

Datasets used
We used three publicly available datasets for our benchmark, namely the BICCN dataset, the 10X Brain dataset, and the RNA mixture dataset.
BICCN data. We created the first dataset starting from the mouse primary motor cortex datasets generated by the BRAIN Initiative Cell Census Network (BICCN) (Yao et al., 2021), which we refer to as the BICCN dataset.
The raw data was generated as part of the BICCN consortium and can be downloaded from http://data.nemoarchive.org/biccn/lab/zeng/transcriptome/.
The known batch effect present in this dataset is due to the difference between platforms, namely 10X Chromium single-cell sequencing v2 and v3 and 10X Chromium single-nucleus sequencing.
We selected the 1,000 most variable genes after correcting for the batch effects using the mutual nearest neighbor method (?), as implemented in the fastMNN function of the batchelor Bioconductor package (v. 1.3.14). All the methods were applied to the raw counts of these 1,000 genes.
10X Brain data. The second dataset is the 1.3 million brain cell single-cell RNA-seq (scRNA-seq) data set generated by 10X Genomics.
The dataset is available in dense HDF5 format as part of the TENxBrainData Bioconductor package at https://bioconductor.org/packages/TENxBrainData. We selected the 1,000 most variable genes on the entire dataset.
RNA mixture data This dataset is made of 636 samples obtained from a mixture of RNA from three different cell lines (see Tian et al., 2019, for details). Samples were obtained using two different technologies that represent the batch effect.

Evaluation strategy
Here, we briefly describe the two metrics used to evaluate the accuracy of the compared methods, namely the Adjusted Rand Index (ARI) and the Akaike Information Criterion (AIC).

Adjusted Rand Index (ARI)
The ARI is a similarity measure between two partitions (usually a prediction and a ground truth). It is the adjusted version of the Rand Index (RI) (Hubert and Arabie, 1985). RI considers all pairs of samples and counts pairs that are assigned in the same or different clusters respectively in the predicted and true clusterings. Given a set S and two partitions of S , partition X composed of r subsets and partition Y composed of s subsets we define the following: • a, the number of pairs of elements in S that are in the same subset in X and in the same subset in Y • b, the number of pairs of elements in S that are in different subsets in X and in different subsets in Y • c, the number of pairs of elements in S that are in the same subset in X and in different subsets in Y • d, the number of pairs of elements in S that are in different subsets in X and in the same subset in Y Then the RI is defined as: The raw RI is then "adjusted for chance" into the ARI score using the following formula: The expected value is estimated using a permutation approach where the number and size of clusters within a clustering are fixed, and all random clusterings are generated by shuffling the elements between the fixed clusters.
The ARI has a value close to 0 for random labeling and 1 when the clusterings are identical (up to a permutation).

Akaike Information Criterion (AIC)
The AIC is a mathematical method for evaluating how well a model fits the data. It is used to compare different models and determine which one is the best fit for the data. The AIC can be interpreted as the value of the log-likelihood "adjusted" for the number of parameters of the model. In our case, the AIC can be written as: where k = J(M + K + 1) + n(L + K) is the total number of parameters andβ,γ,Ŵ, α, andθ are the estimates of the parameters of the model.
The AIC is a relative measure: its value cannot be interpreted in absolute terms, but can only be used to compare models. In particular, the lower the AIC the better the model. The best model according to AIC is the one that explains the greatest amount of variation using the fewest possible parameters.

Batch effect reduction
In this section we report additional results on the ability of NewWave to reduce batch effects. To do so we employ two datasets: the BICCN dataset, which is a complex real dataset, but lacks ground truth, and the RNA mixture, which has the advantage of providing ground truth labels.

BICCN dataset
Supplementary Figure 1 shows a UMAP representation of the full BICCN data color-coded by batch (panel A, B) and putative cell type (panel C, D).
Supplementary Figure 1 shows that NewWave reduces batch effects and leads to a better clustering of cell types compared to PCA.

RNA mixture dataset
The analysis of the RNA mixture data leads to similar conclusions. Briefly, the RNA mixture data is derived from seven mixtures of three cell lines (HCC827, H1975 and H2228) in different proportions, sequenced with two different technologies. Here, the single-cell protocol (CEL-seq2 vs SORT-seq) is considered a batch effect and the seven mixtures as the biological groups of interest.
Supplementary Figure 2 shows that NewWave is able to correct the batch effect generated from the different technologies used. In particular, the two batches are well mixed (panel A) and the proportions are correctly ordered in the UMAP plot (panel C).
In order to quantify the batch effect removal we performed a cluster analysis using the Louvain algorithm. The partition obtained after applying NewWave results in an ARI of 0.60 while the partition obtained with the PCA workflow results in an ARI of 0.50.   Figure 1: Comparison between NewWave and PCA in the BICCN dataset. A. UMAP following NewWave dimensionality reduction and color coded by batch. B. UMAP following PCA dimensionality reduction and color coded by batch. C. UMAP following NewWave dimensionality reduction and color coded by putative cell type. D. UMAP following PCA dimensionality reduction and color coded by putative cell type. Figure 2: Comparison between NewWave and PCA in the RNA mixture dataset. A. UMAP following NewWave dimensionality reduction and color coded by batch. B. UMAP following PCA dimensionality reduction and color coded by batch. C. UMAP following NewWave dimensionality reduction and color coded by putative cell type. D. UMAP following PCA dimensionality reduction and color coded by putative cell type.