Single-cell quantification of a broad RNA spectrum reveals unique noncoding patterns associated with cell types and states

Significance Each individual cell transcribes nearly 85% of the genome. However, when it comes to the analysis of cellular RNA, most studies are still only looking at the 3% that correspond to protein-coding transcripts. The role and abundance of the remaining RNAs across individual cells remain largely unknown. In the present study we describe Smart-seq-total, an RNA-sequencing method that delivers a broad picture of the total cellular RNA content. Using Smart-seq-total, we analyzed the content of hundreds of human and mouse cells and showed that the noncoding RNA content of cells significantly differs across cell types and dynamically changes throughout the vital processes of a cell, such as cell cycle and cell differentiation.

The ability to interrogate total RNA content of single cells would enable better mapping of the transcriptional logic behind emerging cell types and states. However, current single-cell RNA-sequencing (RNA-seq) methods are unable to simultaneously monitor all forms of RNA transcripts at the single-cell level, and thus deliver only a partial snapshot of the cellular RNAome. Here we describe Smartseq-total, a method capable of assaying a broad spectrum of coding and noncoding RNA from a single cell. Smart-seq-total does not require splitting the RNA content of a cell and allows the incorporation of unique molecular identifiers into short and long RNA molecules for absolute quantification. It outperforms current poly(A)-independent total RNA-seq protocols by capturing transcripts of a broad size range, thus enabling simultaneous analysis of protein-coding, long-noncoding, microRNA, and other noncoding RNA transcripts from single cells. We used Smart-seq-total to analyze the total RNAome of human primary fibroblasts, HEK293T, and MCF7 cells, as well as that of induced murine embryonic stem cells differentiated into embryoid bodies. By analyzing the coexpression patterns of both noncoding RNA and mRNA from the same cell, we were able to discover new roles of noncoding RNA throughout essential processes, such as cell cycle and lineage commitment during embryonic development. Moreover, we show that independent classes of short-noncoding RNA can be used to determine cell-type identity.
single-cell RNA-seq j noncoding RNA j cell cycle j differentiation E fforts in characterizing transcriptional states of single cells have so far mostly focused on protein-coding RNA (1)(2)(3)(4). However, a growing number of studies indicate that noncoding RNAs (ncRNAs), are actively involved in cell function and specialization (5)(6)(7)(8). Importantly, compared to the coding RNA, which is transcribed from only ∼1 to 2% of the genome, ncRNA constitutes a major fraction of all cellular transcripts and covers ∼70% of the genomic content (9). The role of these transcripts in shaping different cell types and states remain poorly understood.
Several groups have developed techniques to measure ncRNA in single cells (10)(11)(12)(13)(14)(15). The respective methods, however, are designed to target only a subset of noncoding transcripts, which are either short (∼18 to 200 nt; e.g., microRNA) (11,16), long (>200 nt, e.g., long ncRNA [lncRNA] or circular RNA [circRNA]) (10,14,17,18), or limited to specific types of RNA molecules, such as miRNA-mRNA pairs, for example (10,12). None of the existing methods are able to simultaneously quantify all RNA types within a cell. This limits the ability to map the regulatory connection between coding and noncoding transcripts within a cell and motivates the need for the development of novel singlecell technologies capable of assaying both poly(A) + and poly(A) À RNA, irrespective of transcript length.
In the present study we describe Smart-seq-total, a scalable one-pot method designed to capture both coding and noncoding transcripts regardless of their length. Inspired by the widely used Smart-seq2 protocol (19), this method harnesses the template-switching capability of MMLV reverse transcriptase to generate full-length cDNA with high yield and quality. Smartseq-total captures nonpolyadenylated RNA through templateindependent addition of polyA tails and further oligo-dT priming of all cellular transcripts. Because all RNA molecules are reverse transcribed using oligo-dT, they can also be tagged with unique molecular identifiers (UMIs) at that step. Therefore, Smart-seqtotal simultaneously quantifies the levels of mRNA alongside other RNA types in the same cell, which permits: 1) the annotation of cell types and states based on mRNA and integration of this data with the existent single-cell RNA-sequencing (scRNA-seq) datasets, and 2) the discovery of noncoding regulatory patterns of the respective states.

Results
Smart-seq-total relies on the ability of Escherichia coli poly(A) polymerase to add adenine tails to the 3 0 prime of RNA molecules. Total polyadenylated RNA is then reverse-transcribed using anchored oligo dT, in the presence of the template switch oligo (TSO) (20) (Fig. 1A). Compared to previous studies that explored similar approaches to construct libraries from total RNAs (21,22), Smart-seq-total utilizes an optimized version of the TSO (19), specifically engineered to be rapidly eliminated from the reaction through enzymatic digestion directly following the reverse transcription. This allows us to remove the "contaminant" constructs, originating from polyA-tailing and mispriming of TSO, which otherwise dominate the resulting Significance Each individual cell transcribes nearly 85% of the genome. However, when it comes to the analysis of cellular RNA, most studies are still only looking at the 3% that correspond to protein-coding transcripts. The role and abundance of the remaining RNAs across individual cells remain largely unknown. In the present study we describe Smart-seq-total, an RNA-sequencing method that delivers a broad picture of the total cellular RNA content. Using Smart-seq-total, we analyzed the content of hundreds of human and mouse cells and showed that the noncoding RNA content of cells significantly differs across cell types and dynamically changes throughout the vital processes of a cell, such as cell cycle and cell differentiation.
sequencing library and render the short RNA transcripts undetectable (SI Appendix, Fig. S1 A and B). Additionally, we employ a CRISPR-mediated removal of overrepresented sequences, which allows us to eliminate the majority of the sequences corresponding to ribosomal RNA from the final library in a singlepool reaction (23) (targeting rRNA regions and other abundant RNA molecules) (SI Appendix, Fig. S1 A and B and Table S1).
When applied to single HEK293T cells, Smart-seq-total identified the full complement of mRNA as well as a broad spectrum of ncRNA genes, such as small nucleolar RNA (snoRNA), small Cajal body-specific RNA (scaRNA), histone RNA, and lncRNA. The majority of these molecules endogenously lack poly(A) tails and thus cannot be captured through a direct polyA-priming employed by Smart-seq2 (1) or other popular scRNA-seq methods (SI Appendix, Fig. S1 C and D). Among other ncRNA detected by Smart-seq-total are transfer RNAs (tRNAs) and mature miRNAs (SI Appendix, Fig. S1 C and D). To facilitate the quantitative comparison across RNA biotypes within a cell, we also implemented a UMI-compatible version of Smart-seq-total (v2) (SI Appendix, Fig. S2 A-C). Like Smart-seq3, Smart-seq-total (v2) does not require a clean-up step after reverse transcription and thus is also potentially subject to random UMI incorporation during cDNA amplification (24). We show, however, that under the current protocol conditions these events are negligible (SI Appendix, Fig. S2 D-F). Both versions of Smart-seq-total are compatible with the Tn5-based fragmentation of cDNA in order to sequence full-length transcript coverage (SI Appendix, Fig. S3).
The sensitivity of Smart-seq-total estimated based on external RNA control consortium (ERCC) capture is comparable to Smart-seq2 (SI Appendix, Fig. S3 E and F) (25). At the same time, Smart-seq-total detects a broader spectrum of RNA types than previous single-cell approaches and furthermore allows the incorporation of UMIs for absolute quantitation into both short and long RNA molecules (SI Appendix, Figs. S1 and S4).
To demonstrate the scalability of the method, we sequenced total RNA from individual human primary dermal fibroblasts (n = 277), HEK293T (n = 245), and MCF7 (n = 90) cells sorted in 384-well plates and processed in one-tenth of the standard Smart-seq2 volume (i.e., cells are sorted in 0.3 μL of lysis buffer and RNA is reverse-transcribed in 1 μL) (Materials and Methods) (19). We sequenced the libraries to obtain one to two million reads per cell, mapped, and counted the reads using the reference that contains both coding and noncoding transcripts (including miRNA, snoRNA, small nuclear RNA [snRNA], tRNA, and so forth). Within all three cell types, we identified a broad spectrum of transcripts, such as mRNA, miRNA, lncRNA, and snoRNA in each profiled cell ( Fig. 1B and SI Appendix, Fig. S5). We found metazoan RNA7SK and RN7SL1, which are involved in regulation of transcription and translation, respectively (annotated as "miscellaneous RNA"-type [mis-cRNA] in the GENCODE database) to be the most abundant in our data comprising together ∼40% of all mapped reads ( Fig. 1B and SI Appendix, Fig. S6). We demonstrate that, if desired, these molecules could also be depleted from the A B D C Fig. 1. Smart-seq-total performance. (A) Schematic comparison of Smart-seq2 and Smart-seq-total pipelines. Following cell lysis, total cellular RNA is polyadenylated, primed with anchored oligodT, and reverse transcribed in a presence of the custom degradable TSO. After reverse transcription, TSO is enzymatically cleaved, single-stranded cDNA is amplified and cleaned up. Amplified cDNA is then either tagmented or directly indexed, pooled, and depleted from ribosomal sequences using DASH (23). The resulting indexed libraries are then pooled and sequenced on Illumina platform. (B) Distribution of mapped reads across RNA types in human primary fibroblasts, HEK293T, and MCF7 cells. Percentage of total reads mapped to each RNA type. miscRNA class is additionally split into RN7SK, RN7SL, and other miscRNA categories. (C) Examples of coding and noncoding marker genes for each cell type. Top exemplary markers per biotype computed among cell types using Wilcoxon rank sum test. RNY1 belongs to miscRNA, SCARNA23 and SCARNA20 to scaRNA, MT-TD to mitochondrial tRNA class. (D) t-SNE plots of three profiled human cell types generated using indicated subset of genes. From top to bottom: protein coding, lncRNA, miRNA, and other small ncRNA (include snoRNA, snRNA, scaRNA, scRNA, and miscRNA). We have excluded histone coding genes from the protein coding (polyA+) set, since a large fraction of these RNAs are known to lack polyA tails (60).
sequencing libraries at the rRNA depletion step using dedicated CRISPR guides (SI Appendix, Fig. S1 A and B and Table S1). Among cell-type-specific transcripts, we found well-characterized marker genes for either fibroblasts (COL1A2, FN1, MEG3), HEK293T (CKB, AMOT, HEY1), or MCF7 cells (KRT8, TFF1) ( Fig. 1C and SI Appendix, Fig. S8A), as well as transcripts that belong to various types of ncRNA, such as microRNA, snoRNA, and lncRNA ( Fig. 1C and SI Appendix, Fig. S8). For example, we found high levels of MIR222 in fibroblasts while we could not detect it in MCF7 cells. We also observed that oncogenic miRNA cluster MIR17HG is specific to HEK293T cells, while not found in either fibroblasts or MCF7 cells. In contrast, MCF7-specific transcripts include lncRNA, such as LINC00052, as well as snoRNA, such as SNORD71 and SNORD104.
Given the observed differences in the levels of ncRNA across profiled cells, we next asked whether ncRNA alone could be used to distinguish cell types. To answer this question, we performed principal component analysis (PCA) followed by the dimensionality reduction through t-distributed stochastic neighbor embedding (t-SNE) on the genes corresponding to one or multiple ncRNA types. Evaluation of the similarity between cells in two-dimensional space revealed that, in addition to lncRNA (26), miRNA taken alone separate the investigated cell types into three distinct clusters. Combining snoRNA, scaRNA, snRNA, and tRNA together allowed us to achieve similar results (Fig. 1D). While the exact function of individual snoRNA and scaRNA remains largely uncharacterized, these RNAs are believed to play a vital role in posttranscriptional and posttranslation control (27)(28)(29). Here we show that their abundance is also celltype specific (SI Appendix, Fig. S8A).
After binning all the cells according to cell-cycle phases (30), we observed that in addition to cell-type-dependent differences in ncRNA, the abundance of certain noncoding transcripts also changed throughout the cell cycle ( Fig. 2A). In agreement with previous bulk studies, which suggested the involvement or miRNA in cell-cycle regulation (31, 32), we found that levels of a subset of miRNAs in a cell dynamically change through the cell cycle, peaking at either S, G2M, or G1 phase ( Fig. 2A). For example, our data showed that the levels of MIR16-2 in fibroblasts are high during the S phase and later gradually decrease during G2M and G1 phases (SI Appendix, Fig. S9). The opposite holds true for MIR222, in both fibroblasts and HEK293T cells, which is more abundant during cell proliferation (G1) and decays during DNA replication (S) and cell division (G2M) phases ( Fig. 2A and SI Appendix, Fig. S10). Among miRNAs more abundant during G2M phase, we identified MIR27A, MIR103A2, and MIR877 (SI Appendix, Figs. S9-S11). In addition to miRNA, a large number of lncRNA, snRNA, scaRNA, snoRNA, and miscRNA were also up-regulated (log 2 fold-change [FC] > 1, adjusted P < 0.01) during the G2M phase ( Fig. 2A and SI Appendix, Figs. S9-S11). Given the active role of these RNA types in splicing and ribosome biogenesis, we suggest that they are produced by the cell in response to a rapid demand for protein synthesis and cell growth during the G2M phase.
To further link the observed ncRNA dynamics with the expression of well-characterized cell-cycle mRNA markers, we searched for coregulated coding and noncoding genes throughout the cell cycle. We identified 24 clusters comprised of coexpressed coding and noncoding genes specific to either one or multiple cell types (Fig. 2B, SI Appendix, Fig. S12, and Dataset S1). Two of these mixed-gene clusters (33 genes up-regulated in the S phase and 53 genes up-regulated in the G2M phase) showed identical patterns in all three profiled cell types. Interestingly, both clusters are marked by landmark cell-cycle genes-such as CCNA2, MCM6, and TOP2A-but also include miRNAs, lncRNAs, and snRNAs previously unknown to follow a distinct expression pattern upon transition between phases.
Histone RNA is another type of mainly nonpolyadenylated RNA, which we observed to be strongly correlated with the cell cycle. Consistent with prior studies (33,34), histone RNA levels sharply rise during the S phase in all three profiled cell types (Fig. 2C). The ability to capture nonpolyadenylated histones also has a strong impact on cell clustering, by introducing a cell-cycle bias. Particularly, histones drive the separation of each cell type into two distinct populations (SI Appendix, Fig. S13A), marked by increased levels of the majority of histone genes during the DNA replication phase (SI Appendix, Fig. S13B).
In addition to being expressed in a cell-cycle-dependent manner, we also identified several histones to be cell-type specific. For example, HIST1H4L is expressed in fibroblasts but absent in HEK293T and MCF7 cells, while HIST1H1B is absent in HEK293T cells while present in the other two cell types (Fig.  2D). Given the importance of histones in establishing and maintaining a distinct chromatin landscape of a cell, we anticipate that the ability to measure corresponding transcripts could be valuable for predicting the epigenetic state of a cell.
In principle, Smart-seq-total is designed to broadly quantify total cellular RNA content. However, we also show that short, less abundant molecules, such as miRNAs (35), can be sizeselected from a UMI-tagged and indexed Smart-seq-total library for further in-depth analysis (Fig. 2E). In the example of 24 HEK293T cells we demonstrate that this size-based enrichment strategy yields results comparable to the state-of-the-art singlecell small RNA-seq method (11) in terms of the number and type of miRNAs detected per cell ( Fig. 2F and SI Appendix, Fig. S14 A-C). Among the most abundant miRNAs in profiled HEK293T cells, we identified multiple members of the three conserved paralog clusters-miR-17/92, miR106a/363, and miR-10b/25 (36)-as well as various members of the let-7 miRNA family (SI Appendix). We noted that levels of mature miRNAs generally correlated better within a cluster rather than across different clusters (SI Appendix, Fig. S14 D-F). However, the abundance of mature forms largely varied across cluster members. Specifically, we found that the levels of two mature miRNAs, miR-92a-3p and miR-25-3p, were several folds higher than those of any other cluster member (Fig. 2G). Selective retention of miR-92a was previously observed at the tissue level in vivo (36) and has been attributed to differential posttranscriptional processing of cluster members (37,38). Our data indicate that the phenomenon of selective miRNA retention can be observed at the single-cell level and that it extends beyond the 17/92 cluster.
While short RNAs are selected from a fully constructed UMIindexed Smart-seq-total library, the remaining total library can be used to quantify mRNA counterpart of the same cells. We applied this strategy to investigate the relationship of the retained miR-92a-3p and miR-25-3p with the rest of the transcriptome at the physiological cell state. Specifically, we generated a matching miRNA-mRNA profile for ∼300 HEK 293T cells by sequencing the complete Smart-seq-total library, as well as its small RNA fraction (SI Appendix). We used the obtained data to perform a pair-wise correlation of miR-92a-3p and miR-25-3p with the rest of the genes across profiled cells. Among the most correlated and anticorrelated genes (Pearson's r <À0.5 or > 0.5, adjusted P < 0.01), we identified several validated targets of the respective miRNAs ( Fig. 2H and SI Appendix, Fig. S15A). Interestingly, we also found that several other genes, which are neither known nor predicted to directly interact with the respective miRNAs, correlate with either miR-92a-3p, miR-25-3p or both ([abs(r)] > 0.5, adjusted P value <0.01) (SI Appendix, coanalysis align with those findings and indicate that miR-92a-3p and miR-25-3p miRNA levels are linked to the expression of a metabolic gene set at the physiological cell state. Finally, we sought to understand whether the unique noncoding signature acquired by different cell types is established during early stages of cell development, and if so, how dynamic it is with respect to cellular transcriptome. To address this question, we referred to an in vitro model of early lineage commitment: the differentiation of pluripotent stem cells into embryoid bodies (EBs) (40). The role of ncRNA in maintaining stem cell pluripotency and lineage commitment has been demonstrated previously through bulk experiments (41,42). Thus, we hypothesized that applying Smart-seq-total to single cells at different stages of EB formation would allow us to identify coexpressed coding and noncoding transcripts within emerging lineages. As such, we analyzed the RNAome of primed pluripotent stem cells and that of individual cells obtained from dissociated EB at days 4, 8, and 12 of culture (∼200 cells per each time point, 913 cells in total) (Fig. 3A). Consistent with previous studies (43), the number of coding genes expressed by pluripotent stem cells was also higher compared to differentiated progenitors (SI Appendix, Fig. S16A). This was also the case for several ncRNA types, such as lncRNA, miRNA, and scaRNA (SI Appendix, Fig. S16B). Specifically, we observed that the levels of certain snoRNAs (such as Snord17, Snora23, and Snord87), scaR-NAs (such as Scarna13 and Scarna6), lncRNAs (Platr3, Lncenc1, Snhg9, Gm31659, and so forth), and miRNAs (Mir92-2, Mir302b, and Mir19b-2) go down (log 2 FC > 1, adjusted P < 0. 01) after cells exit pluripotency (Fig. 3B). In contrast, we also identified that the levels of several lncRNAs (Tug1, Meg3, Lockd) and miRNAs (Mir298, Mir351, Mir370) increase with differentiation (Fig. 3B).
We next used PAGA (47) to infer a developmental trajectory and compute pseudotime coordinates for each cell in our dataset (Materials and Methods and Fig. 3 D and E). Aligning cells in pseudotime within each lineage further confirmed the existence of expression gradient within different RNA types (Fig. 3F). Furthermore, we found that the majority of identified variable noncoding transcripts were germ-layer specific. Examples of such transcripts include Mir2137, Mir320, Gm49024, and Gm38708 in ectoderm; Mir351, Mir370, and Meg3 in mesoderm; as well as Neat1 in endoderm. Mir296 and Mir298 were expressed in both mesoderm and endoderm but were absent in ectoderm (Fig. 3G, SI Appendix, Fig. S18B, and Dataset S3).
Finally, to understand the relationship between mRNA and ncRNA genes, we performed a pairwise correlation analysis of gene expression across all sampled cells. This analysis showed that germ layer-specific miRNAs are correlated with gene sets associated with the proliferation of certain cell lineages (SI Appendix, Fig. S19A and Dataset S4). For example, miR-370 positively correlated with genes involved in the regulation of nervous system development and miR-351 correlated with genes associated with smooth muscle cell migration and osteoblasts differentiation (SI Appendix, Fig. S19). In addition, the expression of ∼50% of identified histone-coding genes correlated with the expression of other protein-coding genes (Spearman's ρ > 0.5) (SI Appendix, Fig. S19). Overall, we found that multiple ncRNAs from all assayed RNA types (e.g., miRNA, snoRNA, snRNA, and so forth) are either positively or negatively correlated with the expression of protein-coding genes (SI Appendix, Fig. S19). Most of these ncRNAs represent putative uncharacterized regulators of lineage commitment.

Discussion
Altogether, Smart-seq-total enables an unbiased exploration of a broad spectrum of coding RNA and ncRNA transcripts in individual cells. Current limitations of Smart-seq-total are: 1) the inability to assay circRNA and 2) the loss of the endogenous polyadenylation status of transcripts. Further modifications to Smartseq-total can include the selection for a specific transcript length (short vs. long) and depletion of a wider range of overrepresented RNA. We anticipate that Smart-seq-total will facilitate the identification of noncoding regulatory patterns and their functional roles in regulating cellular functions and shaping cellular identity. This could also shift the current protein-centered view of gene regulation toward comprehensive maps featuring both protein and RNA regulators.
Cells were stained with calcein-AM and ethidium homodimer-1 (LIVE/ DEAD Viability/Cytotoxicity Kit, ThermoFisher, L3224) following the manufacturer's protocol and individual live cells were sorted in 384-well lysis plates using SONY sorter (SH800S) with a 100-μm nozzle chip. Plates were spun down and stored at À80°C immediately after sorting.
To perform the library preparation in 96-well plates, we followed the above-described protocol, except that all volumes were scaled up 10 times.
To generate libraries through tagmentation, we followed previously described procedure (49).
Library Pooling, Ribosomal Sequence Digestion, and Sequencing. After library preparation, wells of each library plate were pooled using a Mosquito liquid handler (TTP Labtech). If pooling tagmented and nontagmented libraries, the two types of libraries were mixed in ∼1:1 molar ratio. Pooling was followed by a purification with 0.8× AMPure beads (Fisher, A63881). Ribosomal reads were digested using DASH (depletion of abundant sequences by hybridization), as described previously (23) (see SI Appendix, Supplementary Protocol for details). Briefly, the guides designed to target 45S rRNA and other abundant sequences (SI Appendix, Table S1) were combined with tracer RNA and assembled with Cas9 protein in 2:1 ratio. The assembled complexes were incubated with the sequencing library in 1× Cas9 buffer (SI Appendix, Table S1) for 1 h at 37°C. Following rRNA sequence digestion, Cas9 was inactivated through incubation with proteinase K for 15 min at 50°C. The library was then purified twice, first using 1.2× AMPure beads to DNA ratio. then using 2% Pippin Prep gels (Sage Sciences) in the 210-to 600-bp range.
Data Processing. Sequences from the NovaSeq were de-multiplexed using bcl2fastq v2. 19.0.316. The analysis of Smart-seq-total v1 and Smart-seq-total v2 data were carried out as described in GitHub Smart-seq-total page (https:// github.com/aisakova/smart-seq-total/blob/master/README.md). Briefly, for Smart-seq-total v1, reads were trimmed from polyA tails using cutadapt v1.18 with the following parameters: -m 18 -j 4 -a AAAAAAAAAA -a TTTTTTTTTT. Reads were then aligned to the human (GRCh38) or mouse (GRCm38) genomes using STAR v2.7.0d (50) with the following parameters: -outFilterMismatch NoverLmax 0.05 -outFilterMatchNmin 18 -outFilterMatchNminOverLread 0 -outFilterScoreMinOverLread 0 -outMultimapperOrder Random. Reads mapping to multiple locations were assigned either to a location with the best mapping score or, in the case of equal multimapping score, to the genomic location randomly chosen as "primary." Transcripts were counted using featureCounts v1.6.1 (51) with the following parameters: -M -primary -s 1. GENCODE v32 and GENCODE M23 (52) annotations were used for human and mouse reads, respectively. tRNA was quantified using high-confidence gene set obtained from GtRNA (53). To account for multimappers, "primary" alignment reported by STAR was counted. For miRNA and tRNA, all reads mapping to arms or the stem loop were summed to quantify the expression at the gene level.
Comparison of Smart-seq2 and Smart-seq-total. HEK293T cells were sorted in 96-well plates containing 3 μL of lysis buffer (as described above). The reaction volumes for Smart-seq-total were scaled up 10 times compared to 384-plate format (i.e., RNA from each cell was polyadenylated in 5 μL, reverse-transcribed in 15 μL, and cDNA was preamplified in 15 μL total volume). We retrieved Smart-seq2 data from Picelli et al. (19) (GSE49321). Smart-seq2 and Smartseq-total reads were mapped using STAR and counted using featureCounts, as described above. Comparisons between protocols in SI Appendix, Fig. S1C were generated on depth-normalized libraries, using 2.5 million randomly selected reads per adaptor-trimmed library (or all reads for libraries that had less than 2.5 million reads) to compute expression levels (counts per million, cpm).

Unsupervised Clustering and Dimensionality Reduction Analysis of Human Cell
Types. Standard procedures for filtering, variable gene selection, dimensionality reduction, and clustering were performed using the Seurat package v3.1.4 (54). Cells with fewer than 2,000 detected genes and those with more than two Mio reads were excluded from the analysis. Counts were log-normalized for each cell using the natural logarithm of 1 + cpm. Variable genes were selected based on overdispersion analysis and projected onto a low-dimensional subspace using PCA. The number of PCs was selected on the basis of inspection of the plot of variance explained. Cells were visualized using a two-dimensional t-SNE of the PC-projected data. Dimensionality reduction parameters for t-SNE (resolution and number of PCs) were adjusted on a per cell type and per biotype basis and can be viewed in the Rmd files available on GitHub. Cells were assigned a cell cycle score using Seurat's CellCycleScoring() function using cell cycle markers described in Tirosh et al. (30).
Clustering of Coding and Noncoding Genes. Clusters of coding and noncoding genes shown in Fig. 2B were computed and visualized using the DEGreport R package (55). The top 250 marker genes for each cell-cycle phase and all noncoding genes with average expression ln(cpm+1) > 0.05 in at least one phase were used for this analysis. Gene-expression values were normalized using variance stabilizing transformation (56) before clustering. Further details of the analysis can be viewed in the Rmd files available on GitHub.

Isakova et al.
Single-cell quantification of a broad RNA spectrum reveals unique noncoding patterns associated with cell types and states PNAS j 7 of 9 https://doi.org/10.1073/pnas.2113568118 Correlation between Mature miRNAs and mRNA. Total RNA fraction and a corresponding small RNA fraction of a Smart-seq-total v2 library were sequenced separately on NextSeq 500 sequencer. Sequencing data from both libraries was processed as described above (also see https://github.com/ aisakova/smart-seq-total/blob/master/README.md). Small RNA and total RNA count tables were normalized and log-transformed separately. Pearson coefficients were computed for all pair-wise correlations between miRNAs and all other genes detected in ∼300 HEK293T cells [present in at least 30 cells, ln(cpm+1) > 1]. Predicted and validated targets were annotated using multiMiR pipeline (scanning predictions from various databases, such as TargetScan, mirTarBase, miRanda, and so forth) (57). Predicted targets were limited to those listed in two or more external databases. Validated targets with "weak" evidence were excluded from the analysis [(abs(r)) >0.5, adjusted P value <0.01] (SI Appendix, Fig. S15A). GO was performed on ∼100 to 200 top correlated (r > 0.3, adjusted P < 0.01) or ∼100 anticorrelated (r < À0.3, adjusted P < 0.01) genes. The 10 GO terms with lowest P value in Kolmogorov-Smirnov test were used (SI Appendix, Fig. S15B) Preprocessing and Clustering of mESCs. Standard procedures for filtering, variable gene selection, dimensionality reduction, and clustering were performed using the Seurat package v3. 1.4 (54). Cells with fewer than 1,000 detected genes and those with more than two Mio reads were excluded from the analysis. Counts were log-normalized for each cell using log 1 p(counts) and 1e4 scale factor. Variable genes were projected onto a lowdimensional subspace using PC analysis. The number of PCs was selected on the basis of inspection of the variance explained plot. A shared nearestneighbor graph was constructed on the basis of the Euclidean distance in the low-dimensional subspace spanned by the top PCs. Cells were visualized using the uniform manifold approximation and projection (UMAP) algorithm (58) of the PC-projected data. Clusters were annotated based on the expression of known marker genes corresponding to one of the three germ layers. Cells were assigned a cell-cycle score using Seurat's CellCycleScoring() function and cell-cycle markers described in Tirosh et al. (30) Developmental Trajectory Inference of EB Differentiation. Developmental trajectory of mESC differentiation was inferred using PAGA through dynoverse wrapper (59). Pseudotime coordinates computed from the trajectory were appended to Seurat object and further used to generate Fig. 3 C-F.
Correlation between Coding and Noncoding RNA Levels. Spearman coefficients were computed for all pair-wise correlations of expressed genes [average ln(cpm+1) > 2 across all cells]. The resulting matrix was subset to only mRNA:ncRNA correlations. Pairs with Spearman's ρ > 0.5 were used to generate a chord diagram shown in SI Appendix, Fig. S17A and pairs with Spearman's ρ < À0.5 were used to generate a chord diagram shown in SI Appendix, Fig. S17B.