LinRace: single cell lineage reconstruction using paired lineage barcode and gene expression data

Understanding how single cells divide and differentiate into different cell types in developed organs is one of the major tasks of developmental and stem cell biology. Recently, lineage tracing technology using CRISPR/Cas9 genome editing have enabled simultaneous readouts of gene expressions and lineage barcodes in single cells, which allows for the reconstruction of the cell division tree, and even the detection of cell types and differentiation trajectories at the whole organism level. While most state-of-the-art methods for lineage reconstruction utilize only the lineage barcode data, methods that incorporate gene expression data are emerging, aiming to improve the accuracy of lineage reconstruction. However, effectively incorporating the gene expression data requires a reasonable model on how gene expression data changes along generations of divisions. Here, we present LinRace (Lineage Reconstruction with asymmetric cell division model), a method that integrates the lineage barcode and gene expression data using the asymmetric cell division model and infers cell lineage under a framework combining Neighbor Joining and maximum-likelihood heuristics. On both simulated and real data, LinRace outputs more accurate cell division trees than existing methods for lineage reconstruction. Moreover, LinRace can output the cell states (cell types) of ancestral cells, which is rarely performed with existing lineage reconstruction methods. The information on ancestral cells can be used to analyze how a progenitor cell generates a large population of cells with various functionalities. LinRace is available at: https://github.com/ZhangLabGT/LinRace.


26
Understanding how cells divide and differentiate into various cell types is a fundamental problem in de-27 velopmental biology. Lineage tracing technology which traces cell divisions using a "recorder" is the most 28 widely used technique to study the developmental histories of cells, while traditional lineage tracing tech-29 nologies can only work with a limited number of cells with low resolution [16]. Recently, sequencing-based 30 lineage tracing methods (e.g. using CRISPR/Cas9 genome editing) have enabled the simultaneous recording 31 of the clonal relationships of single cells alongside the transcriptomes [34] for up to thousands of cells. Such 32 methods utilize lineage recorders, which are exogenous DNA sequences integrated into the genome. Even 33 though there are different ways of designing the lineage recorder [1,2,3,13,20,25,31,32], the common idea is 34 to introduce changes at the target sites (the location to which the Cas9 protein binds to induce mutations) 35 on the lineage recorder which accumulates through generations of cell divisions. Finally, the recorders are 36 sequenced together with the transcriptome of every single cell, resulting in the "lineage barcode" data. Each 37 target site corresponds to a character in the barcode. The barcode data are used to reconstruct the cell 38 division tree, which is also called the cell lineage tree in this paper. The reconstructed cell division tree can 39 shed light on the developmental process that can not be directly measured. represented by the mutation rate parameter, may not be optimal for reconstructing the cell division tree. 51 Is it shown that the distribution of mutations across the barcode is not uniform, but rather biased towards 52 certain target sites [29,24]. The mutation rate, which represents the efficiency of mutations being induced in 53 the barcode, has a major impact on the potential to successfully inferring the cell division lineages from the 54 data. However, current experimental technologies do not guarantee that the mutations occur at a rate that 55 allows the tracing of every cell division event. using barcode data [9,24,29]. Moreover, due to the short barcode length and dropouts, a number of cells can 65 have the same barcode, and the reconstructed lineage trees tend to have low depth (maximum path length 66 from the root to a leaf node) and few internal nodes (with some nodes having large degrees) even using a per-67 fect method. More recently, methods that combine lineage barcode and gene expression data are proposed, 68 aiming to further improve the accuracy of cell lineage reconstruction. LinTIMaT [37] develops a combined 69 likelihood function and uses a local search framework to search for the tree with the maximum likelihood. 70 Integrating paired gene expression to infer the cell lineage tree can potentially refine the reconstruction, 71 however, despite that the paired data should theoretically provide more information than barcodes alone, 72 LinTIMaT did not beat methods that use only the lineage barcode data according to previous comparisons 73 on synthetic datasets [24].

75
The key to combining the lineage barcode and gene expression data is to model the relationship between 76 the two types of data, i.e., how the gene expression of cells changes along with the barcode data during cell simple assumption is to assume that cells that have similar transcriptomes should locate close in the cell 79 division tree, i.e., they should have similar barcodes. This is the assumption used in LinTIMaT. However, in 80 a few recent papers that present paired single-cell lineage barcode and gene expression data, it is observed 81 that, in a tree reconstructed using the barcode data, although a proportion of cells with the same cell state 82 located in the same subtree, it is remarkable that some cells of the same cell state are located in different 83 subtrees, and the same subtree can have multiple cell types [3,25]. We call this phenomenon the "partial 84 consistency between transcriptome similarity and barcode similarity".

86
The asymmetric cell division model has been shown to be able to account for the "partial consistency 87 between transcriptome similarity and barcode similarity" [24]. It is commonly considered that cells can divide 88 in a symmetric or asymmetric manner [15,21]. A symmetric cell division gives rise to daughter cells with the 89 same cell state as the parent cells. During an asymmetric cell division, one daughter cell keeps the parent's 90 cell state, and the other one differentiates into a future cell state according to the cell state tree. For a given 91 cell, there is a probability with which it divides asymmetrically, termed as the asymmetric division rate, 92 denoted by p a . It has been shown that this probabilistic asymmetric cell division model leads to realistic 93 paired single-cell lineage barcode and gene expression data.

95
To address these problems, we present LinRace, a new method that combines the lineage barcode and gene 96 expression data to infer cell division trees, based on a joint Neighbor Joining (NJ) and maximum-likelihood 97 framework. The asymmetric cell division model is used in LinRace to infer the states of ancestral cells and 98 to calculate the likelihood function, thus incorporating the relationship between lineage barcode and gene 99 expression data in a realistic way. On both simulated and real datasets, LinRace consistently outperforms 100 the state-of-the-art methods according to multiple metrics. We show that the use of gene expression data in 101 LinRace helps to improve the lineage tree reconstruction accuracy compared to methods that use only the 102 lineage barcode data (Cassiopeia and DCLEAR). We also show that LinRace achieves better performances 103 than the existing method that also uses gene expression data (LinTIMaT) while improving computational 104 efficiency. We also propose a new metric, reconstruction potential, to analyze how LinRace performs compared 105 to a theoretically "perfect" algorithm given the same input data. Moreover, we also demonstrate that when 106 applied to large-scale real datasets, LinRace uncovers more detailed local lineage structures compared to 107 LinTIMaT, as well as ancestral state information and state-lineage relationships that are consistent with 108 observations from real data. First, from the barcode data of all the cells, we extract the unique barcodes and build a tree of these 123 barcodes ( Fig. 1 Step A) using the neighbor joining [28] tree reconstruction method. We then obtain a 124 tree where each leaf node represents a unique barcode and can correspond to multiple cells with the same 125 barcode. We denote this tree by T 0 . Then we use the single-cell gene expression data to refine the tree T 0 . 126 We consider that there exists an underlying cell state transition mechanism that can be represented by a cell 127 state tree, following which different cell types emerge during the cell division processes. This cell state tree 128 can be inferred from the single cell gene expression data using trajectory inference methods [27,33,35] ( Fig. 1   129 Step B). The cell state tree along with the asymmetric cell division model together model the relationship one unique barcode and potentially many cells, we use a novel likelihood function (Methods) to find the 134 best bifurcating tree (which means every internal node has exactly two children nodes) that maximizes this 135 likelihood ( Fig. 1 Step C). This likelihood is based on the asymmetric cell division model and the cell state 136 tree. Finally, we attach the subtrees inferred in Step C to the lineage tree inferred from the barcode data 137 (T 0 ) to get the complete lineage of single cells ( Fig. 1 Step D). We first test LinRace's tree reconstruction performance against baseline methods using simulated data. We   Other parameters used for LinRace are specified in Methods.

166
As the states of cells and the cell state tree inferred in Fig. 1 Step B are used to infer the states of 167 ancestral cells and calculate the likelihood ( Fig. 1 Step C), the accuracy of the inferred cell states and cell 168 state tree can affect the lineage tree reconstruction accuracy of LinRace. With the ground truth cell states 169 and cell state tree provided by TedSim, we can investigate the effect of cell state and cell state tree inference 170 on the final performance of LinRace. To do this, we run LinRace in two modes: (1) using the ground truth 171 cell states and the cell state tree, this mode is denoted as LinRace-TST (True State Tree); (2) using inferred 172 cell states and the cell state tree, and this mode is denoted as LinRace-IST (Inferred State Tree). We use 173 Slingshot [33] to infer cell states and the cell state tree from the gene expression data. As Slingshot does not 174 infer the direction of trajectories, it is a common practice to assign a root cell so that we obtain a directed 175 trajectory. We randomly selected a cell from the root cell state in the true cell state tree to provide this cell 176 to Slingshot as the root cell. To evaluate the quality of reconstructed lineage trees, we use the RF (Robinson-Foulds) distance [26] 179 and Nye Similarity score [22] to quantify the distance or similarity between the reconstructed lineage tree 180 and the true lineage tree (Methods). For RF distance, lower is better, and for the Nye Similarity, higher is   When the barcode data quality is good, that is, with no dropouts and with optimal mutation rate (from 210 0.1 to 0.2), DCLEAR-kmer, which uses only barcode data, achieves comparable or slightly better ( Fig. 2a) 211 performance as LinRace. However, we emphasize that current lineage barcode data is far from this quality, 212 thus incorporating the gene expression data in an appropriate way is necessary. We also observed similar   Fig. 2b, we can see that dropouts not only affect the absolute RF distance 225 values for LinRace but also the relative values between reconstruction potential and RF distance of LinRace.

226
LinRace is able to get close to the reconstruction potential when the barcode data is good (µ = 0.05 − 0.15 227 without dropouts), which means that the number of correctly inferred splits of LinRace is on par with the 228 number of detectable splits on the true tree. The RF distance of LinRace can sometimes (µ = 0.1 without 229 dropouts) exceed the reconstruction potential, which can be potentially caused by local structures inferred 230 from the gene expression data.   LinRace runs more local searches on larger subtrees, which will increase its running time.    We use a zebrafish brain dataset from scGESTAULT[25] and reconstruct the cell lineage using LinRace 300 and other baseline methods ( Fig. 4a and Supplementary Fig. 7). We are not able to provide quantitative 301 accuracy of the reconstructed trees as there does not exist a ground true lineage tree. Instead, we compare 302 the resolutions of the trees using the depth and number of internal nodes of the reconstructed trees. From 303 Supplementary Fig. 7a, we see that the LinRace reconstructed lineage tree has more depth and inferred 304 internal nodes than the Cassiopeia tree and LinTIMaT tree. For DCLEAR which also infers a bifurcating 305 tree, the local splits between cells with the same barcode is randomly decided which does not provide any 306 biological insights, and its number of internal nodes is almost the same as that of LinRace but its depth is 307 much higher means that the DCLEAR tree is much less balanced than the LinRace Tree (Supplementary 308 Fig. 7).  Fig. 4b). We can also observe the case where the progenitor cell divides symmetrically to keep the sustainable 316 activity of neurogenesis (Subtree 1 in Fig. 4b).  Fig. 4b, respectively).  2.5 LinRace helps to answer the sources of diverse cell types in the mouse embryo data 333 We applied LinRace to an early mouse embryo dataset from Chan et al [3] to infer the cell division tree.

334
The mouse embryo dataset contains 9707 cells of 34 annotated cell types from the authors (Fig. 5a). On this 335 dataset, we not only analyze how the cell types are distributed over the cell division tree but also ask if any 336 lineage signature exists in a cell's gene expression profile.

338
We cut the reconstructed tree at depth 26 and obtained subtrees with their roots at a distance of 26 to 339 the root of the complete tree (Fig. 5b). First, we investigate the cell type composition of the subtrees. The 340 cell type composition of 7 subtrees with at least 10 leaves is shown in (Fig. 5c). Each subtree consists of 341 multiple cell types (colors are consistent with the color legend of Fig. 5a), and the same cell type appears in 342 different subtrees. 343 Next, we take the cells that belong to the same cell type (Fore/Midbrain) and look for differences that 345 potentially exist between cells from different lineages but with the same cell type. We visualize the Fore/Mid-346 brain cells labeled by their clonal IDs (same as the subtree IDs, Fig. 5d). In the UMAP space, cells from  In this paper, we present LinRace, an integrated method that combines the lineage barcode and gene ex-      In LinRace, we utilize a Hamming distance-based NJ method to infer the "backbone lineage tree". Since and then Neighbor Joining is applied to the K × M matrix to get the lineage tree of unique barcodes (Fig. 1). 425 We call this tree the reconstructed lineage backbone, or tree backbone (denoted by T 0 ). To calculate the likelihood of a candidate tree based on the gene expression profiles of the leaves, we first 428 need to infer the states of cells at ancestral nodes using the states of leaf nodes and the cell state tree 429 that are inferred during Fig. 1 Step B. The inference of ancestral cell state follows the rule of asymmetric  Fig. 1 Step C follow these two rules given the cell state tree learned from Step B. This ancestral state 435 inference process allows one cell to divide into two cells with different cell states. If two daughter cell states 436 belong to the same differentiation path (a path from the root state to a leaf state), the parent cell will be 437 the same cell state as the earlier cell state between the two. Cells with identical barcode form a star tree in the initial NJ tree T 0 (See Fig. 1). LinRace utilizes the gene 440 expression data to learn the bifurcating tree topology of these cells. The learned bifurcating trees, termed 441 Gene Expression Subtrees (GES), are then attached to the tree backbone to yield the full cell lineage tree. 442 We design a maximum likelihood scoring function and local search strategy to find the GES.
where e = (e 1 , e 2 ) represents an edge on the tree graph from node e 1 to e 2 , S denotes the cell state assign-455 ments of all cells, and S e1 and S e2 denotes the cell states of the two cells e 1 and e 2 connected by e ∈ E.

457
The state transitions of cells' gene expressions are governed by an underlying developmental cell state 458 tree, which is inferred from the gene expression data using Slingshot ( Fig. 1 Step B). The cell state tree 459 guides the cells to differentiate into certain future cell states irreversibly. For any two states on the cell state 460 tree, there can exist at most one path(a sequence of connected, directed edges) that links these two states.

461
Therefore, when the states of a pair of ancestor-descendant, denoted as (S e1 , S e2 ) between two cells (e 1 , e 2 ), 462 the transfer probability P (S e2 |S e1 ) is calculated as follows: where D(S e1 , S e2 ) represents the graph geodesic from S e1 to S e2 on the cell state tree. If D(S e1 , S e2 ) = + inf, 464 a penalty of −50 is applied to the log likelihood. The probability for every distinct state transition is given 465 as follows: The asymmetric division likelihood considers the asymmetric divisions in the lineage tree and it is defined ). Then, the 480 transition probability between any two cells can be calculated as follows: and the neighbor distance likelihood can be calculated as: Finally, the total likelihood is calculated as: where λ 1 and λ 2 are hyperparameters.

485
To find the best GES based on our likelihood function, we utilize hill-climbing local search to search 486 in the tree space ( Fig. 1 Step C). In order to propose a new tree from a current tree, we adopted random to the new one. Repeat the process for every iteration until no better tree can be found within a number 496 of iterations, and the current tree is identified as a local optimal tree. We use a random restart technique 497 after a local optimal tree is found, and try to find as many local optimal as possible within the number of   Therefore, we also introduce another metric, Nye Similarity score [22], which can be regarded as an 544 extended version of RF distance. Nye Similarity also compares each split of edge on the two trees, but 545 instead of giving a binary score(0 or 1 for RF distance), it calculates a score based on the similarity between 546 the two splits. Considering each edge e in a tree T , the score s(e i , e j ) is for any pair of edges (e i , e j ), e i ∈ T 1 547 and e j ∈ T 2 , is obtained by the partition of the leaf nodes. Considering the set of all leaf nodes L, we have: where P e il , P eir are the two disjoint subsets of L by the edge e i , and P e jl , P ejr are the two disjoint subsets 549 by the edge e j . Then, for r, s = l, r, the number of leaf nodes shared by the partition can be given as, 550 a rs = |P eir ∩ P ejs | |P eir ∪ P ejs | For a pair of splits (e i , e j ), the score s(e i , e j ) is then defined by 551 s(i, j) = max{min{a ll , a rr }, min{a rl , a lr }} Finally, the Nye Similarity score is derived by maximizing the quantity: where f (e i ) is an alignment of edges for the two trees. The quantity is maximized by finding the best 553 alignment of edges f . Here is a summary Major parameters and their settings for LinRace are:

560
-Number of genes kept after filtering: we keep the top 100 highly variable genes for simulated data and ...

561
-Maximum number of iterations for each local search is 500 in all results.

562
-Asymmetric division rate p a is set to 0.8 in all tests

563
Running LinRace-IST also involves the steps of identifying cell states and inferring the cell state tree. 564 k-means clustering method is used on the scRNA-seq data to identify cell states and Slingshot is applied to 565 infer the cell state tree, based on the identified cell states. When using simulated data, the number of clusters 566 for k-means is set to 7. The same number of clusters is used for the C. elegans data. For the scGESTAULT 567 dataset (ZF1-F3), the number of clusters for k-means is set to 20. For the mouse embryo data, we are using 568 annotated cluster ids from the paper.  For DCLEAR, we used the "kmer" mode and set the k−mer length to 2 for all datasets, which is the 577 default value for this parameter. We also used default values of other parameters for DCLEAR.

579
For Cassiopeia, Cassiopeia-greedy and Cassiopeia-hybrid is used for 1024 cells. For the Cassiopeia-hybrid, 580 we set the convergence time limit for the ILP solver to be 1000, the maximum potential graph layer size to 581 be 500, and set a cell cutoff between the top greedy solver and the bottom ILP solver to be 20. For the pa-582 rameter indel priors (which represents the probabilities of particular indels occurring), we have left it empty.  reconstruction potential Q r is given as follows:  Fig. 1. Overview of LinRace.
Step A For barcode data, we extract the unique barcodes, and then perform Neighbor Joining to obtain the tree backbone, where each leaf represent a unique barcode shared by some cells.
Step B For the gene expression data, we use kmeans on PCA reduced dimensions to infer the cell states, and then use Slingshot[33] to infer the state trajectories which is then used to infer the ancestral states.
Step C For each group of cells of the same barcode, we use a maximum likelihood + local search framework to find the subtree topologies of the cells.
Step D The final output tree is obtained by combining the subtrees at their specific leaves on the tree backbone. Nodes in the trees that are illustrated by square represent cell states (or cell types), and those illustrated using circles represent cells, and the color of each circle node represent the state of the cell.

Fig. 2.
Benchmarking results for lineage reconstruction methods on TedSim simulated datasets. a Comparisons of LinRace (LinRace-IST and LinRace-TST) and other methods on TedSim simulated datasets using RF distance and Nye similarity. Both RF distance and Nye similarity have the range of [0, 1]. For both RF distance, lower is better, and for Nye similarity, higher values indicate the better performance. The detailed descriptions for simulation and method settings can be found in Methods. b Comparison of the reconstruction potential and LinRace-TST performances.
The data used to calculate the reconstruction potential and the RF distance of LinRace-TST reconstructed trees are the same data from a. Fig. 3. Results for tree reconstruction methods on real C.elegans datasets. a The exact true lineage of C.elegans at L1 larvae stage. There are 363 profiled cells on the tree while the original L1 larvae lineage has 668 cells. The tree pruning is done using the ape package in R and the clonal relations between cells are preserved. The tree is visualized using iTOL [18]. b 2-D PCA visualization of the gene expression of the C.elegans data. After PCA, the first 20 PCs are used for kmeans clustering and Slingshot to determine cell states and state trajectories (cell state tree). Detailed description of the processing steps for the gene expression data can be found in the Supplementary Note 2. c Benchmarking result on the C.elegans dataset with simulated barcodes. The methods are tested for varying mutation rate and dropout effect. Two metrics, RF distance and Nye similarities are used for the benchmark. .a LinRace reconstructed tree. The outer ring represent major cell type assignments and the inner colors on edges show detailed intermediate cell type assignments. The statistics of the reconstructed trees for different methods are also shown, and the max depth means the maximum total edge length to go from the root to any leaf cell. b A detailed look of two GES in LinRace reconstructed tree with inferred ancestral states. In the left GES, Subtree 1 shows a subtree of progenitor's selfrenewal and subtree 2 shows a subtree of differentiated Forebrain population. Black dashed circles denote asymmetric divisions of progenitor cells. The annotated leaf cell states are from the original data paper, and the ancestral states of all hidden nodes are inferred using LinRace. Detailed description of the processing steps for the gene expression data are in the Supplementary Note 2. Only subtrees have more than 20 leaves are labelled and cells that do not belong to these subtrees are colored black and labelled as "0".