Alignment of spatial transcriptomics data using diffeomorphic metric mapping

Spatial transcriptomics (ST) technologies enable high throughput gene expression characterization within thin tissue sections. However, comparing spatial observations across sections, samples, and technologies remains challenging. To address this challenge, we developed STalign to align ST datasets in a manner that accounts for partially matched tissue sections and other local non-linear distortions using diffeomorphic metric mapping. We apply STalign to align ST datasets within and across technologies as well as to align ST datasets to a 3D common coordinate framework. We show that STalign achieves high gene expression and cell-type correspondence across matched spatial locations that is significantly improved over landmark-based affine alignments. Applying STalign to align ST datasets of the mouse brain to the 3D common coordinate framework from the Allen Brain Atlas, we highlight how STalign can be used to lift over brain region annotations and enable the interrogation of compositional heterogeneity across anatomical structures. STalign is available as an open-source Python toolkit at https://github.com/JEFworks-Lab/STalign and as supplementary software with additional documentation and tutorials available at https://jef.works/STalign.

time and Α acts on ' & ( ) through matrix vector multiplication in homogeneous coordinates. The 92 optimal $,& is computed by minimizing an objective function that is the sum of a regularization 93 term, ( ) and a matching term, ) ( $,& • " , # ). The relative weights of the regularization term 94 and matching term can be tuned with * + and * , . The regularization term controls spatial 95 smoothness. In this term, we optimize over ( , ∈ [0, 1] noting that if ( is constricted to being 96 a smooth function, the ' & constructed from ( is guaranteed to be diffeomorphic. The matching 97 term incorporates a Gaussian mixture model ( ) to estimate matching, background, and artifact 98 components of the image to account for missing tissue such as due to partial tissue matches or 99 tears. Additionally, the matching term contains an image contrast function ) to account for 100 differences due to variations in cell density and/or imaging modalities. To solve all parameters in 101 each term a steepest gradient descent is performed over a user-specified number of epochs. Once 102 STalign enables alignment of single-cell resolution ST datasets within technologies 120 As a proof of concept, we first applied STalign to align two single-cell resolution ST datasets from 121 the same technology. Specifically, we aligned, in a pairwise manner at matched locations, ST data 122 from 9 full coronal slices of the adult mouse brain representing 3 biological replicates spanning 3 123 different locations with respect to bregma assayed by MERFISH (Methods). Inherent local spatial 124 dissimilarities between slices, due to biological variability and further exacerbated by technical 125 variation as well as tears and distortions sustained in the data acquisition process, render affine 126 transformations such as rotations and translations often insufficient for alignment. 127 To evaluate the performance of STalign, we first evaluated the spatial proximity of 128 manually identified structural landmarks between the source and target ST datasets, expecting the 129 landmarks to be closer together after alignment. We manually placed 12 to 13 landmarks that could 130 be reproducibly identified (Supp Fig 1, Supp Table 1). To establish a supervised affine 131 transformation for comparison with STalign, we solved for the affine transformation that 132 minimized the error between these landmarks using least squares. We then compared the positions 133 of the corresponding landmarks after both the supervised affine alignment and STalign alignment 134 using root-mean-square error (RMSE). When the supervised affine transformations were used for 135 alignment, RMSE was 202 +/-17.1 µm, 170 +/-3.47 µm, and 266 +/-6.65 µm for biological 136 replicates of each slice location respectively. When STalign based on an LDDMM transformation 137 model was used for alignment, RMSE was 113 +/-10.5 µm, 169 +/-4.53 µm, and 175 +/-5.47 138 µm for biological replicates of each slice location respectively. STalign was thus able to 139 consistently reduce the RMSE between landmarks after alignment compared to an affine 140 transformation, suggestive of higher alignment accuracy. 141 Given the ambiguity of where landmarks may be manually reproducibly placed and their 142 inability to evaluate alignment performance for the entire ST dataset, we next took advantage of 143 the available gene expression measurements to further evaluate the performance of STalign. 144 Because of the highly prototypic spatial organization of the brain, we expect high gene expression 145 correspondence across matched spatial locations after alignment. We focused our evaluation on 146 one pair of ST datasets of coronal slices from matched locations (Methods). We visually confirm 147 that alignment results in a high degree of spatial gene expression correspondence (Fig 2a, Supp  148   Fig 2a). To further quantify this spatial gene expression correspondence, we evaluated the gene 149 expression magnitudes at matched spatial locations across the aligned ST datasets. Specifically, 150 we aggregated cells into pixels in a 200μm grid to accommodate the differing numbers of cells 151 across slices and then quantified gene expression magnitude correspondence at spatially matched 152 200μm pixels using cosine similarity (Fig 2b-c, Supp Fig 2b). For a good alignment, we would 153 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint expect a high cosine similarity approaching 1, particularly for spatially patterned genes. To identify 154 such spatially patterned genes, we applied MERINGUE 8 to identify 457 genes with highly 155 significant spatial autocorrelation (Methods). For these genes, we observe a high spatial 156 correspondence after alignment as captured by the high median cosine similarity of 0.73. In 157 contrast, for the remaining 192 non-spatially patterned genes, we visually confirm as well as 158 quantify the general lack of spatial correspondence (Fig 2d-f, Supp Fig 3a-b). We note that these 159 non-spatially patterned genes are enriched in negative control blanks (57%), which do not encode 160 any specific gene but instead represent noise such that we would not expect spatial correspondence 161 even after alignment. Further, we observe a low median cosine similarity of 0.21 across non-162 spatially patterned genes that is significantly lower than for spatially patterned genes (Wilcoxon 163 rank-sum test p-value < 2.2e-16). 164 We next compare the alignment achieved with STalign to the alignment from a supervised 165 affine transformation based on our previously manually placed landmarks (Supp Fig 4a, Methods). 166 We visually confirm that a supervised affine alignment results in a lower degree of spatial gene 167 expression correspondence than alignment by STalign (Supp Fig4b). We again evaluate 168 performance of the supervised affine transformation using a pixel-based cosine similarity 169 quantification (Supp Fig4c). We find that for spatially patterned genes, the cosine similarity is 170 consistently higher with a mean difference of 0.09 for the alignment by STalign compared to 171 supervised affine (Supp Fig 4d). In contrast, for non-spatially patterned genes, the cosine similarity 172 is more comparable with a mean difference of 0.02 for the alignment by STalign compared to 173 supervised affine (Supp Fig 4e). This greater improvement in spatial gene expression 174 correspondence for the alignment achieved with STalign compared to supervised affine 175 transformation for spatially patterned genes suggests that modeling non-linearity in alignment with 176 approaches like STalign can achieve a higher alignment accuracy compared to linear alignment 177 approaches. 178 179 180 Figure 2. Evaluation of STalign based on spatial gene expression correspondence. a. Correspondence of gene 181 expression spatial organization between the target and aligned source for select spatially patterned genes. b. Transcript 182 counts in the target compared to the aligned source at matched pixels for select genes: Gabbr2, Gpr6, and Cckar. c.

183
Distribution of cosine similarities between transcript counts in target versus aligned source at matched pixels for 457 184 spatially patterned genes with select genes marked. d. Spatial pattern of expression for a select non-spatially patterned 185 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made analyzed single-cell resolution ST dataset of a full coronal slice of the adult mouse brain assayed 194 by MERFISH to a multi-cellular pixel resolution ST dataset of an analogous hemi-brain slice 195 assayed by Visium (Fig 3a). As such, in addition to being from different ST technologies, these 196 two ST datasets further represent partially matched tissue sections. Because of this partial 197 matching, we incorporated manually placed landmarks to initialize the alignment as well as further 198 help steer our gradient descent towards an appropriate solution (Online Methods). For the Visium 199 dataset, we leveraged a corresponding registered single-cell resolution hematoxylin and eosin 200 (H&E) staining image obtained from the same tissue section for the alignment (Methods). 201 To evaluate the performance of this alignment, we again take advantage of the available 202 gene expression measurements. Due to partially matched tissue sections, we restricted downstream 203 comparisons to tissue regions STalign assessed with a matching probability > 0.85 (Methods). We 204 again visually confirm that the spatial alignment results in a high spatial gene expression 205 correspondence albeit at differing resolutions across the two technologies (Fig 3b, Supp Fig 5a). 206 To further quantify this spatial gene expression correspondence, we evaluated the gene expression 207 magnitudes at matched spatial locations across the aligned tissue sections for the 415 genes with 208 non-zero expression in both ST datasets. We evaluated these genes for spatial autocorrelation on 209 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint the Visium data to identify 227 spatially patterned genes and 188 non-spatially patterned genes 210 (Methods). Due to the resolution differences between the two technologies, to ensure appropriate 211 comparisons, we used the positions of the Visium spots to aggregate MERFISH cells into matched 212 resolution pseudospots. Likewise, to control for detection efficiency differences between the two 213 technologies, we performed the same counts-per-million normalization on the Visium spot gene 214 expression measurements and the aggregated MERFISH pseudospots gene expression 215 measurements (Fig 3c, Supp Fig 5b). We again evaluated gene expression correspondence at 216 spatially matched spots using cosine similarity and observed a median cosine similarity of 0.55 217 across spatially patterned genes (Fig 3d) and a median cosine similarity of 0.06 across non-218 spatially patterned genes (Supp Fig 6). We note that this gene expression correspondence after 219 spatial alignment is lower than what was previously observed within technologies most likely due 220 to variation in detection efficiency across technologies in addition to variation in tissue 221 preservation rather than poor spatial alignment. While MERFISH detects targeted genes at high 222 sensitivity, Visium enables untargeted transcriptome-wide profiling though sensitivity for 223 individual genes may be lower 9 . Likewise, while the MERFISH dataset was generated with fresh, 224 frozen tissue, the Visium dataset was generated with FFPE preserved tissue. Still, we anticipate 225 that while sensitivity to specific genes may vary across technologies and with different tissue 226 preservation techniques, the underlying cell-types should be consistent. 227 228 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made    (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint Therefore, we sought to evaluate the performance of our alignment based on cell-type 244 spatial correspondence. To identify putative cell-types, we performed transcriptional clustering 245 analysis on the single-cell resolution MERFISH data (Supp Fig 7a) and deconvolution analysis 10 246 on the multi-cellular pixel-resolution Visium data (Fig 4a, Methods). We matched cell-types based 247 on transcriptional similarity between cell clusters and deconvolved cell-types (Supp Fig 7b). 248 Indeed, we visually observe high spatial correspondence across matched cell-types (Fig 4a-b). We 249 evaluated the proportional correspondence of cell-types at aligned spot and pseudospot spatial 250 locations by cosine similarity and observed a high median cosine similarity of 0.75 across cell-251 types (Fig 4c-   Tissues are inherently 3-dimensional (3D), and tissue sections are subject to distortions in 3D as 266 well as 2D. As such, a more precise spatial alignment of 2D tissue sections must accommodate 267 this 3D distortion. The underlying mathematical framework for STalign is amenable to alignment 268 in 2D as well as 3D (Online Methods). We thus applied STalign to align ST datasets to a 3D 269 common coordinate framework (CCF). Specifically, we applied STalign to align 9 ST datasets of 270 the adult mouse brain assayed by MERFISH to a 50µm resolution 3D adult mouse brain CCF 271 established by the Allen Brain Atlas 11 (Methods, Fig 5a). We note that such a 3D alignment can 272 accommodate deformations in and out of 2D planes (Fig 5b). In the construction of the Allen Brain 273 Atlas CCF, brain regions were delineated based on several features like cellular architecture, 274 differential gene expression, and functional properties via modalities such as histological stains, 275 in situ hybridization, and connectivity experiments to generate a set of reference brain region 276 annotations 11 . By aligning to this CCF, we can lift over these annotations to each cell (Fig 5c,  277 Supp Fig 8a), enabling further evaluation of variations of gene expression and cell-type 278 composition within and across these annotated brain regions. 279 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

285
Cell-type composition difference between paired brain regions from two MERFISH replicates. The x axis represents 286 cell-type composition difference within matched brain structures annotated by STalign across replicates and the y axis 287 represents cell-type composition difference between STalign-annotated regions and size-matched random brain 288 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made To assess the performance of our atlas alignment and lift-over annotations, we first 293 confirmed the enrichment of genes within certain brain regions. Numerous previous studies have 294 shown that some brain regions can be demarcated based on the expression of particular genes 12,13 . 295 We use these characteristic gene expression patterns to evaluate whether the brain regions lifted 296 over from the Allen Brain Atlas CCF by STalign indeed contain expression of known marker 297 genes. Consistent with previous studies, we found Grm2 to be visually primarily enriched in the 298 dentate gyrus brain region 14 , Sstr2 to be enriched in cerebral cortical layers 5 and 6 brain region 15 , 299 and Gpr161 to be enriched in the CA1 brain region 16 (Fig 5d), which was consistent across 300 replicates (Supp Fig 8b). 301 Next, we took a more agnostic approach to assess the performance of our atlas alignment 302 and lift-over annotations by evaluating the consistency of cell-type compositional heterogeneity 303 within brain structures across replicates. To identify cell-types, we perform unified transcriptional 304 clustering analysis on these 9 ST datasets to identify transcriptionally distinct cell clusters and 305 annotate them as cell-types based on known differentially expressed marker genes (Methods, Fig  306 5e-f, Supp Fig 9a). Many brain regions are known to have a characteristic cell type distribution 17-307 19 . Consistent with previous studies 20 , we observed cell-types to be spatially and compositionally 308 variable across brain regions (Fig5c, Fig 5e). We visually confirmed that this spatial and 309 compositional variability is consistent across replicates (Supp Fig8a, Supp Fig 9b). To further 310 quantify this consistency, for each brain region, we evaluated whether its cell-type composition 311 was more similar between replicates than compared to a randomly demarcated brain region of 312 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint matched size (Methods). For an accurate atlas alignment, we would expect the lift-over brain 313 region annotations to be more similar in cell-type composition across replicates, particularly for 314 brain structures with distinct cell-type compositions, as compared to random brain regions of 315 matched size. Indeed, we found that in 93% of evaluated brain structures (131/141), the cell-type 316 composition was significantly more similar (Paired t-test p-value = 6.805e-121) between replicates 317 than compared to a random brain region of matched size. (Fig 5g). For the 7% (10/141) of brain 318 regions that were less similar across replicates, we found that the number of cells in these brain 319 regions were significantly fewer (Wilcoxon rank-sum test p-value = 0.002) than other brain regions 320 (Supp Fig 10a). Notably, 60% of these brain regions had a minimum width of under 50µm, 321 including both compact and long, thin structures ( Supp Fig 10b), highlighting potential limitations 322 with respect to alignment accuracy of such structures at this given resolution of alignment. 323 Finally, we also sought to assess the performance of our atlas alignment and lift-over 324 annotations by evaluating cell-type compositions within and beyond annotated brain region 325 boundaries (Methods). Specifically, we compare the entropy of each brain region based on the 326 region's cell-type composition to entropy if the boundaries of these regions were expanded (Fig  327   5h). Again, due to the characteristic cell-type distributions within brain regions in which one or a 328 few cell-types predominate, we would expect accurate lift-over brain region annotations to exhibit 329 entropies that are comparatively lower than if the boundaries of these regions were expanded, as 330 more cell-types would be incorporated into the region and entropy would increase. We therefore 331 expanded the brain structures lifted over by STalign by 100 nearest neighbors (NN), or 332 approximately 100µm, and evaluated the change in entropy. We performed the same analysis on 333 randomly demarcated brain regions of matched size, which were expanded by 100 NN to account 334 for increases in entropy due to an incorporation of more cells. We found that the entropies for the 335 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint original brain region annotations lifted over by STalign were significantly lower (paired t-test p-336 value=8.6e-18) than for the expanded regions. In contrast, the entropies for random brain regions 337 were not significantly lower (paired t-test p-value = 0.12) for the expanded regions (Supp Fig 11). 338 Taken together, these results demonstrate that STalign can align ST datasets to a 3D CCF to 339 consistently lift over atlas annotations that recapitulate the unique gene expression and cell-type 340 composition within brain regions. 341 342 STalign applicable to diverse tissues profiled by diverse ST technologies 343 STalign relies on variation in cell densities that generally form visible structures that can 344 be used for alignment. As we have shown, alignment across samples and animals is possible for 345 tissues with highly prototypic structures such as the brain. We further highlight the applicability 346 of STalign to the diverse ST technologies that can assay this tissue by demonstrating that we can 347 apply STalign to achieve structural correspondence for coronal slices of the adult mouse brain Here, we presented STalign, which builds on advancements in LDDMM, to perform alignment of 380 ST datasets in a pairwise manner within ST technologies, across ST technologies, as well as to a 381 3D common coordinate system. We have shown that STalign achieves high accuracy based on the 382 spatial proximity of manually identified shared landmarks as well as gene expression and cell-type 383 correspondence at matched spatial locations after alignment. We note that based on these metrics, 384 STalign outperforms affine transformations alone, highlighting the utility of local, non-linear 385 transformations in alignment. STalign can further accommodate partially matched tissue sections, 386 where one tissue section may be a fraction of another. We further apply STalign to align ST 387 datasets to a 3D CCF to enable automated lift-over of CCF annotations such as brain regions in a 388 scalable manner. We confirm that lift-over brain region annotations identify cells that express 389 expected genes for a variety of brain regions. We also show that brain region annotations lifted 390 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint over by STalign exhibit consistent cell-type compositions across replicates and within boundaries 391 compared to random brain regions matched in size. 392 We anticipate that future applications of STalign to ST data particularly across ST 393 technologies will enable cross-technology comparisons as well as cross-technology integration 394 through spatial alignment. In particular, aligning ST data for similar tissues across different ST 395 technology platforms may allow us to better interrogate platform-specific differences and 396 strengths. Given that different ST technologies currently generally prioritize either resolution or 397 genome-wide capabilities, we may wish to apply different ST technologies on serial sections to 398 leverage their unique strengths to characterize matched spatial location. With atlasing efforts like 399 The Human BioMolecular Atlas Program and others producing 3D CCFs 25 , application of STalign 400 to align ST data to such CCFs to enable automated lift over of atlas structural annotations will 401 facilitate standardization and unification of biological insights regarding annotated structures. 402 Likewise, STalign complements gene-expression-based approaches for sample alignment 26 by 403 focusing on the real space rather than a higher-order transcriptomic manifold. We further anticipate 404 future applications of STalign to ST data from structurally matched tissues in case-control settings 405 will enhance the throughput for yielding meaningful comparisons regarding gene expression and 406 cell-type distributions in space as evidenced by recent applications of ST technologies to 407 characterize spatially-resolved age-related 27 and injury-related 28 gene expression variation. 408 As ST technologies continue to evolve, we anticipate STalign will continue to be applicable 409 due to our use of rasterization to convert the positions of single cells into an image with specified 410 resolution. The runtime of each iteration of the STalign alignment algorithm scales with respect to 411 the number of pixels in this image. For most evaluated datasets, we find that STalign is generally 412 able to converge onto an optimal alignment within a few minutes to a few hours, depending on the 413 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint number of pixels, the number of iterations, and other system variables (Methods, Supp Table 2). 414 Whereas other alignment algorithms generally scale in memory and runtime with the number of 415 spatially resolved measurements (spots or cells) 1,2 , which will likely make them computationally 416 untenable as ST technologies evolve to increase the number of spatially resolved measurements 417 that can be assayed. Overall, we anticipate that the ability for users to choose the rasterization 418 resolution, and therefore the number of pixels in the rasterized image, will allow STalign to 419 maintain its utility for larger datasets. Further, as STalign is based on an LDDMM transformation model for alignment, it inherits 428 the same limitations. As LDDMM relies on optimization using gradient descent, the resulting 429 alignment solution may converge on local minima. Strategies to guide the optimization away from 430 potential local minima may be applied in the future. Likewise, the more different the source and 431 targets for alignment, particularly for partially matching sections, the more important the 432 initialization will be for this optimization. As we have shown, landmark points may be used to 433 guide the initialization of an orientation and scaling for alignment. In addition, LDDMM enforces 434 an inverse consistency constraint such that every observation in the target must have some 435 correspondence in the source in a manner that cannot accommodate holes or other topological 436 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint differences in the tissue through the deformation only 7 . As such, when performing alignments, we 437 advise choosing the more complete tissue section as the source because our Gaussian mixture 438 modeling for accommodating partially matched tissues and other artifacts applies to the target 439 image intensity only. 440 Still, alignment accuracy at the resolution of single cells is limited by the fact that there is 441 generally no one-to-one correspondence between cells across samples, particular for complex 442 tissues. As such, accuracy can typically only be expected to be achieved up to a "mesoscopic scale" 443 at which it is reasonable to define cell density 29 . As we have shown, this presents challenges 444 particularly in aligning thin structures. While STalign currently uses an isotropic (Gaussian) kernel 445 to estimate cell densities, future work considering non-isotropic kernels may improve accuracy for 446 these thin structures. However, generally, our choice of kernel will inherently bias our alignment 447 towards accuracy at a certain structural scale. Likewise, although we focused here on aligning 448 based on cell densities, STalign and the underlying LDDMM framework can also be applied to 449 align using cellular features such as gene expression magnitude, reduced dimensional 450 representations of gene expression such as via principal components, or cell-type annotations, 451 which may improve the accuracy of alignment for regions with homogenous cell density but 452 heterogeneous gene expression and cell-type composition. However, integration of such features 453 in the alignment process necessitates orthogonal means of performance evaluation beyond the 454 correspondences in gene expression magnitude and cell-type proportions that we have used here. 455 By aligning based on cell densities, we do not require shared gene expression quantifications or 456 unified cell-type annotations, potentially enhancing flexibility and providing opportunities for 457 integrating across other data modalities for which spatially resolved single cell resolution 458 information is available such as other spatial omics data in the future. 459 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint Overall, we anticipate that moving forward STalign will help provide a unified 460 mathematical framework for ST data alignment to enable integration and downstream analyses 461 requiring spatial alignment to reveal new insights regarding transcriptomic differences between 462 different tissue structures and across various physiological axes. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made (https://singlecell.broadinstitute.org/single_cell/study/SCP1830/spatial-atlas-of-molecular-587 […]pes-and-aav-accessibility-across-the-whole-mouse-brain) 588 589 Developing heart data for samples CN73_E1 and CN73_E2 were downloaded from the Human 590 Developmental Cell Atlas https://hdca-sweden.scilifelab.se/a-study-on-human-heart-591 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The 50µm resolution 3D adult mouse brain CCF was obtained from the Allen Brain Atlas website 601 (https://download.alleninstitute.org/informatics-archive/current-602 release/mouse_ccf/annotation/ccf_2017/annotation_50.nrrd). 603

Application of STalign 605
To align MERFISH datasets, we applied STalign in a pairwise manner across replicates for 606 sections from matched locations with respect to bregma, rasterized at a 50µm resolution, and 607 iterated over 1000 epochs, with the following changes to default parameters (sigmaM: 0.2). 608

609
To align a MERFISH dataset to a Visium dataset, we applied STalign with MERFISH Slice 2 610 Replicate 3, rasterized at a 50µm resolution, as the source and the high resolution Visium 611 hematoxylin and eosin (H&E) staining image as the target. We utilized the landmark points 612 stored in Merfish_S2_R3_points.npy and tissue_hires_image_points.npy as inputs pointsI and 613 pointsJ. We iterated for 200 epochs with the following changes to default parameters (sigmaP: 614 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made To align MERFISH to the Allen CCF, we applied STalign using the 3D reconstructed Nissl image 618 from the Allen CCF atlas as a source, and each of our 9 MERFISH images as a target. To align Xenium datasets, we applied STalign with Xenium Breast Cancer Replicate 1 as the 635 source and with Xenium Breast Cancer Replicate 2 as the target, rasterized at 30µm resolution. 636 We placed a set of 3 manually chosen landmark points to compute an initial affine transformation. 637 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  To identify cell-types in the Visium data, we applied STdeconvolve on a corpus of 838 genes after 714 filtering out lowly expressed genes (<100 copies), genes present in < 5% of spots and genes present 715 in > 95% of spots and restricting to significantly over-dispersed with alpha =1e-16 to obtain a 716 corpus < 1000 genes, resulting in 16 deconvolved cell-types. To generate randomly demarcated brain regions, a random number generator (random.randint) 746 defined the x, y coordinate of the center of the random region, and the random region was 747 composed of the N closest points to the center, where N is the number of cells in the brain region. 748 A slice/replicate with random regions was constructed for all slice/replicates with STalign 749 annotated regions, and the number of cells (N) were the same for STalign and randomly 750 demarcated regions. 751 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

752
To compare cell-type compositions, each region was represented by a cell-type vector, which was 753 composed by the proportion of each cell type in the region (29x1 vector). We calculate the 754 Euclidean distance between cell type vectors of the same region across replicates in Slice 2 using 755 the regions annotated by STalign. The Euclidean distance was also found across replicates in Slice 756 2 using randomly demarcated brain regions and STalign brain regions. The Euclidean distances of 757 both groups were compared using a paired t-test. 431 data points for each group were used, 758 comparing replicate 1 to replicate 2, replicate 2 to replicate 3, and replicate 3 to replicate 1. 759

760
To evaluate annotated brain region boundaries, brain regions were expanded using k-nearest 761 neighbors (k=100) using the 'ball tree' algorithm for each region and each replicate in Slice 2 762 (sklearn.neighbors.NearestNeighbors). The procedure was conducted for STalign annotated brain 763 regions and randomly demarcated brain regions. Shannon's entropy was evaluated for STalign 764 annotated and randomly demarcated brain regions that were expanded by 100 nearest neighbors. 765 Paired t-tests were used to compute p-values between original and expanded brain regions for 766 STalign and random groups. Effect size was computed as a difference in the means of the 767 compared distributions. PP plots were used to visualize normality, and we used a Gaussian fit with 768 R>0.8 and a variance ratio less than 4 to confirm normality and equal variances. 431 data points 769 for each group were used. 770 771 To evaluate regions that had a greater Euclidean distance between two STalign regions compared 772 to random versus STalign regions, we calculated the number of cells and Shannon's entropy of 773 each region and tested for significance using a Wilcoxon Rank Sum test due to the small sample 774 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint 1 Methods

Cell density data model
For single-cell resolution spatially resolved transcriptomics data, we model the point sets of detected cells in the framework of varifold measures [1]. While the theory extends to more complex spaces of features, here we focus on image varifolds [2] by utilizing the locations of cells only, termed the marginal space measure (after marginalizing out features other than spatial location) as defined in [3].
Briefly, these space measures are weighted sums of Dirac δ distributions ρ . = Nq i w ρ i δ x ρ i , where x ρ i ∈ R D stores the spatial coordinate of the ith out of N q cells, and w ρ i ∈ R stores its weight. In this work, cell positional data is two dimensional, so D = 2, and with some abuse of notation we sometimes write (x, y) instead of x.
We aim to evaluate the similarity between two single-cell resolution spatially resolved transcriptomics datasets, which we call a source and a target. Note other commonly used terms for source are: template, atlas, or moving image, while another commonly used terms for target is: fixed image. To compute distances between datasets, we embed their corresponding space measures ρ S and ρ T respectively in the dual of a Reproducing Kernel Hilbert Space V * and use the standard operator norm (see for example [4]). For some choice of kernel function k, the norm squared is Here we chose k as a Gaussian with k( where | · | denotes the standard Euclidean norm, and σ is a user specified kernel width parameter. The variables w ρ S i S , w ρ S j S , and x ρ S i S , x ρ S j S corresponds to the weights and spatial coordinates of the ith and jth cells in the source, while w ρ T i T , w ρ T j T , x ρ T i T , x ρ T j T correspond to the weights and spatial coordinates of the cells in the target. For simplicity, in the main paper we write (x ρ S , y ρ S ) and (x ρ T , y ρ T ) for source and target points respectively.
In STalign, we initialize weights to 1, though applying nonlinear deformations will modify these weights as discussed below in section 1.4.

Rasterization
Since computation of this norm is of quadratic complexity in the number of points, we turn to a more efficient representation for computing optimal transformations through rasterization. We can reduce the complexity of our calculations significantly by approximating our space measures through sampling a density signal on a regular grid (known as rasterization), rather than keeping a list of points and weights.
By defining k 1 2 such that k 1 2 * k 1 2 = k (where * refers to convolution ), the above expression for norm squared (1) can be written as Note that when k is a radially symmetric Gaussian, k 1 2 is also a radially symmetric Gaussian but with half the variance. Here we have defined the smooth density function and ∥ · ∥ 2 is the L 2 norm on functions. Due to the smoothness introduced by k 1 2 , these functions can be accurately discretized by sampling them on a uniform pixel grid at a resolution rate defined by the user and comparable in size to σ.
Without rasterization, evaluating the function I at a given point would involve a sum over every x i , an order N complexity operation. After rasterization the function I can be evaluated at any point in order 1 complexity using bilinear interpolation. This allows the norm to be evaluated by summing over a pixel grid in order P complexity (where P is the number of pixels), rather than a double sum over the points x i in order N 2 complexity.
For example, MERFISH Slice 2 Replicate 3 has 85958 cells, and the rasterized dataset has 336 × 256 = 86, 016 pixels. The Naive approach would involve 7,388,777,764 terms in the sum (pairs of cells), whereas in the rasterization approach there are only 86,016 terms in the sum (pixels). This is an approximately 86,000 times increase in efficiency which occurs for each iteration of optimization, ignoring the negligible time cost of rasterization itself, which occurs only once at the start of registration.
2 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; In this section we showed how a rasterized image I can be produced from a list of cell location, in a manner compatible with the theory of varifolds. However, our registration algorithm can be performed with any standard rasterized image type. For example, in the main manuscript we show examples where I T is a red-green-blue image corresponding to a brightfield microscopy image of H&E stained tissue. How such images of different contrast profiles are handled is described in section 1.5.

Diffeomorphic transformation model
We estimate alignments between two rasterized datasets by applying a transformation ϕ : R D → R D . ϕ(x) . = Aφ 1 (x), the composition of two transformations: a diffeomorphism (a smooth differentiable transformation with a smooth differentiable inverse) φ 1 : R D → R D generated in the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework [5], and an affine transformation A (i.e. a 3x3 matrix in homogeneous coordinates whose upper left 2x2 block is a linear transform and upper right 2x1 block is a translation vector). In this notation Aφ 1 (x) denotes matrix multiplication of the matrix A and the vector φ 1 in homogeneous coordinates.
In the LDDMM framework a diffeomormphism is generated by integrating a time varying velocity field v t over time t ∈ [0, 1], by solving the ordinary differential equation with initial condition φ 0 = id. For identifying alignments, we optimize over v t rather than φ 1 directly, and to emphasize this dependence we use the superscript φ v in the main text. Similarly, we use ϕ A,v to emphasize the dependence of ϕ on both A and v t . As long as v t a smooth function of space, φ v 1 is guaranteed to be diffeomorphic. We enforce this through regularization as described below in section 1.5.
While this section described how we parameterize our transformations, next we need to describe how they act to deform our datasets, in order to use them in an optimization problem.

Action of transformations on datasets
The action of a transformation ϕ on a space measure dataset ρ moves the spatial coordinate of each cell, and adjusts the weight of each cell based on the transformation's Jacobian determinant.
where dϕ(x) denotes the matrix of partial derivatives of the map ϕ at the point x, and | · | represents the determinant of a matrix.
We note that the standard image action [ϕ · I](x) = I(ϕ −1 (x)) has been well studied theoretically (as a left group action), computationally (in terms of its efficient implementation through interpolation), and application-wise (in terms of its use in a variety of image registration platforms e.g. [5]). This image action for continuous image functions is not appropriate for space datasets and therefore the image action does not match the measure action defined in (5). However, in the dense tissue limit, the continuous image action is consistent with the measure action of (5) as proven in [3]. Since the applications shown here provide a dense approximation, this aforementioned consistency motivates us to leverage the continuous image action for its 3 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint Online Methods Figure 1: Rasterization followed by deformation with the image action (left), versus naively deforming the positions of cells followed by rasterization (right). Note the intensity changes that occur in regions of high deformation. computational advantages. We write the image action as follows: The approximate equality is accurate in our examples when k 1 2 is narrow relative to the smoothness of v t and wide relative to the spacing between cells. The approximate equality would be exact if k 1 2 were a Dirac delta function. If cells are too far apart, a larger value of σ could be chosen. For a particularly sparse set of cells, a different method that does not include rasterization would be more appropriate, for example measure matching [6].
With this formulation, deformations can be applied to smooth density images I using interpolation in order P (number of pixels) complexity. Online Methods Figure 1 illustrates the importance of the Jacobian factor. Without including this factor, transforming a density image alters its brightness, which is typically undesirable: a bigger organ tends to have more cells with the same cell density, rather than the same number of cells with a lower density.

Image registration
We compute a spatial alignment between two ST datasets by minimizing the sum of two objective functions: a regularization term R, and a matching term M θ , which we define below. Note that the computation of I is described in section 1.2, the parameterization of ϕ A,v is described in 1.3, and the action ϕ A,v · I S is described in the section 1.4. Recall that ϕ depends on both the velocity field v and the affine transform A.

4
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; Following the LDDMM framework, we regularize our diffeomorphism via where id is an identity matrix, σ 2 R is a user tunable parameter that adjusts balance between matching accuracy and regularization, where large values correspond to less regularization and higher matching accuracy, and small values correspond to more regularization and lower matching accuracy. ∆ is the Laplacian, a is a constant with units of length that controls spatial smoothness, and p = 2 is a power that must be large enough to guarantee that results are diffeomorphisms [7]. Note that small values of a may be overfitting of noise whereas large values of a may lead to low accuracy. In practice, we chose a value of a based on the spatial smoothness of deformations that we believe to be realistic. We then consider several values of σ 2 R (starting with a value provided by one of our online examples), and chose the one that achieves a reasonable balance between regularization and accuracy.
Our matching takes the form of [8] M θ (ϕ A,v · I S , I T ) = 1 2σ 2 where σ 2 M is a user tunable parameter that describes the amount of noise in our imaging data (see description of Gaussian Mixture modeling below).
Note that I T need not correspond to a smooth density image as defined in section 1.2. For example, we include the case where it is a red green blue image corresponding to an H&E stain.
The function f θ is a transformation of image contrast with unknown parameters θ. We use a polynomial for f θ , in which case the minimizing parameters θ can be found exactly by solving a weighted least squares problem. The purpose of this transformation is to model differences in contrast between images from the same modality due to calibration issues; and contrast/color differences between different modalities. In this work we found that first order polynomials were sufficient for accurate image registration. In other work in neuroimaging we have used 3rd order polynomials, which have enough degrees of freedom to map the intensity of gray matter, white matter, and background to arbitrary intensities [8]. Because there are many more pixels than degrees of freedom, it is unlikely that these polynomials will overfit the observed data I T . However, depending on the initialization of transformation parameters this is possible: if tissue in I and I T do not overlap at all, parameters θ may be estimated to zero out imaging information and transform I into a constant function that looks like background only.
The term W is a positive weight that represents the probability that a given pixel in the target image can be matched accurately to one in the source image. For example, if tissue is missing in the target image but not the source image, pixels in the region of missing tissue would get a small weight. Similarly, if the target image included a signal not present in the source (e.g. a bright fluorescence signal).
To optimize E, we alternate between updating W M with Gaussian mixture modeling, and jointly updating (θ, ϕ A,v ) with gradient based methods, using expectation maximization algorithm as discussed in [8]. Briefly, we use 3 classes in our Gaussian mixture model (pixels to be matched, background, and artifact). Each is modeled as a Gaussian random variable with an unknown mean (optionally the mean can be assumed known and specified as an input parameter), a known variance (specified as an input parameter), and an unknown prior probability. While the means for background and artifact are constant values, the mean for pixels to be matched is equal to f θ ([ϕ A,v · I S ](x)) and is a function of space. Parameters are estimated by standard Gaussian mixture modeling techniques, and W M (x) is computed as the posterior probability that the pixel at x belongs to the "pixels to be matched" class. If g(x, µ, σ 2 ) is a multivariate normal with mean µ and covariance σ 2 times identity, and π i are prior probabilities for each class (i ∈(matching, background, and artifact)), then W M (x) = π M g(I T (x), f θ ([ϕ A,v · I S ](x)), σ 2 M ) π M g(I T (x), f θ ([ϕ A,v · I S ](x)), σ 2 M ) + π B g(I T (x), µ B , σ 2 B ) + π A g(I T (x), µ A , σ 2 A ) In the above expression, note that the denominator shows a mixture of three Gaussians, and the numerator shows the first class in the mixture. In our code we also define W B and W A , which are posterior probabilities 5 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted August 19, 2023. ; https://doi.org/10.1101/2023.04.11.534630 doi: bioRxiv preprint that a pixel belongs to the background or artifact classes. They are defined with same denominator as W M , but with numerators corresponding to the mixture component for their class. Recall that updating ϕ corresponds to updating the affine transformation matrix A, and the velocity field v t which generates the deformation φ 1 from (4).
After solving for the optimal transformation parameters A and v t , a transformation and its inverse are constructed by solving (4) sampled on a regular grid, using Semi-Lagrangian techniques [9]. With ϕ(x) = Aφ v 1 (x) and ϕ −1,A,v (x) = φ −1,v 1 (A −1 x) computed, cell locations x ρ S i S in the source image can be mapped into the target by calculating ϕ(x ρ S i S ) through linear interpolation. Similarly, a point x ρ T i T in the target image can be mapped to the atlas by calculating ϕ −1 (x ρ T i T ) through linear interpolation. For improved robustness, our software allows users to input pairs of corresponding points in the source and target images. These points can be used either to initialize the affine transformation A through least squares (steering our gradient based solution toward an appropriate local minima in this challenging nonconvex optimization problem); or can be used to drive the optimization problem itself by modifying to our objective function to be E(A, v) = R(v) + M θ (ϕ A,v · I S , I T ) + P (ϕ A,v (X S ), X T ) such that where X S i and X T i are the ith point of N corresponding points in the source and target respectively and σ 2 P is a user tunable parameter that adjusts balance between matching corresponding landmark points, matching images, and regularization, where large values correspond to less accuracy matching points and small values correspond to more accuracy matching points. Landmark based optimization in the LDDMM framework has been studied extensively (see for example [10]).

3D to 2D alignment
In addition to aligning spatially resolved transcriptomics datasets in which the cell positional information is 2D, we registered the 3D reconstructed Allen common coordinate framework (CCF) atlas (source) to each of the 9 MERFISH datasets (target). The image transformation is similar to the alignment discussed in the section 1.5 with a few exceptions: It is important to note that all transformations are performed on the source 3D atlas. Since the 50 µm Nissl-stained Allen Brain Atlas CCF v3 was used as the source image, rasterization is not applied to the atlas. The affine transformation A for the 3D-2D alignment is a 4×4 3D matrix in homogeneous coordinates. The space of dense 3D images in the orbit of the atlas, are defined via diffeomorphisms ϕ : (x 1 , x 2 , x 3 ) ∈ R 3 → ϕ(x) = (ϕ 1 (x), ϕ 2 (x), ϕ 3 (x)) ∈ R 3 (17) The diffeomorphism ϕ ∈ D acts on the atlas to generate the orbit of imagery I, I ∈ I, I = ϕ · I atlas .
The velocity field v t is still defined by (4), but v t ∈ R 3 . The image I S in the matching term M represents the transformed source atlas evaluated at z = 0, to enable comparison in the same dimension between the source and the target images.