Pathogenic variants damage cell composition and single cell transcription in cardiomyopathies

Pathogenic variants in genes that cause dilated (DCM) and arrhythmogenic cardiomyopathies (ACM) convey high risks for the development of heart failure through unknown mechanisms. Using single nucleus RNA sequencing (snRNAseq), we characterized the transcriptome of 880,000 nuclei from 18 control and 61 failing, non-ischemic human hearts with pathogenic variants in DCM and ACM genes or idiopathic disease. We performed genotype-stratified analyses of the ventricular cell lineages and transcriptional states. The resultant DCM and ACM ventricular cell atlas demonstrated distinct right and left ventricular responses, highlighting genotype-associated pathways, intercellular interactions, and differential gene expression at single cell resolution. Together these data illuminate both shared and distinct cellular and molecular architectures of human heart failure and suggest candidate therapeutic targets.


Data reporting
Data objects with the raw counts matrices and annotation are available via cellxgene through the Human Cell Atlas (HCA) Data Coordination Platform (DCP). Raw data for all samples, including the five major genotypes (LMNA, RMB20, TTN, PKP2 and PVneg) and rare genotypes (BAG3,DES,DSP,FKTN,TNNC1,TNNT2,TPMI,PLN), are available through the European Genome Archive. All of our data can be explored at https://cellxgene.cziscience.com/collections/e75342a8-0f3b-4ec5-8ee1-245a23e0f7cb/private. All methods refer to the analyses of the five major genotypes unless differently stated.

Ethics statement
Clinical details on cardiac tissues are provided on Table S1. Control hearts were obtained from unused transplant organ donations, including 12 previously described samples (4) Control heart samples were collected as previously described (4). Disease samples were collected from cardiomyopathy patients with heart failure, prior to mechanical support (n=15) or at the time of heart transplantation (n=31). All samples were full-thickness myocardial specimens from the LV and RV free walls, and the LV apex, and interventricular septum. Regions with large epicardial fat deposits or macroscopic areas of high fibrosis were intentionally excluded. When full-thickness apical cores were obtained, other regions were not available. Additional details are provided in Fig. S2A and Table S1. Thorough sample collection details have been described previously (89).

Patient genotyping
Genomic DNA from patient samples from the Bad Oeynhausen Heart Center was isolated from blood using the High Pure PCR Preparation Kit ® (Roche Diagnostics GmbH) or Genomic DNA Extraction Kit (Qiagen). DNA was prepared for gene enrichment re-sequencing on a MiSeq ® sequencing system using the TruSight TM Rapid Capture Sample Preparation Kit (Illumina). DNA was screened for pathogenic or likely pathogenic variants using the TruSight TM Cardio (174 genes) or the TruSight TM Cardiomyopathy (46 genes) Sequencing Panel (Illumina). In a subset of samples, whole exome or genome sequencing was performed (Table S1) and analyzed for cardiomyopathy genes.
Variants were annotated using VariantStudio TM v.3.0 (Illumina) and classified according to the recommendations of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG) (91).
Genomic DNA from patients enrolled at the Imperial College was isolated from blood or heart tissues using the Genomic DNA Extraction Kit (Qiagen) and processed for cardiomyopathy panel sequencing or WES using Illumina NovaSeq instruments. Panel sequencing and bioinformatic analysis was performed as previously described (96). Reads were aligned to hg38 (GRCh38) using BWA-MEM. Variants were annotated using Ensembl VEP v99 (97), and CardioClassifier (98). Rare variants (MAF<0.001) in known disease genes were considered for a potential causal role in disease and interpreted using the ACMG-AMP variant interpretation framework (91).
Single nuclei isolation of cardiac samples and processing on the 10X Genomics platform Samples were processed independent of disease, genotype, or control status. Nuclei isolation and library preparation was performed at the Harvard Medical School, Imperial College London, and the Max-Delbrück Center for Molecular Medicine as previously reported (4, 99). Isolated nuclei from flash-frozen tissue were visually inspected under the microscope to assess nuclei integrity and manually or automatically counted using a Countess II (Life |Technologies). Nuclei suspension was loaded on the Chromium Controller (10X Genomics) with targeted nuclei recovery of 5,000-10,000 per reaction (Table S2).
3′ gene expression libraries were prepared according to the manufacturer's instructions of the v3 Chromium Single Cell Reagent Kits (10X Genomics). Quality control of final library cDNA was done using Bioanalyzer High Sensitivity DNA Analysis (Agilent) and the KAPA Library Quantification kit. Libraries were sequenced on an Illumina HighSeq 4000 or NovaSeq with a targeted read number of 30,000-50,000 reads per nucleus (Table S2).
Data pre-processing and transcriptome mapping Bcl files were converted to Fastq files by using bcl2fastq. Each sample was mapped to the human reference genome GRCh38 with a modified pre-mRNA gtf file of Ensembl release Ens84 (100) using the CellRanger suite (v.3.0.1) with specifications provided in the DCMheart github repository. Reads mapping within exonic and intronic regions were counted. Mapping quality was assessed using the cellranger summary statistics. Reads overlapping multiple sequence features have been discarded. Two libraries (Table S2) exceeded the read number per nucleus of 200,000 and were downsampled to 100,000 reads per nucleus using the DropletUtils R package on the molecule_info.h5 file (101).

Count data processing
Empty droplets (identified by Emptydrops, implemented in the CellRanger workflow) were removed, samples were assembled into an AnnData object by concatenating the filtered_feature_bc_matrix.h5, and metadata information was added.
No significant differences were identified in cell-type abundances or expression profiles between free-wall (FW), the apical core (AP) and septal (S) samples and thus these were merged and denoted as LV as previously described (4). After read count normalization and log-transformation, highly variable genes were selected. Effects of percentage of mitochondrial genes and total counts per nucleus were regressed out and values were scaled to unit variance. Principal components were computed and elbow plots were used to define the appropriate number of principal components for neighbor graph construction. Prior to manifold construction using UMAP, selected principal components were harmonized by using R Harmony (106) or Python Harmonypy with "Patient" as batch key (Table   S2). Nuclei were clustered using the network-based Louvain and Leiden algorithms (107, 108). Differential expressed genes per cluster were calculated using the Wilcoxon rank sum test. Clusters with high similarity in differentially expressed genes were merged.
Nuclei were classified for cell type or denoted unassigned (Fig. 1C). Subsequently nuclei were subclustered to identify cell-states. Despite prior doublet removal we identified during subclustering some droplets with chimeric transcriptional profiles. Whether these may represent real biology, background RNA noise (soup), or multiplets is unclear. We labeled these clustered "nuclei" as unassigned (Fig. 1C, n= 52981 "nuclei"; 6%). Additional filters are provided in scripts deposited at the DCMheart github. Lymphocytes were annotated initially by merging scRNAseq (4) with snRNAseq data using the scVI framework v0.9.0 to improve marker identification and annotation (109).
Subsequently scRNAseq data were then removed for further downstream analyses.

Differential gene expression and variance analysis
Differentially expressed genes (DEG) per cell type and state were calculated using the implemented Wilcoxon rank-sum-test. Only genes with mean expression (log transformed and library size normalized counts) >0.0125 in the control and genotype group were tested for differential expression. Genes were called as differentially expressed with FDR<5% and |log2FC|>0.5. DEGs for rare cell states (>5 nuclei in at least 3 patients) have not been computed. For comparison of genotype specific effects, pseudobulks for each cell type and cell state per LV and RV sample were computed. Differential gene expression analysis on pseudobulk expression values were performed using edgeR, v3.28.1 (110, 111).
We assessed differences in the variability across cells between patients with a specific PV or PVneg and controls, in each cell type and each anatomical region separately. Because the variance of UMI counts is strongly dependent on the mean expression level, we first applied a variance stabilizing transformation (112) similar to scTransform (113) on the 1000 most highly variable genes. We used the Pearson residuals of a negative binomial regression with explanatory variables: total read count of each cell and the fraction of mitochondrial reads. Next, we computed the variance of the Pearson residuals for each gene in each patient, anatomical region and cell type. Finally, we compared the variance for each gene between patients with a specific mutation (or PVneg) against the control group using a Wilcoxon rank sum test and corrected across all comparisons made using the Benjamini Hochberg method.

Differential abundance analysis
Compositional data analyses were performed to determine genotype specific differences in cell type abundances, excluding unassigned nuclei. Analyses of cell counts per cell type and cell counts per cell state within each cell type were performed separately in each anatomical region (LV and RV). To account for the compositional nature of the data count data were transformed using the centered log ratio (CLR) transformation. Counts of zero were assumed to be due to insufficiently deep sampling and therefore imputed using the method of multiplicative replacement (114). To assess statistical differences between groups of samples, for example all patients with a specific genotype vs. controls, a linear model of the CLR values was estimated as a function of the grouping encoded as an indicator variable and a t-test was performed to determine the significance of the regression coefficient. Differential abundance of all cell types or states in each anatomical region was assessed separately between patients of each genotype and controls. In addition, all DCM patients were compared to controls. For the analysis of cell states within each cell type, only cell state counts assigned to each cell type were considered, effectively normalizing all cell states within a cell type to 100%. Samples with less than 10 cells per cell type were excluded from the cell state analysis.
Before analyzing differential abundance between samples of different genotypes or diagnosis, we assessed whether samples from different anatomical regions show compositional differences using the region as a grouping variable in the model described above. The comparison of FW and AP showed no significant abundance changes (all FDR>5%, Fig. S2D). Therefore AP and FW were merged. Next, we also compared the merged AP and FW to SP. As no significant abundance changes were observed (all FDR>5%, Fig. S2E) AP, FW and SP were merged into LV.
We also employed CLR transformed cell type abundance to consider sex specific differences of cell type abundance, separately for LVs and RVs, between DCM (10 females, 29 males) and controls (7 females, 11 males).
As explanatory variables we included the additive terms phenotype (control=0, DCM=1) and sex (female=0, male=1) as well as an interaction term (DCM and male = 1, others = 0). We tested whether the interaction term was different from 0, which would indicate a significant (FDR < 5%) sex specific difference. Using the same approach, we compared sex specific cell abundances between LMNA and control hearts, as only this genotype had sufficient samples from males (n=7) and female (n=5) patients.
In addition to CLR values, abundance differences were also reported as differences of mean percentages between groups (while using statistical significance from the CLR analysis). The proportional changes in mean percentages of control and disease samples were reported (Figs. 1D, S4, S8, S12, S17, S21, S26, S32, S35, S38).
Positive values indicate higher abundance in the disease group. In addition to CLR values, log ratios of abundances for cell type (or state) pairs between genotypes and controls were ascertained and reported. CLR values are normalized to the geometric mean of all abundance values. For a more intuitive interpretation of the differential abundance results, we complemented the CLR analysis with an analysis of all pairwise cell type (respectively cell state) ratios. Based on the imputed abundances used for the CLR analysis we also computed differences in log ratios of counts of cell type c1 and cell type c2 between groups of patients as assessed in the CLR analysis. Specifically we tested whether log (c1/c2) in group 1 was equal to log (c1/c2) in group 2 using a t-test. All P-values were adjusted for multiple testing using the Benjamini and Hochberg method and only significant results are shown.

GOterm and pathway enrichment
GOterm enrichment analysis was performed using the web-tool Gprofiler2 with default settings (115).

Cell-cell interaction and differential connectome analysis
Cell-cell interactions between assigned cell states in LVs and RVs were inferred using CellChat (version 1.1.0 using the cell-cell interaction database) (75). The cellchat database is accessible at http://www.cellchat.org/cellchatdb/. Analyses were performed using the log-transformed normalized gene counts with default parameters, and population size was accounted for when calculating communication probabilities. Data from controls and genotypes were compared to identify significant changes.
Interaction heatmaps were generated using the table produced by the rankNet() function comparing each genotype to controls, which produced aggregated communication probabilities across all cell states for each signaling pathway and p-values. To address for multiple testing p-values were Bonferroni-adjusted using the p.adjust() function in the R stats package. The log2(fold change) in aggregated communication probability of genotype vs control was calculated and plotted on a heatmap using heatmap.2() from the gplots package (https://CRAN.Rproject.org/package=gplots). Heatmap color scales depict 0.05 sized intervals, from -14 to 14.
Circle plots (Fig. 6C) showing pathway specific changes in interaction strength between cell types were generated by aggregating communication probabilities per cell state, while subsetting for a specified pathway using the aggregateNet() function. Communication probabilities were then aggregated for cell states of the same cell type using the mergeInteractions() function and plotted using netVisual_diffInteraction(). Chord plots showing cell state interactions for specific signaling pathways were generated using the netVisual_aggregate() function in CellChat.

Masson trichrome staining and collagen quantification via hydroxyproline
Control and disease LV and RV tissue were fresh-frozen in isopentane (ThermoFisher) at -80°C and OCT (VWR) embedded. Sections were cut (10 μm thickness) using a microtome, placed onto slides and processed using a standard Masson trichrome staining protocol. Extracellular matrix was quantified within 50 mg of LV and RV (Table S15) tissue, by measuring the quantity of hydroxyproline, a common amino acid in mammalian collagens, as described (119). Slides were imaged using one of three microscope platforms. The LSM710 confocal microscope (Zeiss, North American samples) was used with 25x or 40x oil immersion objectives (1.3 oil, DIC III). The SP8 confocal microscope (Leica, for samples obtained and processed in Germany) was used with a NA 1.4 63x oil immersion objective. The LSM780 confocal microscope (Zeiss, samples from the Imperial College) was used with 20x (0.8 NA) or 40x oil immersion objectives (1.3 NA). Spectral bleed-through was corrected using Fiji or Zeiss' Zen Black software according to the manufacturer's manual (120, 121). Cell segmentation and transcript quantification/spot detection for cardiomyocytes was performed using Multiple-Choice Microscopy (MCMICRO, Fig. 2C) (122). Cells were segmented using a custom trained instance segmentation model, Cypository (123), based on the MaskRCNN resnet50 architecture (124) and pretrained on the COCO dataset (125). Briefly, images were manually annotated with two classes -cell membrane stained with wheat germ agglutinin and background consisting of all other areas. Annotated data was split into training, validation and test images in a 0.72:0.18:0.1 split. Training was performed with stochastic gradient descent with a learning rate of 0.005, momentum of 0.9 and weight decay of 0.0005. Cypository was deployed using MCMICRO (122) which is an end-to-end nextflow-based image analysis pipeline for tissue images. After cells were segmented as label masks, spot detection of RNA was performed in S3segmenter (126) by convolving a Laplacian of Gaussian (LoG) kernel over the image and identifying local maxima that had responses above a threshold.
The number of spots per channel and mean intensity were then quantified on a single cell basis. In order to quantify SMYD1 transcripts per CMs, control (n=2) and disease samples were analyzed. At least five images per sample were taken at different regions with 64 to 114 CMs per image. Other RNAscope quantifications (Figs. 2D, S6D) was performed manually in controls and disease samples using the H-score as described by ACDBio (ACDBiotechne).

Immunofluorescence Staining
Fresh LVs obtained from controls (n=5) or patients with PVs (n=5) were embedded in OCT compound, and sectioned at 5 μm thickness using a cryotome and mounted on Superfrost Plus slides. Slides were fixed in 4% paraformaldehyde solution and permeabilized with 0.1% Triton X-100 (Sigma-Aldrich). After several washing steps, slides were incubated in TrueBlack Ⓡ Lipofuscin Autofluorescence Quencher (Biotium), then blocked in 4% Bovine Serum Albumin (BSA). Slides were incubated in primary antibodies for SMYD1 (Abcam, ab181372) and Cardiac Troponin T (cTnT) to identify cardiomyocytes (Invitrogen, MA5-12960) overnight at 4°C. Next, slides were incubated with anti-rabbit Alexa Fluor™ 568 (Invitrogen, Wavelength: 578/603 nm), and anti-mouse Alexa Fluor™ 647 coverslipped. Slides were imaged using the SP5 laser scanning confocal microscope (Leica) with a 100x immersion objective (1.4 NA). Spectral bleed-through was corrected using Fiji or Zeiss' Zen Black software according to the manufacturer's manual (120, 121). SMYD1 quantification per CMs (cTnT staining) was performed by measuring the integrated density using Fiji, where a minimum of ten transmural images were quantified with a total of 97 to 174 CMs per sample.
Selection of GWAS candidate loci and assessment of expression in cell types Table S65 provides the candidate genes residing in 15 previously identified DCM GWAS loci. Differential gene expression results were obtained from the edgeR analysis. For analyses of GWAS genes we applied a strict fold change cutoff of |log2FC|>1 for disease compared to controls, and FDR cutoff of 5%. In addition, we removed signals derived from ambient RNA, identified as transcripts from the top 30 genes with cell type specific expression and high technical noise for each cell type.
Cell type specific expression was defined by first computing the average UMI count per gene, per nucleus, per cell type. Next, these averages per gene were normalized to sum to 100% across cell types. Finally, genes were termed as specifically expressed if a cell type's average UMI fraction was >85%. To assess expression above background levels, background droplets that do not contain nuclei were identified by the cellranger pipeline.

Construction of graph attention network models
Cell-types were split into separate anndata files followed by library-size normalization and logtransformation per nucleus (barcode). Highly variable genes (HVGs) were selected based on mean expression and dispersion. Effects of percentage of mitochondrial genes and total counts per nucleus were regressed out and values were scaled to unit variance. kNN-neighbor graphs were computed (sc.pp.neighbors) on harmonized (Patient as batch key) PCs per cell-type as edge input for graph attention (GAT) models, using classification model structure described above.
In order to recognize genotype specific transcriptional patterns, an aggregated GAT model was built consisting of the individual cell-type specific GAT models, where each model was trained per cell-type and validated on test data (not included in training). Model performance was evaluated by using an inductive learning policy using a leave-one-out cross-validation approach (LOO-CV) .
Training was stopped at the point when performance on the validation dataset did not improve to avoid overfitting (early stopping procedure). Subsequently, the genotype probability was assigned based on the transcriptional signature of each nucleus from one patient that was left out of the training, a process that was replicated to encompass all patients. Probabilities across all nuclei per cell-type were aggregated to obtain the genotypelikelihood for each patient (Fig. 6D). The final classification model was restricted to highly abundant cell types (CM, FB, EC and myeloid cells) in LV. Each cell type per genotype was given a weight, which was obtained by multiplying the true positive (TP) and 1-false positive (FP) rate.
GAT takes node (nuclei) features (X, anndata.X) and the adjacency matrix (A) of the nodes as input features (Fig.   S53). = { 1 , 2 , 3 , … , }, ∈ where N is the number of nodes, and F is the number of HVGs for each node.
In the first step the input files were forwarded to the GAT layer. As output, the learned graph representation was received (H'). Secondly, layer normalization (LN) and Exponential Linear Unit activation function (ELU) was applied to the learned graph representation. LN is defined as and the ELU activation function is Next, a self-attention layer (F') was applied (127) and the procedures from the previous step (LN and ELU) were repeated. Hidden features were extracted from the graph representation. H' and output from F' were concatenated to feed into the second GAT layer. Finally, a log-softmax function for multiclass classification was executed.

( ) = ( ∑ )
For multiclass classification a negative log-likelihood was used as a loss function.
The individual GAT layers were built as described in (128); in brief: 1) Apply a linear transformation -Weighted matrix W to the feature vectors of the nodes.
where W is an earnable weight matrix and ℎ a lower layer embedding.
( ) = ( ) ℎ ( ) 2) Attention Coefficients determine the relative importance of neighboring features to each other. These were calculated using the formula.
where K denotes the number of independent attention maps used.
To derive the final prediction per patient, an aggregation procedure was used in which median prediction scores per cell-type per patient were multiplied with a pre-computed weight matrix. Prediction scores per cell-type equals the relative abundance of nuclei assigned to a genotype. The weight matrices were created from previously computed confusion matrices. The aggregation procedure was defined as, where P denotes the probability per genotype, ̂ the probability of a cell-type to be derived from a patient with a certain genotype. W is the pre-computed weight matrix. n equals the number of cell-types included in the aggregation procedure. The weighted aggregation procedure yielded more robust results than mean value per class, as errors per classifier were compensated.

Alternative modeling
For alternative modeling, six different models were used, and accuracy and F1 macro on cross validation was compared. An example for fibroblasts is below (Table S71). Model performance was measured on the cell-type level. For input data we used the gene expression data matrix (anndata.X) and meta information about patients (such as age, gender). Hyperparameters for a) Random Forest, b) XGBOOST and c) KNN classifiers were selected according to GridsearchCV. The training procedure was based on cross validation with patient stratification, to avoid overfitting of patient specific transcriptional patterns. The best result by alternative methods was obtained by the Random Forest Classifier with an accuracy of 0.39 and F1 macro score 0.15. In addition to three classical machine learning approaches, three approaches based on neural networks were compared. i) Feed Forward Neural Network (FFNN) on the count matrix with three hidden layers obtained a lower performance on validation data. ii) SCANVI (single-cell ANnotation using Variational Inference), developed for single-cell annotation using variational inference (129), was outperformed by GAT in the unbiased genotype classification task. iii) An additional FFNN on graph embeddings was applied. The neighbors' information from the graph (considering first and second neighbors order) and edge quantity were extracted using a graph neural network embedding and used as input features. This approach performed better than other neural networks and the classical machine learning approaches, however, was still outperformed by GAT (accuracy and F1 macro).

Figure S1: Quality assessment of snRNAseq data (A)
Infographic depicts the 78 individuals studied (women, pink; men, blue; binned by age range), including controls (unused donor hearts), patients with pathogenic variants (PV pos) or patients with unknown disease etiology (PVneg).
Further clinical details are provided in the metadata sheet (Table S1)                      The table of genes used is provided in Table SEC2. Nuclei with a score lower than 0.3 for both processes were considered as unscored and nuclei with a MET score higher than an EMT score were considered as undergoing MET.  Table SEC2. Nuclei with a score lower than 0.1 were unscored and nuclei with a greater positive than negative score were denoted as undergoing cell death.                  (Table S65)

Figure S42: Representation of cell-cell interactions for the IGF pathway in LVs and RVs with different genotypes
Chord plots of significant (adjusted p-value≤0.05) cell-cell communications depict the differentially regulated insulin growth factor (IGF) pathway and interactions in disease LVs and RVs. The line thickness denotes interaction strength of signals from sending and receiving cell types, with color (orange, increased; blue, decreased) scaled from zero to maximum in diseased versus controls. Arrows indicate directionality.

Figure S43: Representation of cell-cell interactions for the BMP pathway in LVs and RVs with different genotype
Chord plots of significant (adjusted p-value≤0.05) cell-cell communications depict the differentially regulated bone morphogenic protein (BMP) pathway and interactions in disease LVs and RVs. The lines thickness denotes interaction strength of signals from sending and receiving cell types, with color (orange, increased; blue, decreased) scaled from zero to maximum in diseased versus controls. Arrows indicate directionality.

Figure S44: Representation of cell-cell interactions for the EDN pathway in LMNA LVs and PKP2 RVs
Chord plots of significant (adjusted p-value≤0.05) cell-cell communications depict the differentially regulated neuregulin (NRG) pathway and interactions in disease LVs and RVs. The lines thickness denotes interaction strength of signals from sending and receiving cell types, with color (orange, increased; blue, decreased) scaled from zero to maximum in diseased versus controls. Arrows indicate directionality.  Color of the dots represents the probability of communication for NRG receptor-ligands pairs (x-axis). The specific ligand, expressed by EC7.0, and receptor, expressed by the respective receiving cell-state is shown (y-axis).  (presented in Fig. S7E), studied as in (A).   Fig. 2J), studied as in (A). (C) PLN LV (presented in Fig. 5D), studied as in (A).