Contextual AI models for single-cell protein biology

Understanding protein function and developing molecular therapies require deciphering the cell types in which proteins act as well as the interactions between proteins. However, modeling protein interactions across biological contexts remains challenging for existing algorithms. Here we introduce PINNACLE, a geometric deep learning approach that generates context-aware protein representations. Leveraging a multiorgan single-cell atlas, PINNACLE learns on contextualized protein interaction networks to produce 394,760 protein representations from 156 cell type contexts across 24 tissues. PINNACLE’s embedding space reflects cellular and tissue organization, enabling zero-shot retrieval of the tissue hierarchy. Pretrained protein representations can be adapted for downstream tasks: enhancing 3D structure-based representations for resolving immuno-oncological protein interactions, and investigating drugs’ effects across cell types. PINNACLE outperforms state-of-the-art models in nominating therapeutic targets for rheumatoid arthritis and inflammatory bowel diseases and pinpoints cell type contexts with higher predictive capability than context-free models. PINNACLE’s ability to adjust its outputs on the basis of the context in which it operates paves the way for large-scale context-specific predictions in biology.

Dataset.We extract human housekeeping genes from the Housekeeping and Reference Transcript Atlas (https://housekeeping.unicamp.br/) 1 and marker genes from the human gold standard T lymphocyte-specific protein functional networks from HumanBase (https://hb.flatironinstitute.org/)(accessed on November 20th, 2023) 2 .From HumanBase, only edges of level C1 (i.e., tissue-specific) are kept.The nodes corresponding to these edges are considered to be marker genes for cell types in the family of T lymphocytes.The lists of marker and housekeeping genes do not overlap, as we remove any overlapping housekeeping genes from the list of marker genes.
Analysis.We compare embedding similarities of a marker (orange) or housekeeping (gray) gene's contextualized protein representation (from PINNACLE) across different cell type contexts.For each marker or housekeeping gene, its cell type-specific protein representations are compared in similar contexts (i.e., between different T lymphocyte cell types; a total of 10 T lymphocyte cell types) or different contexts (i.e., between a T lymphocyte cell type and a non-immune cell type; a total of 115 non-immune cell types).We perform the two-sample Kolmogorov-Smirnov test (via ks 2samp from scipy).
Results.Although PINNACLE learns protein representations using context-aware protein, cell type, and tissue networks alone, it effectively captures protein functions.We analyze the embedding similarities of contextualized protein representations for marker and housekeeping genes across cell type contexts.For each T lymphocyte marker or housekeeping gene, we compare its cell type-specific protein representations in similar contexts (i.e., between different T lymphocyte cell types) and in different contexts (i.e., between a T lymphocyte cell type and a non-immune cell type).Housekeeping genes exhibit higher embedding similarity in similar contexts than marker genes (Supplementary Figure S5; p-value = 3.2 × 10 −14 ).This result aligns with the expectation that housekeeping genes maintain shared functions across these cell types.
Housekeeping genes in different contexts also show higher embedding similarity than marker genes (Supplementary Figure S5; p-value = 1.0 × 10 −91 ), reflecting their consistent functions across non-immune cell types.Conversely, marker genes in similar contexts display higher embedding similarity than those in different contexts (Supplementary Figure S5; p-value = 3.1 × 10 −26 ), consistent with their specificity to T lymphocyte cell types.Their protein representations are more similar within T lymphocyte contexts compared to when these marker genes are in the context of non-immune cell types.These analyses suggest that the protein embedding regions in PINNACLE are organized according to cellular contexts, potentially capturing subtle nuances not explicitly included in the training dataset or the model itself.This encompasses the possibility of cell type-dependent roles for proteins, a complexity that can enhance our understanding of protein functions across different biological contexts.Such insights warrant further investigation into proteins with context-specific and non-specific functions.
Results.We benchmark our contextualized protein representations (structure-free) and contextualized structure-based protein representations against two null distributions and four context-free approaches.We show that randomly sampling pairs of proteins from different cell type contexts, padded (no 3D structure; score gap −0.0431) or concatenated with the structure-based protein representations (score gap −0.0356), cannot produce the score gap observed in the contextualized protein representations (PINNACLE without 3D structure) nor contextualized structure-based protein representations (PINNACLE with 3D structure) (Supplementary Figure S7).Similarly, context-free protein representations cannot predict intercellular communication (i.e., protein interactions between different cell types).Such is demonstrated using context-free protein representations generated by a graph attention neural network 3 on the global reference protein interaction network (i.e., GAT), padded (no 3D structure; score gap −0.1319) and concatenated with the structure-based protein representations (score gap −0.0486), and context-free protein representations generated by BIONIC 4 , a graph convolutional neural network designed for multi-modal network integration, padded (score gap 0.0046) and concatenated with the structure-based protein representations (score gap 0.0043).Our benchmarking results suggest that incorporating context can improve 3D structure prediction of protein interactions.
Unlike approaches that generate cell embeddings to advance cell-level downstream tasks, such as batch correction and cell type annotation [5][6][7] , PINNACLE generates protein representations across cell types for precise protein-level prediction at cell type resolution.PINNACLE learns embeddings of cell types and tissues as a means to inject cellular and tissue organization (via the metagraph) into the unified protein embedding space.To enable cell-level characterization, PINNACLE can easily be extended to learn cell (rather than cell type) embeddings.We hypothesize that the predicted LR interactions are enriched in our cell type specific PPI networks.To quantify the enrichment of LR interactions, we calculate the fraction of LR interactions where the corresponding ligand and receptor proteins are activated in the cell type pair (i.e., for a LR interaction identified between cell types A and B, the ligand protein is activated in cell type A's PPI network and the receptor protein is activated in cell type B's PPI network).We compare the fraction of LR pairs that are activated in our cell type specific PPI networks against the fraction of LR pairs that are activated in null distribution PPI networks.For each cell type specific PPI network, we generate 100 null distribution PPI networks by sampling the same number of nodes with a similar degree distribution 9 .Degree distribution is preserved by binning nodes such that there are at least 100 nodes in each bin, and nodes are then randomly sampled within the appropriate degree interval 9 .We find that our cell type specific PPI networks have a significantly higher fraction of ligand-receptor pairs activated (0.47 ± 0.12) than the null distribution PPI networks (0.04 ± 0.04); n = 2,020 pairs of cell type specific PPI networks, of which 20 are pairs of real cell type specific PPI networks and 2,000 are pairs of null cell type specific PPI networks.Note that the ligand-receptor interactions considered in both analyses are those where the genes corresponding to the ligands and receptors are known.However, this does not factor into our construction of the edges/interactions between cell types (CCI).The bounds of the box show the quartiles of the data, the center indicates the median value of the data, and the whiskers represent the farthest data point within 1.5 × IQR.To examine whether cell types with fewer cells are poorly represented in our networks, we construct networks after subsampling equal numbers of cells per cell type.We compare our finalized networks (no subsampling of cells) against approaches that subsample 100, 200, and 300 cells.We find that our approach yields networks that are maximally similar to the global reference network yet maintain specificity to cell type context.Table S5: Ablation studies to interrogate the contribution of the metagraph.The first row consists of results from the complete model.The remaining three rows show results from three types of ablations: removing cell-type-to-cell-type relationships (i.e., shuffling the cell type nodes' identities), removing tissue-to-tissue relationships (i.e., shuffling the tissue nodes' identities), and removing the metagraph (i.e., setting the weight of the metagraph-related terms in the loss function to zero).The performance metrics evaluate the models' ability to capture cell type and tissue organization in the embedding space.The second column is the correlation between tissue embedding distance (computed using the model's tissue representations) and tissue ontology distance; we expect a positive correlation.The third column is the correlation between tissue embedding distance and tissue ontology distance among the tissue leaf nodes of the metagraph; we expect a positive correlation.The fourth column is the correlation between tissue embedding distance and fraction of overlapping cell types; we expect a strong negative correlation.All Spearman correlation statistical tests are two-sided.Table S6: Data split of downstream tasks.Sizes of the train, validation, and test datasets for the rheumatoid arthritis (RA) PINNACLE model and inflammatory bowel disease (IBD) PINNACLE model.The numeric value outside the parentheses represents the number of protein representations across cell type contexts, and the numeric value inside the parentheses represents the number of unique protein identities.The numbers represent both positive (label = 1) and negative (label = 0) proteins.Note that the validation dataset set is sampled from the train dataset, which is fixed, at each run of the model.The numbers for train and validation datasets (columns 3-4) are from seed 1.

Dataset Type of protein target
Proteins in train dataset (unique) -receptor interactions per cell type specific PPI network

Figure S1 :
Figure S1: Network properties of the metagraph and cell type specific protein interaction networks.(a-b) Degree distributions of the metagraph and cell type specific protein interaction (PPI) networks.(a) Degree distributions of the metagraph (composed of cell type-cell type, cell type-tissue, and tissue-tissue edges), tissue-tissue graph, and cell type-cell type graph.The median, maximum, and minimum degrees for the metagraph are 24, 169, 1; for the tissue-tissue graph are 2, 15, 1; and for the cell type-cell type graph are 24, 157, 4. (b) Distribution of the median node degree of each cell type specific PPI network.The median, maximum, and minimum of median node degree across cell type specific PPI networks are 6, 11, and 3, respectively.(c-d)Enrichment analysis of ligand-receptor interactions in the cell type specific PPI networks.We utilize CellPhoneDB 8 to predict interactions between cell types in our metagraph by identifying significantly expressed ligand-receptor (LR) interactions between pairs of cell types in our dataset.(c) Shown is a histogram of the number of significant LR interactions per cell type specific PPI network predicted by CellPhoneDB.(d) We hypothesize that the predicted LR interactions are enriched in our cell type specific PPI networks.To quantify the enrichment of LR interactions, we calculate the fraction of LR interactions where the corresponding ligand and receptor proteins are activated in the cell type pair (i.e., for a LR interaction identified between cell types A and B, the ligand protein is activated in cell type A's PPI network and the receptor protein is activated in cell type B's PPI network).We compare the fraction of LR pairs that are activated in our cell type specific PPI networks against the fraction of LR pairs that are activated in null distribution PPI networks.For each cell type specific PPI network, we generate 100 null distribution PPI networks by sampling the same number of nodes with a similar degree distribution9 .Degree distribution is preserved by binning nodes such that there are at least 100 nodes in each bin, and nodes are then randomly sampled within the appropriate degree interval9 .We find that our cell type specific PPI networks have a significantly higher fraction of ligand-receptor pairs activated (0.47 ± 0.12) than the null distribution PPI networks (0.04 ± 0.04); n = 2,020 pairs of cell type specific PPI networks, of which 20 are pairs of real cell type specific PPI networks and 2,000 are pairs of null cell type specific PPI networks.Note that the ligand-receptor interactions considered in both analyses are those where the genes corresponding to the ligands and receptors are known.However, this does not factor into our construction of the edges/interactions between cell types (CCI).The bounds of the box show the quartiles of the data, the center indicates the median value of the data, and the whiskers represent the farthest data point within 1.5 × IQR.

Figure S2 :
FigureS2: Sensitivity analysis of network construction.To examine whether cell types with fewer cells are poorly represented in our networks, we construct networks after subsampling equal numbers of cells per cell type.We compare our finalized networks (no subsampling of cells) against approaches that subsample 100, 200, and 300 cells.We find that our approach yields networks that are maximally similar to the global reference network yet maintain specificity to cell type context.(a) Edge and (b) node Jaccard similarity of a cell type specific PPIN to the global reference PPIN.(c-j) Distribution of edge jaccard similarity between PPINs constructed by (c) our finalized approach and subsampling (d) 100, (e) 200, and (f) 300 cells.(g-j) Distribution of node jaccard similarity between PPINs constructed by (g) our finalized approach and subsampling (h) 100, (i) 200, and (j) 300 cells.

Figure S3 : 6 Figure S4 :Figure S5 :
FigureS2: Sensitivity analysis of network construction.To examine whether cell types with fewer cells are poorly represented in our networks, we construct networks after subsampling equal numbers of cells per cell type.We compare our finalized networks (no subsampling of cells) against approaches that subsample 100, 200, and 300 cells.We find that our approach yields networks that are maximally similar to the global reference network yet maintain specificity to cell type context.(a) Edge and (b) node Jaccard similarity of a cell type specific PPIN to the global reference PPIN.(c-j) Distribution of edge jaccard similarity between PPINs constructed by (c) our finalized approach and subsampling (d) 100, (e) 200, and (f) 300 cells.(g-j) Distribution of node jaccard similarity between PPINs constructed by (g) our finalized approach and subsampling (h) 100, (i) 200, and (j) 300 cells.

Figure S7 :Figure S8 : 5 SpearmanFigure S9 :Figure S10 :
FigureS7: Benchmarking context-free and contextualized 3D structure protein representations.Shown are binding and non-binding scores (i.e., cosine similarity) of proteins when using only 3D structure-based protein representations (p-value = 0.2121; n = 22 pairwise comparisons between 2 binding and 20 non-binding pairs), PINNACLE's contextualized protein representations (without 3D structural information; p-value = 0.0299; n = 7,956 pairwise computations between 180 binding and 7,776 non-binding pairs), contextualized structure-based protein representations (p-value < 10 −5 ; n = 7,956 pairwise computations between 180 binding and 7,776 non-binding pairs), and baseline models.The baseline models are random context only (i.e., randomly sampling pairs of PINNACLE's protein representations from different cell type contexts; p-value = 1.0; n = 7,956 pairwise computations between 180 "binding" and 7,776 "non-binding" pairs), concatenating random context protein representations with 3D structure-based protein representations (p-value = 1.0; n = 7,956 pairwise computations between 180 "binding" and 7,776 "non-binding" pairs), GAT only (i.e., context-free protein representations generated by a graph attention neural network3 on the global reference interactome; p-value = 0.6939; n = 22 pairwise comparisons between 2 binding and 20 non-binding pairs), concatenating GAT protein representations with 3D structure-based protein representations (p-value = 0.5706; n = 22 pairwise comparisons between 2 binding and 20 non-binding pairs), BIONIC only (i.e., context-free protein representations generated by BIONIC4 , a graph convolutional neural network designed for multi-modal network integration; pvalue = 0.4556; n = 22 pairwise comparisons between 2 binding and 20 non-binding pairs), and concatenating BIONIC protein representations with 3D structure-based protein representations (p-value = 0.2797; n = 22 pairwise comparisons between 2 binding and 20 non-binding pairs).Note that all protein representations have consistent dimensions (328 = 200 structure-based protein representation + 128 context-aware/-free protein representation) to ensure that they are comparable.The protein representations without 3D structure are padded with 0's (i.e., null 3D structure-based protein representation).The significance of the score gaps between binding and non-binding proteins is measured using a one-sided non-parametric permutation test.Data are represented as mean values with error bars indicating a 95% confidence interval.