SMAP is a pipeline for sample matching in proteogenomics

Nat Commun. 2022 Feb 8;13(1):744. doi: 10.1038/s41467-022-28411-8.

Abstract

The integration of genomics and proteomics data (proteogenomics) holds the promise of furthering the in-depth understanding of human disease. However, sample mix-up is a pervasive problem in proteogenomics because of the complexity of sample processing. Here, we present a pipeline for Sample Matching in Proteogenomics (SMAP) to verify sample identity and ensure data integrity. SMAP infers sample-dependent protein-coding variants from quantitative mass spectrometry (MS), and aligns the MS-based proteomic samples with genomic samples by two discriminant scores. Theoretical analysis with simulated data indicates that SMAP is capable of uniquely matching proteomic and genomic samples when ≥20% genotypes of individual samples are available. When SMAP was applied to a large-scale dataset generated by the PsychENCODE BrainGVEX project, 54 samples (19%) were corrected. The correction was further confirmed by ribosome profiling and chromatin sequencing (ATAC-seq) data from the same set of samples. Our results demonstrate that SMAP is an effective tool for sample verification in a large-scale MS-based proteogenomics study. SMAP is publicly available at https://github.com/UND-Wanglab/SMAP , and a web-based version can be accessed at https://smap.shinyapps.io/smap/ .

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Chromatin Immunoprecipitation Sequencing
  • Data Analysis
  • Datasets as Topic*
  • Female
  • Humans
  • Male
  • Mass Spectrometry / methods
  • Mass Spectrometry / statistics & numerical data
  • Proteogenomics / methods*
  • Proteogenomics / statistics & numerical data
  • RNA-Seq
  • Software
  • Whole Genome Sequencing