OCMA: Fast, Memory-Efficient Factorization of Prohibitively Large Relationship Matrices

G3 (Bethesda). 2019 Jan 9;9(1):13-19. doi: 10.1534/g3.118.200908.

Abstract

Matrices representing genetic relatedness among individuals (i.e., Genomic Relationship Matrices, GRMs) play a central role in genetic analysis. The eigen-decomposition of GRMs (or its alternative that generates fewer top singular values using genotype matrices) is a necessary step for many analyses including estimation of SNP-heritability, Principal Component Analysis (PCA), and genomic prediction. However, the GRMs and genotype matrices provided by modern biobanks are too large to be stored in active memory. To accommodate the current and future "bigger-data", we develop a disk-based tool, Out-of-Core Matrices Analyzer (OCMA), using state-of-the-art computational techniques that can nimbly perform eigen and Singular Value Decomposition (SVD) analyses. By integrating memory mapping (mmap) and the latest matrix factorization libraries, our tool is fast and memory-efficient. To demonstrate the impressive performance of OCMA, we test it on a personal computer. For full eigen-decomposition, it solves an ordinary GRM (N = 10,000) in 55 sec. For SVD, a commonly used faster alternative of full eigen-decomposition in genomic analyses, OCMA solves the top 200 singular values (SVs) in half an hour, top 2,000 SVs in 0.95 hr, and all 5,000 SVs in 1.77 hr based on a very large genotype matrix (N = 1,000,000, M = 5,000) on the same personal computer. OCMA also supports multi-threading when running in a desktop or HPC cluster. Our OCMA tool can thus alleviate the computing bottleneck of classical analyses on large genomic matrices, and make it possible to scale up current and emerging analytical methods to big genomics data using lightweight computing resources.

Keywords: Eigen decomposition; Gene mapping; Genetic matrices; Genomic selection; Genotype-based phenotype prediction; Memory virtualization; Singular value decomposition.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Animals
  • Biological Specimen Banks / trends
  • Breeding
  • Computer Simulation
  • Genome / genetics*
  • Genomics*
  • Genotype
  • Humans
  • Models, Genetic*
  • Polymorphism, Single Nucleotide / genetics
  • Principal Component Analysis
  • Software

Associated data

  • figshare/10.25387/g3.7384973