ALFA: Allele Frequency Aggregator

Please provide your feedback by completing this short 3 min survey.

Table of Contents

ALFA at a glance:

  • The aim is to provide allele frequency from more than 1 million subjects by adding 100-200K new subjects available in dbGaP with each ALFA quarterly release.
  • The initial release of ~100 thousand subjects included allele counts and frequency for 447 million rs site including 4 million novel ones aggregated from 551 billion genotypes.
  • The dbGaP studies include chip array, exome, and genomic sequencing data with subjects from 12 diverse populations including European, African, Asian, Latin American, and others.
  • The data will be integrated with dbSNP regular build release with assigned RS accessions for variants and available for access by web, FTP, and API.

Background

NCBI database of Genotypes and Phenotypes (dbGaP) contains the results of over 1,200 studies that have investigated the interaction of genotype and phenotype. The database has over two million subjects and hundreds of millions of variants along with thousands of phenotypes and molecular assay data. This unprecedented volume and variety of data promise huge opportunities to identify genetic factors that influence health and disease. NIH has recently lifted the restriction on Genomic Summary Results (GSR) access for responsible sharing and use of the data. In fulfilling this updated GSR policy and to promote research toward identifying genetic variants that contribute to health and disease, NCBI developed the Allele Frequency Aggregator (ALFA) pipeline to compute allele frequency for variants in dbGaP across approved un-restricted studies and to provide the data as open-access to the public through dbSNP. The goal of the ALFA project is to make frequency data from over 1M dbGaP subjects open-access in future releases to facilitate discoveries and interpretations of common and rare variants with biological impacts or causing diseases. Toward that goal, over 925K dbGaP subjects with genotype data have been analyzed using GRAF-pop as candidates for the ALFA project, pending study approval and processing.

Build Summary

Release Version Date
1 20200227123210 March 10, 2020
2 20201027095038 January 6, 2021

Data Generation


Data from selected studies are harmonized and normalized. Using existing dbSNP and dbGaP curation and semi-automatic pipelines the data either from GWAS chip array genotyping or direct sequencing of exomes and whole genomes were QA/QC and transformed to standard VCF format as input into a pipeline that transform variants to SPDI notation and normalized using VOCA to aggregate, remap and cluster to existing dbSNP rs or assign new ones (Holmes et al.), and allele frequency computed.

Populations

Sample ancestries are validated using GRAF-pop and assigned to 12 major populations including European, Hispanic, African, Asian, and others (Jin et al., 2019).

Data QC

We do our best to ensure that the data released is of the highest quality, complete, accurate, and useful. However, because we did not generate the original submitted data from dbGaP that were used as input for this project, and because the processing required to make the data useful is complex, we cannot be liable for omissions or inaccuracies. Please see the release summary with QC report (coming soon) for more details.

Data Excluded by QC:

  • Variants with call rate < 95%

  • Subjects with call rate < 95%

Data Excluded by QC and awaiting fixes from original dbGaP Submitters and may be included in future releases.

  • Array datasets with conflicting subjects or markers between the marker manifest and reported genotype

  • Datasets with incorrect or flipped allele orientation

  • Datasets where the frequency of Ancestry Informative Markers (AIMs) tested is inconsistent with 1000 Genomes for whole study or for a particular population. The dataset is excluded if the percentage of AIMs outlier markers tested with allele frequency difference > abs(+/-0.15) exceed 0.3% for the whole study or 0.1% for a population (see details).

  • Dataset where polymorphic SNPs are recorded as monomorphic

  • Dataset suspected of having errors due to chip array design

  • Dataset with various systemic errors and not does not appeared random

Terms of Use

Please see the Terms of Use applied to dbGaP frequency data and NCBI standard disclaimer.

Data Release Cycle

ALFA import new studies and regenerate the data in a quarterly basis for release with each dbSNP build. We anticipate adding between 100-200K new dbGaP subjects per release. Novel variants will be assigned RS numbers and the frequency data will be integrated with dbSNP regular release products (Entrez search, RefSNP report, API, Sequence Viewer, Variation Viewer, and FTP JSON and VCF files).

Interim ALFA releases to provide more frequent updates, such as the initial release, will only include reporting of ALFA allele frequency for existing RS on the RefSNP page. Separate ALFA specific download files are provided that include both existing RS and novel variants. Novel variants from interim releases are also available by API position search (See Data Access below).

Users can subscribe to the mailing list to get data release and update announcements.

Data Access

RefSNP Web Page

Access RefSNP page using the rs number. Allele frequencies from ALFA and other projects are reported in "Frequency" tab.


Example: rs334

FTP Download

All ALFA dbGaP variants including novel ones not yet in dbSNP are availabe in VCF format.

Track Hub

An ALFA track hub is now publicly available with the UCSC Genome Browser. It can be acceessed with this link.

Also, a track hub definition file can be used to add ALFA tracks to a personalized Genome Browser or Genome Data Viewer. An example with NCBI GDB can be found here.

API Queries

All ALFA dbGaP variants including novel ones not yet in dbSNP are available through NCBI Variation Service API and include three queries:

See tutorials below for Python examples.

Enhanced Search and Filtering Features

More search and filtering features are added to NCBI search page to make use of the ALFA frequency data.

ALFA Reporting on Entrez SNP

On the SNP search result page, if an RS has ALFA frequency information, it will be displayed along with a url link to the frequency tab on the SNP RS page.


Search Filtering with ALFA

A user can also filter the search results with ALFA frequency. As shown in the image below, on the left side of the search result page, a filter 'by-ALFA' is added under Validation Status.


Advanced Search with ALFA Population

With the SNP Advanced Search Builder, a user can search RS with ALFA frequency of a specific population. The user first selects a population from the dropdown list and then provides a specific range of the minor allele frequency.


Tutorials

Presentations

  • NCBI Minute: ALFA Webinar materials and video.
  • ASHG 2019 Collab
  • ASHG 2019 Platform talk
  • Human Population Genetic Data at NCBI (Video)
  • New Variation Services for Normalizing, Remapping, and Annotating Variants (Video)
  • ASHG 2020 Collab: Introduction to and tutorial for using ALFA, the Allele Frequency Aggregator, at the National Center for Biotechnology Information (NCBI) (Video).

Citing this Project

We're planning on submitting a resource manuscript about the ALFA project later this year. For now, please use the MLA standards for citing this project website below.

L. Phan, Y. Jin, H. Zhang, W. Qiang, E. Shekhtman, D. Shao, D. Revoe, R. Villamarin, E. Ivanchenko, M. Kimura, Z. Y. Wang, L. Hao, N. Sharopova, M. Bihan, A. Sturcke, M. Lee, N. Popova, W. Wu, C. Bastiani, M. Ward, J. B. Holmes, V. Lyoshin, K. Kaur, E. Moyer, M. Feolo, and B. L. Kattman. "ALFA: Allele Frequency Aggregator." National Center for Biotechnology Information, U.S. National Library of Medicine, 10 Mar. 2020, www.ncbi.nlm.nih.gov/snp/docs/gsr/alfa/.

Contact

Please send your comments and suggestions to snp-admin@ncbi.nlm.nih.gov

Support Center

Last updated: 2021-05-05T12:40:41Z