Human Variation Sets in VCF Format

This document describes two sets of human variation files in VCF format:

Table 1 sumarizes the files that are generated, the frequency with which they are updated in addition to providing a brief overview of their content and the location of the files for variations mapped to the most recent builds of GRCh37 and GRCh38. The first column file names in Table 1 are linked to more detailed descriptions of each file.

Please note that we use version 4.1.  

See Related Documentation

What's new in VCF for dbSNP Build150 (April, 2017 release)
 

b150 General Information and File Access

  • Access /human_9606_b150_GRCh37p13 and human_9606_b150_GRCh38p2 directly from ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/. Scroll down to human_9606 to see available builds.
  • The definition of a common population is based on at least one population out of more than 26 major populations
  • Human b150 supports both the GRCh38p2 and GRCh37p13 asemblies since we map the rs in b150 to both GRCh38p2 and GRCh37p13.

b150 Updates to ClinVar VCF file sets

dbSNP RefSNP data now includes allele frequency data from 1000 Genomes, ExAC, and GO-ESP along with HapMap and other submitted populations.  Below are the links to descriptions for the populations ueed to generate allele frequency data:

Variations Omitted from the VCF Files:

  •   Variations listed as microsatellites
  •   Named variations (i.e. variations without sequence definition)
  •   Variations not mapped on assembled chromosomes of the reference genome (currently GRCh38), independent of the patch version.
  •   Variations mapped to more than one location on the reference genome (weight > 1).
     
File Naming Pattern or Subdirectory Update Frequency Content Location
Table 1. Summary of VCF files
All once per dbSNP build all human rs in except those variations omitted as listed above, without restriction by clinical significance. This file pairs with All_papu file below.

GRCh37: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/

GRCh38: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p2/

All_papu once per dbSNP build All human variations found in the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized/unplaced contigs (papu). 

GRCh37: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/

GRCh38: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p2/

common_all once per dbSNP build the subset of 00-All categorized as common (minor allele frequency >= 0.01 in at least one of 26 major populations, with at least two unrelated individuals having the minor allele) as described below

GRCh37: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/

GRCh38: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh38p2/

common_and_clinical weekly

An inventory of "common" germline variations with evidence of medical interest. Only common_all.vcf.gz records of possible medical interest are reported in common_and_clinical.vcf.gz.

GRCh37: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/

GRCh38: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

common_no_known_medical_impact weekly any rs in common_all.vcf.gz not known to have a pathogenicity rating above the threshold as detailed below

GRCh37: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/

GRCh38: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

clinvar weekly any rs of 00-All with at least one submission via a clinical channel

GRCh37: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/

GRCh38:ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

clinvar_papu weekly(?) any rs of 00-All with at least one submission via a clinical channel and are found in the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized/unplaced contigs (papu).

GRCh37: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/

GRCh38: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/



FILE CONTENT NOTICE

The "common_no_known_medical_impact.vcf.gz" file and the "clinvar.vcf.gz" file are NOT mutually exclusive because common variants asserted to be non-pathogenic and obtained through clinical channels appear in both the "clinvar.vcf.gz file and the "common_no_known_medical_impact.vcf.gz" file.  In other words, some records for non-pathogenic variations submitted through clinical channels may be common enough to be listed in the "common_no_known_medical_impact.vcf.gz" file.

FILE UPDATE SCHEDULE

  • Weekly updates occur on Thursday
  • The dbSNP human build is released in June and November

Directory Contents
 

Data Organization Note:

When multiple alleles are present, the following organizational rules apply to all VCF files:

  • The data for each INFO tag is presented in the order that the alleles appear in the variant entry.  The data for the primary allele is shown first, followed by the data for each alternate allele in the same order that the alternate alleles are presented.
  • The INFO tag values for each allele are separatated by a semi-colon " ; "
  • If a single INFO tag has multiple values for a particular allele, each value for that allele is separated by a comma " , "

When multiple alleles are present, the following organizational rule applies to all VCF files EXCEPT clinvar.vcf.gz:

  • If a single INFO tag has no data available for a particular allele, then a dot " . " placeholder represents the value of the INFO tag for that allele so as to maintain data order.

When multiple alleles are present, the following organizational rule applies to the clinvar.vcf.gz file only:

If a variant's INFO tag has no data available for a particular allele, the data will be omitted. Data order is maintained in the clinvar.vcf.gz file by the CLINALLE tag since CLINALLE provides an ordered list of the alleles described by the clinical (CLN*) INFO tags that follow it.  A user can match the ordered list of alleles in the CLINALLE tag to their corresponding clinical data in the other clinical (CLN*) INFO tags because these clinical data are listed in the same order that the CLINALLE tag lists the alleles. See the example in variation FAQ number 8 for more details about how the CLINALLE tag allows for data matching. 

 

The VCF Files in the Directory of Human Variation Sets Include the Following:

00-All.vcf.gz

This file is a comprehensive report of short human variations formatted in VCF. It does not include genotypes, population specific allele frequencies, or any information regarding clinical significance. It also does not include microsatellites or named variations (i.e. variations without sequence definition).

File Updates: This file is updated once per build.

All_papu.vcf.gz

This file is an inventory of all all human variations found in the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized/unplaced contigs (papu). We also release companion files for clinical data, named with a “papu” extension to support reporting on these additional, non-primary chromosome locations.

File Updates: This file is updated once per build.

common_all.vcf.gz

This file is an inventory of all "common" human variations that fall within the scope of VCF processing. The "common" category is based on germline origin and a minor allele frequency (MAF) of >=0.01 in at least one major population, with at least two unrelated individuals having the minor allele. This file may contain variations that happen to be both common and have evidence of medical interest. The definition of common may be based on one of more than 26 major populations

Note: The populations used to calculate allele frequency may not include the population you are studying.  

Important: an allele shown to be "common" in one of the 26 major populations used for this directory may not be common in all populations.

File Updates: This file is updated once per build.

common_and_clinical.vcf.gz

This file is an inventory of all "common" germline variations that have evidence of medical interest. To create this inventory, only those records in the common_all.vcf.gz whose CLINSIG values are 4 and greater (records of possible medical interest) are reported in common_and_clinical.vcf.gz. The resulting file therefore contains records that are both common and may have medical interest.

Note: A small percentage of records in "common_and_clinical.vcf.gz" are marked as "suspect". we suspect these records  to be false positive due to artifacts of the presence of a paralogous sequence in the genome or evidence suggested sequencing error or computation artifacts. You can find these records by looking for the "SSR" tag in the information header of the VCF file.

Important: an allele shown to be "common" in one of the 26 major populations used for this directory may not be common in all populations.

File Updates: This file is updated weekly.  Older versions of the "common_and_clinical" file will have the date (in yyyymmdd format) appended to the end of the file name, while the "common_and_clinical" without the appended date will point to the most recent version of the file. Since updates to "common_and_clinical.vcf.gz" will capture changes to variations submitted to NCBI through clinical channels, users should verify that they are using the most recent version.

clinvar.vcf.gz

This file contains variations submitted through clinical channels. The variations contained in this file are therefore a mixture of variations asserted to be pathogenic as well as those known to be non-pathogenic (see Note below). The user should note that any variant may have different assertions regarding clinical significance and that this file will contain only those that are the most "pathogenic".

File Updates: This file is updated once per build.

clinvar_papu.vcf.gz

This file is an inventory of all all human clinical variations found in the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized/unplaced contigs (papu).

File Updates: This file is updated once per build.

Variation records included in this file fall into the following clinical significance (CLINSIG) categories:

Signficance CLINSIG Variations with this value are included in All.vcf file.  When consistent with allele frequency, variations are also in the common_all file.
Table 2. Values Assigned for Clinical Significance and the Files in which Variations with these Values may be Found
unknown (VUS) 0 common_no_known _medical_impact.vcf.gz1, clinvar.vcf.gz
untested (not provided) 1 common_no_known _medical_impact.vcf.gz1, clinvar.vcf.gz
non-pathogenic (benign) 2 common_no_known _medical_impact.vcf.gz1, clinvar.vcf.gz
probably non-pathogenic 3 common_no_known _medical_impact.vcf.gz1, clinvar.vcf.gz
probably pathogenic 4 clinvar.vcf.gz
pathogenic 5 clinvar.vcf.gz
affecting drug response 6 clinvar.vcf.gz
affecting histocompatibility 7 clinvar.vcf.gz
other2 255 clinvar.vcf.gz

1Variations with this value for clinical significance are in this file only if the allele frequency is greater than the stated threshold.

2Variations for which there is not yet an enumerated clinical significance class.  These variations are grouped in a clinical significance class called "other", which includes:

  • Variations that are found only in somatic cells and are with or without known trait or phenotype. If a variant's source is not asserted during submission, we assume that the source of the variant is germline. Those variants submitted with the clinical phrase (clinic_phrase) tag set to "cancer" are reported  as somatic
  • Somatic or germline variations that are disease risk factors
  • Somatic or germline variations that act to protect a disease state (protective variants)

Note: The "common_no_known_medical_impact.vcf.gz" file and the "clinvar.vcf.gz" file are not mutually exclusive since some variants asserted to be non-pathogenic that were obtained through clinical channels appear in both the "clinvar.vcf.gz" file and the "common_no_known_medical_impact.vcf.gz" file.  In other words,  non-pathogenic variations submitted through clinical channels are marked as non-pathogenic and may have allele frequencies consistent with being reported as common.

Note: A small percentage of records in "clinvar.vcf.gz" are marked as "suspect" because they are suspected to be false positive due to artifacts of the presence of a paralogous sequence in the genome or evidence suggested sequencing error or computation artifacts. You can find these records by looking for the "SSR" tag in the information header of the VCF file.

File Updates:  This file is updated weekly.  Older versions of the "clinvar.vcf.gz" file will have the date (in yyyymmdd format) appended to the end of the file name, while the most recent version will have a symlink called "-latest" appended to the end of the file name that points to the most recent file. Since updates to "clinvar.vcf.gz" will capture changes to variations submitted to NCBI through clinical channels, users should verify that they are using the most recent version.

common_no_known _medical_impact.vcf.gz

This file is an inventory of all "common" germline variations that fall within the scope of VCF processing.  To create this inventory, variation records of probable medical interest (records in clinvar.vcf.gz with CLINSIG values of 4 and above) are removed from the “common” variation records file (common_all.vcf.gz).

Note:   The  file common_no_known _medical_impact.vcf.gz was created to provide users with an up-to-date report of common alleles not known to cause clinical phenotypes. This file can be used to subtract variants (filter) from a set of variant calls, thereby narrowing the list of variations that might warrant further evaluation for clinical signficance. Should you wish to filter polymorphisms out of your whole genome/exome sequencing results, use the "common_no_known_medical_impact" file. 

Note: The "common_no_known_medical_impact.vcf.gz" file and the "clinvar.vcf.gz" file are not mutually exclusive because some variants asserted to be non-pathogenic that were obtained through clinical channels appear in both the "clinvar.vcf.gz" file and the "common_no_known_medical_impact.vcf.gz" file.  Records for non-pathogenic variations that were submitted through clinical channels are marked as non-pathogenic and have allele frequencies consistent with a non-pathogenic status.

File Updates:  This file is updated weekly.  Older versions of the "common_no_known_medical_impact.vcf.gz" file will have the date (in yyyymmdd format) appended to the end of the file name, while the most recent version will have a symlink called "00-latest" appended to the end of the file name that points to the most recent file. Since updates to "common_no_known_medical_impact.vcf.gz" will capture changes to variations submitted to NCBI through clinical channels, users should verify that they are using the most recent version.


General Notes for All Files

  1. Minor Allele Frequency (MAF) is the allele frequency for the 2nd most frequently seen allele.  For example, consider a variation with alleles and allele frequencies as follows:
        
         Reference Allele=G; frequency = 0.600
         Alternate Allele=C;  frequency = 0.399
         Alternate Allele= T;  frequency = 0.001

    Based on the MAF guideline mentioned above, the minor allele is "C", so the minor allele frequency (MAF) is 0.399.  Allele "T" with frequency 0.001 is considered a rare allele rather than a minor allele.
     
  2. The Minor allele Frequency (MAF) for 1000 Genomes populations were calculated using genotype data from the 1000 Genomes Project [phase 1] total population of 2504 individuals, and is called the "1000G MAF" or "GMAF". You can find the data used for this computation in: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/
     
  3. The criteria used to consider a variation "common":
    1. Variant has germline origin.
    2. Variant has a minor allele frequency (MAF) of >=0.01 in at least one major population, with at least two unrelated individuals having the minor allele.
    3. MAF was computed with founder genotypes only. That is, if a variant's minor allele was observed only in a parent and its child, the variant is not considered "common".

  4. The *.tbi files in the ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF directory are created with Tabix for use with SAMtools.  See details at: https://samtools.sourceforge.net/.  The command options for Tabix are located at: https://samtools.sourceforge.net/tabix.shtml

 

Information Contained in the "common_all.vcf.gz" and  "common_no_known _medical_impact.vcf.gz" VCF File Headers

The VCF headers for the "common_all.vcf.gz "common_no_known_medical impact" and "clinvar_yyyymmdd" files are similar to the standard VCF headers, but contain the following:

  1. The INFO tags: dbSNP_POP_IDS, dbSNP_LOC_POP_IDS, and dbSNP_POP_HANDLES all contain information that allow you to retrieve the submitter ID and the local population ID for the population in question. 
    Each of these tags have the following comma separated fields (in order): numeric SNP population IDs; local population ID; submitter identifier (handle). Since the order of these fields is consistent for each tag, you can use the position of the values relative to each other to deterimine the local population ID and submitter handle for a particular population ID.
     
  2. The “Info” column contains an additional field called “POPFREQ”.  This field contains the frequency information for each population ID.  The frequency information provided in this field follows the format:  pid(ns/na):f(c1/c2)[|f(c1,c2)], where:

    pid = population ID

    na = the number of chromosomes in which alleles were observed. This is usually the sample count multiplied by 2 (one for each chromosome)

    ns = the number of samples

    f = the minor allele frequency (MAF). This is actually the frequency of the alternate (ALT) allele for the 2nd most frequently seen allele.  If f > 0.5, then the genome allele is the minor allele.  An example is rs3091274 on chr 1 where all frequencies are > 0.5

    c1 = the number of occurrences of the minor allele for the population.  For samples where the minor allele is homozygous, the number of occurrences is 2, for heterozygous samples the number of occurrences is 1, otherwise the number of occurrences is 0.  Example: if the population contains 3 heterozygous samples that have the allele, 1 homozygous sample that has the allele, and 12 samples that don't have the allele, then
    c1= 5 (3 alleles+2 alleles+0 alleles).

    c2 = sample count for the allele (c1 minus homozygous count).  Example: if the population contains 3 heterozygous samples that have the allele, 1 homozygous sample that has the allele, and 12 samples that don't have the allele, so that c1=5 (see c1 example above), then c2= 4 (that is 5-1) since you subtract the homozygous count (in this case 1) from the sample count (c1) for the allele.

    f(c1/c2) represents the f, c1 and c2 values for the first alternate allele listed in the ALT column

    [|f(c1/c2)] represents additional instances of f(c1/c2) that may follow a vertical bar.  These additional instances of f(c1/c2) provide the f, c1 and c2 values for alternate alleles when there is more than one alternate allele listed in the ALT column.

    Example 1: In this example, we will use contents of the AAM_GENO_PANEL.vcf.gz file. A description and composition of the population in this file is available at https://www.ncbi.nlm.nih.gov/SNP/snp_viewTable.cgi?pop=4446.

    The POPFREQ value for rs12121577 given in the AAM_GENO_PANEL.vcf file is: POPFREQ=248:124:0.0887096774193548(22/22).  The reference allele given in AAM_GENO_PANEL.vcf for rs12121577 is "C" and the alternate allele is "G". Remember, the values for f, c1 and c2 are those for the alternate (and in this case, minor) allele "G". You can see these features in a sample of the AAM_GENO_PANEL.vcf file available on the web-based poly_clin_readme page.

    Using the POPFREQ format statement: na:ns:f(c1/c2)[|f(c1/c2)],  the value of each variable for each alternate allele in the POPFREQ statement for rs12121577 is the following:

    na=248

    ns=124

    f=.0887096774193548 (The refSNP page for shows the MAF as .0891 in the Population Diversity section)

    c1=22

    c2=22

    You can find the frequency of the reference allele ("C") by subtracting the frequency of the alternate allele (.0887096774193548) from 1: 1-.0887096774193548=0.911290323

    Example 2: In this example, we again use contents of the AAM_GENO_PANEL.vcf.gz file, but examine rs11121815 instead.

    The POPFREQ value for rs11121815 given in the AAM_GENO_PANEL.vcf file is: POPFREQ=248:124:0(0/)|0.491935483870968(122/92)|0(0/)  The reference allele given in AAM_GENO_PANEL.vcf  for rs11121815 is "T" and the alternate alleles are "A", "C" and "G". You can see these features in a sample of the AAM_GENO_PANEL.vcf file available on the web-based poly_clin_readme page.

    Using the POPFREQ format statement: na:ns:f(c1/c2)[|f(c1/c2)], you can determine the value of each variable for each alternate allele in the POPFREQ statement for rs11121815 :

    na=248

    ns=124

    For alternate allele "A":

    f=0

    c1=0

    c2=0

    For alternate allele "C"

    f=0.491935483870968 (The refSNP page for shows the MAF as .492 in the Population Diversity section)

    c1=122

    c2=92

    For alternate allele "G":

    f=0

    c1=0

    c2=0

    You can find the frequency of the reference allele ("T") by subtracting the MAF  (0.491935483870968) from 1:
    1 - 0.491935483870968 = 0.508064516

Disclaimer: Assertions about the phenotypic effects of variants are provided by multiple sources, have different levels of experimental support, and may conflict. NCBI does not independently verify assertions and cannot endorse their accuracy. Information obtained through this resource is not a substitute for professional genetic counseling and is not intended for use as the basis of medical decision making.

Please contact snp-admin@ncbi.nlm.nih.gov if you have any questions or comments.

Support Center

Last updated: 2017-04-20T15:32:46-04:00