NCBI's Variation Resources: Frequently Asked Questions

This FAQ covers general aspects of NCBI's variation databases.  If you do not find your answer on this page, please check the resource specific FAQs listed below:

Resource FAQ Scope
dbSNP An archive of questions asked by users of dbSNP and dbMHC.
dbVar NCBI's database of large, structural variation.
ClinVar Aggregate of information about sequence information and its relationship to human health
Variation Reporter Search NCBI's variantion data for matches to your variant calls.  Supports consequence prediction for novel variants.
Variation Viewer Search, filter, browse and view variations in sequence graphics.

Table of Contents

  1. Where do I find all common variations in dbSNP that are not known to have medical impact?
  2. Is there a document describing the VCF files for common variants and for variants with medical impact?
  3. Do the ClinVar files contain all variations known to have medical impact?
  4. Do the ClinVar files in this directory (ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/) include variants that have a highly significant association to common disease?
  5. How does NCBI calculate clinical significance?
  6. Are some variants included in both the ClinVar and files with common variations?
  7. How do I find all missense, nonsense, frameshift and splice site variations in vcf files?
  8. How often does dbSNP release updates for human?
  9. If a variant has multiple alleles, how are the attributes for those alleles represented in the VCF INFO tags?
  10. If a single refSNP contains multiple alleles with different clinical significance values, how is clinical significance determined for a refSNP cluster?
  11. Where do I find allele frequencies for human variants?

Where do I find all common variations in dbSNP that are not known to have medical impact?

The records for these variations can be found in the common_no_known_medical_impact.vcf.gz file, which is located in ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/. Please note how common is used in this context.

Is there a document describing the VCF files for common variants and for variants with medical impact?

Yes. Please see the Human Variation Sets in VCF Format document. 

Do the ClinVar files contain all variations known to have medical impact?

No. ClinVar is based on direct submission, and not all variants known to cause or be associated with medical disorders have been submitted to ClinVar or mapped to genomic coordinates. You should not interpret the absence of medical information for a variant in a ClinVar file as a lack of medical or functional effect for that variation. You are encouraged to submit data or contact us if you identify an error in our processing.

Do the ClinVar VCF variant files in this directory (ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/) include variants that have a highly significant association to common disease?

Yes. ClinVar has accessioned allelic variant records from OMIM. Some of these variants are risk alleles or susceptibility loci identified by genome-wide association studies, and are reported with a clinical significance  of 'other'.  ClinVar has not systematically accessioned variants with highly significant associations to disease, such as those reported in Phenotype Genotype Integrator(PheGenI) or NHGRI's Catalog of Published Genome-Wide Association Studies, but some are included in OMIM's allelic variant records.

How does NCBI calculate clinical significance?

The values of clinical signficance in the ClinVar files are not determined by NCBI, rather they are reported as provided from the submitter.  For complete documentation of ClinVar's processing of clinical significance, please see the document Representation of clinical significance in ClinVar and other variation resources at NCBI.

Are some variants included in both the "clinvar" and "common_no_known_medical_impact" files?

Yes. Some variants asserted to be non-pathogenic that were obtained through clinical channels appear in both the "clinvar" file and the "common_no_known_medical_impact" file.  Records for non-pathogenic variations that were submitted through clinical channels are marked as non-pathogenic and may have allele frequencies consistent with a non-pathogenic status.

How do I find all missense, nonsense, frameshift and splice site variations in vcf files?

You can identify alleles that affect a protein product using the values in the INFO tag. Search for missense, nonsense, frameshift and splice site variations in the information section of vcf records using the appropriate VCF INFO Tag: 

  • NSM: (non-synonymous) missense
  • NSN: (non-synonymous) nonsense
  • NSF: (non-synonymous) frameshift
  • ASS: spice site acceptor
  • DSS: splice site donor

How often does dbSNP release updates for human?

dbSNP does not update information continuously. For human, there are two major releases, termed builds, which occur twice a year, in June and November. Data submitted via the clinical channel, however, is updated monthly, on the first Thursday of the month.

If a variant has multiple alleles, how are the attributes for those alleles represented in the VCF INFO tags?

When multiple alleles are present, the following organizational rules apply to all VCF files:

  • The data for each INFO tag is presented in the order that the alleles appear in the variant entry.  The data for the primary allele is shown first, followed by the data for each alternate allele in the same order that the alternate alleles are presented.
  • The INFO tag values for each allele are separatated by a semi-colon " ; "
  • If a single INFO tag has multiple values for a particular allele, each value for that allele is separated by a comma " , "

When multiple alleles are present, the following organizational rule applies to all VCF files EXCEPT clinvar.vcf.gz:

  • If a single INFO tag has no data available for a particular allele, then a dot " . " placeholder represents the value of the INFO tag for that allele so as to maintain data order.

When multiple alleles are present, the following organizational rule applies to the clinvar.vcf.gz file only:

  • If a variant's INFO tag has no data available for a particular allele, the data will be omitted. 
  • Data order is maintained in the clinvar.vcf.gz file by the CLNALLE tag since CLNALLE provides an ordered list of the alleles described by the clinical (CLN*) INFO tags that follow it.  A user can match the ordered list of alleles in the CLNALLE tag to their corresponding clinical data in the other clinical (CLN*) INFO tags because these clinical data are listed in the same order that the CLNALLE tag lists the alleles.  See the example (below) for more details about how the CLNALLE tag allows for data matching.
Example 1:
15 48796042 rs140603 G A,C . . RSPOS=48796042;RV;GMAF=0.0027;dbSNPBuildID=78;SSR=0;SAO=1;VP=050368000b05050517100100;
                               GENEINFO=FBN1:2200;WGT=1;VC=SNV;PM;PMC;S3D;SLO;NSM;REF;SYN;ASP;VLD;G5;HD;GNO;KGPhase1;
                               KGPROD;OTHERKG;PH3;LSD;CLNALLE=1,2;
                               CLNHGVS=NC_000015.9:g.48796042G>A,NC_000015.9:g.48796042G>C;CLNSRC=Correlagen,Correlagen;
                               CLNORIGIN=1,1;CLNSIG=2,4;CLNDBN=Marfan Syndrome, Marfan Syndrome
 
In the above example, the CLNALLE tag = 1,2; where 1 = First alternate allele ("A"), and 2 = Second alternate allele ("C"), indicating that only the first and second alternate alleles will be described in the clinical (CLN*) tags that follow -- the reference allele ("G") will have no description in the CLN* tags that follow since the CLNALLE tag does not include a value for the reference allele (CLNALLE=0 for reference allele). 
 
The clinical (CLN*) tags that follow the CLNALLE tag will each contain two separate values separated by commas:  the first value describes the first alternate allele ("A"), and the second value describes the second alternate allele ('C").  So, in the case of the CLNHGVS tag in this example, the first value of "NC_000015.9:g.48796042G>A" applies to the first alternate allele ("A") and the second value of "NC_000015.9:g.48796042G>C" applies to the second alternate allele ("C").  In the case of the CLINSIG tag, the first value of "2" (non-pathogenic) applies to the first alternate allele ("A"), while the second value of "4" (probable pathogenic) applies to the second alternate allele ("C").
 
There is an exception to the organizational rule for multiple alleles described in example 1.  There will be some VCF records where multiple alleles are listed for ALT, but not all of the alleles listed will have INFO tag data available.  This omission of data occurs when additional information for the alleles is unavailable due to historic reporting of refSNP properties at the refSNP level and not at the allele change level :
 
Example 2:
19 54627246 rs119475042 G A,C . . RSPOS=54627246;dbSNPBuildID=132;SSR=0;SAO=1;VP=050260000a05000002110100;
                                  GENEINFO=PRPF31:26121;WGT=1;VC=SNV;PM;S3D;NSM;REF;ASP;OTHERKG;LSD;OM;
                                  CLNALLE=1;CLNHGVS=NC_000019.9:g.54627246G>A;CLNSRC=OMIM Allelic Variant;CLNORIGIN=1; 
                                  CLNSRCID=606419.0002;CLNSIG=5;CLNCUI=C1838601;CLNDBN=Retinitis Pigmentosa 11;
                                  CLNACC=SCV000024781.1
In the example above, two ALT alleles are listed -- "A" and "C", but the INFO tags for the record only show one value each (e.g. SAO=1; WGT=1; CLNALLE=1; ;CLNSRCID=606419.0002).  The INFO tag values in such cases apply only to the allele change reported by the "CLNHGVS" tag, which in this case is NC_000019.9:g.54627246G>A -- the "A" allele. 

In this example then, the single values reported by the INFO tags listed in the record apply to alternate allele "A" only.
 
 

If a single refSNP contains multiple alleles with different clinical significance values, how is clinical significance determined for a refSNP cluster?

We report the most pathogenic allele of a cluster. The order of decreasing pathogenic severity are:

  • Pathogenic
  • Likely pathogenic
  • histocompatibility
  • drug-response
  • other
  • Uncertain significance
  • unknown\untested
  • Likely benign
  • Benign

How do I find allele frequencies for human variants?

Large-scale public projects that determine allele frequency (namely 1000 Genomes and GO-ESP) are based on alignment to the GRCh37 assembly. For that reason, allele frequencies in VCF files are reported only on that assembly, accessed by these equivalent paths.

Based on these data, ClinVar also reports frequency information in its full XML reports

as a property of the allele itself, namely in Attribute elements of Type "GlobalMinorAlleleFrequency" or "AlleleFrequency". The  source of each datum is reported in the element as well.

Frequency data are also available for download from the 1000 Genomes browser, and Variation Viewer. Note:

The 1000 Genomes site also provides hints to solve this problem: http://www.1000genomes.org/faq/how-can-i-get-allele-frequency-my-variant

Support Center

Last updated: 2015-09-25T13:12:11-04:00