ClinVar Variations in VCF Format

This document describes the ClinVar set of human variations in the VCF format.

The files report on human variations with clinical assertions that have been mapped to assemblies GRCh37 and GRCh38. The files are provided at the ClinVar FTP repository ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/.

ClinVar Files

ClinVar VCF files currently represent all human variants with precise endpoints that have been reported to ClinVar.

These files exclude:

  • Variants with imprecise endpoints, such as those identified by microarray. We plan to develop a second set of VCF files to represent these variants.
  • Other variants at the same location that are registered in dbSNP but do not have an assertion in ClinVar.
  • Variants that cannot be localized on the genome, such as variants reported in the literature with legacy nomenclature that cannot be confirmed.

ClinVar VCF files are allele-specific - each row represents a single allele at that position, rather than one row per rs number as in the dbSNP VCF files.

ClinVar provides VCF files for both GRCh37 and GRCh38.

Note that we use VCF version 4.1.  

Table 1 below summarizes the files generated by ClinVar, with a brief overview of their content.

Each file in the table is built per assembly, as part of ClinVar’s monthly release and is located in the respective directory:

Files using the first version of the ClinVar VCF format (1.0) are archived in the following directories:

Files using the new version of the ClinVar VCF format (2.0) are archived in the following directories:

File Name  Update Frequency Description
Table 1. Summary of ClinVar VCF files 
clinvar_vcf.GRCh37.vcf.gz Monthly ClinVar variants with precise endpoints. Each variant is represented by a single location in the reference assembly for GRCh37. This file pairs with the file clinvar_vcf.GRCh37.vcf_papu.gz below.
clinvar_vcf.GRCh37_papu.vcf.gz Monthly

ClinVar variants with precise endpoints. Each variant is represented by all other mapped locations in GRCh37 not reported in the file above. This includes the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized or unplaced contigs (papu). This file pairs with the file clinvar_vcf.GRCh37.vcf.gz above.

clinvar_vcf.GRCh38.vcf.gz Monthly ClinVar variants with precise endpoints. Each variant is represented by a single location in the reference assembly for GRCh38. This file pairs with the file clinvar_vcf.GRCh38.vcf_papu.gz below.
clinvar_vcf.GRCh38_papu.vcf.gz Monthly

ClinVar variants with precise endpoints. Each variant is represented by all other mapped locations in GRCh38 not reported in the file above. This includes the pseudoautosomal region (PAR), alternate loci, patch sequences and unlocalized or unplaced contigs (papu). This file pairs with the file clinvar_vcf.GRCh38.vcf.gz above.

Notes on content

  • The ID column (column 3) reports the ClinVar Variation ID.
  • ClinVar accepts all IUPAC ambiguity codes for nucleotides. However, the VCF specification (https://samtools.github.io/hts-specs/VCFv4.2.pdf) only allows ambiguity code N. Thus ClinVar XML retains the actual ambiguous bases, but all ambiguous values are converted to N in the VCF files.
  • Interpretations may be made on a single variant or a set of variants, such as a haplotype. Variants that have only been interpreted as part of a set of variants (i.e. no direct interpretation for the variant itself) are considered "included" variants. The VCF files include both variants with a direct interpretation and included variants. Included variants do not have an associated disease (CLNDN, CLNDISDB) or a clinical significance (CLNSIG). Instead there are three tags are specific to the included variants - CLNDNINCL, CLNDISDBINCL, and CLNSIGINCL (see below).
  • Data reported in the INFO tags is aggregated by Variation ID. INFO tags that are retained from the old format are CLNDN, CLNDISDB, CLNSIG, GENEINFO, RS, SSR.

INFO Tag Comment
Table 2. VCF Info tags in ClinVar VCF files 
AF_ESP, AF_EXAC, AF_TGP    

Allele frequency is reported in three tags, one for each source of data.

  • AF_ESP reports allele frequency from GO-ESP
  • AF_EXAC reports allele from the ExAC Consortium
  • AF_TGP reports allele from the 1000 Genomes Project
ALLELEID The ClinVar Allele ID for the variant
CLNDNINCL Used only for “included” variants. ClinVar's preferred disease name for an interpretation for a haplotype or genotype that includes this variant
CLNDISDBINCL Used only for “included” variants. The database name and identifier for the disease name for an interpretation for a haplotype or genotype that includes this variant. Multiples are separated by a pipe
CLNHGVS The top-level genomic HGVS expression for the variant. This may be on an accession for the primary assembly or on an ALT LOCI
CLNSIGINCL Used only for “included” variants. The clinical significance of a haplotype or genotype that includes this variant. It is reported as pairs of Variation ID for the haplotype or genotype and the corresponding clinical significance
CLNVI Identifiers for the variant in other databases, e.g. OMIM Allelic variant IDs
CLNVC The type of variation, using terms from Sequence Ontology
CLNVCSO The Sequence Ontology identifier for the type of variation
MC

The predicted molecular consequence of the variant. It is reported as pairs of the Sequence Ontology (SO) identifier and the molecular consequence term joined by a vertical bar. Multiple values are separated by a comma. This tag replaces ASS, DSS, INT, NSF, NSM, NSN, R3, R5, SYN, U3, and U5 in the old format

See also NCBI Variation Resources Frequently Asked Questions.

Support Center

Last updated: 2018-02-20T16:29:31Z