Guide to using files from the ftp site or accessed via e-utilities

Use cases

Background

See also our Introduction page

The unit of record accessioned by ClinVar is the assessment made by the submitter about the relationship between variants and phenotypes, with supporting evidence. Thus ClinVar maintains standardized information about variants, diagnostic terms, phenotypic measures, methods of data acquisition, methods of data assessment, and evidence supporting the existence of the variant-phenotype relationship based on family studies, analyses of populations, or functional assays. ClinVar archives each submission (the accession beginning with SCV),  aggregates data by variant/phenotype pairs and accessions that relationship (RCV), adds information such as database identifiers and locations on multiple assemblies (RCV only) and also aggregates data by simple alleles or complex alleles (VariationID).  The complete extraction of data in ClinVar is reported the first Thursday of each month as ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz .  These records represent aggregation of data by variant and interpreted phenotype combinations.

Complete extractions are also reported week, usually Monday, in the path ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/weekly_release/.  These files are retained until the next monthly release.

Each record  (ClinVarSet element) in the full release contains one ReferenceClinVarAssertion element, which is based on the aggregation of data received by ClinVar (ClinVarAssertion element) and data added by NCBI’s processing, such as rs#, equivalent HGVS expressions, identifiers for disorders, allele frequencies, etc . The ReferenceClinVarAssertion has a versioned accession starting with RCV.  The data received from the submitter (ClinVarAssertion) has a versioned accession starting with SCV, and packages the data as submitted (usually after transformation from a spreadsheet to XML).  The evidence is represented in //ObservedIn elements. Not all of the data in the ObservedIn elements from each ClinVarAssertion/SCV are aggregated into the ReferenceClinVarAssertion/RCV, so the XML path //ClinVarAssertion//ObservedIn should be used to extract all submitted evidence.

ClinVar also reports data in other formats and with different organizing principles ( e. g. variant instead of variant-phenotype assessment).

Definition of simple and complex alleles represented in ClinVar

There are multiple files that maintain information defining simple and complex alleles. The most detailed comprehensive file is the full release ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ ), but for a tabular comprehensive subset of data, ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz also lists alleles and their placements on GRCh37 and GRCh38.

Alleles without genomic coordinates

Some of the records in ClinVar report alleles that have not been mapped to genomic coordinates. There are several causes of this, ranked below in decreasing order of frequency.

  • ClinVar processes allelic variant records on behalf of OMIM, and until recently, most allelic variant records did not contain a computable sequence definition of the allele.  ClinVar does not have the resources to research all of these to define the molecular basis of the report, but staff does continue to try to reduce the number of gaps.
  • Variants originally defined based on analyses of cDNAs, without evidence of the genomic basis of the sequence change. A frequent example of these would be reports of exon loss, without evidence of genomic deletion or single nucleotide changes affecting splice junctions.
  • Submissions accepted from non-OMIM sources without having validated the sequence definition. Most of these records are old, and ClinVar is continuing to remove these gaps.
  • Database maintenance errors

Reporting genomic coordinates of alleles

All reports of the location of a variant are currently based on offset 1. Please note, however, that the convention for reporting location in the XML (SequenceLocation element) and tab-delimited files is consistent with HGVS expression, while the location in the VCF meets the VCF standard. Thus, for single nucleotide variants not in repeat regions, the location based on VCF or HGVS  will match on nucleotide position, but for insertions, duplications, and deletions in repeat regions, the different conventions of left justification (VCF) and right justification (HGVS) will make it appear that the definitions of the locations of the alleles are inconsistent among the files.

ClinVar release XML

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz

Released on the first Thursday of the month.

The archive

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/

Example

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/sample_xml/RCV000077146.xml

Description

The path //ReferenceClinVarAssertion/MeasureSet contains all the metadata ClinVar has accumulated about any haplotype. Each simple allele in the set is described in the path //ReferenceClinVarAssertion/MeasureSet/Measure. It includes multiple HGVS expressions, sequence locations with reference and alternate alleles at those locations,  identifiers from freely accessible public databases such as dbSNP, dbVar, OMIM, LSDB, and gene relationships. In other words this element contains the result of ClinVar's standardization of data from mutliple submitters. At present, the sequence locations are reported to be consistent with the HGVS expression. In the future, locations consistent with VCF standards will be added .

Note : Because of the RCV record structure (report per variant/phenotype pair), the same /MeasureSet may be reported in more than one RCV. Each will have the same ID value <MeasureSet Type="Variant" ID="VariantID here">

variant_summary

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz

Released on the first Thursday of the month.

The archive

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/README

This tab-delimited file is also comprehensive with respect to the variants represented in ClinVar, but it reports only selected metadata about each. It is useful to capture some attributes , but not the alternate descriptions.  Locations in this file are consistent with HGVS expressions.

The Entrez document summary

ClinVar is maintained in NCBI's Entrez system, and thus data can be accessed by Entrez's programming utilities . ClinVar's document summary, accessed by the esummary command, is structured around the VariantID, not each variant-phenotype relationship.  For annotated examples of the xml that is generated from an esummary request, please refer to

Locations in these reports are consistent with HGVS expressions.

Current views in XML and JSON formats:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=9&retmode=json

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=9&retmode=XML

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=1904&retmode=json

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=1904&retmode=xml

VCF

At present, ClinVar's VCF file is limited to records that have been assigned rs#. Thus this file is not comprehensive , because it lacks representation of the structural variants in dbVar, the records that have not been mapped to the genome, and records that are in the process of being assigned rs# by dbSNP.

The directories

GRCh37/hg19: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/

GRCh38/hg38: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/

Please refer to http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/ for documentation of dbSNP's VCF files for human.

Definition of genotypes represented in ClinVar

There are few submissions to ClinVar that represent assertions about simple or complex genotypes. At present, these are represented only in the full XML and document summary files.

//MeasureSet/@Type="CompoundHeterozygote"

Current assessment of alleles independent of name of the trait

Both the Entrez document summary (and the detailed web page) and the variant_summary , being organized by the variant, report clinical significance based only on the variant and not the variant-phenotype relationship. Data from all submitters are aggregated, and reported in the ClinicalSignificance elements according to the rules documented here .  Aggegate data, such as review status and clinical significance, will differ from the ClinVar release XML (next section) because the units of aggregation differ.

Current assessment of alleles based on the name of the trait

In the ClinVar release XML, the path //ReferenceClinVarAssertion/ClinicalSignificance reports the clinical significance of a MeasureSet (Variants) relative to a TraitSet  (Conditions) by aggregating submitted clinical significance values from each //ClinVarAssertion. Clinical significance in the VCF is also  based on aggregating data based on the variation-phenotype relationship. Aggegate data, such as review status and clinical significance, will differ from the web page, the Entrez document summary and the variant_summary , (previous section) because the units of aggregation differ.

Evidence about individuals or families assessed

Detailed descriptions of the individuals tested, and the segregation of an allele with a phenotype, is represented only in the ClinVar release XML. Some of the data are aggregated in //ReferenceClinVarAssertion/ObservedIn, but the details from each submitter are presented in each //ClinVarAssertion/ObservedIn element.

Phenotypic details about individuals assessed

Detailed descriptions of phenotypes observed in individuals tested are represented only in the ClinVar release XML. Some of the data are aggregated in //ReferenceClinVarAssertion/ObservedIn/TraitSet, but the details from each submitter are presented in each //ClinVarAssertion/ObservedIn/TraitSet element.

Summary of cross-references

Two files in the tab-delimited directory report cross-references between ClinVar's AlleleID and VariationID and other databases. ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/var_citations.txt reports PubMed identifiers, along with ids from dbSNP and dbVar. ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/cross_references.txt provides identifiers for dbSNP and dbVar, and when those were last modified.

Support Center

Last updated: 2018-12-21T18:49:32Z