Identifiers in ClinVar

ClinVar archives submissions that interpret the effect of a single variant or set of variants on phenotype. It also archives data aggregated from those submissions. These archives are assigned accession numbers.

To support the function of aggregating data, key concepts in the submission are extracted, assigned identifiers, and enriched with content from other databases.

This document

  • provides detailed information about ClinVar's accessions, identifiers, and relationships established with records in other public databases.
  • enumerates where each type of identifier is reported in ClinVar's information products
  • gives examples of using some of these identifiers

Table of contents

Accession numbers

ClinVar assigns versioned accession numbers to its records. Accession numbers in ClinVar have the pattern of 3 letters and 9 numerals.  The letters are either SCV (think of it as submission to ClinVar) or RCV (Reference ClinVar record). These accession numbers also are assigned a version number. The version is incremented when a submitter updates a record or when the contents of a reference record change because of addition to, updates of, or deletion of the SCV accessions on which it is based.

SCV

Web display

'SCV' refers to the first 3 letters of the accession number assigned to a submission to ClinVar, e.g. SCV000020145. If you submit a query to ClinVar based on that accession number, e.g. SCV000020145 , you are directed automatically to a page specific to the VariationID generated from that submission. The accession number and version are displayed in the Assertions and evidence details section. At present, there is no web display in ClinVar specific to an SCV record.

XML releases

In our full XML releases, ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ ), the content specific to each SCV is reported in a //ClinVarSet/ClinVarAssertion/ element, with the accession number reported as /ReleaseSet/ClinVarSet/ClinVarAssertion/ClinVarAccession/@Acc and the version as /ReleaseSet/ClinVarSet/ClinVarAssertion/ClinVarAccession/@Version. The XML release represents content frozen on the Sunday preceding the first Thursday of each month.

Reports in the tab_delimited directory

There are two files in the tab-delimited directory on our ftp site ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/ ) that contain information specific to submitted data.

  • submission_summary.txt.gz
  • summary_of_conflicting_interpretations.txt

submission_summary provides an overview of interpretation, phenotypes, observations, and methods reported in the current version of each submission.

summary_of_conflicting_interpretations reports all pairwise differences in interpretation of a variant, without regard to phenotype.

VCF files

The SCV accession is not reported in ClinVar's VCF files.

RCV

Web display

'RCV' refers to the first 3 letters of the accession calculated by ClinVar to aggregate information from all submissions interpreting the same phenotype relative to the same variant or set of variants.  If you submit a query to ClinVar based on an RCV accession, e.g. RCV000009910 you are directed to the page specific to that record. The Assertion and evidence details section of this record lists all the supporting submissions with their SCV accessions.

XML releases

In our full XML releases, ( ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ ), the content specific to each RCV is reported in a //ClinVarSet/ReferenceClinVarAssertion/ element, with the accession reported as //ReleaseSet//ClinVarSet/ReferenceClinVarAssertion/ClinVarAccession/@Acc and the version as //ReleaseSet/ClinVarSet/ReferenceClinVarAssertion/ClinVarAccession/@Version. The XML release represents content frozen on the Sunday preceding the first Thursday of each month.

Reports in the tab_delimited directory

submission_summary mentions RCV accessions and versions in comments supplied by submitters (Description column).  The references to RCV are not verified by ClinVar.

summary_of_conflicting_interpretations mentions RCV accessions and versions in comments supplied by submitters (Submitter1_Description, Submitter2_Description).  The references to RCV accessions are not verified by ClinVar.

VCF files

The list of RCV accessions that aggregate information about variants represented in ClinVar are reported in ClinVar's VCF file.

Identifiers specific to ClinVar

Variation ID

ClinVar assigns a unique integer identifier to each set of variants described in submissions. The majority of ClinVar's submissions interpret a single variant, but a Variation ID is assigned even if there is only one variant in the set. There are two subclasses of Variation IDs:

  • those being interpreted directly (interpreted)
  • those being interpreted only in the context of a set of variants (included)

The majority of Variation IDs in ClinVar are interpreted, meaning that they were the focus of a submission, with the clinical significance of that variant provided as an interpretation.  However, there are submissions that describe a compound heterozygote, a haploype, or a diplotype, for which ClinVar has not independently received a submission interpreting the effect of each individual variant. The individual variants defining the complex set, but without a direct interpretation, are represented by Variation IDs of the 'included' class.

Note the example https://www.ncbi.nlm.nih.gov/clinvar/variation/561/ . The Variation ID is 561. None of the Allele IDs 38381, 38382, nor 15600 has a submission to ClinVar directly interpreting that variant, so each of those is "included", and the Variation IDs assigned to the each individual variant (242756, 242755, 242821, respectively) are being phased into ClinVar's public reports. The standard for reporting the correspondences is ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz

Note: If for any reason the description of Variation ID were changed from the set of 3 simple variants as it is now, to one deletion /insertion event spanning 8 base pairs, the Variation ID would be retained, because it describes the same final sequence at that location, but there would be a single, different Allele ID assigned to the deletion/insertion.

Web display

The Variation ID is used to anchor ClinVar's display specific to a set of variants. Take for example, Variation ID 561.

  • the value 561 in the URL  https://www.ncbi.nlm.nih.gov/clinvar/variation/561/ is the Variation ID.
  • the value 561 has been assigned to the set of 3 distinct simple variants reported in the Allele(s) section.
  • The Variation ID is also displayed explicitly on the Variation Report.
XML releases

The ClinVarFullRelease files describe the interpreted set of variants in the element //MeasureSet. The attribute @ID is the Variation ID.

Reports in the tab_delimited directory

There are multiple reports in the tab-delimited directory that reference the Variation ID.  The column containing those values is clearly labeled.  The file reporting the relationships between VariationID and AlleleID is variation_allele.txt.gz.

  • hgvs4variation.txt.gz
  • submission_summary.txt.gz
  • var_citations.txt
  • variation_allele.txt.gz
VCF files

The current VCF file does not reference the Variation ID. ClinVar is working on a major revision to the VCF reports that will be focused on the Variation ID.

Allele ID

A unique integer identifier, the Allele ID,  is assigned to each individual variant in ClinVar. The numbering systems for the Allele ID and the Variation ID described above overlap, so it is important to note the context of any integer identifier.

XML releases

The ClinVarFullRelease files describe each individual variant in the element //Measure. The attribute @ID is the Allele ID.

Reports in the tab_delimited directory

There are multiple reports in the tab-delimited directory that reference the Allele ID. The column containing the Allele ID is clearly labeled. The file reporting the relationships between VariationID and AlleleID is variation_allele.txt.gz.

  • allele_gene.txt
  • cross_references.txt
  • hgvs4variation.txt.gz
  • variant_summary.txt.gz
  • var_citations.txt
  • variation_allele.txt.gz
VCF files

The Allele ID is not reported in the VCF files.

Relationships between Variation ID and Allele ID

A Variation ID represents one or more Allele IDs.  Any Allele ID may be a component of one or more sets of discrete variants (aka Variation ID). For example, consider a submission with an interpretation for a single nucleotide variant (Allele ID n, assigned Variation ID a), and a different submission with an interpretation for that same single nucleotide variant (Allele ID n) in combination with a different single nucleotide variant (Allele ID m), the combination being assigned Variation ID b.  The standard for reporting the correspondence between Variation ID and Allele ID  is ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variation_allele.txt.gz.

To find all Allele IDs represented by a Variation ID, try something like

zcat variation_allele.txt.gz | awk '$1==561 {print}'

which finds all the lines where the value in the first column (Variation ID) is 561, and reports that 561 corresponds to an interpreted (yes in column 4) Haplotype, with the Allele IDs in the 3rd column.

561     Haplotype       15600   yes
561     Haplotype       38381   yes
561     Haplotype       38382   yes

To find all Variation IDs represented by an Allele ID, try something like

zcat variation_allele.txt.gz | awk '$3==15600{print}'

which finds all the lines where the value in the third column (Allele ID) is 15600, and reports that 15600 is part of an interpreted haplotype (yes in column 4), but also is represented by Variation ID 242821 which has not been interpreted.

561     Haplotype       15600   yes
242821  Variant         15600   no

Other identifiers in ClinVar's XML releases

ClinVar's XML does report integer @ID values for multiple elements other than MeasureSet and Measure.  These values correspond to the unique keys used in the relational database tables that ClinVar uses to represent the data. At present these values can be used for identification in processing any element from one report to another, e.g. //Trait/@ID, but ClinVar does not consider these as public identifiers and reserves the right to alter the numbering system.

Identifiers specific to other NCBI resources

ClinVar maintains multiple identifiers to other NCBI resources.  These include the BookShelf, dbSNP, dbVar, Gene, MedGen's CUI, PubMed, and PubMedCentral.

  • In the XML, these are reported in the XRef element.
  • In the tab-delimited directories, these are reported in
    • cross-references.txt
    • var_citations.txt

Identifiers specific to resources outside of NCBI

ClinVar maintains multiple identifiers to resources outside of NCBI.

  • In the XML, these are reported in the XRef element.
  • In the tab-delimited directories, these are reported in
    • cross-references.txt
    • var_citations.txt

In other ftp paths (See ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/README )

  • gene_condition_source_id

Support Center

Last updated: 2018-05-31T12:14:53Z