FAQ about using ClinVar and understanding its data display

Please use this FAQ for questions about the submission process.

How to find data and configure the displays

  1. How do I search ClinVar efficiently?
  2. What groups of records does ClinVar precalculate?
  3. How can I modify the position sorting so that the representation is in gene-specific order?
  4. How can I retrieve batches of data from ClinVar?
  5. ClinVar doesn't have much data for my variant of interest. Where can I find more information?
  6. When I query ClinVar, sometimes the the counts for the clinical significance filters don't match the data in the results table. What does this mean?

Using the web display

  1. What do SCV and RCV mean?
  2. Why are there different assertions of clinical significance for the same variant?
  3. Why are there multiple RCV accessions in ClinVar for the same variant?
  4. I'm interested in a variant that was reported with the condition "not specified". What does that mean?
  5. What does it mean if a ClinVar record is "classified by single submitter" but has more than one submission?
  6. When a variant may lie in multiple genes, what does ClinVar report?
  7. A variant in ClinVar is described as a 1-nt deletion in the genomic DNA, but a 60-nt deletion in the mRNA. Is that an error?
  8. I followed a link to the Breast Cancer Information Core (BIC), but it did not work. What do I do?
  9. A variant in ClinVar is described as "suspect". What does this mean?

Data sources and processing

  1. How do I find out what data sources are included in ClinVar?
  2. Why don't ClinVar records include HGMD identifiers?
  3. Does ClinVar represent variation that has not been localized on the genome?
  4. Is a new version number assigned for any change in an SCV record?
  5. When there are multiple RefSeqs for a gene, does ClinVar select a subset to use for reporting?
  6. What is ClinVar's convention for representing the location of variation with length differences when there are multiple options, left or right justified?

Reports

  1. Why doesn't the VCF file contain all the data in the XML file?
  2. Where can I find statistics about the number of ClinVar submissions?

Other

  1. How should I refer to a ClinVar record in written reports?
  2. How should I reference ClinVar?
  3. I thought all variants in ClinVar were based on medical relevance. Why does ClinVar have a record for a variant derived from chemical mutagenesis?

How to find data and configure the displays

ClinVar can be searched with terms like

  • gene symbols, e.g. PTEN
  • HGVS expressions, e.g. "NM_000314.4%3Ac.395G>T" (use quotation marks)
  • protein changes, e.g. G132V
  • phenotypes, e.g. PTEN hamartoma tumor syndrome
  • submitters, e.g. Invitae

Note that by default, searching uses the exact search terms provided; for example, searching for "Noonan" finds records that include the word Noonan but does not find records with the word "Noonan's". Consider doing a wild-card search like "Noonan*" if you want to expand your search. Also by default, ClinVar queries search all fields of data. More information on how to narrow your query by searching particular fields in available in the ClinVar help document. If you have favorite queries that you will do periodically, you can login to MyNCBI and save your searches. Saved searches can be run on-the-fly or you can receive regular email updates with results of the search.

What groups of records does ClinVar precalculate?

Recurrent concepts in ClinVar are captured in what are termed properties in NCBI's Entrez system. These properties are created to facilitate finding data chararacterized by standard values. Some of these properties are exposed as the filters you see on the result set, but there are many more. You can scan the names of properties and the counts of records with each property by using the advanced query option: http://www.ncbi.nlm.nih.gov/clinvar/advanced/.

  • Select Properties from the Builder/All fields menu.
  • Click on Show index list to the righ of that menu.

Descriptions of each property, with sample queries, are provided in this document.

How can I modify the position sorting so that the representation is in gene-specific order?

When a gene is on the negative strand, the location of a variation relative to the gene sorts in opposite order to the location on the genome. The position column on ClinVar's tabular display is the chromosome location, so the ordering seems counterintuitive. At present we do not provide a method to sort in the opposite order.

How can I retrieve batches of data from ClinVar?

ClinVar does not currently support a batch query interface, but there are several approaches that might still meet your needs:

Use case Possible solutions
Approaches to process batches of data
Variant-specific data for a list of genes

Query ClinVar by listing the genes using the boolean OR, and download the results interactively or using e-utilities

http://www.ncbi.nlm.nih.gov/clinvar?term=spred1[gene]%20OR%20shoc2[gene]%20OR%20raf1[gene]%20OR%20ptpn11[gene]

Download the file ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz and process to extract gene-specific lines.

Download the full data extract ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xml/ClinVarFullRelease_00-latest.xml.gz and process to extract gene-specific records.

Variant-specific data for a list of conditions Use the same approach as for genes, but use instead a list of identifiers such as MIM numbers or Concept UIDs (CUI) from MedGen.  The file variant_summary.txt reports identifiers for phenoytpes, but not their names.

ClinVar doesn't have much data for my variant of interest. Where can I find more information?

If ClinVar can tell you are searching for a variant, the search results include this text:

You may also find information on this variant by searching: All NCBI Databases, Google

with links to search other NCBI databases or Google for that variant. Also note that submissions are not necessarily comprehensive; the fact that a laboratory has not submitted a variant does not necessarily mean that they have not seen the variant.

When I query ClinVar, sometimes the the counts for the clinical significance filters don't match the data in the results table. What does this mean?

The clinical significance filters are restricted to variants reported as germline. This restriction may be lifted soon when filters for germline, somatic and de novo variants are made available.

Using the web display

What do SCV and RCV mean?

Each individual submission to ClinVar, defined by the combination of variant, condition, and submitter, receives an accession number  with the prefix SCV ("submission to ClinVar"). The accession number is also versioned, so that updates that the submitter makes to their record are tracked. ClinVar aggregates SCV records for the same combination of variant and condition into a "reference ClinVar" record, which receives an accession number with the prefix RCV. The RCV record can be viewed on the web, by clicking the "See supporting ClinVar records" link on the right side of the variation page.

Why are there different assertions of clinical significance for the same variant?

ClinVar is an archive for assertions of clinical significance made by our submitters. If multiple groups have reported different values for clinical significance for the same variant, we report that there is a conflict and show all of the submitted values for clinical significance. These records have "conflicting data from submitters" as the clinical significance and the review status; the review status is also represented as 0/4 stars to indicate the lack of certainty in the interepretation of the variant. ClinVar does not arbitrate and resolve these conflicts. However, if we have a submission for that variant from an expert panel or a professional society, the assertion made by the expert panel or professional society is displayed and differnt interpreations from other submitters are not reported as conflicts.

Why are there multiple RCV accessions in ClinVar for the same variant?

An accession in ClinVar is based on a variant-phenotype combination, not the variant alone.  This representation was selected so that distinct accessions could be assigned to variants that result in distinct disorders.  Each submission is assigned an accession of the format SCV000000000.0 and versioned if the submitter updates a record (e.g. SCV000000001.1 would be updated to SCV000000001.2). Each unique combination of variant-phenotype relationship is aggregated into a ClinVar record with an accession of the format RCV000000000.0. In these early days of ClinVar, when there is less consensus about how to describe the clinical condition that results from variation, we recognize there are often multiple RCV accessions assigned to the same variant. We anticipate this will change over time as expert panels review the data and decide how to describe the phenotype.  When updates result in matching phenotypes, the RCV accessions will be merged.

ClinVar's current default web display is variation-centric rather than organized by variation-phenotype combinations.  ClinVar provides an overview comparing these displays.

I'm interested in a variant that was reported with the condition "not specified". What does that mean?

Some submitters want to report that a variant is benign with respect to a specific condition, which leaves the possibility that it is clinically relevant for a different condition. Other submitters want to report that the variant is generally benign, in that it does not appear to cause any genetic disorder that should be observable because it is highly penetrant. ClinVar, in collaboration with members of the ClinGen project, requests that submitters provide "not specified" as the condition to indicate that they are not specifying any single condition but rather that the variant is generally benign. Use of this term for this kind of submission may be re-evaluated in the future.

What does it mean if a ClinVar record is "classified by single submitter" but has more than one submission?

The review status, such as "classified by single submitter", is based on submissions in which a clinical significance was provided. Some submissions to ClinVar lack an explicit statement of clinical significance. These are included in the number of submissions for a variant, but they do not contribute to the variant's review status, which considers whether or not the variant was classified.

When a variant may lie in multiple genes, what does ClinVar report?

There are several situations in which a variation may be considered to have a relationship to more than one gene. The reported gene or genes, and the preferred designations,  are selected as follows:

Submitted as Reported as
location on a cDNA with or without identifying the gene the gene as submitted or calculated from the cDNA, and preferred name as calculated from the cDNA reference standard for that gene
genomic location covering multiple non-overlapping genes, with no gene specified all calculated genes in the region based on the most recent NCBI annotation release. The preferred desgination is a genomic HGVS expression without a gene symbol.
genomic location covering multiple overlapping genes, including those with shared exons all genes are reported , but the preferred description is based on selection of the RefSeq that corresponds to an exonic location. If the variation is in an exon of more than one gene, then the preferred description will not include a gene symbol, and will be based only on the genomic location.

 

A variant in ClinVar is described as a 1-nt deletion in the genomic DNA, but a 60-nt deletion in the mRNA. Is that an error?

This variant may represent a 1-nt deletion in genomic DNA that results in exon skipping, and therefore a larger deletion in the mRNA.

To display data from the Breast Cancer Information Core (BIC), you must be a member and be logged in. Registration is available online.

Secondly, we have discovered that some URLs to BIC contain spaces, and when ClinVar provides that URL sometimes NHGRI's website translates that space twice. If you are already logged in, the space will not be translated twice and the URL from ClinVar should function as expected.

The URL https://research.nhgri.nih.gov/projects/bic/Member/cgi-bin/bic_query_result.cgi?table=brca2_exons&nt=995&base_change=del%20CAAAT will fail because NHGRI converts to https://research.nhgri.nih.gov/projects/bic/Member/cgi-bin/bic_query_result.cgi?table=brca2_exons&nt=995&base_change=del%2520CAAAT. Until we this bug is fixed at NHGRI, you can either

  • Edit the URL back to https://research.nhgri.nih.gov/projects/bic/Member/cgi-bin/bic_query_result.cgi?table=brca2_exons&nt=995&base_change=del%20CAAAT
  • Edit the URL to https://research.nhgri.nih.gov/projects/bic/Member/cgi-bin/bic_query_result.cgi?table=brca2_exons&nt=995&base_change=del CAAAT

and do the search. Then the query record at BIC will be displayed.

A variant in ClinVar is described as "suspect". What does this mean?

This annotation is imported from dbSNP, and indicates a suspected false positive. See dbSNP's documentation for more information.

Data Sources and Processing

How do I find out what data sources are included in ClinVar?

From the ClinVar homepage, click on the Statistics link in the navigation bar at the top of the page. This takes you to a summary of ClinVar's submitters.

Why don't ClinVar records include HGMD identifiers?

Allele information from HGMD is not publicly available, so ClinVar is unable to connect variants accurately to the appropriate record in HGMD. The files ClinVar previously provided on the ftp site (ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar) were removed at HGMD's request.

Does ClinVar represent variation that has not been localized on the genome?

ClinVar accepts the description of the variant provided by the submitter. We believe it helps our public to find records, even though the location of the variant on a defined sequence may be uncertain. In some cases, the original assay was at the protein level, and the numbering system for that protein is uncertain or the nucleotide change that may generate the protein change is indeterminate. In other cases, deletions have been reported relative to a transcript, and it is not clear whether the nucleotide change in the genome resulted from a corresponding genomic change, or aberrant splicing generated from another genomic location. Many submissions do have supporting citations, but ClinVar does not have the resources to review the literature to establish the precise nucleotide locations. We welcome submissions that would improve these data for us.

In other words, ClinVar does assign accessions to submissions that represent human variation descriptively, rather than based on an explicit public nucleotide sequence. In that situation, the search results table shows no nucleotide location for the variant and no links are provided to viewers. If the location of the allele is determined, the record is updated.  This may not require re-submission from the submitter, if NCBI staff are able to establish the location of the variant base from review of the literature.

Is a new version number assigned for any change in an SCV record?

Any change that the submitter makes to any SCV record causes the version number to increase. The SCV version number does not increase if there is a change in data that NCBI provides, such as allele frequencies, additional HGVS expressions or a MedGen ID for a phenotype. Data that NCBI provides are packaged only in the RCV accession, and changes to those data to not cause an RCV version to increment either. A new version is assigned to an RCV is there is a new version of an SCV, or if more SCV are aggregated into an RCV.

When there are multiple RefSeqs for a gene, does ClinVar select a subset to use for reporting?

Yes, ClinVar uses as its default the RefSeq cDNAs that have been selected as reference standards by the RefSeqGene/LRG collaboration. These sequences can be identified within the GenBank view of each cDNA by the comment worded as:

This sequence is a reference standard in the RefSeqGene project.

The full set can be retrieved from RefSeqGene's ftp site:
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene

You can identify all RefSeqs that have been classified as reference standards using these approaches:

Sample queries
Goal Approach Comment
find all RefSeqGene sequences ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/ The FTP site provides all sequences, as well as several reports about reference standard cDNAs and matching identifiers in LRG.
find the reference standard sequences  for a gene "refseqgene standard"[Properties] AND fgfr3[gene] Use the nucleotide database, and apply the property 'refseqgene standard' with the additional qualifier of the gene symbol
find all reference standard transcripts "refseqgene standard"[Properties] AND biomol_rna[prop] Use the nucleotide database, and apply the property 'refseqgene standard' with the additional qualifier of the sequence type RNA

 

What is ClinVar's convention for representing the location of variation with length differences when there are multiple options, left or right justified?

To conform to conventions of HGVS notation, ClinVar will represent the location of the sequence change at the right-most location. The standard for VCF, however, is POS coordinate is based on the leftmost possible position of the variant. For this reason, the location represented by a dbSNP rs number may be left of the location represented by the HGVS notation.

Reports

Why doesn't the VCF file contain all the data in the XML file?

ClinVar's VCF files are currently limited to records that have been assigned rs# in dbSNP. Thus there may be two types of gaps:

  • The variant is not in scope for dbSNP (i.e. the length is greater than 50 bp or the exact location is not known).
  • The variant has not yet been processed by dbSNP.

The ClinVar VCF and related VCF set can be retrieved from ClinVar's ftp site:

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/

Where can I find who has submitted to ClinVar and statistics about number of ClinVar submissions?

ClinVar reports statistics for the number of submissions, genes, and variants from ClinVar submissions. Total counts and counts per submitter are provided. Note that variation is represented according to the ClinVar data model, in that a variation is represented as a set which may have one or more members. For example, two variations submitted together in cis are members of a single set and are counted in the statistics as one variation.

Other

How should I refer to a ClinVar record in written reports?

ClinVar records should be referred to with accession and version numbers. It important to include the version number because the interpretation may change over time; the version number allows you to distinguish between previous and current versions. Note that the RCV accession number refers to the aggregate record for the variant and phenotype, while the SCV accession number refers to a specific submission for the variant and phenotype. If you need to build URLs based on a ClinVar accession, please note the instructions for constructing links to ClinVar.

How should I reference ClinVar?

A description of ClinVar has been published in Nucleic Acids Research.

Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS, Church DM, Maglott DR. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res. 2014 Jan 1;42(1):D980-5. doi: 10.1093/nar/gkt1113. PubMed PMID: 24234437.

If you wish to reference a specific submission, please cite the SCV accession and version.

If you wish to reference a specific ClinVar assertion, please cite the RCV accession and version.

I thought all variants in ClinVar were based on medical relevance. Why does ClinVar have a record for a variant derived from chemical mutagenesis?

ClinVar includes evidence for pathogenicity of variants, which may include in vitro experiments as well as information about individuals in whom a variant was observed.

Write to the Help Desk

Last updated: 2014-12-11T11:56:50-05:00