How ClinVar validates submissions

Validation during submission processing

  1. Variant
  2. Condition
  3. Variant-gene
  4. Variant-condition
  5. Miscellaneous
  6. Interpretation

Information reported after submission processing

Validation during submission processing

ClinVar analyzes the content of submissions and validates selected elements.  This analysis includes both automated checks and manual checks by curators. Some checks result in rejecting a submission; others allow the submission to proceed but with questioned information returned to the submitter for review.

Variant

Variant definition

ClinVar validates variants with precise location described by either an HGVS expression or chromosome coordinates. This validation primarily uses NCBI’s Variation Services, which are based on SPDI notation (PMID: 31738401); it is supplemented with validation of intronic variants described on ReSeq transcripts and with limited validation of variants that are large and/or have an imprecise location, like CNVs. Variants that do not pass validation are not processed by ClinVar; they are returned to the submitter for correction or removal from the submission.

For HGVS expressions

  • validate HGVS format
  • validate content, including the reference sequence and reference allele
  • ClinVar does have some older submitted records based on HGVS expressions that were not handled as rigorously. These may include variant descriptions that are truly valid that our code did not handle correctly, and variants that are truly invalid but that we accepted anyway. For those submissions the HGVS expression is reported as “non-validated”, rather than invalid.

For chromosome coordinates

  • validate assembly and chromosome
  • validate that the chromosomal locations are within the range of the chromosome
  • validate that outer start < inner start < start < stop < inner stop < outer stop
  • validate that start and stop are consecutive nucleotides for insertions
  • validate that the length of the reference allele matches the start/stop locations for deletions
  • validate that the reference and alternate alleles are provided; otherwise, variant type is expected, to describe large deletions, duplications, etc.
  • validate that for insertions, the reference allele is either “-“ or an anchor nucleotide (VCF-style)
  • validate that for deletions, the alternate allele is either “-“ or an anchor nucleotide (VCF-style)

For both HGVS expressions and chromosomal coordinates

  • validate that the asserted reference allele matches the allele in the reference sequence
  • validate that the alternate allele is an IUPAC base, or one of a number of non-standard abbreviations for retroviral insertions (insAlu, insLINE, insLINE1, insSINE)
  • validate that the reference sequence is human
  • validate that the reference sequence is valid.

Known issues

Our variant validation has a few known issues. They include some less frequently used HGVS standards that our code does not handle yet, as well as other issues for which we do not have a programmatic solution yet. We will continue to improve the code; please contact us at clinvar@ncbi.nlm.nih.gov if you have a submission that is blocked by one of these issues.

  • The HGVS format using both an NG and an NM for intronic variants, e.g. NC_000023.10(NM_004006.2):c.357+1G>A, is valid but not handled by ClinVar’s validation.
  • New versions of RefSeq NM transcripts may not be validated by ClinVar immediately.
  • Suppressed RefSeqs are accepted, but they should return an error.
  • Some HGVS expressions representing no change at intronic positions are not validated correctly.
  • Some HGVS expressions representing microsatellites with uncertain ranges or with ambiguous nucleotides are not validated correctly.
  • Some HGVS expressions representing inversions across intron-exon or exon-UTR boundaries are not validated correctly.

Examples

The examples that we use as test cases for our validation code are available on the ClinVar FTP site. They are organized into files with either valid or invalid cases, and with descriptions using HGVS expressions or chromosomal coordinates. The files using chromosomal coordinates are available in both GRCh37 and GRCh38 coordinates.

Consistency checking

• If an rs number is provided for a variant, ClinVar validates that the genomic location for that identifier is consistent with the variant description. • Legacy descriptions are not validated. • If a submission is updated, ClinVar stops the processing if a change in the variant definition is detected. The update is allowed to continue only if the submitter verifies the previous definition was an error. ClinVar then assigns a new AlleleID and a new VariationID, when it updates the version of the SCV accession.

Multiple variants

  • If multiple variants are submitted together as a single record, e.g. as a haplotype or as a compound heterozygote, curators confirm the submitter's intent. This may result in splitting a submission, e.g. two variants submitted as a compound heterozygote as pathogenic for an autosomal recessive disease should be split into two distinct submissions, reporting each individual variant as pathogenic for the disease.

Condition

  • If a submission defines a disease or phenotype with a database identifier, ClinVar validates:
    • that the identifier is valid for the database
    • that the identifier is for a disease or phenotype. For example, a MIM number for a disease is valid; a MIM number for a gene is not.
  • If a submission defines a disease or phenotype with both a database identifier and a name, ClinVar validates that the identifier and name represent the same concept.
  • If multiple diseases or phenotypes are submitted as a single record, curators confirm the submitter's intent. e.g. a variant that is reported as pathogenic for two diseases on a single record indicates that the variant results in the combination of those diseases, not that the variant can cause either disease. For the latter case, the submission is split into two distinct submissions, one for the variant with each of the diseases.

Variant-gene

  • If a submitter specifies the gene affected by a variant, ClinVar verifies that the genomic location of that variant is within that gene. (in progress)

Variant-condition

  • ClinVar is adding a check to determine whether, when a MIM number submitted for a disorder and that disorder is specific to a causative gene, the variant's location is consistent with the location in a gene.
  • ClinVar checks whether the submitter has already submitted an intepretation for the variant and condition. We are aware that the database contains some duplicate records that were submitted before checks were put in place. We are working with submitters to resolve the duplicates.

Evidence

ClinVar validates:

Miscellaneous

  • For an update to an SCV accession, ClinVar verifies that the submitter is the owner of that record.
  • If a batch of submissions includes only variants of "uncertain significance", curators confirm that the variants were determined to be uncertain as the result of an interpretation process. If the variants are uncertain because they were not interpreted, they are not in scope for ClinVar.

Interpretation

  • ClinVar does not validate the interpretation for a variant.
  • ClinVar does not determine which interpretation is correct when submitters disagree.
    • ClinVar represents the interpretation provided by each submitter.
    • ClinVar calculates an aggregate clinical significance based on submissions and indicates when there is conflict between submitters.
    • Submitters are encouraged to provide evidence for the interpretation so that users can understand why submitters may disagree with the interpretation.
    • A curated interpretation from an expert panel or a practice guideline overrules any conflict from other submitters.
  • ClinVar does not review criteria for interpretation used by submitters ( assertion criteria ).
    • The submitter may provide documentation of the categories used to classify variants and the criteria needed to categorize variants into each bin.
    • ClinVar staff may review this documentation to ensure that it describes categories and criteria, but they do not decide whether the categories and criteria are appropriate.
    • This documentation of assertion criteria is for users to evaluate how an interpretation was made and may help users understand why submitters disagree in their interpretation.

Information reported after submission processing

After submission, a report is provided to the submitter based on checks done when submitted data is integrated into the database. This report is provided only as an FYI for the submitter, and it includes information such as:

  • ClinVar processed a variant description that could not be validated
  • the submitted HGVS expression uses a previous version of the reference sequence
  • the interpretation is inconsistent with the allele frequency (e.g. a pathogenic variant with a high allele frequency in GO-ESP , 1000 Genomes , or ExAC )
  • the interpretation was made for a novel gene-disease relationship
  • ClinVar has conflicting submissions for the same variant-disease relationship (ClinVar checks for this issue proactively but this check addresses historical issues with redundant records)
  • the variant is flagged in dbSNP as suspect
  • the submitter’s interpretation conflicts with the interpretation from an expert panel or practice guideline
  • The submitter’s interpretation differs from another submitted interpretation
  • The interpreted disease is idiopathic
  • The interpretation is “pathogenic” but no disease was provided
  • The interpretation changed but the date last evaluated did not change

Support Center

Last updated: 2020-07-30T15:05:01Z