U.S. flag

An official website of the United States government

Structural Variation Data Hub

Here are the datasets most commonly requested by our users. For a complete listing of all dbVar data please see our Study Browser or our Variant Summary page.

  1. Clinical Structural Variants
  2. Common Structural Variants
  3. Long Read Technology
  4. Genome-wide surveys of structural variation
  5. Datasets most accessed by users

Last updated: Friday, September 16, 2022

Clinical Structural Variants

All structural variants with clinical interpretations curated at ClinVar are included in a single dbVar study: Clinical Structural Variants (nstd102). Many of these variants were previously accessioned in separate studies at dbVar (e.g., nstd37, nstd101, etc.). The old accessions will be retired in 2021. A file linking old accessions to new is available here. The easiest way to browse all dbVar clinical variants is to visit the Clinical Structural Variants (nstd102) data track in NCBI's Variation Viewer or connect to the Public dbVar Hub at the UCSC Genome Browser.

Study Download Regions; Calls Search Variants Description
Clinical Structural Variants (nstd102) 75,239; 79,230 nstd102 variants Structural Variants with clinical assertions, submitted to ClinVar by external labs. dbVar now imports all placements from ClinVar as "submitted" and only remaps what is missing in order to place all variants on both GRCh37 and GRCh38. See Variant Summary counts for nstd102 in dbVar Variant Summary. See the latest statistics for nstd102 in Summary of nstd102 (Clinical Structural Variants).

Common Structural Variants

All common structural variants are included in a single dbVar study: NCBI Curated Common Structural Variants (nstd186). These variants are also accessioned in separate studies at dbVar (1000 Genomes Consortium Phase 3 Integrated SV (estd219), gnomAD Structural Variants (nstd166), DECIPHER Consensus CNVs (nstd183), Lee et. al 2020 (nstd194), Abel et. al 2020 (nstd200), Byrska-Bishop et. al 2021 (nstd206)). A file linking accessions between the studies is available here. The easiest way to browse all dbVar common variants is to visit the NCBI Curated Common Structural Variants (nstd186) data track in NCBI's Variation Viewer or connect to the Public dbVar Hub at the UCSC Genome Browser.

Study Download Regions; Calls Search Variants Description
NCBI Curated Common Structural Variants (nstd186) 92,934; 111,219 nstd186 variants A curated dataset of all structural variants in dbVar that meet the following criteria: were part of a study with at least 100 samples; included allele frequency data; had an allele frequency of >=0.01 in at least one population. Data content of this study is subject to change as new data become available. See Variant Summary counts for nstd186 in dbVar Variant Summary. See the latest statistics for nstd186 in Summary of nstd186 (NCBI Curated Common Structural Variants).

Long Read Technology

Long-read sequencing is better than short-read technologies at capturing large structural variation events. The following studies used long-read sequencing methods to detect SV.

Study Download Regions; Calls Search Variants Description
Genome in a Bottle Structural Variants - Tier I, v0.6 (nstd175) 12,745; 12,745 nstd175 variants The v0.6 Genome in a Bottle Consortium [www.genomeinabottle.org] structural variant (SV) benchmark set includes ~10,000 sequence-resolved insertions and deletions >49bp from the broadly-consented GIAB/Personal Genome Project Ashkenazi son (HG002/GM24385). These SVs, along with an accompanying benchmark BED file, are discovered and evaluated by multiple short, linked, and long read sequencing technologies and are intended as a benchmark for identifying false positive and false negative SV calls in any method. Original VCF files and the benchmark BED file can be found here. See Variant Summary counts for nstd175 in dbVar Variant Summary. PubMed:Genome in a Bottle.
PacBio Circular Consensus Sequencing of human male (nstd167) 30,218; 30,634 nstd167 variants PacBio Circular Consensus Sequencing (CCS) of the human male HG002/NA24385 to evaluate the ability of highly-accurate long-read sequencing to identify small and large variants, to phase variants into haplotypes, and to assemble a genome de novo. See Variant Summary counts for nstd167 in dbVar Variant Summary. PubMed:Wenger et al. 2019.
Intermediate-sized deletions examined with Nanopore long-read sequencing (nstd171) 4,378; 4,378 nstd171 variants Intermediate-sized deletions (30bp-5kbp) were identified from whole-genome sequencing data of a Japanese population using a two-stage identification process. Detected intermediate-sized deletions underwent stringent filtering and accuracy of the deletion calls were checked using data from Oxford Nanopore long-read sequencers. See Variant Summary counts for nstd171 in dbVar Variant Summary. PubMed:Wong et al. 2019.
Multi-platform discovery of haplotype-resolved structural variation in human genomes (nstd152) 103,985; 214,917 nstd152 variants This is an integrated callset from three individuals (HG00514, HG00733, and NA19240) sequenced using Illumina, Illumina 3.5 kbp jumping libraries, Illumina 6kbp jumping libraries, PacBio, BioNano Genomics, 10x, Hi-C, and Strand-seq. See Variant Summary counts for nstd152 in dbVar Variant Summary. PubMed:Chaisson et al. 2019.
Major Structural Variant Alleles of the Human Genome (nstd162) 99,810; 342,842 nstd162 variants Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we generated long-read sequence data on thirteen genomes. Systematically merging SVs yielded 95,827 sequence-resolved insertions, deletions, and inversions. Among these, we identified more than 1 Mbp of SVs shared among all genomes and more than 6.5 Mbp of SVs in the majority of genomes indicating errors or extreme minor alleles captured in the reference. See Variant Summary counts for nstd162 in dbVar Variant Summary. PubMed:Audano et al. 2019.
Discovery and genotyping of structural variation from long-read haploid genome sequence data (nstd137) 32,954; 35,154 nstd137 variants In an effort to more fully understand the full spectrum of human genetic variation, we generated deep single-molecule, real-time (SMRT) sequencing data from two haploid human genomes. Using an assembly-based approach (SMRT-SV), we systematically assessed each genome independently for structural variants (SVs) and indels resolving the sequence structure of 461,553 genetic variants from 2 bp to 28 kbp in length. We find that 82% of these variants have been missed as part of analysis of the 1000 Genomes Project. We estimate that this theoretical human diploid differs by as much as ~16 Mbp with respect to the human reference, with long-read sequencing data providing a fivefold increase in sensitivity for genetic variants ranging in size from 7 bp to 1 kbp when compared to short-read sequence data. See Variant Summary counts for nstd137 in dbVar Variant Summary. PubMed:Huddleston et al. 2016.

Genome-wide surveys of structural variation

The following are high-quality datasets that contain the results of genome-wide discovery surveys of CNVs and other Structural Variants from a wide variety of global populations.

Study Download Regions; Calls Search Variants Description
Structural variants in gnomAD (nstd166) 304,733; 313,581 nstd166 variants The v2.1 release of gnomAD-SV represents a catalogue of structural variants (SVs) discovered from whole-genome sequencing of 14,891 individuals at 32X mean coverage with 2x150bp Illumina reads. From this dataset, site-level SV data was able to be released for 10,847 unrelated individuals with appropriate consent for broad data sharing. For more information, please refer to Collins*, Brand*, et al., bioRxiv (2019), or the gnomAD-SV explainer. Original VCF files can be found here and, with dbVar accessions included, here. See Variant Summary counts for nstd166 in dbVar Variant Summary. PubMed:gnomAD_Structural_Variants.
Major Structural Variant Alleles of the Human Genome (nstd162) 99,810; 342,842 nstd162 variants Although the accuracy of the human reference genome is critical for basic and clinical research, structural variants (SVs) have been difficult to assess because data capable of resolving them have been limited. To address potential bias, we generated long-read sequence data on thirteen genomes. Systematically merging SVs yielded 95,827 sequence-resolved insertions, deletions, and inversions. Among these, we identified more than 1 Mbp of SVs shared among all genomes and more than 6.5 Mbp of SVs in the majority of genomes indicating errors or extreme minor alleles captured in the reference. See Variant Summary counts for nstd162 in dbVar Variant Summary. PubMed:Audano et al. 2019.
Multi-platform discovery of haplotype-resolved structural variation in human genomes (nstd152) 103,985; 214,917 nstd152 variants This is an integrated callset from three individuals (HG00514, HG00733, and NA19240) sequenced using Illumina, Illumina 3.5 kbp jumping libraries, Illumina 6kbp jumping libraries, PacBio, BioNano Genomics, 10x, Hi-C, and Strand-seq. See Variant Summary counts for nstd152 in dbVar Variant Summary. PubMed:Chaisson et al. 2019.
Genome in a Bottle Structural Variants - Tier I, v0.6 (nstd175) 12,745; 12,745 nstd175 variants The v0.6 Genome in a Bottle Consortium [www.genomeinabottle.org] structural variant (SV) benchmark set includes ~10,000 sequence-resolved insertions and deletions >49bp from the broadly-consented GIAB/Personal Genome Project Ashkenazi son (HG002/GM24385). These SVs, along with an accompanying benchmark BED file, are discovered and evaluated by multiple short, linked, and long read sequencing technologies and are intended as a benchmark for identifying false positive and false negative SV calls in any method. Original VCF files and the benchmark BED file can be found here. See Variant Summary counts for nstd175 in dbVar Variant Summary. PubMed:Genome in a Bottle.
1000 Genomes Project (Phase 3 SV analysis) (estd219) 68,825; 8,812,557 estd219 variants 1000 Genomes Phase 3 structural variants as reported in a companion paper specifically dedicated to SV analysis. Much of these data are identical to those reported in the main paper as study estd214. See Variant Summary counts for estd219 in dbVar Variant Summary. PubMed:1000 Genomes Consortium Phase 3 Integrated SV.
Short Tandem Repeat (STR) Population Survey (nstd128) 1,328,521; 4,394,628 nstd128 variants We report high quality genomes from 300 individuals from 142 diverse populations. As part of this study, we generated a comprehensive catalog of short tandem repeat (STR) genotypes. We used this call set to characterize allele frequency spectra, analyze sequence determinants of STR variation, and to identify common loss of function alleles. See Variant Summary counts for nstd128 in dbVar Variant Summary. PubMed:Mallick et al. 2016.
CNV Global Population Survey (nstd112) 15,012; 3,303,297 nstd112 variants To explore the diversity and selective signatures of duplications and deletions in human copy number variation (CNV), we sequenced 236 individuals from 125 distinct human populations. We observed that duplications exhibit fundamentally different population genetic and selective signatures than deletions and are more likely to be stratified between human populations. We find that the proportion of CNV to SNV base pairs is greater among non-Africans than it is among African populations but we conclude that this difference is likely due to unique aspects of non-African population history as opposed to differences in CNV load. See Variant Summary counts for nstd112 in dbVar Variant Summary. PubMed:Sudmant et al. 2015.

Datasets most accessed by users

The following are the top 10 most accessed datasets in the last 12 months.

Study Download Regions; Calls Search Variants Description
Clinical Structural Variants (nstd102) 75,239; 79,230 nstd102 variants Structural Variants with clinical assertions, submitted to ClinVar by external labs. dbVar now imports all placements from ClinVar as "submitted" and only remaps what is missing in order to place all variants on both GRCh37 and GRCh38. See Variant Summary counts for nstd102 in dbVar Variant Summary. See the latest statistics for nstd102 in Summary of nstd102 (Clinical Structural Variants).
Structural variants in gnomAD (nstd166) 304,733; 313,581 nstd166 variants The v2.1 release of gnomAD-SV represents a catalogue of structural variants (SVs) discovered from whole-genome sequencing of 14,891 individuals at 32X mean coverage with 2x150bp Illumina reads. From this dataset, site-level SV data was able to be released for 10,847 unrelated individuals with appropriate consent for broad data sharing. For more information, please refer to Collins*, Brand*, et al., bioRxiv (2019), or the gnomAD-SV explainer. Original VCF files can be found here and, with dbVar accessions included, here. See Variant Summary counts for nstd166 in dbVar Variant Summary. PubMed:gnomAD_Structural_Variants.
NCBI Curated Common Structural Variants (nstd186) 92,934; 111,219 nstd186 variants A curated dataset of all structural variants in dbVar that meet the following criteria: were part of a study with at least 100 samples; included allele frequency data; had an allele frequency of >=0.01 in at least one population. Data content of this study is subject to change as new data become available. See Variant Summary counts for nstd186 in dbVar Variant Summary. See the latest statistics for nstd186 in Summary of nstd186 (NCBI Curated Common Structural Variants).
Byrska-Bishop et. al 2021 (nstd206) 173,332; 173,354 nstd206 variants The 1000 Genomes Project (1kGP), launched in 2008, is the largest fully open resource of whole genome sequencing (WGS) data consented for public distribution of raw sequence data without access or use restrictions. Here, we present a new, ​high coverage WGS resource encompassing the original 2,504 1kGP samples, as well as an additional 698 related samples that result in 602 complete trios in the 1kGP cohort. We sequenced this expanded 1kGP cohort of 3,202 samples to a targeted depth of 30X using Illumina NovaSeq 6000 instruments. We performed SNV/INDEL calling against the GRCh38 reference using GATK’s HaplotypeCaller, and generated a comprehensive set of SVs by integrating multiple analytic methods through a sophisticated machine learning model, upgrading the 1kGP dataset to current state-of-the-art standards. We called 7% more SNVs, 59% more INDELs, and 170% more SVs per genome than the phase 3 callset. Moreover, we leveraged the presence of families in the cohort to achieve superior haplotype phasing accuracy and we demonstrate improvements that the high coverage panel brings especially for INDEL imputation. We make all the data generated as part of this project publicly available and we envision this updated version of the 1kGP callset to become the new de facto public resource for the worldwide scientific community working on genomics and genetics. See bioRxiv pre-print. See Variant Summary counts for nstd206 in dbVar Variant Summary.
Refining analyses of CNV and developmental delay (nstd100) 70,319; 318,775 nstd100 variants Copy Number Variants from 29,083 cases of Developmental Delay and Intellectual Disability from Signature Genomics, and 11,256 Control Samples. This study contains samples in common with Cooper et al. 2011. Due to analysis differences (see manuscripts) please use the case samples (Sampleset 1) from only one of these submissions. Control sample sets do not overlap and may be combined. See Variant Summary counts for nstd100 in dbVar Variant Summary. PubMed:Coe et al. 2014.
Abel et. al 2020 (nstd200) 299,092; 299,092 nstd200 variants We used a scalable pipeline to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. See Variant Summary counts for nstd200 in dbVar Variant Summary. PubMed:Abel et al. 2020.
Almarri et. al 2020 (nstd209) 152,814; 167,799 nstd209 variants Structural variants contribute substantially to genetic diversity and are important evolutionarily and medically, but they are still understudied. Here we present a comprehensive analysis of structural variation in the Human Genome Diversity panel, a high-coverage dataset of 911 samples from 54 diverse worldwide populations. We identify, in total, 126,018 variants, 78% of which were not identified in previous global sequencing projects. Some reach high frequency and are private to continental groups or even individual populations, including regionally restricted runaway duplications and putatively introgressed variants from archaic hominins. By de novo assembly of 25 genomes using linked-read sequencing, we discover 1,643 breakpoint-resolved unique insertions, in aggregate accounting for 1.9 Mb of sequence absent from the GRCh38 reference. Our results illustrate the limitation of a single human reference and the need for high-quality genomes from diverse populations to fully discover and understand human genetic variation. See Variant Summary counts for nstd209 in dbVar Variant Summary. PubMed:Almarri et al. 2020.
A CNV morbidity map of developmental delay (nstd54) 81,345; 468,909 nstd54 variants Copy Number Variants from 15,767 cases of Developmental Delay and Intellectual Disability from Signature Genomics, and 8329 Control Samples. This study contains samples in common with Coe et al. 2014. Due to analysis differences (see manuscripts) please use the case samples (Sampleset 1) from only one of these submissions. Control sample sets do not overlap and may be combined. See Variant Summary counts for nstd54 in dbVar Variant Summary. PubMed:Cooper et al. 2011.
ClinGen Curated Dosage Sensitivity Map - obsoleted (nstd45) 364; 397 nstd45 variants Genes/genomic regions with sufficient evidence supporting (pathogenic) or refuting (benign) dosage sensitivity as a mechanism for disease. Evidence is evaluated on a continual basis by the ClinGen Structural Variation Working Group as described in Riggs et al. 2012. See Variant Summary counts for nstd45 in dbVar Variant Summary. PubMed:Riggs et al. 2011.
1000 Genomes Project (Phase 3, landmark paper) (estd214) 61,678; 6,943,353 estd214 variants This study contains the structural variants from the combined release set which contains more than 79 million variant sites and includes not just biallelic snps but also indels, deletions, complex short substitutions and other structural variant classes. It is based on data from 2504 unrelated individuals from 26 populations around the world. Most of the structural variants reported here can also be found in estd219, where they were reported in a companion paper specifically dedicated to SV analysis. See Variant Summary counts for estd214 in dbVar Variant Summary. PubMed:1000 Genomes Project Consortium et al. 2015.

Support Center

Last updated: 2022-09-16T14:21:14Z