Variant interpretation using population databases: Lessons from gnomAD

Abstract Reference population databases are an essential tool in variant and gene interpretation. Their use guides the identification of pathogenic variants amidst the sea of benign variation present in every human genome, and supports the discovery of new disease–gene relationships. The Genome Aggregation Database (gnomAD) is currently the largest and most widely used publicly available collection of population variation from harmonized sequencing data. The data is available through the online gnomAD browser (https://gnomad.broadinstitute.org/) that enables rapid and intuitive variant analysis. This review provides guidance on the content of the gnomAD browser, and its usage for variant and gene interpretation. We introduce key features including allele frequency, per‐base expression levels, constraint scores, and variant co‐occurrence, alongside guidance on how to use these in analysis, with a focus on the interpretation of candidate variants and novel genes in rare disease.


INTRODUCTION
Reference population databases are a powerful tool for understanding the biological function of genetic variation. Population frequency data allow the rare variants that are more likely to be the cause of Mendelian disorders to be distinguished from the millions of common and largely benign variants present in every human genome.
In the era prior to the availability of large sequenced cohorts, the frequency of candidate pathogenic variants was typically defined through the painstaking genotyping of small in-house cohorts of healthy individuals. However, over the last decade, a series of databases have provided increasingly more accurate and comprehensive genomewide estimates of variant frequency through the generation and aggregation of large collections of human sequencing data. The 1000 Genomes Project was a pioneer in creating a publicly available reference database of variation from sequence data (1000Genomes Project Consortium et al., 2010, followed by the Exome Sequencing Project, where 6,500 European and African American individuals were sequenced and aggregate data was shared on the Exome Variant Server (Fu et al., 2013). The need for a larger and more diverse reference population database was well recognized (MacArthur et al., 2014), and the first large-scale aggregation of existing sequence data from 60,000 individuals, the Exome Aggregation Consortium (ExAC) dataset, was released in 2014 (Lek et al., 2016). With the addition of genome data to ExAC, it was renamed as the Genome Aggregation Database (gnomAD), that today has variant data from more than 195,000 individuals and is currently the most widely accessed reference population dataset with over 150,000 weekly page views ( Figure 1) (Karczewski et al., 2020). Other large databases include NHLBI's Trans-Omics for Precision Medicine (TOPMed)-BRAVO and the Geisinger Healthcare System DiscovEHR dataset (Dewey et al., 2016;Taliun et al., 2021).
Given known mutation rates, it is almost certain that every possible single base change compatible with life exists in a living human. Synonymous variation is under less selective pressure than missense or loss of function (LoF) variation and can be used to estimate how close gnomAD is to sampling the full spectrum of natural human variation.
As of now, gnomAD is approaching saturation for the highly mutable CpG dinucleotides  (grey), 5-10,000 (pink), and more than 10,000 (purple). (c) Schematic of the distribution and overlap of more than 195,000 unique individuals in gnomAD v2 exomes (orange), v2 genomes (green) and v3 genomes (violet). (Duncan & Miller, 1980;Lander et al., 2001), with 85% of all possible synonymous CpGto TpG transitions observed (Karczewski et al., 2020). However, across non-CpG trinucleotide contexts, less than 12% of possible synonymous variants have been observed in gnomAD, indicating that a much larger number of individuals will need to be sequenced before we begin to discover the full spectrum of tolerated variation. The fraction of observed variants is even lower for variants under purifying selection, with less than 4% of nonsense variants currently observed in gnomAD.
With the existing sample size of gnomAD, an individual will on average carry about 200 very rare coding variants (gnomAD allele frequency < 0.1%). This number varies by ancestry, partly depending on the populations represented in the database, but is also influenced by the heterozygosity rate. At the current size, each individual has tens of variants that are absent from gnomAD. Within one individual's exome, there is a mean of 27 ±13 novel coding variants that are absent in all other gnomAD individuals (variants unique to that individual), with more novel variants in East Asians (35 ± 11), and South Asians (38 ± 14) and fewer in African/African Americans (21 ± 7), Latino/Admixed Americans (19 ± 11) and Europeans (23 ± 11) ( Figure 2B and Supp. Table S1), correlating with sample sizes and representation in the dataset for each continental population. For population-groups not well represented in gnomAD, we would expect these numbers to be even higher. Often there is limited evidence to rule out pathogenicity for the variants not observed in a population database, resulting in an increased number of variants of uncertain significance in clinical genetic testing and highlighting the need for continued aggregation of sequencing data to improve the accuracy of genetic test interpretation (Naslavsky et al., 2021). The gain from increased sample size and improved representation is demonstrated by the decrease in number of unique variants per individual when utilizing the entire gnomAD dataset versus v2 Figure 2: (a) Mean count of coding very rare variants (allele frequency < 0.1%), and (b) mean count of unique coding variants (across v2 and v3) grouped by population; black bar represents the 95% confidence interval. (c) Comparison of genome-wide distribution of loss of function (LoF) constraint scores in 19,197 genes, colored by LOEUF decile; a continuous distribution for LOEUF score and a dichotomous-like distribution for pLI scores. Dotted line marks suggested thresholds for LoF constrained genes at pLI ≥ 0.9 and LOEUF < 0.35 in gene interpretation. exomes only (Supp. Figure S1 and Supp. Table S2).
This review provides guidance for using the gnomAD browser and key features like allele frequency, per-base expression levels, constraint scores and variant cooccurrence, for variant and gene interpretation in clinical and research analysis.

Individuals represented in gnomAD
The gnomAD browser displays summary statistics and aggregate variant data from deidentified exome and genome data, with approaches consistent with the guidance in the NIH data sharing policy (NOT-OD-03-032) to reduce the risk of subject identification.
The gnomAD database aggregates data from over 195,000 individuals through a worldwide collaborative effort on data sharing. More than 140 principal investigators have contributed genome data from over 60 studies, including data from several other population datasets (https://gnomad.broadinstitute.org/about). Most of the sequence data in gnomAD is generated for case-control studies of common adult-onset disease, such as type 2 diabetes, psychiatric disorders, and cardiovascular disease. No sequencing has been done for the purpose of depositing data in gnomAD; some data has been reprocessed for inclusion, particularly sequence data from populations that are under-represented in gnomAD. Data contributions were made with an assumption that no phenotype or individual-level data would be shared with users. However, access to many datasets included in gnomAD are available through the Database of Genotypes and Phenotypes (dbGaP) and the Analysis, Visualization, and Informatics Lab-Space (AnVIL) as well as other repositories (Schatz et al., 2021;Tryka et al., 2014).
Aggregation of data from disparate sources and platforms has been made possible by uniform joint variant calling using a standardized BWA-Picard-GATK pipeline (Van der Auwera et al., 2013) and Hail for data processing, analysis, and the addition of a gVCF combiner used in the v3 dataset. The aggregated dataset has been subjected to thorough sample and variant quality control (QC), with samples removed if they have low coverage, too many or too few variants for the population, or sex aneuploidy. To avoid inflation of allele frequencies for rare variants, first and second degree relatives have also been removed. In addition, to create a dataset as close as possible to a general population reference, individuals known to be affected with severe pediatric disease, as well as their first degree relatives, are also excluded. An allelespecific random forest approach (Karczewski et al., 2020) or the allele-specific version of GATK Variant Quality Score Recalibration (VQSR) have been applied to distinguish true genetic variants from artifacts. Additionally, variants were removed if no sample harboring the variant had a high quality genotype (depth ≥ 10, genotype quality ≥ 20, minor allele fraction ≥ 0.2 for non-reference heterozygous variants). With this design, gnomAD is particularly suitable for aiding in the interpretation of variants in rare disease genetic analysis. !

gnomAD version 2 and version 3
As new cohorts are added, new versions of gnomAD are released. To date, the database consists of two versions, v2.1.1 and v3.1.2 (referred to as v2 and v3) released after the original ExAC database that is largely represented within gnomAD v2 as well as a separate dataset on the browser (Figure 3:1). With 125,748 exomes and 15,708 genomes aligned to GRCh37, gnomAD v2 is preferable over v3 for interpreting coding variants. The current v3 release has 76,156 genomes aligned to GRCh38, providing more data for noncoding regions or coding regions not covered well in exomes, such as regions with high GC content or regions not targeted with exome capture.
While gnomAD does not contain duplicated individuals, or first or second degree relatives within a version release, there is significant overlap between v2 and v3, which is important to note if using both versions for variant interpretation. The majority of v2 genomes (81%) are also in v3, additionally 6% of v2 exomes are also represented as genomes in v3. In total, approximately 14% of the individuals with exome or genome data in v2 have genome data in v3; 26% of individuals in v3 are also present in v2 ( Figure 1C). The overlap of individuals can be resolved on the browser by looking at the v3 non-v2-dataset (further explored in section 3.2.1), important for rare disease analysis that benefits from investigating the entire gnomAD dataset (Supp. Figure S1 and Supp. Table S2). The planned v4 release will include the exomes and genomes from v2 and v3, along with additional data for an expected database of over 500,000 samples aligned on GRCh38, which will be the recommended reference dataset for all analyses.

Structural and mitochondrial variants
This review focuses on the interpretation of single nucleotide variants (SNVs) and indels from the nuclear genome. The gnomAD browser also provides allele frequencies for structural variants (SVs) and mitochondrial variants. As part of gnomAD v2, there are annotations for ~445,000 SVs from 10,738 genomes (Collins et al., 2020) that can be explored using the search bar menu on the landing page or the gene page (Figure 3:1 and 3:2). Mitochondrial variants are available from 56,434 gnomAD v3 genomes that can be found by searching "MT-" followed by gene name or "M-" followed by a mitochondrial chromosome position. The release includes mitochondrial specific data such as homoplasmic and heteroplasmic calls as well as both population and haplogroup-specific allele frequencies (Laricchia et al., 2021).    (17)  GRCh38. We display the transcripts and tissue expression profiles using data provided by the Genotype-Tissue Expression (GTEx) project, a resource for gene expression and regulation with RNA-sequencing of samples from 54 non-diseased tissue sites across nearly 1,000 individuals (GTEx Consortium, 2015). The transcript with the highest mean expression can be inferred from the size of the dot placed next to each transcript ( Figure 3:7). In this example, the second listed NSD1 ENST00000354179.4 transcript has the highest mean expression across all tissues, with the highest tissue specific expression in "Brain -Cerebellar Hemisphere". More detail is provided in the heatmap of transcript tissue expression (Supp. Figure S2), found by selecting "Show transcript tissue expression" (Figure 3:8).

Gene constraint metrics
Constraint metrics are key features of the gnomAD database and have been widely used to aid gene and variant interpretation in rare disease (Bamshad, Nickerson, & Chong, 2019;Oved et al., 2020). The constraint metrics are based on a gene's observed vs. expected number of very rare SNVs (allele frequency < 0.1%), corrected for sequence context and coverage. Constrained genes have fewer variants than expected and are under a higher degree of selection than less constrained genes. The metrics are divided by mutation class: synonymous, missense, and predicted loss of function (pLoF). The constraint scores for the canonical transcript are shown in the constraint table on the gene page; additionally, an updated transcript-specific constraint table is available for each transcript on the transcript page.
The degree of synonymous and missense constraint is measured by an observed to expected ratio (o/e) and the Z-score (Figure 3:9). A positive Z-score indicates fewer variants observed than expected, hence increased constraint (intolerance to variation), and a negative Z-score indicates that a gene has more variants observed than expected. Synonymous variants are classically not under evolutionary selective pressure indicated by a Z-score close to zero or an o/e ratio close to one. However, genes with highly similar paralogs, pseudogenes, or segmental duplications may have mapping challenges with short read data that can inflate observed variant counts. Hence a Z-score for synonymous variation deviating significantly from zero may indicate that the constraint measurement is unreliable for that gene and interpretation of all constraint data for the gene should be done with care.
As an example the HIST1H4E gene has a synonymous Z-score of -9.89. In addition to missense constraint, some genes will have regions of missense constraint, often overlapping with functionally important domains (Samocha et al., 2017). The regional missense constraint track can be viewed by selecting the ExAC subset on the gene page. Of note, the track is only shown for genes where regional missense constraint is observed ( Figure 3:2, Supp. Figure S2). LoF constraint is measured by two different scores: probability of being LoF intolerant (pLI) and LoF observed/expected upper bound fraction (LOEUF) ( Figure   3:10). The pLI score was developed for the ExAC release (Lek et al., 2016) on the premise that genes can be divided into genes where biallelic LoF is tolerated by natural selection (LoF tolerant genes, pLI close to 0) and genes where LoF is not tolerated (haploinsufficient genes, pLI close to 1) ( Figure 2C, right). However, the increased sample size of gnomAD enhanced the power to detect LoF constraint with a continuous metric; LOEUF is a conservative estimate (upper bound) of the observed/expected ratio of the LoF effect predictor (LOFTEE) high confidence SNV-LoF variants ( Figure 2C, left). The use of constraint score in analysis is further explored under section 4.2.

Expression score (pext)
Using per-base expression annotations aids in determining if a variant occurs in a biologically relevant exon and helps deprioritize variants unlikely to impact gene function (Abou Tayoun et al., 2018). The gnomAD browser displays the mean proportion expressed across transcripts (pext) score, based on GTEx v7 data from 800 individuals (GTEx Consortium, 2015). The pext score is unique in providing a normalized expression value for each position in a gene. By doing so, pext allows quick visualization of the mean expression of exons across a gene either as an aggregate score including all tissues ( Figure 3:11, blue track), or separated across 38 different tissues by selecting "show tissues" (Cummings et al., 2020). The NSD1 gene page demonstrates the applicability of pext by occurrence of pLoF variants in gnomAD individuals in low pext regions and, to a larger extent, a lack of pathogenic ClinVar variants in the same regions ( Figure 3:12). Of note, pext is based on adult postmortem tissue and may not accurately represent genes with differential expression during development (GTEx Consortium, 2015). The use of pext in variant interpretation is explored under section 4.3. , the clinical significance of each variant will be indicated by shape. This allows for visualization of patterns that can inform variant classification such as identifying genes where there is a predominant type of pathogenic variation (LoF or missense variation) or regional clustering of missense variation (hot spot). NSD1 ( Figure 3) displays clear enrichment of pathogenic (red) pLoF (X) variants, consistent with haploinsufficiency of NSD1 resulting in Sotos syndrome. There are also pathogenic missense variants (red triangles) that cluster within a portion of NSD1 with evidence of regional missense constraint, harboring the SET and PWWP domains (Supp. Figure S3). The variant table lists all variants in a gene sorted by genomic position ( Figure   3:19, Variant ID). The most severe consequence across transcripts is noted in the HGVS Consequence column. When the most severe consequence occurs in a transcript that is non-canonical, the HGVS nomenclature is denoted with a † symbol.

Variant tracks and the variant table
The variants in the table can be filtered by variant effect, exomes or genomes, and SNV or indels, and there is also an option to include variants that did not pass gnomAD quality control; however, these variants should be interpreted with caution ( Figure 3:20).
Configuring and re-ordering the variant table ( Figure 3:21), and sorting by column values of interest is also possible, for example including pext score for each variant site.
Some pLoF variants will have a LoF curation verdict and warning flags noted in the table (Supp. Figure S4). More information about LOFTEE and the flags it generates as well as the manual LoF curation results that are present for a subset of variants and genes, can be found by hovering over these flags, or on the variant page. Furthermore, the customized variant table can be downloaded as a comma separated values (CSV) file for local analysis ( Figure 3.22).

The variant page
Searching for a variant in the search bar (   (8) visualization of v3 non-v2 dataset; (9) age data; (10) genotype/depth/allele balance for heterozygotes and (11) site quality metrics; (12) read data and (13) the option to load read data for additional individuals. Presence in v2 genomes only is the most common reason for low allele number and it is not a cause for concern. The regional coverage should be investigated by looking at The dataset menu (Figure 4:6) provides the opportunity to explore allele frequency in specific subsets of gnomAD (non-cancer, non-neuro, non-TOPMed, control and non-v2). The control subset consists of individuals reported as controls in common disease studies or included from a biobank, but these individuals may still have medical conditions. For example, someone who participates as a control in a type 2 diabetes study could have a past history of, or in the future develop cancer or neuropsychiatric disease and still be included in the control subset. This is one reason that these subsets should not be used as a control set in common disease studies (further discussed in section 5). While we expect individuals with severe early-onset disease to be heavily depleted from gnomAD, individuals with conditions that would not prohibit participation in common disease research are likely included at mildly depleted or similar levels to the general population, particularly for phenotypes like infertility, vision and hearing impairment, and conditions with late onset or reduced penetrance. It can be useful to investigate the allele count across the database for rare variant analysis (Supp. Figure   S1 and Supp. Table S2)

Age data
Age data is available for a subset of individuals. While it is defined as the last known age, for some cohorts it is the age at enrollment (Figure 4:9). The data is displayed as a distribution including the age of carriers and the age of all individuals in the dataset (striped), colored by sequencing method (blue for exome, green for genome). For variants in genes associated with later onset conditions, age data can reveal if a carrier is of an age younger than onset of disease, which might explain the presence of a disease-variant in gnomAD. In general, age data is of limited use in variant interpretation given the lack of available phenotype data on these individuals.

Quality control
Variants reported in gnomAD have passed robust quality control including hard filters and a random forest model for assessing both the quality of the variant and the site.
However, presence of artifacts is inevitable in any reference database and careful review of individual variants is important, especially in rare disease analysis. While sequence artifacts occur relatively evenly across the genome, biologically important sequences are depleted for natural variation, which results in a relative enrichment of artifacts compared to natural variation in disease-associated regions (Gudmundsson et al., 2021).
Genotype quality metrics (Figure 4:10 interpretation. For detailed information on how to use IGV we refer to the IGV User Guide and previous articles (Robinson, Thorvaldsdóttir, Wenger, Zehir, & Mesirov, 2017). Not all CRAMS were still available during gnomAD production to generate read data. For variants where read data is missing from the variant page of one version of gnomAD (v2, v3, and ExAC), we suggest investigating if they are represented in another version.

VARIANT INTERPRETATION USING gnomAD
The gnomAD dataset is used in the majority of rare disease analysis pipelines in both diagnostic and research settings around the world ( Figure 1B). The American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) have defined standards for variant classification (Richards et al., 2015). These provide rigorous guidance for the evaluation and aggregation of variant evidence, including the use of reference databases such as gnomAD. They defined terminology for variant classification including five categories: benign, likely benign, uncertain significance, likely pathogenic, and pathogenic. Further, they defined four major areas in which variants can be awarded with evidence that determines their final classification including: population, computational, functional, and segregation data.

Allele frequency in variant interpretation
The vast majority of pathogenic variants are rare, hence identifying rare variants is an essential step in Mendelian analysis. It is important to remember that most rare variants are not pathogenic and rarity is consistent with, but not sufficient for, determining pathogenicity (Figure 2A and B). Variants that are absent from gnomAD, or present at a lower frequency than expected (particularly for recessive disease variants where unaffected carriers are expected), were initially considered a moderate level of evidence for pathogenicity (PM2) (Richards et al., 2015). However, rarity or absence in population databases has been recommended to be downgraded to supporting evidence by the Clin Gen Sequencing Variant Interpretation (SVI) Working Group, given that most unrepresented variation is benign and reference population databases are far from saturation for most variation types (Karczewski et al., 2020).
Population evidence for pathogenicity can also be applied for variants that are more prevalent in affected individuals compared to controls (PS4). There have been some attempts to use gnomAD as a control population in formal association studies, but in general this is not recommended. Challenges to this approach include the lack of information about case numbers for any specific disease in gnomAD, the inability to correct for confounders such as population stratification (which would require individual-level data access), as well as differences in the technical processing and QC of data between cases and controls which can lead to erroneous associations .
To most effectively filter out common variants, we recommend using the popmax allele frequency, defined as the population maximum allele frequency in the continental populations (African/African American, East Asian, European, Latino/Admixed American, and South Asian). Generally, if a variant is common in one population, it can be assumed to be benign across all populations. Consideration should be taken if studying a condition that is much more common in one population.
Following ACMG variant classification guidelines, stand-alone benign (BA1) evidence should be applied to variants with an allele frequency ≥ 5%, unless the variant was previously noted to be pathogenic, as high allele frequency can be a result of low penetrance in monogenic disease genes (Ghosh et al., 2018). Hypomorphic variants in particular may have a higher allele frequency. An allele frequency of > 1% is considered strong evidence that the variant is benign (BS1); however, certain recessive disorders can have common pathogenic variants that rise above this threshold (e.g. Phe508del in CFTR associated with Cystic Fibrosis, OMIM #219700); these well-established variants can often be identified using ClinVar. There are occasions when 1% is a too conservative threshold, particularly for severe, dominant disorders. Whiffin et al. has developed a more refined frequency filtering approach (Whiffin et al., 2017). A maximum credible allele frequency for the specific condition is defined using information about the prevalence, inheritance mode, penetrance, and genetic architecture (accounting for maximum genetic or allelic contribution). As gnomAD is a sampling of the general population, a filtering allele frequency (FAF) is generated from the popmax allele frequency to adjust for sampling variance (Figure 4:3). If the FAF is higher than the maximum credible population allele frequency, then benign evidence (BS1) can be applied.
Estimating allele frequency in genes affected by clonal hematopoiesis has been a particular challenge (Carlston et al., 2017;Karczewski et al., 2020) as somatic variants that increase proliferation of the hematopoietic lineage can rise to high allele fraction (Jaiswal et al., 2014). A warning has been added to the gene page for genes where this phenomenon has been shown to occur (i.e. ASXL1, DNMT3A, TET2).
Variants in these genes should be interpreted with caution in any reference population database (Jaiswal et al., 2014).

Constraint scores in variant interpretation
Constraint scores are useful to indicate when specific types of variation are depleted in a gene. For example, haploinsufficient genes often have a high pLI score and a low LOEUF score; therefore pLoF variants occurring within LoF-constrained genes are of high interest (Bamshad et al., 2019). Choice of score may be influenced by whether the analysis is more suited to a continuous (LOEUF) or dichotomous-like (pLI) metric ( Figure 2C). When a cut off is being applied, we recommend using pLI ≥ 0.9 (3,060 genes in v2) or LOEUF < 0.35 (2,968 genes in v2). While many LoF constrained genes are associated with Mendelian disease to date, 68% (2,071 of 3,060) of LoF constrained genes (pLI ≥ 0.9) are not yet linked to a phenotype in humans.
Genes with a missense Z-score ≥ 3.1 are significantly depleted for missense variation. Unlike LoF variation, there may be only a region of a gene that is intolerant for missense variation rather than the entire gene, often overlapping with a protein domain.
Reviewing the ClinVar track for pathogenic variants along with the regional missense constraint track (Supp. Figure S3) can help identify hotspots or domains without benign variation, providing moderate evidence towards variant pathogenicity (PM1) (Harrison, Biesecker, & Rehm, 2019). For missense constrained genes with many pathogenic missense variants, this can be considered supporting evidence of pathogenicity (PP2) (Harrison et al., 2019). At the other end of the missense variation spectrum lie genes without evidence of missense constraint (Z scores around zero or less) and where only truncating variants have been reported as pathogenic, which is considered as supporting benign evidence for classification (BP1) for missense variants.
In rare disease analysis, careful attention should be paid to rare variation in constrained genes, both in prioritizing variants and also in identifying novel diseasegene relationships. There are some caveats that are worth noting. Constraint is more commonly seen for dominant disease genes, particularly for phenotypes that are absent or depleted in gnomAD. It is not as informative in the interpretation of recessive disease, as carriers of recessive disease variants will be present in gnomAD. As constraint is due to negative selection, variation that results in a phenotype later in life, particularly in post-reproductive years, may not be depleted in gnomAD. For example, LoF variants in the BRCA1 gene are strongly correlated with risk for breast and ovarian cancer, yet BRCA1 does not show evidence of constraint for LoF variation, as the phenotype presents post-reproduction and also is of lower penetrance in males.
A full list of all constraint metrics is available on the gnomAD Downloads page (https://gnomad.broadinstitute.org/downloads#v2-constraint) including genome-wide pLI and LOEUF ranking of all genes.

Per-base expression in variant interpretation
Proportion expressed across transcripts (pext) scores can be used to examine the perbase expression pattern across transcripts and exons as well as in a tissue of interest (Cummings et al., 2020). Regions with low pext scores are likely of less biological importance. While the pext score is helpful in the interpretation of any coding variant, it can be particularly useful when deciding whether to apply the very strong evidence for a pLoF variant in a gene where LoF is a known mechanism of disease (PVS1). A pext score < 0.2 is evidence that a pLoF at that site may not be of biological relevance and PVS1 may not apply.
For conditions impacting specific tissues, the pext score for that tissue can be reviewed, as transcript expression between tissues can differ. Of note, pext score is based on adult tissue and the developmental expression profile may be different.
Additionally, particularly long genes can be affected by the 3' bias of polyA tail transcriptome sequencing to a degree which is not always adequately corrected for in pext scores, and genes with this pattern should be interpreted with caution (for example DMD). It is important to consider the relative value compared to the mean and maximum pext value of the gene. For example, if a variant falls in an exon with a pext of 0.1, but the gene average is 0.2 (such as in the setting of expression of a noncoding transcript), the data does not provide any information for interpretation, and PVS1 may still apply.

Variant co-occurrence
The gnomAD variant co-occurrence feature allows investigation of the statistical likelihood of two variants occurring on the same or different haplotypes in individuals in gnomAD. This can be helpful in rare disease analysis to deprioritize compound heterozygous genotypes that are seen in the general population or to predict phasing where only a proband is available for sequencing and two rare variants are observed within a gene. The co-occurrence can be assessed if both variants are in gnomAD exomes, appear in the same gene, have an allele frequency ≤5%, and are coding, in the splice region, or in the 5' or 3' untranslated regions (UTRs). When two variants co-occur in individuals in gnomAD more often than expected based on population frequencies, then the variants are predicted to be on the same haplotype (Supp. Figure S5A). If two variants co-occur in some individuals in gnomAD but only at the expected rate, then these variants are predicted to be on different haplotypes, and it is unlikely that this variant combination is deleterious for phenotypes that are not expected to be seen in gnomAD (Supp. Figure S5B). If two variants are found in gnomAD but do not co-occur in any individuals in gnomAD, then they are likely on different haplotypes but no information would be available about the potential impact of a compound heterozygous genotype. These estimates will be more accurate when considered within a specific population. A detailed description of the varant co-occurrence feature is found in the News section (https://gnomad.broadinstitute.org/news).

gnomAD flags and warnings
The gnomAD browser provides flags and warnings displayed in the variant table and on the variant page to highlight variant details important for interpretation. Variants with flags or warnings should not be automatically discounted but we advise careful consideration of whether these may impact the analytical validity or the effect of the variant.

Multi-nucleotide variants
When two variants in the same codon are present on the same strand, the variant consequence may be misrepresented as the type if each variant were present independently rather than interpreting the variant combination. The assumption of independence is present in almost all current variant annotation pipelines. Multinucleotide variants (MNVs) that occur within a codon are annotated in gnomAD v2 data to aid interpretation of their combined effect (Wang et al., 2020). For example, two variants independently annotated as nonsense and missense could result in a missense variant when interpreted in combination (Supp. Figure S6). Specific information about the MNVs identified in gnomAD can be found on the MNV page, including the number of individuals who have the MNV versus either of the SNVs.
Frame restoring indels (for example a 4 base pair deletion and nearby 5 base pair deletion on the same haplotype) are not annotated in the gnomAD browser.
However, these are an important source of rescues and can be identified by inspecting the read data (Supp. Figure S7).

Loss of function transcript effect estimator (LOFTEE) and manual LoF curation
The LOFTEE package was developed to complement pLoF annotations by variant effect predictor (VEP) and filter variants that are unlikely to result in LoF. Variants are determined as low-confidence if predicted to not result in LoF due to criteria such as terminating at the 3' end of a gene or affecting splicing of the UTR. The remaining variants are categorized as high-confidence (Karczewski et al., 2020). Around 14% of high-confidence pLoF variants in gnomAD have additional flags (i.e. single exon genes, poor conservation by Phylogenetic Codon Substitution Frequencies (PhyloCSF) (Lin, Jungreis, & Kellis, 2011), NAGNAG splice acceptor sites, and non-canonical splice sites) that can be found on the variant page together with the VEP information. The additional flags differ from those variants that are filtered as low-confidence pLoF as they generally relate to the properties of individual transcripts or exons and can be overruled by gene-specific knowledge.  Figure S4). The flags and manual curation verdicts will often impact whether PVS1 can be applied; however, careful review is essential in determining if these flags change the interpretation of the variant in a specific context. Of note, pLoF variants suggested to not result in LoF or pLoF variants with flags could still be pathogenic, as LoF might not be the only mechanism of disease.

Low complexity regions
Low complexity sequences are enriched for artifacts. The low complexity region (LCR) flag highlights variants identified in these regions to allow a more careful review (Morgulis, Gertz, Schäffer, & Agarwala, 2006). The allele frequency of variants in LCRs might be skewed because of enrichment for artifacts, and population frequency evidence (PM2/BA1/BS1/BS2) should be applied cautiously. An additional type of LCR that is often not flagged in gnomAD are homopolymer runs, which also show an enrichment of sequence artifacts. Skewed allele balance can provide evidence that the variant is an artifact in the population data. However, pathogenic variants also commonly occur here. If the variant is Sanger confirmed in the patient, it should not be discounted as causal despite being present at an apparent appreciable frequency in population data.

Variant page warnings
There are two main warnings that may be seen on the variant page underneath the header. Firstly, variants that are covered in less than 50% of individuals are highlighted since there is risk that variant allele frequencies may be inaccurate (Supp. Figure S8).
These variants often fall in regions that are difficult to sequence with current exome and genome methods. Coverage can also vary because of differences in exon capture between different sequencing platforms. If a variant is better covered in the genome data, using v3 allele frequencies might be more appropriate.
Secondly, heterozygous variants with a high proportion of alternate reads (allele balance for the variant is ≥ 90%) are highlighted (Supp. Figure S9). Heterozygous variants with inflated alt-reads are likely homozygous variants called heterozygotes due to contamination that affects the variant calling likelihood models. This depletion of homozygous calls can incorrectly deflate the allele frequencies .

LIMITATIONS OF REFERENCE POPULATION DATABASES
The gnomAD database is a useful, publicly available collection of human sequence data, but there are a number of caveats that are important to note when drawing inferences about variant pathogenicity from this resource (or, indeed, most other existing variant databases).
It is important to note that some individuals with Mendelian disease may still be included in the datasets. We suggest caution about excluding variants as disease candidates when seen in one or a few individuals. Also, as demonstrated in Figure 2A, 2B, and prior work (Karczewski et al., 2020), we are still far from representing all possible variation, and a variant's absence from gnomAD is consistent but far from sufficient evidence for its involvement in disease ( Figure 2B).

Phenotype and other individual-level data is not available for individuals included
in the aggregate gnomAD data. This is better accessed through a biobank or other study, such as the UK Biobank, BioMe, FinnGen, or All of Us. Because individuals known to have Mendelian phenotypes have been removed, gnomAD would not be useful to try to match for patient phenotype. Other resources for this type of matching are available, including Genotype 2 Mendelian Phenotype (Geno2MP) and VariantMatcher, which are aggregated databases of rare disease sequence data with associated HPO terms and the ability to contact the researcher.
The over-representation of European participants in genetics studies is reflected in the database. There is poor representation of many communities, including African, Middle Eastern, and Oceanian populations, leading to patients from these communities having more rare variants of uncertain significance. Improving diverse representation of populations is of high priority, with resources dedicated to reprocessing available datasets for inclusion in gnomAD.
Despite extensive quality control, gnomAD (like any large genomics resource) contains sequencing and annotation artifacts, and we have suggested approaches to evaluate variant quality.

RESOURCES
Additional features are frequently released on the gnomAD browser. Extensive feature releases are described in the News section (https://gnomad.broadinstitute.org/news) and updates to the browser are detailed in the Changelog (https://gnomad.broadinstitute.org/news/changelog). Additionally, updates are announced by @gnomAD_project on Twitter.
On the browser, additional information is available via the "?" buttons located throughout the gnomAD pages and on the Help page

CONCLUDING REMARKS
The gnomAD resource illustrates both the power and the challenges of interpreting human biology using large-scale genomic datasets. The sheer size of gnomAD makes it possible to obtain accurate estimates of allele frequency extending down to incredibly rare variation, and to explore the patterns of variation across genes and regions of genes. This power has proven invaluable for variant interpretation in patients with rare genetic disorders, and has also empowered a wide range of scientific applications including comparison of the mutational intolerance of mouse and human genes (Dickinson et al., 2016), estimation of selection coefficients (Cassa et al., 2017), Future releases of gnomAD will further increase the size and scope of the resource, leading to improved power for all downstream applications. New releases of structural variation calls will increase the resolution of analysis for large deletions, duplications, inversions, and complex rearrangements, while the next exome release (expected to exceed 500,000 individuals, mapped to GRCh38) will dramatically enhance power for assessing coding allele frequency as well as constraint against gene disruption and regional missense variation. Over time we expect that improvements in variant-calling methods for currently under-represented classes of variation, such as repeat expansions and complex structural rearrangements, will provide increasingly accurate frequency estimates for these variants and support the discovery of additional pathogenic alleles.
Finally, a major ongoing focus on increasing the representation of diverse ancestries, both from the gnomAD aggregation effort, and from the broader human genomics community, will be needed to improve the applicability of this database to currently under-represented populations. This effort will require greater efforts to ensure these communities are included in global genomics projects, as well as ensuring that the resulting data are shared with aggregation efforts in a manner that balances accessibility with respect for the wishes of communities and individuals, especially for Indigenous peoples (Hudson et al., 2020). Increased representation of all communities will decrease the number of variants of uncertain significance in patients from currently under-represented ancestries, while also improving the power of this resource for all communities.

ACKNOWLEDGMENTS
We thank the individuals whose data is in gnomAD for their contributions to research.
We thank Ellie Seaby for generous data sharing and inspiration for Figure 2C

CONFLICT OF INTEREST
DGM is a founder with equity of Goldfinch Bio, and serves as a paid advisor to GSK, Variant Bio, Insitro, and Foresite Labs. AO-DL is on the Scientific Advisory Board for Congenica.

Analysis of very rare coding variants in gnomAD populations
A sample of 100 individuals was randomly selected from each of the following populations in the gnomAD v2.1.1 exome dataset: African/African American, Latino/Admixed American, East Asian, Non-Finnish European, and South Asian.
Variants were then filtered to only very rare variants using a popmax allele frequency of < 0.1% (popmax refers to the maximum frequency across continental populations in gnomAD, not including Ashkenazi Jewish, Finnish, or any other remaining samples) across the entire gnomAD dataset (v2 exomes, v2 genomes, and v3 genomes).
Variants were also excluded if they had not passed gnomAD quality control or had been flagged as part of a problematic region (low complexity, decoy, and segmental duplication). Unique variants for the 100 individuals were obtained by filtering to variants with AC of 1 in gnomAD v2 exomes and no observations in v2 genomes or in the v3non v2 subset.
The variant annotations previously applied by the Variant Effect Predictor (VEP; version 85) were used to assign each variant to one of three categories: pLoF, missense/inframe indel, or synonymous (McLaren et al., 2016). The VEP annotations were filtered to the most severe consequence for the canonical transcript. A variant was annotated as pLoF if the most severe consequence was one of: "splice_acceptor", "splice_donor_variant", "stop_gained", "frameshift_variant" and it was found to be highconfidence (indicated by LOFTEE) (Karczewski et al., 2020). The missense/indel annotation was applied if the most severe consequence was one of: "missense_variant", "inframe_insertion", "inframe_deletion". A synonymous annotation was used for a variant if the most severe consequence was "synonymous_variant".
Analysis was performed using Hail (https://hail.is/). Variants were exported to a tsv for data visualization in R. The code used to select individuals and filter variants can be found on GitHub: https://github.com/broadinstitute/gnomad_review_hum_mut. Supplementary Figure S1: Total unique variants in two sets of 100 randomly selected individuals filtered using v2 exomes (n = 125,748; left) versus the entire gnomAD dataset (n = 197,912; right). The increased sample size results in a decrease in the number of unique variants per individual in all populations. The increase from ~8,000 African/African American individuals in v2 exomes to ~30,000 individuals across the entire database results in a significant decrease in unique variants from mean 42 (± 11) to 21 (± 7) variants per individual in this population (purple). In comparison the minor increase in South Asian samples from ~15,000 to ~16,500 results in an non-significant decrease from mean 41 (± 15) to 38 (± 14) variants per individual (orange). Of note, the mean is influenced by the fact that these individuals are in gnomAD and likely from a community that has some representation in the database. An individual from a subpopulation or community that is not as well represented in gnomAD will likely have a higher number of unique very rare variants than seen in this analysis. Black bar represents the 95% confidence interval. ** p < 0.01, *** p < 0.001, ns: not significant with Student t-test. (±) Standard deviation.
Transcripts (y-axis) can be sorted by their expression in a specific tissue. In this example, sorting by Brain -Cerebellar Hemisphere lists ENST00000354179.4 first as the highest expressed transcript in that tissue. Tissues (x-axis) can be sorted alphabetically (this example) or by highest mean expression. A darker purple square indicates higher expression of a specific transcript in a specific tissue measured by transcripts per million (TPM). The exact value is available as a hover over (in this example highlighting the expression of ENST00000354179.4 in Pituitary tissue as 7.75 TPM). The highest expression is found in "Brain -Cerebellar Hemisphere" and "Brain -Cerebellum" (darkest purple). *The canonical transcript.
Supplementary Figure S3: ExAC dataset gene page for NSD1. The track is available under the pext track (black arrow) for genes with regional missense constraint. The numbers on the track refer to the proportion of observed versus expected missense variation seen in the region in gnomAD. In this example, missense constrained regions (red and orange) show clustering of pathogenic/likely pathogenic missense variants in ClinVar, while there is a depletion of missense variation in gnomAD for the same region (box). The variants are found in isolation in 68 and 170 individuals (red) and co-occur in 304 individuals (purple and blue). Given the high proportion of individuals where both variants are present, they are predicted to be on the same haplotype in the informative populations. East Asian and European (Finnish) are uninformative as there are no individuals harboring both variants in these populations. The variants must be on the same haplotype in the one individual (blue) who is homozygous for both variants. (B) In the second example chr1:55505647G>T (NM_174936.4:c.137G>T) and chr1:55523855G>A (NM_174936.4:c.1327G>A) in the PCSK9 gene are investigated.

Supplementary
They are reported to likely be on different haplotypes in all informative populations as they co-occur in only four individuals in gnomAD (purple). Thus, the presence of these four individuals in gnomAD suggest that compound heterozygosity for these two variants is not associated with a severe, early-onset, highly penetrant disorder. (c) Regional view of IDUA gene displaying low coverage in exomes (blue), but not genome (green) in gnomAD v2 for the region of the highlighted variant (yellow oval). Allele frequencies in the region are likely reliable in the IUDA gene from the genome (but not exome) data (c) but not reliable in CBS gene from either genome or exome data (b).