Leveraging Base Pair Mammalian Constraint to Understand Genetic Variation and Human Disease

Although thousands of genomic regions have been associated with heritable human diseases, attempts to elucidate biological mechanisms are impeded by a general inability to discern which genomic positions are functionally important. Evolutionary constraint is a powerful predictor of function that is agnostic to cell type or disease mechanism. Here, single base phyloP scores from the whole genome alignment of 240 placental mammals identified 3.5% of the human genome as significantly constrained, and likely functional. We compared these scores to large-scale genome annotation, genome-wide association studies (GWAS), copy number variation, clinical genetics findings, and cancer data sets. Evolutionarily constrained positions are enriched for variants explaining common disease heritability (more than any other functional annotation). Our results improve variant annotation but also highlight that the regulatory landscape of the human genome still needs to be further explored and linked to disease.


Supplementary Methods
This section contains essays that provide detail regarding the approaches taken in this work. These generally have a 1:1 connection to a section of the main manuscript. This information is intended to facilitate reproducible research and to document and justify the decisions made in these analyses. All figures called in the methods have the format Fig. SX.Y, where X is the section number, and Y is the Figure number within section X.

By: Patrick Sullivan
We describe here the latest efforts and available resources to catalog and predict the effect of genetic variation on human phenotypes including: common and rare variant association studies and catalogs of human variation and control of gene expression.

Genome-wide Association Studies (GWAS).
We downloaded the NHGRI/EMB GWAS catalog (10/2021) (1). The inline table summarizes the results, and Fig. S1.1A shows the publication of GWAS over time and Fig. S1.1B shows the cumulative genome-wide significant associations by publication date. As noted below, over 95% of the associated SNPs are non-coding.  Genotype-Tissue Expression (GTEx) has provided a large set of gene expression data in multiple tissues (2). Pairing individual gene expression with genomics-in essence, conducting GWAS for SNPs in the vicinity of every transcript-yields a catalog of eQTL-SNP/eQTL-gene pairs. The inline table summarizes the results. There are millions of eQTL-SNPs in the genome. Across all tissues, most protein-coding genes have one or more eQTL-SNPs (90.9%, or 18,123/19,937). Fig1 S1. 2 shows the distribution of the number of eQTL-SNPs for each protein coding gene; due to extensive LD in the human genome, most genes have many eQTL-SNPs (3).  (18,123 genes). Due to LD, most genes have many eQTL-SNPs.
In most instances, GWAS and eQTL studies have not directly identified underlying causal variation. Linkage disequilibrium is a double-edged sword. These studies usually genotype a selected set of ~500K biallelic SNPs using a commercial SNP array and then leverage linkage disequilibrium to accurately impute ~8 million variants. However, once a genomic region is identified, linkage disequilibrium usually makes confident identification of causal variation within an associated locus difficult. Fig. S1.3 depicts a strong association for schizophrenia over a ~1 Mb multigenic region.  Single gene diseases. Online Mendelian Inheritance in Man (OMIM) is a curated source of gene-level findings from human genetics that accumulates human disease gene findings (4). The numbers of genes with phenotype-causing mutations was slightly over 1,000 in 2001 and 4,525 in 10/2021. Another source is the variant-level database ClinVar (5) which listed 20,493 "pathogenic" variants in 2016 and 85,320 in 10/2021.
Fig. S1.5 shows the allele frequency distribution of biallelic SNPs in gnomAD. The rarest variants dominate: 87.9% of biallelic SNPs are in the rarest categories (allele counts ≤ 10, allele frequencies ≤ 3.98e-5), and only 1.2% are common (allele frequency ≥ 0.005). The pLoF subset is defined by mutations that occur at canonical intronic splice sites, frameshift insertion-deletion variants, or introduction of a premature stop codon. However, the numbers of pLoF variants are far less than missense and synonymous variants. Reasonable prediction of the fraction of missense variants that alter protein function is an unsolved problem. Summary. These data sets document the marked increase in genomic knowledge in the recent past. For the approximately 200,000 genomic positions associated with human rare and common variant studies (i.e., combining the GWAS catalog SNPs and ClinVar "pathogenic" variants), the functional importance or impact of most is unclear. Most GWAS findings are non-coding, and a major hypothesis is that they directly or indirectly point at regulatory variants. The ~200K cataloged associations in the human genome from WES and WGS have greatly exceeded knowledge of their potential functional consequences (particularly in non-coding regions). Even for strong eQTL-SNP to eQTL-gene expression associations-the RNA phenotype "closest" to DNA variation-linkage disequilibrium generates many associations, and there is a clear need for ways to prioritize the observed associations to understand which are more likely to be causal. Taken together, these "fine-mapping" and functional interpretation of non-coding variation problems have motivated the NIH/NHGRI Impact of Genomic Variation on Function (IGVF) consortium ($185 million, 25 awards across 30 sites).

Section 2: Comparison of Zoonomia Constraint with Human Surveys
By: Patrick Sullivan, Steven Gazal, Jennifer R. S. Meadows The inline table below compares high level features of mammalian constraint to constraint measures derived from large catalogs of human genetic variation. As noted in Supplemental Methods, Section 1, the most comprehensive human catalogs to date are gnomAD (based on WES) and TOPMed (based on WGS) (6,8). These approaches share the intention of inferring the impact of purifying selection via observed constraint. Beyond this similarity, the methods have sets of strengths and weaknesses. Relatively limited: ultra-rare variants exposed for tens of person-years in few individuals, genomic backgrounds, sets of environmental exposures, and sexes. Effects on reproductive fitness and late-onset diseases often unknown.

Inference of function
Agnostic to functional mechanism Requires algorithmic prioritization to identify impactful variation. Current expression quantitative trait loci (eQTL) analyses and epigenomic surveys are incomplete in samples of cell types over developmental time. eQTL analyses are complicated by cell mixtures, gene co-expression, and linkage disequilibrium to highlight causal variant-gene pairs (3).  (25,035) are in CDS; of the constrained, common, CDS SNPs, only 10.5% (2,621) occur at a synonymous site. Constrained common SNPs are a particularly interesting subset given that they are under evolutionary constraint in mammals but are commonly variable in extant humans. They may play roles in important but variable human phenotypes, and may be a priority SNP set for gene-environment interactions (see Supplementary Methods, section 4). ‡ For most cohorts in human variation catalogs, subjects had to survive to adulthood, provide informed consent, and meet inclusion/exclusion criteria. An unknown number of individuals with strongly deleterious mutations would not be in these studies due to censoring-e.g., early mortality at some point between fertilization and eligibility for inclusion in a human sequencing catalog. Additional individuals would not be included due to some physical disability or illness (i.e., not "healthy" controls) or because of cognitive impairment (that might complicate or make impossible provision of informed consent for participation). Some of the information in these unobserved variants can be captured in gene-based measures of constraint (i.e., fewer observed deleterious variants in a gene compared to a model expectation), but this will generally occur in genes with high levels of constraint and may be missed if the variant occurs in a non-constrained segment of a constrained gene or in a constrained segment of a non-constrained gene. Direct measurement and association of these rare variants with disease is far more informative. § We consider here the annotation of observed genetic variation in exons, the "Achilles heel" of WES studies. Zoonomia is a base pair annotation, and its major strength is its potential to stratify important single nucleotide variants at missense and synonymous positions (the most numerous types of variation). It has lesser informativeness for predicted loss-of-function (pLoF) variants that are a key focus for many WES studies. pLoF are defined by their occurrence in structurally-defined parts of protein-coding genes: intronic splice sites; stop-gain, missense mutation to stop or insertion of a Ter codon; frameshift (insertion-deletion of bases that are not a multiple of 3); and start-or stop-loss. Although there are nuances and complexities, pLoF variants are generally identified by their location in relation to a gene model and the genetic code (CDS exons and codon-position). As we show in Supplementary Methods, section 8, many of the sites at which pLoF can occur have high levels of evolutionary constraint-e.g., start codons, stop codons, intronic splice sites, and positions that can mutate to generate a stop codon). The specific base-pair constraint of a given CDS base may add little new information for an observed variant in one of these locations. In particular, evolutionary constraint adds little to interpretation of ultra-rare and often individual-specific frameshift variants as these are not directly observed in cross-species alignments (there are 3 possible changes to a single base but many possible insertions or deletions of 1, 2, 4, 5, or more bases). Moreover, impactful frameshifts can occur in non-constrained regions. For example, consider three codons: AAA-GTG-AAA (Lys-Val-Lys). A frameshift deletion of the "G" in the fourth position (REF:ALT AG:A) yields a pLoF of AAA-TGA (Lys-Ter, peptide termination). Under a random model of indel generation, this pLoF could occur in a critical, highly constrained CDS region or in an unconstrained, neutrally-evolving CDS region-base-pair constraint may add no new information. The reality may be more complex if local features make indel generation non-random.

Section 3: Deriving Mammalian and Primate Constraint Measures for the Human Genome
By: Michael Dong, Ola Wallerman, Xue Li, Matthew J Christmas, Voichita D. Marinescu, Jennifer R. S. Meadows In this section, we describe the generation of the key constraint metrics. PhyloP scores were computed for each human position to reflect constraint over mammalian evolution (240 mammal species), and PhastCons scores reflect constraint over the evolution of 43 primate species. Notes. Full details of the rationale behind Zoonomia, justification of the species selected, the specific species, the sources of whole genome sequences, the builds used, and the reference-free alignment approach are given in a prior paper (11). As these data are extensive, they are not duplicated here. The Zoonomia alignment has 241 genome sequences from 240 mammalian species (Canis lupus familiaris was included twice, and handled as described below). In this section, "MAF" refers to Multiple Alignment Format. Human gene models were based on GENCODE v36 (11/2020) (12), and the genome build was hg38/GRCh38 (unless otherwise noted).
Neutral Model. Three human neutral models (autosomes, chrX, and chrY) were generated using the HALformat (HAL Tools v2.1) (13) of the Zoonomia alignment of 241 mammalian sequences (11). Ancestral repeats were identified by applying RepeatMasker Open-4.0. 2013-2015 (http://www.repeatmasker.org) to the ancestral sequence of the mammal alignment. Sequence from the penultimate ancestral branch (fullTreeAnc238) was used instead of the most ancestral sequence (fullTreeAnc239), as interspersed repeats on fullTreeAnc238 had better reconstruction of the eutherian ancestral form than fullTreeAnc239 (A. Smit, personal communication, 2 Feb 2021). Repeat coordinates were converted to fullTreeAnc239 using halLiftOver (13). Ancestral sequence repeat sets were filtered to exclude: non-mammalian repeats shared with birds and reptiles; repeat sequences annotated as structural RNA copies, satellites, tandem repeats, and low complexity annotations; and regions where synteny was not present across the four major branches of the mammalian tree (Xenarthran, Afrotherian, Laurasiatherian and Euarchontoglires). A representative random set of ancestral repeat positions (i.e., 100Kb) from the above list was the input for the neutral evolution model calculation. An ancestor-referenced multiple alignment format alignment was extracted (hal2maf). PhyloFit (with default parameters, --subst-mod REV --EM) was applied to the corrected tree to estimate branch lengths, fixing the tree topology. The resulting alignment was processed (mafDuplicateFilter) to remove sequences that aligned multiple times to the same region in order to avoid biases due to species over-alignment. These steps was repeated for the sex-specific models but using repeat positions from chrX or chrY of the human-referenced alignment (i.e., 100Kb). Primate neutral models were constructed in the same fashion, with the ancestral branch reconstruction based on the 43 primates present in the alignment. Synteny with five of the reconstructed branches in the primates tree were checked.
PhyloP and phastCons constraint score calculation. The alignment was preprocessed with mafDuplicateFilter (https://github.com/dentearl/mafTools), in order to keep one sequence per species (i.e., the best match) in each alignment block compared to the sequence to the consensus for that block (14). For the primate-specific data, non-primate species were filtered out from the alignment using the mafSpeciesSubset command from mafTools. Alignment depth, ACGT ratios, and lists of species across the alignment were collected using the Bio.Align package from BioPython (https://biopython.org/wiki/Multiple_Alignment_Format) in a custom script. Total branch lengths for each alignment block in the multiple alignment format files were calculated using the tree_doctor command with the branch length (--branchlen) option from the PHAST software package. Two constraint scores, phyloP and phastCons, were calculated using the PHAST package, (https://github.com/CshlSiepelLab/phast). PhyloP (v1.5) was used to calculate per-base constraint and acceleration p-values (15). Scores were presented as -log p-values under null hypothesis of neutral evolution, where computation involved performing a likelihood ratio test at each alignment column (--method LRT) and with the output of constraint and acceleration scores (--mode CONACC). PhyloP scores calculated on the human-referenced, MAF-formatted, duplicate-filtered alignment range from -20 to 9.28. Scores from the human-referenced, primates-only alignment (43-way) range from -20 to 1.26. Higher scores reflect greater deviation from the neutral models and hence greater constraint.
Scores calculated on the primates have lower ranges, as the total branch lengths of clade-specific trees are relatively lower compared to the entire mammalian tree. PhastCons (Phast v1.5), implemented a phylogenetic hidden Markov model (phylo-HMM) to identify evolutionarily constrained elements (16,17). In contrast to phyloP single base constraint, phastCons metrics incorporate the columns of flanking bases. Model parameters were selected to match those used to generate the phastCons output for the UCSC 100 Species Vertebrate Multiz Alignment (expected-length=45, target-coverage=0.3, and rho=0.31). The phastCons, a per-base constraint score, was outputted.
Constraint score thresholds. We applied two metrics of constraint for 2.85 billion bases in the human genome, phyloP base scores across the evolutionary history of mammals (240 species, range -20 to 9.28, higher scores indicating greater constraint) and windowed phastCons base scores across the evolutionary history of primates (43 species, range 0 to 1, higher scores indicating greater constraint). Both metrics compared the observed phylogeny to a neutral model. The phastCons metric focuses on bases which may have diverged over mammalian history but remained under constraint over the shorter primate branch lengths.
A mammalian phyloP threshold was determined by converting absolute phyloP scores into p-values and then to q-values using a false discovery rate (FDR) correction (18) (R function qvalue). Outputs were classified as constrained or accelerated based on the sign of the score, any column with a resulting q-value ≤ 0.05 was deemed significantly constrained or accelerated (5% FDR 240 mammal phyloP constraint score ≥ 2.27, 3.53% of the human genome). We took a different approach for primate constraint given the fewer species (43) and lesser branch lengths available; a phyloP score across primates species would be underpowered to discriminate highly constrained bases from the background. We identified the phastCons threshold that yielded a similar fraction of the genome under constraint as for all mammals (i.e., phastCons base score ≥ 0.96, 3.54% of the human genome). Given that the 43 primates are a subset of 240 mammals, the locations of significantly constrained bases in mammals and primates overlap.
We note that we validated the mammalian phyloP threshold, the use of different scores to measure constraint in mammals and primates, and the base pair resolution of mammalian phyloP scores by using heritability analyses of human diseases and complex traits.
Fraction constrained of human genome. We estimated the lower bound for the fraction of sites under purifying selection across the genome (π hat) by comparing the empirical cumulative distribution function (ECDF) of phyloP scores across the genome to the ECDF of ancestral repeats, following the same method detailed in (19) where: Here, F is the ECDF for all sites across the genome (a function of scores s) and G is the ECDF for ancestral repeats (i.e., neutrally evolving sites). We extracted coordinates of ancestral repeats from the UCSC repeatmasker track and filtered these to retain a set of ancestral repeats present in four clades of the mammalian tree (Xenarthran, Afrotherian, Laurasiatherian, and Euarchontoglires). The ECDFs of phyloP values for all positions across the genome, as well as those only in the set of ancestral repeats, were calculated using the ecdf function in R v.4.0.4. PhyloP values below -1.5 were excluded so that the extreme left tail of the ECDFs did not heavily influence estimates of constraint. We estimated πhat for human-and primate-centered phyloP sets.

Section 4: Genomic Properties of Constraint Scores
By: Michael Dong, Matteo Bianchi, Ola Wallerman, Chao Wang, Jennifer R. S. Meadows, Patrick Sullivan We describe here the properties of mammalian constraint: genome-wide and in genes, gene parts, and regulatory features.

Annotations and basic definitions
The inline table below defines and references the data sources for these analyses.

Group
Type Description All transcripts, dropping those with an annotation "issue". Selected all protein-coding (PC) transcripts, kept non-PC transcripts with different gene IDs and no PC transcript overlap. For PC genes, we required consistency in the cDNA, CDS, and peptide lengths (e.g., that the coding length was 3x the peptide length and start codon methionine Regulatory features in brain Brain OC Open chromatin (OC) using ATAC-seq in human brain (36 datasets from fetal cortex, adult cortex, and multiple other brain regions) (27)(28)(29)(30). Brain enhancers PsychENCODE brain enhancers (31).

Properties of constrained bases in the human genome
The inline table below shows the genome-wide proportions of constrained bases for three FDR thresholds. As noted in Supplementary Methods, Section 2, we principally consider bases as constrained in mammals for phyloP ≥ 2.27. For 43 primates, the phastCons filter analogous to phyloP FDR 0.05 in all mammals was phastCons base score ≥ 0.961. Given the higher resolution afforded by 240 mammalian species, we focus principally on phyloP in this section.  We compared the overlap of constraint in mammals to that in primates: mammals=100,651,377 bases, primates=101,134,907 bases, intersection=46,739,217, union=155,047,067, and Jaccard index of 0.301. The inline table below shows GC content and nucleotide proportions in different genomic partitions (three phyloP FDR thresholds and genome-wide). GC content is strongly related to constraint, from 40.9% genome-wide to 55.5% in the most constrained bases. For comparison, CDS bases had GC content of 52.3% (pickOne PC transcript set). As shown in Fig. S4.1, bases constrained in mammals showed a marked clustering tendency with the genome-wide distance between constrained bases of median 2 bp (IQR 1-8) vs a random median of 24 bp (IQR 10-47).

Constrained genes, PC gene parts, and regulatory elements
We next evaluated the degree of constraint within multiple important genome annotations (definitions and sources noted above). In Fig. S4.2, we show the relationship between the enrichment in a genome partition and the fraction of the genome occupied by that partition. The enrichment compares the fraction of the partition that is constrained in mammals divided by the fraction of the genome that is constrained. For example, introns are the largest annotation but with no notable enrichment. We note four enrichment groupings: • Small annotations that are highly enriched for constrained bases including start codons, stop codons, exon splice sites, and TSS; • PC transcripts are enriched for constrained bases but particularly for CDS and 5'UTRs more than 3'UTRs; • Proximal promoters (50 bp upstream of TSS) are notably enriched for constrained bases and more than for a broader definition of proximal promoter (500 bp upstream of TSS) and for a commonly used promoter region definition (TSS ± 2kb); • Multiple regulatory annotations showed enrichment including ENCODE promoter-like elements, brain enhancers, open chromatin in brain, and ENCODE enhancer-like elements. Enrichment of constrained bases in start/stop codons, splice sites, and TSS was expected as was the CDS enrichment, and these are examined more closely in other parts of this manuscript.

Constraint in codons
As shown in Fig. S4 Fig. S4.3B-E, key aspects of this CDS constraint overview are that phyloP scores were sensitive to multiple features of the protein code.

singleton (AC=1) and common SNVs (AF≥0.005). Because nearly all P-values were smaller than the operating system's minimum numeric value (P < 2.2e-308), the Y-axis shows patterns of beta coefficients between the two logistic regression models. Each model had 64 terms (phyloP score plus 63 base sequences AAC to TTT compared to reference AAA). Green line shows betas for phyloP constraint. Most of the 3 base context terms were similar (grey lines) with the exception of the 8 terms containing the dinucleotide CG (red, GCN or NGC). The modelling was done in this way (i.e., singleton SNVs vs 10x random local positions and common SVS vs 10x random local positions) due to the great imbalance in the numbers of variants
We next focused on the constraint of SNVs occurring in CDS. Fig. S4.5 summarizes the 6,140,669 TOPMed SNVs that occur in 31,809,819 unique CDS bases. Most SNVs were rare (AC=1 44.8%) and common SNVs were the least frequent grouping (AF≥0.005 1.4%). Most were missense mutations (missense 65.4%, synonymous 32.8%, and mutation to stop 1.8%). SNV mutation types show different relations with AC and constraint. Synonymous SNVs occur at non-constrained sites with a small inverse relation with AC. Missense SNVs and rare terminal mutation SNVs occur at highly constrained sites and the median constraint levels are very high except for common SNV. Due to the importance of CDS, we fit two multivariable logistic regression models. The data frame is fully defined in the next section. For the first model, the dependent variable was the presence of a singleton biallelic SNV with AC=1 versus all other CDS sites containing no variation. For the second model, the dependent variable was the presence of a common biallelic SNV (AF ≥ 0.005) versus all other CDS sites containing no variation. The predictors were phyloP score, CDS base (A, C, G, or T), codon position (1, 2, or 3), the number of possible synonymous mutations at that site (0-3), if the position could mutate to stop, the number of CDS transcripts in which a base is used, whether the base was in multiple reading frames, and local constraint (mean phyloP score in a rolling window of ±9 bases excluding the current base).
The results are summarized in the inline table below. For biallelic singleton SNVs (AC=1), the major effects were: greater odds of occurring at a C or G base, in codon positions 1 and 2, lesser odds of occurring at a base that could mutate to stop, inverse association with constraint, and greater odds of occurring in regions of higher constraint. The catalog of singleton SNVs is incomplete and this class of variant is of recent origin (de novo in the subject or an immediate ancestor) with lesser exposure to purifying selection. In comparison, common biallelic SNVs have a stronger inverse association with constraint, greater odds of occurring at a C or G base, and tend not to occur in important positions (codon position 1 or 2, or at bases that could mutate to stop). The major difference in the two analyses is in the effect of local constraint where singleton SNVs are more likely to occur in constrained regions and common SNVs are less likely.

Constraint and common SNVs
Because common biallelic SNVs are the main markers used in GWAS, we conducted additional analyses focusing on this class of variants. These SNPs had AF ≥ 0.005 over all TOPMed samples. As TOPMed is composed of ancestral groups of different sizes, it is possible that a common variant in a smaller ancestry group might not be common in the full group so we also included SNPs in AFR, ASI, EUR, or SAS that were common in any one population. There are 15,777,878 SNPs (i.e., common biallelic SNVs) in TOPMed and 1.849% (N=291,669) are constrained, far less than the fraction of the fraction of the genome that is constrained (3.53%). For each common SNP, we added annotations for location in CDS, ENCODE3 regulatory elements (promoter, DHS/open chromatin, enhancer, or transcription factor binding site), or in the "fine-mapped" set of distinct eQTL-SNPs in any GTEx tissue (posterior probabilities ≥ 0.8 in 3 credible set algorithms). We fit a logistic regression model over these data with common SNP constraint as the dependent variable and six SNP annotation predictors.
The inline table shows the results. All terms were significant. SNP location in CDS, promoter, GTEx finemapped eQTL-SNP, open chromatin, and enhancer regions were positively associated and location in TF binding site was negatively associated with whether a common SNP was under evolutionary constraint. In the above analysis, common CDS SNPs had 22-fold (e 3.09 ) greater odds of being constrained: 28% of common SNPs in CDS were constrained compared to only 1.7% of common SNPs not in CDS. This was unexpected given the results in Fig. S4.5, but could be explained if selection effects were variable across genes-common CDS SNPs may be permitted in some genes but not others. An important role for some genes is to generate variability in protein structure (e.g., olfactory receptors) whereas many genes are highly intolerant of any CDS variation.
We determined the number of common, constrained CDS SNPs for each PC gene. Irrespective of amino acid consequence, we found that 37.8% of genes contained no common SNPs in constrained CDS whereas other genes had up to 10% of CDS bases composed of common constrained SNPs. (We found similar findings for any common SNP regardless of constraint, 10.7% of genes had none and some genes had common SNPs in up to 26% of CDS). Gene set analysis (hypergeometric test, background of all PC genes, Bonferroni correction to P < 0.05) of 980 PC genes in the top 5% for fraction of CDS containing common, constrained SNPs were significantly enriched for: • G-protein coupled receptors (GPCR) • "Druggable" genes (32) • Olfactory receptors, taste • Skin development • Cell defense response to bacterium, cell killing, lymphocyte chemotaxis, CCR chemokine receptor binding These genes contain human variable alleles in many highly constrained positions. The biological processes suggested by these gene sets function at the interface of a mammal and its environment in the sense of allowing adaptation of a mammal to its environmental niche: olfaction and taste are central to identifying nutrients and avoiding predators and phytotoxins; skin development for thermoregulation, protection, and coloration; and immune processes central to adaptation to local microbiota.
GPCRs are a large and diverse gene family (~4% of human genes) of cell surface receptors that detect extracellular molecules. Some 25% of approved drugs target GCPRs (33). GPCRs are involved in multiple sensory modalities (vision, taste, smell), key neurotransmitter systems relating to behavior and mood (e.g., dopamine and serotonin), and immune sensing (chemokine receptors) among others. Many of the gene sets are due to different types of GCPR (e.g., there are 116 olfactory receptors along with chemokine receptors).
There are also 178 genes in this list that are "druggable" (32) but are not GPCRs. Other notable groups of genes include: hepatic metabolic genes-alcohol and aldehyde dehydrogenases (including ADH1B), cytochrome P450 genes (including the clinically important CYP2D6 and CYP2A6)-and genes involved in blood groups (ABO) and hemoglobin subunits (HBE1, HBG2, and HBQ1); and steroid receptors (including ESR1). Notably the list does not include key CNS neurotransmitter receptors like DRD2.
Moreover, this is not a dismissable set of genes. Many of these genes have disease or medical relevance as 131 have an OMIM entry for diverse conditions: albinism, alcohol dependence, amyloidosis, asthma, familial candidiasis, warfarin resistance, deafness, glycogen storage diseases, hypogonadotropic hypogonadism, hypotrichosis, and retinitis pigmentosa. The list also includes neurological disorders (developmental delay, epilepsy, and spinocerebellar ataxias).

Section 5: Benchmarking ClinVar and CADD against phyloP constraint
By: Matteo Bianchi, Graham Hughes, Patrick Sullivan and Jennifer R. S. Meadows ClinVar class annotated for mammalian phyloP score. The ClinVar (34) summary file was downloaded (7 May 2021), and SNVs extracted and annotated with mammalian phyloP scores. Summary statistics were calculated for benign, pathogenic, and "likely classes" for these groups (inline In addition, for variants in an early version of ClinVar (01/2016), we compared the phyloP scores for the classification of the same variants in 2021. For variants whose status was unchanged (benign/likely benign or pathogenic/likely pathogenic), phyloP scores were highly discriminating (median 0.27 vs 6.36).
Similarly, for variants whose status changed from uncertain to benign or to pathogenic, phyloP scores were also highly discriminating (median 0.76 vs 6.34). Variants whose status was downgraded had a median under that of all CDS bases (3.37 vs 3.6) (inline  (35) integrates multiple annotations, including functional and pathogenicity predictions, experimentally measured regulatory effects and modifications, information on genetic context, as well as constraint metrics based on existing mammalian and vertebrate scores (i.e. GERP++, phastCons, phyloP). While using a single universal cutoff for CADD to discretise variants into pathogenic and benign is inherently flawed (due to the nature of the score and the unique nature of any analysis framework), the recommendation is to further evaluate variants with a scaled CADD score ranging from 10 to 20 or above (35). A scaled score ≥10 indicates that the variants are predicted to be the 10% most deleterious substitutions identifiable in the human genome, while a score ≥20 indicates the 1% most deleterious.
We downloaded gnomAD release 3.0 (8) SNV positions with CADD v1.6 (36) directly from the CADD browser (https://cadd.gs.washington.edu/download) and annotated these with mammalian phyloP scores. This resulted in 540,568,449 variant positions for consideration, and this set was used to generate all classes of variants described below. For multiallelic variants, we calculated the average CADD score of all alternate alleles and found that phyloP and CADD metrics were moderately correlated (Spearman rho = 0.49, P < 2.2x10 -16 ).
To explore the predictive potential of mammalian phyloP scores in comparison to scaled CADD scores, we evaluated the extent to which these metrics were able to capture genetic variation overlapping with different gene and regulatory annotations with increasing analysis thresholds. The gene and regulatory annotations were obtained from i) GENCODE (v36) using one gene transcript (pickOne, see Table page 12), ii) ENCODE3 (DHS, enhancer, promoter, TFBS) and iii) UNICORNs (Companion paper #1, FS1 and Supplementary Methods, Section 8). The phyloP threshold was kept constant at ≥ 2.27 and CADD ranged from ≥10 to ≥20 in single score steps (Fig. S5.1A). A convergence between the fraction constrained by both phyloP and CADD was observed at several annotations at approximately CADD ≥15, suggesting that phyloP scores could potentially be employed for prioritizing variants (Fig. S5.1B). At this general cutoff certain gene and regulatory features (e.g. DHS, enhancer, TFBS) showed a greater enrichment for phyloP, indicating that this metric might perform better for these classes of annotations ( Fig. S5.1B). Considering that one of the current objectives of the CADD project is to improve and advance prediction in regions of the genome that are not protein-coding (e.g. cis-regulatory elements and non-coding RNA species) (36), the implementation of Zoonomia mammalian phyloP within the CADD framework may greatly boost the power to identify functional relevant genetic variation in these regions.
To assess the independent and shared value of phyloP and CADD scores as contributors to the same abovedescribed gene and regulatory elements, we again calculated the enrichment as above, but for three different categories of variant positions, those annotated with i) only scaled CADD, ii) only phyloP, and iii) with both scaled CADD and phyloP scores ( Fig. S5.1C). Of note, the category containing variant positions with both scaled CADD (≥10 step and above) and phyloP constraint scores (≥ 2.27) showed the highest enrichment for the great majority of the annotations, especially at average levels of deleteriousness (i.e. scaled CADD ≥ 15). This indicates the utility of mammalian phyloP scores in the context of functional prediction. For higher steps of deleteriousness, it seems that the great majority of the annotations considered are enriched for variants with only constrained phyloP (i.e. ≥ 2.27), suggesting that this metric can capture the residual functional signals (Fig. S5.1D).  Orthologs from Genome Alignments (Companion paper #1, FS1; Companion paper #10, Kirilenko et al).
The TOGA alignment dataset consisted of 37,552 alignments representing 19,300 genes, with different alignments existing for genes with multiple human transcripts. Sites where a species had a deletion (represented by a gap in the alignment) or an indeterminate amino acid (represented by an 'X'), were ignored and did not impact the 'constrained' status of a specific site. TOGA alignments were selected as they contained both Zoonomia and previously sequenced taxa, including non-placental mammals (See Companion paper #1, FS1). Constrained sites across alignments were identified using a Perl script. ClinVar annotations were used to label sites, and entries available for each gene were collated, with mutations considered 'pathogenic' and 'likely pathogenic' extracted. As a comparison, entries considered 'benign' and 'likely benign' were also collated to determine whether constrained sites had more pathogenic than benign instances. As the human reference sequence used in ClinVar was not necessarily the same reference sequence present in the protein alignment, the relative positions between both were determined by inferring a pairwise alignment using MAFFT (38). Gaps in either sequence were accounted for when present, such that the relative positions between human sequences could be identified and explored in ClinVar. If the residue at a site in the TOGA human sequence differed to the residue at the same relative site in the ClinVar reference sequence, that site was not investigated further. Relative sites between human sequences were determined using several Perl scripts (https://github.com/GMHughes/ZoonomiaScripts). For each gene, the transcripts available and the list of constrained sites, with ClinVar references, are reported in table S19.
The TOGA dataset was used to assess if constraint scores could be used to predict deleterious mutations at constrained sites using the case example of ADRB2. Previously, a deep mutational screen of the sole human exon had been performed, and primary signalling of the protein assessed (39). The top 15 sites in ADRB2 reported by the authors (39) as intolerant to mutation in (P211, G315, Y326, W286, N51, F289, D113, S319, W158, G83, C190, C184, W99, C191, D79) were constrained across all mammals in our data, as were a number of other sites predicted as being crucial to functionality (G102, C106, A181, Y185, S203, V222, A271; 161/483 sites constrained). Taken together, these results suggest that the single metric of phyloP could help predict and direct the genetic basis of novel disease-causing substitution mutations.
Naturally occurring adaptive 'pathogenic' variants in non-human mammals to aid the study of human health. While certain substitutions may result in a human disease state, identification of these residues occurring naturally in non-human mammals has the potential to act as biomarkers or sources of novel therapies. Sites showing pathogenic amino acid substitutions in ClinVar were used to identify instances of pathogenic residues occurring naturally as a wild type in non-human mammals. A total of 697 such mutations were found across 330 genes (TOGA alignment from above). One example of a human disease mutation naturally occurring in other mammals was observed in SOD1, the gene responsible for amyotrophic lateral sclerosis (ALS) in humans. Pathogenic mutations were found in several mammal species at five sites: E22, E23, D97, E101 and N140. The E22K mutation, which results in the loss of a TaqI restriction site (40), naturally occurs in three xenarthran species (Tamandua tetradactyla, Myrmecophaga tridactyla, Choloepus hoffmanni), one bat species (Carollia perspicillata) and in all five rhinocerotid species present (Rhinoceros unicornis, Dicerorhinus sumatrensis sumatrensis, Diceros bicornis, Ceratotherium simum, Ceratotherium simum cottoni), possibly representing the ancestral rhinocerotid state. Another mutation, E101K was observed in equids (Equus przewalskii, Equus caballus, Equus burchellii boehmi, Equus asinus) and three rhinocerotid species. These data augment a previous analysis of SOD1where an E40K mutation otherwise pathogenic in humans was observed as naturally occurring in E. caballus (41). Our data show that this mutation is present in all perissodactylid species for which we have sequences, suggesting not only that multiple mutations in SOD1 that are pathogenic to humans were present in the Perissodactyla ancestor, but that these residues may offer some selective advantage. Additional mutations for the other three sites were found across a range of mammals (E23L, D97V, N140K; table S20). Naturally occurring adaptive 'pathogenic' variants expressed without a disease phenotype in non-human mammals warrant further investigation, representing a source of information that may lead to the development of new therapies, alleviating disease symptoms. Additionally, given that several species analysed are 'critically endangered', this work also highlights the important role that wildlife conservation will play in the future of genomics-based medicine.

Section 6: Constraint Variation and GWAS of Human Disease By: Steven Gazal
We applied stratified LD score regression (S-LDSC), a polygenic approach for partitioning the heritability (h 2 ) causally explained by common human SNPs across genomic annotations (42)(43)(44). All analyses were conditioned to the baseline-LD model (version 2.2, 97 annotations including functional annotations and multiple constrained annotations) and to six ENCODE3 annotations (PLS, pELS and dELS annotations and corresponding flanking regions defined in Supplementary Methods, Section 2) for ~10M reference SNPs with a minor allele count ≥ 5 in the 1000 Genomes Project European reference panel (45). All analyses were performed on hg19 (after liftOver of constraint scores from hg38).
We analyzed a set of 63 GWAS summary statistics with independent association data (labeled as independent traits) by excluding genetically correlated traits in overlapping samples by measuring the intercept of cross-trait LD score regression (46), as previously described (42). For traits with summary statistics from the UK Biobank, we also excluded traits with a squared genetic correlation (47) greater than 0.1 (similar to the squared phenotypic correlation threshold used in (48)). The 63 datasets included six traits that were duplicated in two different datasets (genetic correlation of at least 0.9). After excluding these duplicates, we analyzed 57 independent diseases and complex traits. Traits were prioritized using the zscore for nonzero SNP-heritability computed using S-LDSC with the baseline-LD model (minimum of 6, as in (43)). We also performed a meta-analysis of 11 autoimmune diseases and blood cell traits (5 included in the 63 datasets and 6 additional traits, increasing the total number of GWAS datasets analyzed to 69; table S1) and 9 brain disorders (all included in the 63 datasets); note that Alzheimer disease was considered as a brain disorder only. Results were meta-analyzed across traits using random effects.
To evaluate phyloP and phastCons constraint scores, we created 15 annotations based on genome-wide percentile scores for each score: phyloP in mammals, phyloP in primates, phastCons in mammals, and phastCons in primates. We validated the use of different scores to measure constraint in mammals and primates by the fact that phyloP scores were unable to detect single-base constraint in primates due to lack of power (43 primates, lesser branch lengths) to detect significant h 2 enrichment. In mammals, both phyloP and phastCons scores performed very similarly in heritability analyses, and we defined phyloP as superior for having single-base resolution ( fig. S1). In addition, by performing S-LDSC conditional analyses of different constrained scores, we observed that a constraint annotation derived from phastCons scores in mammals was systematically less informative than annotations derived from phyloP scores in mammals and phastCons scores in primates (see inline table For further analyses, we created a "constrained in mammals" annotation by using phyloP scores at the FDR 5% threshold (i.e., phyloP scores ≥ 2.27), which represents 3.54% of the genome bases with the highest phyloP scores (Supplementary Methods, Section 3). We also created an "accelerated in mammals" annotation by considering SNPs with a phyloP score in mammals ≤ -2.270; we note that this annotation has a non-significant depletion of h 2 in (0.9±0.1 fold). To create a "constrained in primate" annotation, we considered SNPs with a phastCons score in primates ≥ 0.961, as it also covers 3.54% of the genome bases with the highest phastCons score in primates (Supplementary Methods, Section 3). We note that the phastCons Viterbi outputs defined 6.2% of the genome bases and 3.9% of common SNPs as constrained in primates; the corresponding annotation had an enrichment of 7.1±0.2x and was not considered in main analyses.
The final conditional analysis contained 106 total annotations: 97 annotations from the baseline-LD model, six ENCODE3 annotations, and three annotations from Zoonomia ("constrained in mammal", "accelerated in mammal" and "constrained in primate").
We partitioned constrained annotations using functional annotations from the baseline-LD model and from ENCODE3. We defined SNPs as Exonic by using the Coding annotation of the baseline-LD model. We defined SNPs as Promoter by using the union of two promoter annotations of the baseline-LD model and ENCODE3 promoters (PLS), and by excluding SNPs previously defined as Exonic. We defined SNPs as Enhancer by using the union of four enhancer annotations of the baseline-LD model and ENCODE3 pELS and dELS, and by excluding SNPs previously defined as Exonic or Promoter. We defined SNPs as nonfunctional by excluding SNPs previously defined as Exonic, Promoter, and Enhancer, but also by excluding SNPs with a posterior probability of being a QTL-SNP > 0.05 in the four fine-mapped QTL annotations of the baseline-LD model (49). Note that this functional annotation is independent of the UNICORNs defined in Supplementary Methods, Section 8.
We applied S-LDXR to the same 31 independent traits as in (50). We jointly analyzed the recommended baseline-LD-X model (62 total annotations, 28 main binary annotations), the constrained in mammals annotation, the constrained in primates annotation, and their intersection.
We partitioned SNP-heritability of low-frequency variants using S-LDSC with the baseline-LF model, as described in (48). Unlike previous analyses, we used 27 independent UK Biobank traits that have wellimputed low-frequency variants and that are powerful enough to partition low-frequency variant architectures (48). To compute mean standardized squared effects as a function of allele frequencies, we applied the following steps: 1. divided the functional annotation of interest into three low-frequency variant bins and three common variant bins, 2. ran S-LDSC with the baseline-LF model for 27 traits, 3. computed per-SNP heritabilities for 27 traits, 4. computed per-SNP squared effects by dividing per-SNP heritabilities by 2p (1-p), where p is the allele frequency in UK10K (used here as a reference panel), 5. computed standardized squared effects of each bin by summing the standardized per-SNP squared effects with the bin, and computed corresponding standard errors using a Jackknife procedure, and 6. meta-analyzed standardized squared effects across the 27 traits.

By: BaDoi Phan, Steven Gazal
We applied functionally-informed fine-mapping (PolyFun), an approach for prioritizing candidate-causal variants from GWAS by accounting for LD and per-SNP heritability, on 24 well-powered UK Biobank diseases and complex traits (51,52) (mean N = 440K; table S12). All analyses were performed using SuSIE as the underlying fine-mapper and on EUR ancestry UKBB imputed SNPs and LD matrices (53). We performed fine mapping using three progressive annotation sets to assess the informativeness of Zoonomia mammalian phyloP and primate phastCons to fine-map more confident candidate causal SNPs: 1) nonfunctional, fine-mapping using LD information on the SNPs alone, 2) baseline-LF, 187 annotations as previously published using the baseline-LF version 2.2.UKB set (52) , and 3) baseline-LF+Zoonomia, an updated set of 209 annotations comprising the baseline annotations, the Zoonomia annotations, six ENCODE3 annotations, and excluding 12 older primate and mammalian constrained annotations.

Section 8: Constraint highlights functionally important SNP in unannotated and functionally validated regions
By: Matthew Christmas, Jennifer R. S. Meadows, Xue Li UNannotated Intergenic COnstraint RegioNs (UNICORNs) define binned regions of the genome with high constraint but which lack functional annotations (Companion paper #1, FS1 for more detail). Briefly, The unannotated regions of the genome were defined subtractively (BEDTools v.2.29.2), removing GENCODE v37 (12) exons (UTRs and exons for all protein-coding genes) and promoters (TSS±1 kb), ENCODE3 (22) cCREs, DHS (including TF binding sites), ChIA-PET anchors, three promoter annotations (Promoter_UCSC, Human_Promoter_Villar, PLS_hg38), and six enhancer annotations (Enhancer_Andersson, Enhancer_Hoffman, Human_Enhancer_Villar, SuperEnhancer_Hnisz, dELS_hg38, pELS_hg38. Positions were lifted to hg38 where appropriate. For the remaining unannotated regions, bins of constraint were identified by first grouping any two phyloP constraint positions within 5 bp of each other, retaining clusters greater than 10 bp, giving a set of 424,179 UNICORNs. A set of non-constraint clusters was also generated to enable comparisons with all unannotated intergenic regions greater than 10 bp that did not contain constraint positions (phyloP < 2.27). This set generated over 6 million clusters, which were randomly down-sampled to a matched set of 424,179 elements.
We tested for enrichment of fine-mapped SNPs (Supplemental Methods, section 7) within UNICORNs, non-constrained unannotated regions (Unannotated), and the rest of the annotated genome (Other), by overlapping these regions with the available SNPs. The number of SNPs present in each region was 833, 5,895, and 305,599 respectively with mean PIP of 0.15, 0.05, and 0.05 respectively. The list of fine-mapped SNPs, their PIP score and UNICORN regions are listed in table S13. SNP PIP scores for each region were compared using linear regression in R v. 4 Most fine-mapped SNP within UNICORNs were limited to a single trait (738/833), but others showed association to multiple traits (inline table), indicating that these variants lacking annotation could be used to further the biological understating of these traits. One variant, rs72782676, implicated in four traits was placed into genomic context with predicted TFBS (FS), and ENCODE3 regulatory annotations (cCREs, DHS). Further, allele frequencies for all SNPs in the region from TOPMed data freeze 8 were included, as well as mammalian phyloP and primate phastCons scores. Plots were generated using ggplot2 package in R v.4.0.1. Massively parallel reporter assays (MPRAs) are designed to interrogate the regulatory potential of hundreds to thousands of DNA sequences in a single high-throughput experiment. We tested the utility of mammalian phyloP to predict the strongest effect in MPRA reads. If successful, phyloP would first and foremost be benmarked as a tool to suggest causative function at base pair resolution, and could further be used to prioritise variants for more limited MPRA experiments. We accessed MPRA data targeted to four areas of genome regulation: eQTLs (54)  for "all" group, to 420 for the "FDR 1%" group, while for 3'UTR MPRA, it ranged from 12,134 variants for the "all" group, to 471 for the "FDR 1%" group. To make the number of variants in each group comparable, we randomly selected 350 variants (eQTL MPRA) or 450 variants (3'UTR MPRA) in each group, calculated the percentage of expression-modulating variants and repeated this 1,000 times. We also calculated the standard deviation of the percentage of expression-modulating variants for each group. The fold enrichment was calculated as the percentage of true expression-modulating variants at FDR 5% divided by the percentage derived from the "all" group. This was 1.3 fold for 3'UTR MPRA, and 1.8 fold for eQTL MPRA (Fig. S8.3).

Fig. S8.3. Enrichment and standard deviation of true expression modulating skewed variants captured at increasing FDR steps for (A) 3'UTR and (B) eQTL MPRA data sets.
For the final regulatory space, the MPRA data from the error-prone PCR saturation mutagenesis of 9 enhancer and 11 promoter disease-associated regions, size ranging from 186 to 600 bp, was downloaded (56). That data set contained the functional measurements for over 30,000 single nucleotide substitutions and deletions, as each position from the region was mutated to the other three nucleotides or a deletion, and an MPRA effect read out reported. We calculated the average of the MPRA effects for each position to represent the relative ability of this position. These positions were subsequently annotated with mammalian phyloP scores, and the Spearman correlation between phyloP score and MPRA effect for each promoter region computed (inline table). The correlation ranged from negligible in the FOXE1 locus (0.00) to highest in the LDLR locus(0.51), and so we also annotated the regions using ClinVar pathogenic variants and plotted the results for visual inspection (MPRA effect >2 Fig. S8.4, MPRA effect <2 Fig. S8.5). We note that high constraint is not indicative of correlated effect across the oligo and that cell line specific effects will alter the correlation (e.g. correlation ranges from 0.04 to 0.10 for the TERT locus in HEK293T and SF7996 cell lines).

Section 9: Creation of a Gene-Based Measure of Constraint By: Patrick Sullivan, Quan Sun, Yun Li
Measures of the degree to which a gene is constrained across mammalian evolution are often used to help understand the impact of genetic variation on human disease. Measures of gene-based constraint include pLI (7), LOEUF (8), and RVIS (57). Developing a measure of gene constraint is complicated by the extraordinary diversity of human genes. In this section, we describe and validate measures of gene constraint based on the Zoonomia alignment.
Human gene models were based on GENCODE v36 (11/2020) (12), and the genome build was hg38/GRCh38. Supplementary Methods, Section 4 shows data sources and definitions. We conducted extensive checks for each protein-coding (PC) transcript, and removed those flagged as incomplete or containing an error like a mismatch between the number of codons and amino acids.

, ⅙, ¼, ⅓, ½, and 1). Except for Met and Trp (one codon each), all codons deviate from equal usage, most notably for Leu-CTG, Val-GTG, Ter-TGA, and Gln-CAG.
We next applied a model-based approach to adjust phyloP scores for a range of coding features with an impact on the CDS base constraint. The dependent variable was the phyloP score for each CDS base (median 3.6 over the pickOne set of 32.13 million CDS bases). The inline table below defines 12 covariates and gives the median phyloP values associated with the covariate levels. These included: codon data (start codon, stop codon, position in codon, the codon string, and the specific base), mutational consequences (of the 3 possible mutations at a base, the number of synonymous mutations and whether a mutation to stop could occur), positional features (coding base located at an inner edge of an exon, the number of PC transcripts including a given base, whether a base appeared in multiple reading frames), whether the base had a notably small number of species aligning, and, to capture local constraint, the rolling mean phyloP score (±9 bases excluding the current base). The informational units in PC genes are codons not individual base scores, and the above approaches may not completely capture the complexities of codon constraint-in many instances, evolutionary selection will operate on amino acids and not necessarily individual bases. As a complementary approach, we thus evaluated the extent to which each human amino acid was constrained across species. The TOGA multiple species alignment provides peptide comparisons for 442 vertebrate species (including all 240 Zoonomia species). All species are aligned to the human reference for each transcript and each amino acid. We used the R package Bio3D (58) to compute the normalized Shannon entropy score for each amino acid (ranging from 0=high entropy/not constrained to 1=low entropy/highly constrained). This measure was based on the standard amino acid alphabet. We also evaluated collapsing amino acids into 10 groups by physiochemical properties (e.g., hydrophobic/aliphatic, aromatic, polar, etc), and found highly correlated results. (6) The gene constraint metric based on the TOGA multiple species protein alignment was the mean normalized Shannon entropy score over all CDS bases.
(7) From Supplementary Methods, Section 4, we included the fraction of common, constrained CDS SNPs for each PC gene given the results above suggesting its salience for identifying constraint.
In summary, we compared 9 gene constraint metrics in the pickOne transcript set. For each PC gene, we calculated: 1. fracCdsConsFdr05, number of CDS bases with mammalian phyloP ≥ 2.27 divided by the number of CDS bases 2. fracCdsConsPri, number of CDS bases with primate phastCons ≥ 0.96 divided by the number of CDS bases 3. phylopNormMean, mean of normalized phyloP score in CDS bases (vs all codon-position phyloP scores) 4. phylopNormMeanRobustP50, median of robust normalized phyloP scores in CDS bases (vs all codon-position phyloP scores) 5. phylopFittedMean, mean of phyloP scores in CDS bases after adjustment for 12 covariates in a linear model 6. HnormMean, mean normalized Shannon entropy score from the TOGA multiple species protein alignment 7. fracCdsConsCommon, number of common, constrained CDS SNPs divided by the number of CDS bases 8. For comparison, we included the gnomAD LOEUF and pLI scores for the same PC transcripts (8) derived from large exome sequencing studies, and use different methods to contrast the observed number of predicted loss-of-function variants to a model-based expectation per transcript The inline table below shows summary statistics for these 9 measures along with the number of CDS bases (N=19,387 genes, pickOne transcript set). Greater values indicate higher constraint except for LOEUF and fracCdsConsCommon which are inverted. The completion rates are generally high, but HnormMean, LOEUF, and pLI are missing for >10% of genes. The univariate histograms for most measures show reasonable spread except for fracCdsConsCommon (point mass at zero) and pLI (essentially bimodal).  Fig. S9.2A shows the Spearman pairwise correlations among these measures. There were small correlations with the number of CDS bases with the notable exception of LOEUF which was strongly correlated with the number of coding bases (rho = -0.51). This suggests that LOEUF is impacted, and perhaps biased, toward informativeness in larger CDS transcripts. The main gene constraint measures derived from Zoonomia (fracCdsConsFdr05, phylopNormMean, phylopNormRobustp50, and phylopFittedMean) were highly intercorrelated (Spearman rho ≥ 0.95), and there were high correlations with fracCdsConsPri (rho ≥ 0.87) but far lower correlations for HnormMean (rho 0.45-0.52). fracCdsConsCommon was not informative due to lack of variation. into peptides. Many other genes are composed of non-coding exons that can be transcribed into cDNA. We also created a gene-based metric for cDNA which can be calculated for any biotype. fracExonCons is the fraction of cDNA bases that are constrained in mammals. For PC genes, the Spearman rho of fracCdsCons (CDS) with fracExonCons (CDS and UTRs) was 0.64. Fig. S9.3 shows box plots for five biotypes. Most non-coding genes have low median levels of constraint but a fraction have high levels of constraint. Constraint can be used to identify non-coding genes with high levels of mammalian constraint as one index of potential importance.

Section 10: Evaluation of a Gene-Based Measure of Constraint By: Patrick Sullivan, Steven Gazal
Identifying genes whose constraint is measurable: Human PC genes are highly diverse. No single measure of gene constraint will be applicable to all genes. We sought an empirical approach to identify PC genes for which an evolutionary gene constraint metric is applicable. As detailed in Supplementary Methods, Section 9, the gene constraint metric is fracCdsCons (number of CDS bases with phyloP scores ≥ 2.270 divided by the total number of CDS bases). For each PC gene (pickOne transcripts, N=19,386), we created four annotations: fracCdsCons, fracConsPr (the fraction of CDS bases constrained in 43 primates, phastCons ≥ 0.961), the fraction of CDS bases without a phyloP score, and the fraction of CDS bases with a low number of species aligning (≤110). The latter two measures capture difficult to align coding bases across 240 mammalian species (i.e., due to error, complexity, and sequence specific to primates or humans). Adapting scRNA-seq methods, we applied UMAP (59) as a dimensional reduction algorithm and then used DBSCAN to identify clusters in the results (60) (Fig. S10.1).

Fig. S10.1. (A) Results from UMAP dimensional reduction and five clusters from DBSCAN (labeled A-E). (B) Parallel plot of outlier clusters A-D. (C) Mammalian constraint (fracCdsCons, grey=least, red=most). (D) Genes with missing gnomAD LOEUF scores.
These analyses yielded an interpretable high-level classification of human PC genes with respect to mammalian evolutionary features. The inline table describes the clusters. We evaluated human ortholog data (ENSEMBL, 1:1 high-confidence human orthologs from 73 species including 23 primates) (61). Cluster E contained most genes (18,425) and the other 4 clusters were different types of outliers. The 56 genes in Cluster A did not map well and were poorly constrained (e.g., GAGE*, NBPF*, USP17L* and several olfactory receptors). The 221 genes in Cluster B mapped well but were not constrained, and 90% had no ortholog or an ortholog only in primates, a pattern suggestive of a human-or primate-specific gene. Cluster C (669 genes) was notable for low numbers of species with CDS alignment (e.g., five HLA genes and 14 interferon alpha genes). The 15 genes in Cluster D mapped well and were very highly constrained across mammals (especially in primates, fraction CDS constrained > 0.993, multiple HOX* genes).
We conducted hypergeometric gene set analyses for outlier Clusters B and C (outlier clusters with >200 genes). Genes in Cluster B showed significant enrichment for gene sets involved in: (a) olfaction, (b) epidermis development, and (c) male gamete generation. Genes in Cluster C showed significant enrichment for: (a) adaptive immunity, (b) olfaction, and (c) epidermis development. A common theme in Clusters B and C is the position of these genes at the interface between a mammal and its external milieu, and these may evolve for a mammal to adapt to specific environments. Fig. S10.1C shades each gene from light grey (fracCdsCons = 0) to red (fracCdsCons = 1). There is a gradient in mammalian constraint from the top middle anticlockwise to the lower right.
Interpretation: Cluster A contains genes whose CDS bases are in complex genomic regions that align poorly; Clusters B and C contain genes that evolved during mammalian evolution in response to environment; Cluster D contains the most highly constrained genes; and Cluster E are all other genes. We retained Clusters C, D and E as sufficiently comparable genes and viewed Clusters A and B as not having interpretable phyloP-based gene constraint.

Gene set analysis of most and least constrained genes:
We conducted gene set analysis for 19,109 PC genes (Clusters C, D, and E) in the top or bottom 5% (ventile) of the fraction of mammalian gene constraint (fracCdsCons, overall median 0.623, IQR 0.471-0.725, range 0-0.975). Gene set analyses were conducted for genes in the top 5% (955 genes, fracCdsCons 0.811-0.975) and the bottom 5% (956 genes, fracCdsCons 0-0.150). Gene set analysis (hypergeometric test, background all PC genes, Bonferroni-correction to P < 0.005) were conducted for Gene Ontology (GO) pathways, for genes with high expression specificity across over ~1,200 human cell types (via single-cell RNA-seq, Human Cell Landscape, HCL), and the expertcurated set of synaptic genes (SynGO) (62)(63)(64) Gene set analyses of the bottom ventile (lowest mammalian constraint) were dominated by genes at the interface of a mammal with its environment: microbiota defense, taste and smell, and multiple skin processes. These gene sets are central to the adaptive evolution of a mammal to its environment-i.e., the specific microbiota, adaptations of smell and taste to detect prey, predators, and poisons, and adaptations of skin for temperature regulation, camouflage, and defense.
Gene sets in the top ventile (highest mammalian constraint) were notably fundamental to mammalian development. Significant gene sets were involved in basic embryology, morphogenesis of most organ systems, cell cycle, and basic synaptic processes.

Mammalian constraint and external validators:
Following part of the validation strategy for gnomAD LOEUF (8), we evaluated the pattern of constraint with respect to external gene sets with established patterns of constraint. We repeated the analyses previously reported for LOEUF in comparison to fracCdsCons (coding both as deciles, 0=least and 1=most constrained). We evaluated the impact of common SNPs linked to PC genes in each fracCdsCons decile by using S-LDSC on 63 independent GWAS datasets (Supplemental Methods, Section 6). In the main analyses, SNPs were linked to genes using a functional approach (i.e., considering SNPs in exons, promoters, linked enhancers, or are fine-mapped cis-eQTLs using the cS2G approach (65)). We also performed secondary analyses by using a naive approach linking SNPs falling in a 100kb window around a gene. Analyses included the baseline-LD v.2.2 model, one annotation for all genes, 10 annotations for the deciles of six gene scores (fracCdsCons, pLI, LOEUF, and fraction of promoter, proximal enhancer and distal enhancer that were linked to the genes, see below). For each decile annotation, we meta-analyzed its heritability enrichment (i.e. fraction of heritability linked to the genes in the annotation divided by the fraction of common SNPs linked to the genes in the annotation) across independent GWAS datasets. We also computed its gene heritability enrichment, defined as the meta-analyzed heritability enrichment of the annotation divided by the mean of the meta-analyzed heritability enrichments of the 10 decile annotations (considered as a constant to compute standard errors). Significance of gene heritability enrichments were computed using z scores. We observed significantly higher gene h 2 enrichment for SNPs linked to genes in the most constrained deciles (Fig. S10.4 and table S17). Interestingly, while SNPs linked to genes had a higher heritability enrichment in a meta-analysis of 11 blood and immune traits vs 9 brain disorders, we observed stronger gene h 2 enrichment patterns for brain disorders compared to blood and immune traits (Fig. S10.4), suggesting that fracCdsCons can be used to prioritize genes in brain disorders. We obtained similar conclusions using both SNP-gene linking strategies (Fig. S10.4). We also observed similar patterns of gene h 2 enrichment between fracCdsCons, pLI and LOEUF (Fig. S10.5). Mammalian constraint compared to LOEUF: We specifically compared fracCdsCons to the gnomAD LOEUF measure (8) given its widespread use. The Spearman correlation of fracCdsCons with LOEUF was -0.55, and LOEUF had the undesirable property of a strong residual correlation with the number of coding bases (Supplemental Methods, Section 9).
Fig. S10.1D shows the dispersed location of genes with missing gnomAD LOEUF scores (gene constraint based on human exonic variation) (8). Many PC genes do not have LOEUF scores (N=1,962 genes, 10.1%). Fig. S10.6 shows an upset plot to contrast fracCdsCons clusters with LOEUF scores. In general, fracCdsCons appears to provide measures of constraint to more PC genes than LOEUF (98.6% vs 89.1%) and is thus applicable to a greater number of PC genes. Fig. S10.7 shows a direct comparison of fracCdsCons with gnomAD LOEUF and pLI. The first two measures are smoothly continuous with no point of rarity at which to separate "constrained" from "not constrained". pLI is nearly bimodal. These results are an important validator for both Zoonomia and gnomAD. Fig. S10.7A compares gene constraint between Zoonomia fracCdsCons and LOEUF. There was agreement between the gene measures (Spearman r = -0.56). This degree of congruence is notable as these measures are based on markedly divergent samples and phenomena-i.e., constraint over ~100 million years of mammalian evolution vs statistical modeling of pLoF counts in WES catalogs of humans who survived to consent to be included in a gnomAD cohort. In theory, these measures should be related and this empirical confirmation is an important validator for both measures.
However, the correlation is far from perfect, and we highlight that there are advantages and disadvantages to each measure: (a) missingness is lesser for fracCdsCons (see Fig. S10. 6) and (b) LOEUF should perform better for human-or primate-specific genes as Zoonomia has low power (but, in practice, few of these genes had low LOEUF scores).
We investigated discrepancies between the most constrained quantile on one measure and the least constrained on the other. Genes with high fracCdsCons/low LOEUF constraint compared to high fracConsCds/high LOEUF constraint (i.e., top right vs bottom right corners of Fig. S10.7A) had smaller CDS sizes (mean 542 vs 2073 bases) as suggested by the large residual correlation of LOEUF with CDS size. We suspect that this resulted from an issue with the LOEUF model for smaller genes with far fewer expected pLoF variants (83 vs 390) and a large confidence interval (mean 1.3 vs 0.2) and thus high LOEUF score. We suspect that these genes are misclassified by LOEUF. For example, these genes included 3 small HOX genes that are known to be highly constrained (HOXA9, HOXC12, and HOXC5), and gene set analyses were significantly enriched for genes with specific expression in brain cortex, a generally highly constrained subset.

Constraint between the parts of PC genes:
We evaluated the relationship between constraint of different gene parts in more detail here. We finally evaluated constraint in gene enhancers. Investigating gene enhancers is challenging as enhancers do not necessarily regulate the closest genes (66), and because ENCODE3 did not provide enhancer-gene linking predictions of their enhancer-like signatures (ELS) elements. Here, we used the EpiMap enhancergene linking predictions (67) to link ENCODE3 proximal and distal ELS (pELS and dELS, respectively). Briefly, EpiMap predicted enhancers in more than 800 cell-types (including ENCODE datasets), and linked these enhancers to their target genes using correlation between enhancer activity and gene expression across cell-types. Using EpiMap enhancer-gene links in all cell-types, we were able to link 33% and 53% of pELS and dELS to at least one gene, respectively. We observed non negligible correlation between dELS constraint gene scores with constraint in CDS and promoters (rho = 0.25 and 0.25, respectively); we note that we observed similar values when considering all EpiMap enhancers (rather than restricting to ENCODE3 cCRE), and lower correlation when considering pELS (Fig. S10.8). These data are consistent with the theory that if the function of a gene in mammals requires high constraint of protein structure, then the non-coding sequence in the proximal promoter that regulates its expression also tends to be constrained. As discussed below, for many genes these correlations reflect dosage sensitivity (as assessed by intolerance to predicted loss-of-function, pLoF, variation in large surveys of currently living humans) -if a mammalian gene cannot tolerate pLoF then this may imply a corresponding constraint of its regulatory mechanisms. Gene constraints are available in  Finally, we evaluated the impact of common SNPs linked to genes in each constraint score decile by using S-LDSC on 63 independent GWAS datasets (see above). We observed that common SNPs linked to genes with constrained promoters and distal enhancers are as enriched in h 2 as genes with constrained CDS, suggesting that constraint in regulatory elements can be leveraged in the analyses of human diseases and complex traits (Fig. S10.5). We obtained similar conclusions using two SNP-gene linking strategies.

Section 11: Rare copy number variation and human disease
By: Jin Szatkiewicz, Jia Wen, Patrick Sullivan Copy number variants (CNVs) are genomic segments that have fewer or more copies compared to a reference sample. Their sizes can range from 1kb to megabases. CNVs are important drivers of evolution and are notable risk factors for multiple human diseases (68)(69)(70). However, CNVs often occur in high repeat/low mappability regions meaning that detecting their presence and disease significance often carries considerable uncertainty (71,72). We thus evaluated the salience of mammalian constraint in the analysis of disease-related CNVs.
First, as an initial qualitative check, CNV deletions in a gene desert upstream of SOX9 can cause the craniofacial condition Pierre Robin sequence by altering the copy number of distal enhancers. Functional analyses of the CNV-deletion regions in Pierre Robin sequence patients suggested a causal role for several small enhancers >1.2 Mb from SOX9 (73). We found that these enhancers were highly constrained: the fraction of constrained bases comprising three small (3.5-6.9 kb) enhancers (1.45, 1.35, and 1.25 Mb from SOX9) were all 11-12%, and a functionally important element within one enhancer (the "min2" element (73)) was 73% constrained.
Second, we generalized the approach by analyzing TOPMed (6)  However, these SV calls are highly overlapping, and the 335,238 SVs occur in 2,182 genomic regions. A given base can be part of multiple SV that range from very rare to common. To obtain a more detailed view of the impact of SV with respect to base pair constraint, we converted the TOPMed SV calls to base pair

Section 12: Somatic Cancer Driver Genes and Constraint
By: Sharadha Sakthikumar, Jessika Nordin, Ananya Roy, Kerstin Lindblad-Toh, Karin Forsberg Nilsson The full set of processes used to identify and functionally explore non-coding constrained driver genes of cancers can be found in Companion paper #12, Sakthikumar et al. Key methodologies are included in brief below.
Medulloblastoma and pilocytic astrocytoma sample set and basic validation. We downloaded variant calling format files with somatic mutations based on whole genome sequencing data from the International Cancer Genomics Consortium (ICGC, project code PBCA-DE). The data were aligned to the human genome build hg19. A total of 89 pilocytic astrocytoma and 146 medulloblastoma samples were available.
A standard analysis to identify significantly mutated genes was performed to validate our data against previous analyses. The GATK's Funcotator module (76) was used to functionally classify the medulloblastoma somatic point mutation (SPM) and somatic indel mutations (SIM) as being either coding or non-coding changes. Coding mutations were used for identification of SMGs. We found one SMG in pilocytic astrocytoma and three SMGs in medulloblastoma, consistent with what has been seen previously.
Identification of non-coding constraint mutations (NCCMs). SPM and SIM calls were annotated with mammalian phyloP scores to identify NCCMs. PhyloP scores based on hg38 were converted to hg19 coordinates using liftOver (77). Previous estimates of the fraction of the human genome that is functional has been between 6-13%. We used an estimate of the top 8% constraint positions (i.e., mammalian phyloP ≥ 1.2) which we previously used to examine NCCMs in glioblastoma (78). Non-coding variants with phyloP ≥ 1.2 were termed NCCMs. NCCMs located within 5′ or 3′ UTRs, introns, or ±100 kb intergenic flanking regions of protein-coding genes were extracted. The rate of NCCMs, normalized to the length of the queried genomic region for each gene was computed, and those genes with ≥ 2 NCCMs per 100 kb were termed as candidate driver genes.

Evaluation of putative functional impact on NCCMs.
To verify if the NCCMs around candidate genes have cis-regulatory changes, their coordinates were annotated with regulatory annotations from UCSC (77) and ENCODE. Annotations included transcription factor binding sites (TFBS), regulatory markers (ORegAnno), transcription start sites (TSS), enhancers, and DNase I hypersensitive sites (DHS). sTRAP (79) was used to predict if the NCCMs were likely to alter the binding affinity of transcription factors to the mutated versus wild-type sequences. For every NCCM allele and its corresponding wild-type, 20 bp upstream and downstream were used as input sequences. The matrices for the analysis were set to the JASPAR vertebrates database (79,80), and for the background model, the 'human promoter' option was selected. The JASPAR database (http://jaspar.genereg.net) was used to obtain information about TF binding matrices and motifs of interest. Affinity profiles for the top-ranking matrices with P ≤ 0.05 were recorded as having significantly more TF binding for the input sequences than what could be expected from a random sequence.
Electromobility shift assays (EMSA) of NCCMs. Multiple EMSAs were run for NCCMs in the MB002 medulloblastoma cell line (81). All probes and experimental conditions are stated in the Companion paper #12, Sakthikumar et al.

Expression and survival analysis.
To assess gene expression levels between normal cerebellum and medulloblastoma, or between the medulloblastoma subgroups, processed gene expression data from Gene Expression Omnibus (GEO) were used. Dataset GSE124814 contains batch-normalized transcriptional profiles of 1350 medulloblastoma samples and 291 normal cerebellum samples (82). Dataset GSE85217 contains 763 medulloblastoma samples that were used for survival analysis (83). Based on the order of expression, every increasing expression value is used as a cutoff to create the two groups and the significance difference tested in a log-rank test. The most significant expression cutoff for survival analysis is plotted on a Kaplan Meier curve. Fig. S1. Evolutionary constraint in multiple genomic partitions.

Fig. S5. Heritability enrichment of the constrained in mammals (A) and constrained in primates (B) annotations in 69 GWAS.
Traits are sorted by the lower bound of the 95% confidence interval. Significant enrichment (P < 0.05/69) are plotted in blue, with error bars representing 95% confidence intervals. Red lines represent enrichment and corresponding 95% confidence intervals meta-analyzed across 63 independent GWAS. Numerical results are reported in table S5.

Fig. S7. Heritability enrichment of the constraint in mammal annotation stratified by quartiles of PhastCons scores in primates (A) and of the constraint in primate annotation stratified by quartiles of phyloP scores in mammals (B).
PhastCons score quartile boundaries in regions constraint in mammals are 0.

Fig. S8. Common and low-frequency variant heritability enrichments in constraint regions.
We report common variant heritability enrichment (CVE) and low-frequency variant heritability enrichment (LFVE) of constrained in mammals and primates annotations intersected together, and stratified by their genomic function. We plotted coding and nonsynonymous enrichments for comparison purposes. The dashed red lines represent a null enrichment of 1. Results are metaanalyzed across 27 independent UK Biobank traits. Error bars represent 95% confidence intervals. Numerical results are reported in table S10. Supplementary Tables   Table S1.

GWAS summary statistics used in common variant heritability analyses.
We report the 69 summary statistics used in this study, including 63 independent GWAS covering 57 independent diseases and complex traits, 11 autoimmune diseases and blood cell traits, and 9 bran diseases.

Table S2. Heritability partitioning results for common SNPs falling in the top percentiles of phyloP and PhastCons scores in mammals and primates.
We report the heritability enrichments (with corresponding standard errors and P-values), and S-LDSC regression coefficients (with corresponding standard errors and P-values) for common SNPs falling in the top percentiles of phyloP and PhastCons scores in mammals and primates. Results are meta-analyzed across 63 independent GWAS.

Table S3. Heritability partitioning results for common SNPs as a function of their distance to a constrained base.
We report the heritability enrichments (with corresponding standard errors and P-values), for common SNPs as a function of their distance to a constrained base. Results are meta-analyzed across 63 independent GWAS.

Table S4. Heritability partitioning results for the main S-LDSC analysis.
We report the heritability enrichments (with corresponding standard errors and P-values), and S-LDSC regression coefficients (with corresponding standard errors and P-values) for our main model including 97 annotations from the baseline-LD model v2.2, 6 ENCODE3 annotations, and 3 zoonomia annotations.
Results are meta-analyzed across 63 independent GWAS.

Table S5. Heritability partitioning results for the constrained in mammals and constrained in primates annotations in 69 GWAS.
We report the heritability enrichments (with corresponding standard errors and P-values) for the constrained in mammals and constrained in primates annotations in 69 GWAS, as well as the statistical significance between the trait enrichment and the meta-analyzed enrichment.

Table S6. Heritability partitioning results for constrained annotations in 11 blood and immune traits and 9 brain disorders.
We report the heritability enrichments (with corresponding standard errors and P-values) for constrained in mammals and primates annotations intersected together and meta-analyzed across 11 blood and immune traits and 9 brain disorders.

Table S7. Heritability partitioning results for constrained in mammals and primates annotations intersected together and stratified by their genomic function.
We report the heritability enrichments (with corresponding standard errors and P-values), and S-LDSC regression coefficients (with corresponding standard errors and P-values) for constrained in mammals and primates annotations intersected together and stratified by their genomic function. Results are metaanalyzed across 63 independent GWAS.

Table S8. S-LDXR results for the 28 main binary annotations of the baseline-LD-X model and 3 constrained annotations.
We report the enrichment of stratified squared trans-ancestry genetic correlation (with corresponding standard errors and P-values) for the 28 main binary annotations of the baseline-LD-X model and 3 constrained annotations. Results are meta-analyzed across 31 independent traits.

Table S9. UK Biobank GWAS summary statistics used in common and low-frequency variant heritability analyses.
We report the 27 summary statistics used to partition common and low-frequency variant heritability. We report their sample size (N), effective sample size (Neff), prevalence (for binary traits), and common and low-frequency variant heritability (with corresponding Z score) estimated using the baseline-LF model. We restricted our analyses to these traits (different the the 63 used for common variant heritability analyses), as low-frequency variant heritability analyses require very large sample size (48). These traits are the same as the ones used in ref. (48).

Table S10. Common and low-frequency variant heritability partitioning results for constrained in mammals and primates annotations intersected together and stratified by their genomic function.
We report the common and low-frequency variant heritability enrichments (with corresponding standard errors and P-values for differences) for constrained in mammals and primates annotations intersected together and stratified by their genomic function. Results are meta-analyzed across 27 independent UK Biobank traits.

Table S11. Standardized squared effect sizes as a function of allele frequency.
We report the standardized squared effect sizes in multiple annotations where variants are stratified by their allele frequency. Results are meta-analyzed across 27 independent UK Biobank traits.

Table S12. Functionally-informed fine-mapping results across 24 UK Biobank traits.
We report functionally-informed fine-mapping results using 3 sets of annotations for 24 UK Biobank traits. We report trait sample size (N) and effective sample size (Neff), the number of fine-mapped variants with posterior inclusion probability (PIP) > 0.75 and the number and fraction of fine-mapped variants with PIP > 0.75 that are constrained in mammals or primates.

Table S13. Fine-mapped variants within UNICORNs.
We report the location and posterior inclusion probability (PIP) of fine-mapped variants within UNannotated Intergenic COnstraint RegioNs (UNICORNs) as well as the trait associated with each variant. Further, we indicate the ranking of SNPs within the primate PhastCons and mammalian phyloP score distributions.

Table S14. Gene constraint metrics for 19,386 protein-coding genes.
We report gene constraint metrics (pickOne transcript subset) referred to in the manuscript including the fraction of CDS bases constrained in mammals and in primates. We also include the UMAP/DBSCAN cluster identifiers (A-E), the constraint ventiles, and indicate the genes in the top and bottom ventiles (5%) .

Table S15. Gene Set Analysis: most/least constrained 5% of genes (GO gene sets).
We report the results of the hypergeometric gene set analyses of Gene Ontology (GO) pathways for genes in the top or bottom ventiles of mammalian constraint. The background gene set was 19,109 protein-coding genes. We provide data used for P-value calculation.

Table S16. Gene Set Analysis: most/least constrained 5% of genes (SynGO terms).
We report the results of the hypergeometric gene set analyses of the expert-curated synaptic gene sets from the SynGO consortium. The background gene set was 19,109 protein-coding genes. We provide data used for P-value calculation.

Table S17. Heritability partitioning results for common variants linked to constrained gene sets.
We report the heritability enrichments (with corresponding standard errors and P-values) and gene heritability enrichment (with corresponding standard errors and P-values) for common variants linked to gene sets based on their decile for their constrained gene scores. We considered 6 different constrained scores (fraction of CDS bases constrained, fraction of promoter bases constrained, fraction of ENCODE3 proximal enhancer bases constrained, fraction of ENCODE3 distal enhancer bases constrained, pLI, and LOEUF). We considered two strategies to link SNPs to genes: by using the cS2G map (65), and by using the gene-body +/-100 kb window.

Table S19. ClinVar classes of variants identified as constrained with TOGA.
Gene alignments generated using TOGA were explored to identify constrained amino acid residues unchanged across mammalian evolutionary history. All sites were compared with known pathogenic/benign records in ClinVar. The number of sites, as well as their ClinVar disease status where present, are displayed.

Table S20. Naturally occurring adaptive 'pathogenic' variants in non-human mammals.
Pathogenic residues identified in ClinVar were explored in non-human mammals. Any instances of potentially adaptive 'pathogenic' human variants occurring naturally in other species are documented. All species showing human pathogenic residues are listed.