Bayesian estimation of gene constraint from an evolutionary model with gene features

Measures of selective constraint on genes have been used for many applications including clinical interpretation of rare coding variants, disease gene discovery, and studies of genome evolution. However, widely-used metrics are severely underpowered at detecting constraint for the shortest ~25% of genes, potentially causing important pathogenic mutations to be overlooked. We developed a framework combining a population genetics model with machine learning on gene features to enable accurate inference of an interpretable constraint metric, shet. Our estimates outperform existing metrics for prioritizing genes important for cell essentiality, human disease, and other phenotypes, especially for short genes. Our new estimates of selective constraint should have wide utility for characterizing genes relevant to human disease. Finally, our inference framework, GeneBayes, provides a flexible platform that can improve estimation of many gene-level properties, such as rare variant burden or gene expression differences.


18
Identifying the genes important for disease and fitness is a central goal in human genetics. One 19 particularly useful measure of importance is how much natural selection constrains a gene [1][2][3][4]. 20 Constraint has been used to prioritize de novo and rare variants for clinical followup [5,6], predict 21 the toxicity of drugs [7], link GWAS hits to genes [8], and characterize transcriptional regulation 22 [9, 10], among many other applications. 23 To estimate the amount of constraint on a gene, several metrics have been developed using To address this problem, we developed a flexible empirical Bayes framework, GeneBayes, that Table 2 for estimates of s het .  are substantially more concentrated when using gene features.
Next, we compared our estimates of s het using GeneBayes to LOEUF and to selection coeffi-137 cients estimated by [4] ( Figure 2D). To facilitate comparison, we use the posterior modes of s het 138 reported in [4] as point estimates, but we note that [4] emphasizes the value of using full posterior 139 distributions. While the correlation between our estimates is high for genes with sufficient LOFs 140 (for genes with more LOFs than the median, Spearman ρ with LOEUF = 0.94; ρ with s het from [4] 141 = 0.88), it is lower for genes with few expected LOFs (for genes with fewer LOFs than the median, 142 Spearman ρ with LOEUF = 0.71; ρ with s het from [4] = 0.71). 143 We further explored the reduced correlations for genes with few expected LOFs. For example,   constrained by s het , with the highest enrichment observed for LOF variants ( Figure 3D; enrich-212 ment of s het and LOEUF respectively, for missense mutations = 2.2, 1.9; splice site mutations = 213 6.3, 4.6; and nonsense mutations = 9.5, 6.7). Consistent with previous findings, the excess burden 214 of de novo variants is predominantly in highly constrained genes (Supplementary Figure 2C, left).

215
Notably, this difference in enrichment remains after removing known DD genes (Supplementary 216 Figure 2C, right). Together, these results indicate that s het not only improves identification of 217 known disease genes but may also facilitate discovery of novel DD genes [5].

218
Finally, constraint can also be related to longer-term evolutionary processes that give rise to the 219 variation among individuals or species, including variation in gene expression levels. We expect 220 constrained genes to maintain expression levels closer to their optimal values across evolutionary 221 time scales, as each LOF can be thought of as a ∼50% reduction in expression. Consistent with 222 this expectation, we find that less constrained genes have larger absolute differences in expression 223 between human and chimpanzee in cortical cells [42], with a stronger correlation for s het than for 224 LOEUF ( Figure 3E). This pattern should also hold when considering the variation in expression 225 within a species. We quantified variance using the normalized standard deviation of gene expres-226 sion levels estimated from RNA-seq samples in GTEx [43] and found that the variance decreases 227 with increased constraint, again with a stronger correlation for s het ( Figure 3E). we divided our gene features into 10 distinct categories ( Figure 4A) and trained a separate model 233 per category using only the features in that category. We found that missense constraint, gene 234 expression patterns, evolutionary conservation, and protein embeddings are the most informative 235 categories.

236
Next, we further divided the expression features into 24 subgroups, representing tissues, cell 237 types, and developmental stage (Table 6). Expression patterns in the brain, digestive system, 238 and during development are the most predictive of constraint ( Figure 4B). Notably, a study that

272
To contextualize our s het estimates, we compared the distributions of s het for different gene sets for the hypothetical genes in Figure 5C, a GWAS hit affecting Gene 1 has a stronger selective effect 278 than a LOF affecting Gene 2, despite having a smaller effect on expression.   Here, we developed an empirical Bayes approach to accurately infer s het , an interpretable met-  approach eliminates this tradeoff between power and interpretability present in existing metrics.

339
Our estimates of s het will be useful for many applications. For example, by informing gene- severely reduce fitness may be more relevant.

352
As more exomes are sequenced, one might expect that we would be better able to more ac-353 curately estimate s het . Yet, in a companion paper [16], we show that increasing the sample size 354 used for estimating LOF frequencies will provide essentially no additional information for the 355 ∼85% of genes with the lowest values of s het . This fundamental limit on how much we can learn 356 about these genes from LOF data alone highlights the importance of approaches like ours that 357 can leverage additional data types. By sharing information across genes, we can overcome this 358 fundamental limit on how accurately we can estimate constraint.

359
Here we focused on estimating s het , but our empirical Bayes framework, GeneBayes, can be  In summary, we developed a powerful framework for estimating a broadly applicable and 369 readily interpretable metric of constraint, s het . Our estimates provide a more informative ranking 370 of gene importance than existing metrics, and our approach allows us to interrogate potential 371 causes and consequences of natural selection.   Feature processing and selection 736 We compiled 10 types of gene features from several sources:  Training and validation 752 We fine-tuned a set of hyperparameters for our full empirical Bayes approach, using the best hy-753 perparameters from an initial feature selection step (  In classifying essential genes ( Figure 3A), we define a gene as essential if its score is < − 1 780 in at least 25% of screens, and as not essential if its score is > − 1 in all screens. In classifying

822
For each set of genes, we computed the mean enrichment over sites and 95% Poisson confidence intervals for the mean using the code provided by [5].

824
Expression variability across species 825 To understand the variability in expression between humans and other species, we focused on

845
In contrast, we found that CV is only slightly correlated with µ i (Spearman ρ = −0.06 in GTEx).

846
LOESS curves were computed as in "Expression variability across species."

848
Training models on feature subsets 849 We grouped features into categories (see Supplementary Table 4 for the features in each category),    In the simplest version of empirical Bayes, we specify the form of the prior distribution and assume that that prior is shared across all genes-for example, for gene i we might assume the prior distribution is s het )) is normally distributed with mean µ and variance σ 2 . We can then estimate µ and σ using the observed LOF data for each gene, y y y 1 , . . . , y y y M , by maximizing the marginal likelihood: (1) Next, we can compute the posterior distribution of s (2) However, rather than learning the parameters for the prior from only the LOF data, we can also Specifically, for gene i, we assume the prior distribution is s Then, with M genes in the training set, the score that NGBoost maximizes during training is: (3) To do this, NGBoost first initializes the parameters of f and g such that all genes have the same prior distribution. Next, NGBoost adopts a gradient descent approach to maximize the score function: for each iteration until training ends, NGBoost first computes the natural gradient of gene i's score with respect to the parameters µ i and σ i of where the natural gradient of S = S(y y y i ; µ i , σ i ), is defined as: where is the Fisher Information Matrix for p µ i ,σ i (s , whose two components we denote as ∇S µ and ∇S σ 914 (b) Fit decision trees f (t) and g (t) on the natural gradients: Once training is complete, we obtain a learned prior with parameters µ In order to obtain a more interpretable measure of constraint, we formalize constraint as the strength of natural selection acting against gene loss-of-function in a population genetics model. That is, we can ask how much fitness is reduced on average for an individual with one or two nonfunctional copies of a gene relative to individuals with two functional copies, following previous work [1,2,4]. To tie this concept of constraint to observed allele frequency data, we use a slightly simplified version of the discrete-time Wright Fisher model. This model contains mutation, selection, and genetic drift, and assumes that there are only two alleles and that the population is panmictic, monoecious, and has non-overlapping generations. While all of these assumptions are violated in humans (there are four nucleotides, population structure, two sexes, and overlapping generations), the model still provides a good approximation to allele frequency dynamics through time. If the allele frequency in generation k is f k , then we model the allele frequency in the next generation via binomial sampling: where N k+1 is the number of diploid individuals in generation k + 1, with rare this would be. 986 We instead use the equilibrium of another process as the initial condition, which avoids these 987 conceptual pitfalls. We assume the distribution of frequencies at generation 0 is the equilibrium 988 conditioned on the LOF allele never reaching fixation in the population. We then compute the like- given that the allele frequency in generation k is i/2N k . This formulation makes clear that we can obtain the likelihood of observing a given 1013 frequency at present given some initial distribution by performing a series of matrix-vector multi- where y y y (i) is a vector of the observed allele frequencies at each possible LOF site in gene i, and 1046 s (i) het is the selection coefficient for having a heterozygous loss-of-function of gene i. Under this formulation, we can easily model misannotation by assuming that each LOF independently has some probability of being misannotated, p miss , and that misannotated variants evolve neutrally:

Model parameters
sizes through time), and the probability that different variants are misannotated.
genetics model, we assume that there are only two alleles (a functional allele and an LOF allele), 1092 whereas in reality there are four nucleotides. We approximate the rate of mutating from the func-1093 tional allele to the LOF allele as being the sum of the mutation rates from the reference nucleotide 1094 to any nucleotide that might result in LOF. For example, if the reference allele is A, and either a 1095 C or a T would result in LOF, then we say that the rate at which the functional allele mutates to 1096 the LOF allele is the rate at which A mutates to C in this context plus the rate at which A mutates 1097 to T in this context. For the rate of back mutation from the LOF allele to the functional allele, we 1098 compute a weighted average of the rates of each possible LOF nucleotide back-mutating to any 1099 possible non-LOF nucleotide, weighed by the probability that the original non-LOF nucleotide 1100 mutated to that particular LOF nucleotide. Continuing our previous example, suppose A mutates 1101 to C at rate 1 × 10 −8 and A mutates to T at a rate 1.5 × 10 −8 . Then conditioned on there having 1102 been a single mutation resulting in a LOF variant, there is a 1/2.5 = 0.4 chance that the LOF is C 1103 and 0.6 chance that the LOF is T. We then compute the back mutation rate as 0.4 times the rate at 1104 which C mutates to A in this context plus the rate at which C mutates to G in this context (since 1105 both A and G do not result in LOF) plus 0.6 times the rate at which T mutates to A in this con-1106 text plus the rate at which T mutates to G in this context. Implicitly this scheme assumes that the only has a π i probability of being differentially expressed: Variant burden tests 1259 In this example, users have sequencing data from patients with a disease or (if calling de novo 1260 mutations) sequencing data from family trios, and would like to identify genes with excess muta-1261 tional burden in patients (e.g. an excess of missense or LOF variants). One approach is to infer the Prior Because η i is non-negative, one may want to choose a gamma prior with parameters α i 1269 and β i :