Insufficient evidence for natural selection associated with the Black Death

Klunk et al. analyzed ancient DNA data from individuals in London and Denmark before, during and after the Black Death [1], and argued that allele frequency changes at immune genes were too large to be produced by random genetic drift and thus must reflect natural selection. They also identified four specific variants that they claimed show evidence of selection including at ERAP2, for which they estimate a selection coefficient of 0.39–several times larger than any selection coefficient on a common human variant reported to date. Here we show that these claims are unsupported for four reasons. First, the signal of enrichment of large allele frequency changes in immune genes comparing people in London before and after the Black Death disappears after an appropriate randomization test is carried out: the P value increases by ten orders of magnitude and is no longer significant. Second, a technical error in the estimation of allele frequencies means that none of the four originally reported loci actually pass the filtering thresholds. Third, the filtering thresholds do not adequately correct for multiple testing. Finally, in the case of the ERAP2 variant rs2549794, which Klunk et al. show experimentally may be associated with a host interaction with Y. pestis, we find no evidence of significant frequency change either in the data that Klunk et al. report, or in published data spanning 2,000 years. While it remains plausible that immune genes were subject to natural selection during the Black Death, the magnitude of this selection and which specific genes may have been affected remains unknown.


The signal of enrichment of unusual frequency changes at immune genes, a primary motivation for the search for individual signals, is a statistical artifact not driven by evolution.
A primary motivation for Klunk et al.'s search for individual variants that changed in frequency after the Black Death is the finding that in a survey of thousands of variants in immune genes, there was highly significant enrichment of large FST values compared to a panel of putatively neutral genomic regions (P < 10 -11 ). However, when we randomly permuted which samples were labeled as pre-and post-Black Death, such that there should be no enrichment signal due to differences between time periods (Methods), we continued to find an extremely strong signal with inflated P values for the binomial test for enrichment over the 99th percentile of variants with minor allele frequency greater than 10% ( Figure 1A). The P values do not follow a null expectation, and instead the c 2 values had an average inflation factor of 11.80 ( Figure 1B). In particular, 7% of the permutations produce a P value more significant that that observed in the original paper (P = 7.9×10 -12 ), corresponding to an increase of ten orders of magnitude to a non-significant level (P > 0.05). This is the pattern expected from a data artifact related to systematic coverage differences between the sets of immune and neutral loci (mean coverage at GWAS loci 7.1x, at exonic loci 2.9x and at neutral loci 8.3x), and not what would be expected if the signal was due to natural selection. We also carried out an independent set of 100 simulations, taking genome-wide sequencing data from present-day British and North American individuals with Northern European ancestry [2], and randomly assigning individuals to being before and after the Black Death while down-sampling the sequence data to approximately match the coverage of the individuals used in the actual published study on a per-site basis. This analysis showed enrichment of large FST values at immune loci with an average inflation factor of 68.08 of the c 2 values (Methods, Figure 1C,D) and 39% of simulation replicates resulted in P values more significant than observed in the original paper. Taken together, these results demonstrate that differences in coverage at the tested sites can produce P values that do not follow a null expectation and can thus generate a spurious signal of enrichment similar to that reported by Klunk et al.

Allele frequencies are estimated incorrectly and consequently biased; computed correctly, none of the four reported loci pass the filtering thresholds
Because it is difficult to make genotype calls from low coverage data, it is common to estimate allele frequencies based on genotype likelihoods. Typically, this is done using a maximum likelihood approach [3]. With ancient DNA, it is common to use pseudohaploid calls for robustness [4], although maximum likelihood approaches based on genotype likelihoods can also be used [5]. However, Klunk et al. do not do this and instead estimate allele frequencies using the genotype likelihoods as follows: "we calculated the expected number of alternate alleles as the likelihood the individual is heterozygous plus 2x the likelihood the individual is homozygous alternate." This procedure is incorrect. Genotype likelihoods are the probability of the data conditional on the genotype P(data|genotype) but the approach of Klunk et al. treats them as though they were the probability of the genotype conditional on the data P(genotype|data). Their expectation is mathematically equivalent to a posterior mean with a prior that all three genotypes are equally likely. This will produce allele frequency estimates that are biased towards the prior mean of 0.5 and the extent of the bias will depend on coverage (because the posterior information depends on the read depth; Figure 2A-C). Thus, estimated differences in allele frequency across time periods can be driven by differences in coverage, rather than real changes in allele frequency (mean coverage in London pre-Black Death 6.8x, during Black Death 7.5x, post-Black Death 6.0x). Using the genotype likelihoods reported by Klunk et al., we recalculated allele frequencies using maximum likelihood and applied their filtering criteria (Methods). None of the four originally reported loci pass the thresholds used by Klunk et al. with the correctly estimated allele frequencies (one other locus does, but does not pass a Bonferroni-corrected threshold; Figure 2D). Rerunning the enrichment analysis of the coverage-matched data described in the previous section with this maximum likelihood allele frequency estimates still resulted in inflated P values indicating that this issue is separate to the one described above. Finally, we note that in order to avoid false positive calls due to damage most ancient DNA studies have either restricted to known polymorphic sites, restricted to transversions, or used Uracil-DNA glycosylase. Klunk et al. do not use any of these strategies. Of the 22,868 variants with MAF>5% that they analyzed, the transition/transversion ratio is around 19 (compared to an expectation of 2-3 for genomic data) and only 4,458 appear in the ~96 million UK Biobank SNP imputation set [6] which would be expected to include almost all sites with MAF>5% in the ancient populations. This suggests that the majority of the variants analyzed by Klunk et al. are false positives, likely due to ancient DNA damage.

Filtering thresholds do not appropriately correct for multiple testing
In order to control the experiment-wide false-positive rate, genome-wide scans must correct for the large number of statistical tests performed [7]. Klunk et al. do not apply such corrections, instead using an ad hoc filtering strategy with no clear statistical justification. Specifically, their filtering strategy requires variants to be in the top 5% of the FST distribution in London and the top 10% in Denmark. Additionally, the variants are required to first increase then decrease (or vice-versa) in frequency in London and to move in the same direction in London and Denmark, when comparing pre-to post-Black Death samples. We simulated variants under a null model of identical allele frequencies in all populations using the same sample sizes as Klunk et al. (optimistically assuming diploid coverage at all loci) and find that a proportion 1.4×10 -4 of variants pass these filters. For comparison, a conventional Bonferronicorrected significance threshold would be 0.05/3293 = 1.5×10 -5 , approximately ten times smaller. Klunk et al. tested 3293 variants and should therefore expect 3293×1.4×10 -4 = 0.46 . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 15, 2023. ; variants to pass, if there were no difference in allele frequencies across populations. This is consistent with the one that we find when correctly estimating allele frequencies. Furthermore, many ancient DNA studies have required multiple linked variants to exceed the significance threshold in order to remove spurious outliers driven by genotyping errors [5,8]-a requirement that is also standard in genome-wide association studies. Due to the sparse genotyping, Klunk et al. cannot apply this additional filter. Klunk et al. show that almost all of their candidate immune genes are differentially expressed in response to stimulation by multiple pathogens including Y. pestis. They additionally show that their reported top SNP at ERAP2 (rs2549794) is associated with a differential response to Y. pestis stimulation. However, this does not prove that the rs2549794 C allele is protective against infection, nor does it provide evidence of natural selection at the SNP -which can only be proven by a statistical analysis showing a change in population frequency due to Black Death exposure. However, the sample sizes in the study are so small that the uncertainty in FST estimates is too large to infer selection. If there were in reality no difference in allele frequency, the probability of observing an FST at least as high as the value of 0.0247 observed for rs2549794 with the available pre-vs post-Black Death London sample sizes and perfect genotype information would be approximately P = 0.067 (Methods, Figure 3A). This is likely an underestimate since in reality low coverage data, genetic drift and reference bias [9] could lead to inflated FST values. The large point estimate of the selection coefficient-approximately five times larger than that at the lactase persistence allele [10] which is the largest selection coefficient at a common variant documented in humans to date-is likely simply a consequence of the fact that candidate variants were required to have large estimated FST values and is not an independent piece of evidence for selection ( Figure 3B).

No evidence of natural selection at ERAP2 over the past 2000 years
We also examined the allele frequencies of rs2549794 in ancient and present-day populations from the United Kingdom [11][12][13][14][15][16] and Denmark [16,17] to study whether there was any evidence for shifts in allele frequencies (Methods). The C allele frequency in present-day people with "British/Irish" ancestry from UK Biobank is 0.438 (N=264,261 alleles), not significantly different compared to ancient individuals from England dated between 2000 and 1000 BP (frequency=0.466, N=146 alleles, χ 2 P = 0.55). Similarly, the frequency in present-day Denmark (0.429, N= 100,528 alleles) is not significantly different than Denmark 2000-1000 BP (0.368, N=76 alleles, P = 0.34). There is therefore no evidence that the frequency of rs2549794 has changed in either England or Denmark over the past 2,000 years ( Figure 3C). Finally, we note that this SNP shows no evidence of selection in other scans that attempt to detect adaptation in Britain in this time period using a range of different data and methods. These include a scan based on time series of ancient DNA (rs2549794 empirical P = 0.18) [10], the singleton density . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.532615 doi: bioRxiv preprint score based on high coverage sequence data (rs2549794 Z=1.33, P = 0.09) [18] and a scan based on estimated coalescent times in UK Biobank (minimum P value within 1Mb 0.40) [19]. The locus also failed to replicate in two independent studies of ancient genomes from before and after the Black Death in Britain and Norway [9,20].

Conclusion
Inadequate correction for multiple testing, implausibly large estimated effects due to small sample sizes, and post hoc rationalization for marginal associations are precisely the issues that led to the development of clear statistical standards for genome-wide scans. Even ignoring technical issues with the allele frequency estimation and systematic differences in coverage between loci, the evidence presented by Klunk et al. does not meet these standards.

Permutation test for enrichment at immune loci
To test whether the observed enrichment of large FST values at immune loci was due to real signals of differentiation or to technical artifacts related to the samples and targeted loci, we performed a permutation analysis. We randomly shuffled the labels for the London pre-Black Death and London post-Black Death samples such that for every run of the permutation test, the new cohort for each time point consisted of random individuals drawn from the union of the original time points without replacement. We also preserved the number of samples in each time point included for each target panel (GWAS loci, exon loci, and neutral loci). We performed this sampling 100 times, and ran each set of samples through the original Klunk et al. allele frequency estimation and enrichment pipeline, testing for enrichment at a 99% threshold for variants with MAF > 0.10.

Testing whether differences in coverage can lead to apparent enrichment at immune loci
The differences in the mean coverage of the three target panels suggested a possible cause of the observed enrichment of high FST values at the immune loci. To test this, we replicated 100 datasets with approximately matched coverage to the samples sequenced in Klunk et al. and ran them through their original pipeline. For each run, we randomly sampled 206 of the 270 GBR and CEPH samples from the 1000 Genomes Project [2] sequenced to ~30x to match each of the samples in the Klunk et al. study that were not dropped from the final analysis due to high missingness. We computed read depths using "samtools depth" [21] at all sites tested by Klunk et al. and present in the UK Biobank v3 imputation panel [6] for both datasets using the UCSC browser version of LiftOver to convert from hg19 sites to GRCh38 [22]. Next, we used the "samtools view -s" option to down-sample each 1000 Genomes alignment to the fraction of reads present in the original sample. Where the Klunk et al. read depth exceeded that in the . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.532615 doi: bioRxiv preprint alignments through HaplotypeCaller to produce gvcfs with calls at each of the sites and then subsequently joint called again with GATK HaplotypeCaller [23] across all 207 simulated samples, in both instances using the same parameters as those in the original pipeline, again using the original allele frequency estimator. Finally, we used these calls as input for the original Klunk et al. pipeline. The only alteration to this method was that, because of the reduced number of sites compared to the original paper, the same samples were dropped or retained as in the original study regardless of their missingness in the new dataset. In addition to the results reported above, the enrichment of 99% for variants with MAF > 0.30 reported in Klunk et al. showed inflation of chi-squared statistics of 67.92, and 23% of the P values were more significant than that reported in the original study (P=1.16x10 -14 ). When using a maximum likelihood method to calculate allele frequencies rather than the original method, the enrichment of 99% for variants with MAF > 0.10 resulted in 50% of P values more significant than that calculated for the original data with an updated maximum likelihood estimator (P=2.09x10 -16 ) and an inflation of c 2 statistics of 147.66. Additionally, for the enrichment of 99% for variants with MAF > 0.30, 31% of P values were more significant than that calculated for the original data with an updated maximum likelihood estimator (P=1.64x10 -13 ), and there was an inflation of c 2 statistics of 68.81.

Maximum likelihood estimation of allele frequencies
Let the genotype likelihoods for individual i=1…n be given by " # for genotypes j=0,1,2. Then the maximum likelihood estimate of the allele frequency p is: We implemented the maximum likelihood estimator by numerically maximizing the expression above using the genotype likelihoods reported by Klunk  .

Evaluation of filtering threshold under a null model
We simulated 1 million observations of genotype data from populations with identical allele frequencies ranging from 0.1 to 0.5 in intervals of 0.1 with sample sizes corresponding to the cohort sizes of the London and Denmark samples. We then calculated FST using the Weir and Cockerham estimator [24] and counted the proportion of observations that passed the reported filtering threshold, averaged across all allele frequencies.
Selection coefficients for variants that pass the filtering threshold under a null model . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.532615 doi: bioRxiv preprint . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made  . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.532615 doi: bioRxiv preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted March 15, 2023. ; https://doi.org/10.1101/2023.03.14.532615 doi: bioRxiv preprint