IntroUNET: identifying introgressed alleles via semantic segmentation

A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient—ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual’s alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled “ghost” population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method’s success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.


Supporting Information Legends
and "pop 2" respectively), the true introgressed alleles for these two populations (labeled "pop 1 (y)" and "pop 2 (y)" respectively), the introgressed alleles inferred by IntroUNET ("pop 1 (pred)" and "pop 2 (pred)"), and IntroUNET's inferred introgression probabilities (labelled "prob", and scaled according to the color bar shown with the third example).Alignments and introgressed histories, true and predicted, are shown in the same format as in Figure 1.introgression times (with the lower bound always set to zero).These simulations were performed in the same manner as described for the simple bidirectional model in the Methods, with the exception of these two parameters.Each heatmap in this grid shows the accuracy of one version of IntroUNET on each of the 25 test sets, and the parameter combination used to train that network is marked by a circle.For example, if the true split time is 2N generations ago and the true split time is 0.1 times the split time, one can observe the impact of misspecification on accuracy by comparing the top-left value in the top-left heatmap (i.e.no misspecification in this case) to the top-left value of all other heatmaps in the figure, which experience varying degrees of misspecification.
Figure S6: Performance of IntroUNET on the task of identifying introgression on a dataset where introgression is occuring in only one direction, but IntroUNET was trained to detect bidirectional introgression.The test data here were the same as those examined in Figure 4A, but the network used to perform inference was the same as that used for Figure 4C.  1 × 10 −5 , or ∼ 0.095 for a 10 kb introgressed segment).Each example shows the input alignments for the two populations (labeled "pop 1" and "pop 2" respectively), the true introgressed alleles for these two populations (labeled "pop 1 (y)" and "pop 2 (y)" respectively), the introgressed alleles inferred by IntroUNET ("pop 1 (pred)" and "pop 2 (pred)"), and IntroUNET's inferred introgression probabilities (labelled "prob").
Alignments and introgressed histories, true and predicted, are shown in the same format as in Figure 1.  Figure S11: Performance of IntroUNET on the task of identifying bidirectional introgression using unphased genotype data.The test data here were the same as those examined in Figure 4C, but here rather than 47 predicted which haplotypes are introgressed, IntroUNET, when given a matrix of diploid genotypes, infers which diploid individuals have at least one introgressed allele at a given polymorphic site.
Figure S12: Diagram of the ghost introgression demographic model from [40], in which an unsampled archaic population splits off from the main population, before a pulse introgression event introduces alleles from this population into a sampled "Target" population, not an unsampled "Reference" population.Figure S16: Alignments and segmentations from six windows on chr3R in the vicinity of the locus of adaptive introgression (AI).Three of the windows (left) are from outside of the region affected by AI, and the other three (right) are within the AI locus.Each example shows the input alignments for the two populations (labeled "X (dsim)" and "X (dsech)", respectively), IntroUNET's inferred introgression probabilities ("ŷ (probs)"), and the most probable class for each allele in each individual ("ŷ (class)").Inferred introgressed histories and introgression probabilities are shown in the same format as in Figure 1.Alignments are shown in the same format as for previous figures, with the exception that for D. sechellia which has a different color scheme in haplotypes that were inferred to be introgressed in order to highlight these regions of the alignment: blue for the ancestral allele, and red for the derived allele.Table S2: CPU / GPU time estimates for accomplishing the experiments in the paper.The simulation and formatting results for this table were computed from a small sample of 430 replicates over 4 cores of an Intel Core i9-9900K CPU @ 3.60GHz and then scaled to give estimates for 10 5 replicates in each case.The Formatting column lists the time estimates for alignment sorting (for IntroUNET) and statistic calculation (for ArchIE).The Discriminator and Segmenter columns list the training times for classifying entire windows and for identifying introgressed haplotypes, respectively.In the GPU columns the estimate is simply the run time for the training described.The training of the neural networks was done on an NVIDIA A40 GPU, and we found that VRAM usage was <12Gb in all cases.We note that the ArchIE method computes statistics over the entire simulated window (224.64 on average in our simulated 50kb windows) whereas our method only formats a small sequential sample of SNPs from each replicate (192 for the Archaic introgression program).
Below the time estimates for simulation are the simulated region size, and below the formatting times are the resulting "image" size or (populations, individuals, sites) and for the case of ArchIE, the window size in base pairs.Note that we do not include the time for execution on data after training, but we observed that classification times for all are generally negligible (although the sorting/statistic calculation steps must be performed first and these can be costly as shown in the Formatting section  and "pop 2" respectively), the true introgressed alleles for these two populations (labeled "pop 1 (y)" and "pop 2 (y)" respectively), the introgressed alleles inferred by IntroUNET ("pop 1 (pred)" and "pop 2 (pred)"), and IntroUNET's inferred introgression probabilities (labelled "prob", and scaled according to the color bar shown with the third example).Alignments and introgressed histories, true and predicted, are shown in the same format as in Figure 1.   , or ∼ 0.095 for a 10 kb introgressed segment).Each example shows the input alignments for the two populations (labeled "pop 1" and "pop 2" respectively), the true introgressed alleles for these two populations (labeled "pop 1 (y)" and "pop 2 (y)" respectively), the introgressed alleles inferred by IntroUNET ("pop 1 (pred)" and "pop 2 (pred)"), and IntroUNET's inferred introgression probabilities (labelled "prob").Alignments and introgressed histories, true and predicted, are shown in the same format as in Figure 1.A) The fraction of alleles falling within a given bin of IntroUNET's predicted probability of introgression that were in fact truly introgressed, prior to Platt recalibration.B) Same as (A), after recalibration. 64 Within AI region Outside of AI region  Each example shows the input alignments for the two populations (labeled "X (dsim)" and "X (dsech)", respectively), IntroUNET's inferred introgression probabilities ("ŷ (probs)"), and the most probable class for each allele in each individual ("ŷ (class)").Inferred introgressed histories and introgression probabilities are shown in the same format as in Figure 1.Alignments are shown in the same format as for previous figures, with the exception that for D. sechellia which has a different color scheme in haplotypes that were inferred to be introgressed in order to highlight these regions of the alignment: blue for the ancestral allele, and red for the derived allele.Table S2: CPU / GPU time estimates for accomplishing the experiments in the paper.The simulation and formatting results for this table were computed from a small sample of 430 replicates over 4 cores of an Intel Core i9-9900K CPU @ 3.60GHz and then scaled to give estimates for 10 5 replicates in each case.The Formatting column lists the time estimates for alignment sorting (for IntroUNET) and statistic calculation (for ArchIE).The Discriminator and Segmenter columns list the training times for classifying entire windows and for identifying introgressed haplotypes, respectively.In the GPU columns the estimate is simply the run time for the training described.The training of the neural networks was done on an NVIDIA A40 GPU, and we found that VRAM usage was <12Gb in all cases.We note that the ArchIE method computes statistics over the entire simulated window (224.64 on average in our simulated 50kb windows) whereas our method only formats a small sequential sample of SNPs from each replicate (192 for the Archaic introgression program).Below the time estimates for simulation are the simulated region size, and below the formatting times are the resulting "image" size or (populations, individuals, sites) and for the case of ArchIE, the window size in base pairs.Note that we do not include the time for execution on data after training, but we observed that classification times for all are generally negligible (although the sorting/statistic calculation steps must be performed first and these can be costly as shown in the Formatting section).70

Figure S1 :
Figure S1: A diagram of the chosen residual block structure shown with the dimensions for the first convolution for an initial size of two-populations by 64 individuals by 128 polymorphisms.

Figure S2 :
Figure S2: Loss function value trajectories calculated on training and validation data for versions of IntroUNET with different values of the label smoothing strength parameter alpha.All tests were calculated on simulated examples of the simple bidirectional scenario described in the Methods.Note that for alpha = 0.1 training loss is higher than validation loss.This is because label smoothing is only applied during training, and smoothing increases loss by adding noise to the target y values.

Figure S3 :
Figure S3: Five randomly chosen example segmentations on simulations from our simple bidirectional introgression scenario.Each example shows the input alignments for the two populations (labeled "pop 1"

Figure S4 :
Figure S4: Loss function value trajectories calculated on training and validation data for versions of IntroUNET with increasing sample sizes (32, 64, and 128 individuals per subpopulation), training set sizes (1000, 10000, and 100000 alignments), and window sizes (64, 128, and 256 polymorphisms).All tests were calculated on simulated examples of the simple bidirectional scenario described in the Methods.

Figure S5 :
Figure S5: IntroUNET's accuracy when the split and migration times may be misspecified.IntroUNET was trained on 25 different combinations of the population split time and the upper bound of the range of possible

Figure S7 :
Figure S7: The impact of direct selection on IntroUNET's performance.The left column shows confusion matrices obtained when using the same trained network used for the simple bidirectional introgression scenario whose parameters are laid out in Table 1, and applied to data simulated under the same model but with introgressed nucleotides experiencing negative selection.The column on the right shows results when testing the same IntroUNET model on data where introgressed segments are positively selected.The values of s represent the selection coefficient per introgressed nucleotide.Note that the shading represents the number of examples in each entry of the confusion matrix rather than the fraction of examples.

Figure S8 :
Figure S8: Five randomly chosen example segmentations on simulations from our simple bidirectional introgression scenario with relatively strong positive selection acting on each introgressed nucleotide (s =

Figure S9 :
Figure S9: The impact of background selection (BGS) on IntroUNET's performance.The confusion matrix shows IntroUNET's classification performance when applying the same trained network used for the simple bidirectional introgression scenario whose parameters are laid out in Table 1 to data simulated under the the D. melanogaster BGS model specified in the Methods.Note that the shading represents the number of examples in each entry of the confusion matrix rather than the fraction of examples.

Figure S10 :
Figure S10: The impact of recombination rate on IntroUNET's accuracy under the D. melanogaster background selection model.The left panel shows each simulation's accuracy, averaged across classified pixels from the central 100 kb of the simulated chromosome, as a function of the average recombination rate in that same window.The right panel shows each simulation's accuracy, averaged across classified pixels from the central 100 kb of the simulated chromosome, as a function of the recombination rate averaged across the entire simulated 1 Mb chromosome.

Figure S13 :
Figure S13: Ten example segmentations on simulations from our archaic introgression scenario: 5 from IntroUNET (left) and 5 from ArchIE (right).For each example we show the true and inferred introgressed alleles in the recipient population.For each method, both examples with and without introgression are shown.

Figure S14 :
Figure S14: Five example segmentations on simulations from our Drosophila introgression scenario.Each example shows the input alignments for the two populations, the true and inferred introgressed alleles for the D. sechellia population, and IntroUNET's inferred introgression probabilities.Alignments and introgressed histories, true and predicted, are shown in the same format as in Figure 1.

Figure S15 :
Figure S15: Calibration curves showing the impact of Platt scaling on the accuracy of the introgression probability estimates produced by IntroUNET, calculated on the validation set from the Drosophila simulations (Methods).A) The fraction of alleles falling within a given bin of IntroUNET's predicted probability of introgression that were in fact truly introgressed, prior to Platt recalibration.B) Same as (A), after recalibration.

Figure S17 :
Figure S17: The lengths of introgressed haplotypes predicted by IntroUNET at increasing distances away from the adaptive introgression even on chr3R.Introgressed haplotypes were defined as runs of consecutive SNPs classified as introgressed for a given individual.The sweep center was set to position 4624900, the center of the window with the lowest level of diversity in this region (data from [39]).

Figure S18 :
Figure S18: Results of training the same architecture on data seriated via different distance metrics, as well as using unsorted data (i.e.individuals within each population are arranged in the arbitrary order produced by the simulator), for the ghost-introgression problem (top row, panels A and B) and the Drosophila demographic model (bottom row, panels C and D).These plots show the values of training (A and C) and validation (B and D) loss over the course of training.Validation loss is usually lower than training in the case of Drosophila because label smoothing was applied to the training data for the purposes of regularization, but not to the validation data.

Figure S19 :
Figure S19: Confusion matrix showing performance of a classifier that detects genomic windows that have experienced introgression, trained and evaluated on data simulated under the D. simulans-D.sechellia scenario as described in the Methods.

Figure S2 :
Figure S2: Loss function value trajectories calculated on training and validation data for versions of IntroUNET with different values of the label smoothing strength parameter alpha.All tests were calculated on simulated examples of the simple bidirectional scenario described in the Methods.Note that for alpha = 0.1 training loss is higher than validation loss.This is because label smoothing is only applied during training, and smoothing increases loss by adding noise to the target y values.

Figure S4 :
Figure S4: Loss function value trajectories calculated on training and validation data for versions of IntroUNET with increasing sample sizes (32, 64, and 128 individuals per subpopulation), training set sizes (1000, 10000, and 100000 alignments), and window sizes (64, 128, and 256 polymorphisms).All tests were calculated on simulated examples of the simple bidirectional scenario described in the Methods.

Figure S6 :
FigureS6: Performance of IntroUNET on the task of identifying introgression on a dataset where introgression is occuring in only one direction, but IntroUNET was trained to detect bidirectional introgression.The test data here were the same as those examined in Figure4A, but the network used to perform inference was the same as that used for Figure4C.

Figure S7 :
Figure S7: The impact of direct selection on IntroUNET's performance.The left column shows confusion matrices obtained when using the same trained network used for the simple bidirectional introgression scenario whose parameters are laid out in Table 1, and applied to data simulated under the same model but with introgressed nucleotides experiencing negative selection.The column on the right shows results when testing the same IntroUNET model on data where introgressed segments are positively selected.The values of s represent the selection coefficient per introgressed nucleotide.Note that the shading represents the number of examples in each entry of the confusion matrix rather than the fraction of examples.

Figure S9 :
Figure S9: The impact of background selection (BGS) on IntroUNET's performance.The confusion matrix shows IntroUNET's classification performance when applying the same trained network used for the simple bidirectional introgression scenario whose parameters are laid out in Table 1 to data simulated under the the D. melanogaster BGS model specified in the Methods.Note that the shading represents the number of examples in each entry of the confusion matrix rather than the fraction of examples.

Figure S11 :
FigureS11: Performance of IntroUNET on the task of identifying bidirectional introgression using unphased genotype data.The test data here were the same as those examined in Figure4C, but here rather than predicted which haplotypes are introgressed, IntroUNET, when given a matrix of diploid genotypes, infers which diploid individuals have at least one introgressed allele at a given polymorphic site.

Figure S12 :Figure S13 :DFigure S14 :
Figure S12: Diagram of the ghost introgression demographic model from[40], in which an unsampled archaic population splits off from the main population, before a pulse introgression event introduces alleles from this population into a sampled "Target" population, not an unsampled "Reference" population.
Chromosomal position (in Mb) Chromosomal position (in Mb) Chromosomal position (in Mb) Chromosomal position (in Mb) Chromosomal position (in Mb) Chromosomal position (in Mb)

Figure S16 :
FigureS16: Alignments and segmentations from six windows on chr3R in the vicinity of the locus of adaptive introgression (AI).Three of the windows (left) are from outside of the region affected by AI, and the other three (right) are within the AI locus.Each example shows the input alignments for the two populations (labeled "X (dsim)" and "X (dsech)", respectively), IntroUNET's inferred introgression probabilities ("ŷ (probs)"), and the most probable class for each allele in each individual ("ŷ (class)").Inferred introgressed histories and introgression probabilities are shown in the same format as in Figure1.Alignments are shown in the same format as for previous figures, with the exception that for D. sechellia which has a different color scheme in haplotypes that were inferred to be introgressed in order to highlight these regions of the alignment: blue for the ancestral allele, and red for the derived allele.

Figure S17 :Figure S18 :Figure S19 :
Figure S17:The lengths of introgressed haplotypes predicted by IntroUNET at increasing distances away from the adaptive introgression even on chr3R.Introgressed haplotypes were defined as runs of consecutive SNPs classified as introgressed for a given individual.The sweep center was set to position 4624900, the center of the window with the lowest level of diversity in this region (data from[39]). ).

Table S1 :
Bootstrap parameter estimates for the D. simulans and D. sechellia joint demographic model obtained via ∂a∂i.The parameters of the model are the ancestral population size (N ref ), the final population sizes of D. sechellia and D. simulans (N 0−sech and N 0−sim ), the initial population sizes (N sech and N sim ), the population split time (t s ), and the backwards migration rates (m sim→sech and m sech→sim ).Note that parameter estimates are shown for each bootstrap replicate for which our optimization procedure succeeded (Methods), but only those with log-likelihood scores greater than −1750 were used to simulate training data.69