How array design creates SNP ascertainment bias

Single nucleotide polymorphisms (SNPs), genotyped with arrays, have become a widely used marker type in population genetic analyses over the last 10 years. However, compared to whole genome re-sequencing data, arrays are known to lack a substantial proportion of globally rare variants and tend to be biased towards variants present in populations involved in the development process of the respective array. This affects population genetic estimators and is known as SNP ascertainment bias. We investigated factors contributing to ascertainment bias in array development by redesigning the Axiom™ Genome-Wide Chicken Array in silico and evaluating changes in allele frequency spectra and heterozygosity estimates in a stepwise manner. A sequential reduction of rare alleles during the development process was shown. This was mainly caused by the identification of SNPs in a limited set of populations and a within-population selection of common SNPs when aiming for equidistant spacing. These effects were shown to be less severe with a larger discovery panel. Additionally, a generally massive overestimation of expected heterozygosity for the ascertained SNP sets was shown. This overestimation was 24% higher for populations involved in the discovery process than not involved populations in case of the original array. The same was observed after the SNP discovery step in the redesign. However, an unequal contribution of populations during the SNP selection can mask this effect but also adds uncertainty. Finally, we make suggestions for the design of specialized arrays for large scale projects where whole genome re-sequencing techniques are still too expensive.

Various SNP arrays exist for humans [14], plants [15,16] and all major livestock species [17][18][19][20][21][22][23]. SNP numbers within these arrays range from 10 k SNPs [20] [16,17,19,22] up to 600 k [15,21]. The design process of every array has an initial step of SNP discovery in common, where SNPs are identified from existing databases and/or from a small set of sequenced individuals. SNPs are then selected based on different quality criteria like minor allele frequency (MAF) thresholds and platform specific design scores [24]. Additional criteria like equidistant spacing over the genome [21], overrepresentation of some areas like chromosomal ends to increase imputation accuracy [20] or genic regions [21], or increased overrepresentation of high MAF SNPs [17] are applied dependent on the design intentions. In the end, draft arrays are validated either on the set of populations used for the SNP discovery itself [18] and/or on a broad set of individuals from different populations [21,24]. In contrast to whole genome re-sequencing (WGS) data, SNP arrays often show a clear underrepresentation of SNPs with extreme allele frequencies [25]. As population genetic statistics are mostly based on estimates of allele frequencies, this context leads to biased population genetic estimators [25,26] and is known as SNP ascertainment bias.
The absence of rare alleles is mainly driven by two factors in the array design process where SNPs are selected (ascertained) based on different requirements and decisions [27]. The first factor is a relatively small panel of individuals being used for discovery of SNPs, leading to a large proportion of globally rare variants not being selected, since they appear monomorphic in the discovery panel [26,28]. The second factor is the across population use of arrays. Arrays are developed based on the variation within the discovery panel, thus missing variation present in distantly related individuals or populations [25,27]. This second source of bias was shown to be of relatively high importance for livestock studies, where arrays are usually developed for large commercial breeds and later used to genotype diverse sets of local breeds all over the world [29,30].
Besides different strategies to minimize the impact of ascertainment bias [30,31], there are some attempts to correct the allele frequency spectrum via Bayesian methods [25,28,32]. However, those corrections highly rely on detailed statistical assumptions of the ascertainment process [33,34] or take a variety of ascertainment processes and demographic patterns into account to model evolutionary scenarios which are then compared to real world data [29,35]. However, those methods are currently only tested for corrections of the first source of ascertainment bias, the small discovery panel [25,28,32]. Additionally, detailed information on the design process is limited in practice [34] and the complexity of the processes makes statistical models for the corrections inaccurate.
Agricultural species such as chickens often show a complex domestication history, and therefore allow for few prior assumptions on ascertainment bias. Domestic chickens are assumed to originate from red jungle fowl (Gallus gallus) ancestors in Southeast Asia [36,37], represented by the five subspecies G. g. gallus, G. g. spadiceus, G. g. murghi, G. g. bankiva and G. g. jabouillei [38]. Additionally, some hybridization events with other Gallus species (e.g. grey jungle fowl; Gallus sonneratii) have been suggested [37,39]. The diversity of today's local breeds of chickens in Europe originates from chickens that reached the continent about 3000 years ago via a northern and a southern route, followed by selection and crossing with Asian chicken breeds introduced in the 19th century [38]. While commercial white layers were derived solely by intensive directional selection of a single breed, the White Leghorn, commercial brown layers are derived from a broader genetic basis (e.g. Rhode Island Red, New Hampshire, Barred Plymouth Rock). Commercial broilers are derived by cross-breeding of paternal lines (e.g. White Cornish) with maternal lines which descend from a comparable basis as brown layers (e.g. White Plymouth Rock) [40]. For more detailed information on chicken ancestry we refer to Lawal et al. [37] and for a comprehensive overview on diversity and population structure of domesticated chickens to Malomane et al. [13].
Given the complexity of modern array design processes and the chicken population structure, this study aims at highlighting the mechanisms which promote the bias by illustrating the effects of the different steps of the array design process on the allele frequency spectrum, using real data in a typical setting from livestock sciences. For this purpose, the design process of the Axiom™ Genome-Wide Chicken Array [21] was simulated in a set of diverse chicken WGS data. Allele frequency spectra as well as expected heterozygosity (H exp ) were compared to the WGS data and the SNPs of the Axiom™ Genome-Wide Chicken Array. Finally, some recommendations are made to design an array for monitoring genetic diversity.

Material and methods
Ethics approval and consent to participate DNA samples were taken from a data base established during the project AVIANDIV (EC Contract No. BIO4-CT98_0342; 1998-2000; https://aviandiv.fli.de/) and later extended by samples of the project SYNBREED (FKZ 0315528E; 2009-2014; www.synbreed.tum.de). Blood sampling was done in strict accordance to the German animal welfare regulations, with written consent of the animal owners and was approved by the at the according times ethics responsible persons of the Friedrich-Loeffler-Institut. According to German animal welfare regulations, notice was given to the responsible governmental institution, the Lower Saxony State Office for Consumer Protection and Food Safety (33.9-42502-05-10A064).

Populations and sequencing
The analysis is based on WGS data of a diverse set of 46 commercial, non-commercial and wild chicken populations, sampled within the framework of the projects AVIANDIV (https:// aviandiv.fli.de/) and SYNBREED (www.synbreed.tum.de). Commercial brown (BL) and white layer (WL) populations consist of 25 individually re-sequenced animals each, while the two commercial broiler lines (BR1 and BR2) include 20 individually sequenced animals each. For 41 populations, pooled DNA from 9-11 animals per population was sequenced, while Gallus varius (green jungle fowl; GV) samples of only two animals were sequenced as a pool. More detailed information about the samples can be found in S1 File and two previously published papers, from Malomane et al. [30] and Qanbari et al. [41]. Coverage was between 7X and 10X for the individual sequences, while DNA pools were sequenced with 15X to 70X coverage. Sequencing was conducted on Illumina HiSeq machines at the Helmholtz Zentrum, German Research Center for Environmental Health in Munich, Germany.

Raw data preparation and SNP calling
Sequences were aligned to the reference genome Gallus_gallus-5.0 [42,43] and the SNP calling was conducted according to GATK Best Practices guidelines [44,45]. BWA-MEM 0.7.12 [46] was used for the alignment step, duplicates were marked using Picard Tools 2.0.1 [47] Mark-DuplicatesWithMateCigar and base qualities were recalibrated with GATK 3.7 [48] BaseQuali-tyRecalibrator. The set of known SNPs, necessary for base quality score recalibration, was downloaded from ENSEMBL release 87 [49]. SNPs were called for all samples separately using the GATK 3.7 HaplotypeCaller and later on simultaneously genotyped across samples with GATK 3.7 GenotypeGVCFs. Due to computational limitations, the ploidy parameter of Haplo-typeCaller was set to two instead of the higher true ploidy of the pooled sequences. By this, slightly less rare alleles were called. However, effects of this limitation are negligible (S2 File; S1 Fig). Note that allele frequencies were estimated from the ratio of allelic depth by total depth. SNP filtering was conducted using GATK 3.7 VariantRecalibrator, which filtered the called SNPs by a machine learning approach (use of a Gaussian mixture model), which uses both a set of previous known (low confidence needed) and a set of highly reliable (assumed to be true) variants as training sources [50]. The source for known SNPs (prior 2) provided to Var-iantRecalibrator was again ENSEMBL (release 87) and the SNPs of the Axiom™ Genome-Wide Chicken Array were defined as true training set (prior 15). The algorithm was trained on the quality parameters DP, QD, FS, SOR, MQ and MQRankSum. Filters were set to recover 99% of the training SNPs in the filtered set, which resulted in a Transition/Transversion ratio of 2.52 for known SNPs, and a Transition/Transversion ratio of 2.26 for novel SNPs. Only biallelic autosomal SNPs were used in all further analyses.

Identification of the ancestral allele
Ancestral alleles were defined using allele frequency information from the three wild populations Gallus gallus gallus (GG), Gallus gallus spadiceus (GS) and Gallus varius (GV) by an approach comparable to Rocha et al. [51]. It was assumed that the Gallus gallus and Gallus varius species emerged from a common ancestor and Gallus gallus later split into Gallus gallus gallus and Gallus gallus spadiceus subspecies. Additionally, assuming neutral molecular evolution [52], the ancestral allele was most likely the major allele within those three populations, when weighting the allele frequency of Gallus varius twice. This procedure assigned the ancestral status to the reference allele for 86% of the SNPs and to the alternative allele for 14% of the SNPs. The change in the allele frequency spectrum was only relevant for the interval from 0.95-1.00, which was reduced by 111,851 SNPs (0.39% of all SNPs) when switching from alternative to derived allele frequency (S2 File; S2 Fig).

Reference sets
Three different reference sets were defined as follows: the unfiltered WGS SNPs (28.5 M SNPs), SNPs filtered using GATK 3.7 [48] VariantRecalibrator (20.9 M SNPs; filtered WGS) and array SNPs (540 k SNPs), which are the intersection of the unfiltered SNPs and the SNPs of the Axiom™ Genome-Wide Chicken Array. The separate use of unfiltered and filtered WGS SNPs was done to assess the effect of filtering (especially the use of an ascertained SNP set as the true set) on ascertainment bias.

Redesigning the SNP array
The process of redesigning the array in silico is briefly shown in Fig 1 and explained in more detail in the following. For the design process, the populations were divided into four groups:  For SNP discovery, firstly the four commercial lines (commercial white layers, WL; commercial brown layers, BL and the two commercial broiler lines, BR1 and BR2) were used. The set was then extended by additionally selecting those populations that were closest related to each of the commercial populations based on pairwise Nei's standard genetic distance [53]. As the two broiler populations were closest related (S3 Fig), the next two closest populations were chosen. This resulted in the inclusion of White Leghorn (LE), Rhode Island Red (RI), Marans (MR) and Rumpless Araucana (AR). Note that the commercial populations are closely related to the populations used as discovery populations for the development of the Axiom™ Genome-Wide Chicken Array [21] with exception of some inbred lines from the Roslin Institute in Edinburgh of which we do not know the genetic origin. The discovery set used for the original array [21] additionally consisted of more animals from multiple layer and broiler lines than ours. Further, the discovery set had to be split into broilers (BR1, BR2, MR, AR) and layers (WL, LE, BL, RI) for the equal spacing step. From the remaining populations, 19 were randomly chosen for validation of previously discovered SNPs (validation populations), 18 populations (which were not included in the array development) were used as a case study for an application of the array (application populations), and Gallus varius as a different species was defined as outgroup. The interested reader can find all underlying pairwise Nei's standard genetic distances [53] in S3 File and additionally pairwise F ST values [54] in S4 File.
Based on the unfiltered SNP set, the sampling of the SNPs for an approximately 600 k sized array was remodeled in silico in five consecutive steps according to the design process of the original array which was described by Kranis et al. [21], starting from the unfiltered SNP set:

Cluster removal ! 8.8 M SNPs
SNP clusters were defined as SNPs with less than 4 bp invariant sites at one side of a SNP and less than 10 bp invariant sites at the other side of the SNP within the discovery populations. Those SNPs were removed, which is justified rather technically to enable probe binding, but could also lead to an overrepresentation of conserved regions compared to highly variable regions of the genome.

Equal spacing ! 2.1 M SNPs
Reduction of SNPs to achieve approximately equidistant spacing between variable SNPs within discovery populations based on genetic distances. This algorithm was modeled according to Kranis et al. [21] and followed a two-step procedure. The first step was setting up an initial backbone of common SNPs (three sub-steps). It started with selecting SNPs which segregated in all discovery populations (MAF within each population > 0) while requiring a minimal distance of 2 kb, resulting in about 8 k SNPs. This was complemented by a backbone of SNPs which segregated in all layer populations and another one of SNPs which segregated in all broiler populations. Note that Kranis et al. [21] additionally constructed a backbone from a group of inbred lines for which no comparable samples were available for this study. In the second step, the algorithm iterated over all single populations and filled in potential gaps between backbone SNPs which are variable within the according population. This was done by choosing the SNPs closest to equidistant positions within the gap while aiming for a predefined local target density of 667 segregating SNPs/cM (linkage map taken from [56]). See S4 Fig for the detailed contribution of additional SNPs from each sub-step of the algorithm.

SNP validation ! 1.7 M SNPs
Removing SNPs (~20%) which were not variable in at least 8 of the 19 validation populations. This step would in reality be done by genotyping with preliminary test arrays and therefore allows the use of a broader set of populations than the discovery step.

Downsampling ! 580 k SNPs
Downsampling of SNPs comparable to step 3, but without adding the broiler/ layer specific backbones and instead keeping all exonic SNPs (annotation using Ensembl VEP 89.7; [57]). Additionally, the target density in broiler lines was set as three times the target density of the layer lines. The increased target density in broilers is intended to account for lower levels of linkage disequilibrium in these lines.

Variation of the design process
The whole design process was repeated 50 times with populations being randomly assigned to be discovery, validation or application populations, while the Gallus varius population was always kept as the outgroup. In this process, the number of populations per group was the same as in the previous scenario.
To assess the impact of the number of discovery populations on the design process, the number of discovery populations was varied in additional runs from 4 to 40 randomly chosen populations (while assigning the remaining populations, except Gallus varius, to validation and application groups of equal size) with 20 random replicates for each number of discovery populations. In a last scenario, equal spacing was varied with respect to the target density (33-3333 SNPs/cM) with 20 independent population groupings for each target density, with or without the initial backbone. As the number of SNPs from the backbone was constant, the increase of the target density led to a higher number of SNPs chosen by the algorithm due to the equal spacing itself and hence the relative influence of the fixed number of common backbone SNPs decreased.

Analyses of the results
Per-locus-allele frequencies for individually sequenced populations were estimated from genotypes, whereas the estimation for the sequenced DNA-pools was based on the allelic depth. Influences on the allele frequency spectra were examined by comparing density estimates of derived allele frequency spectra (unfolded frequency spectrum). Further H exp , the expected heterozygosity assuming Hardy Weinberg frequencies of the genotypes, for the different populations were used as summary statistics of the within population allele frequency spectra and calculated as in Eq (1), where p ref;l denotes the frequency of the reference allele at locus l and L the total number of loci.
Deviations in the estimation of H exp from the various SNP sets were quantified as differences between the H exp calculated from the respective SNP set and the H exp calculated from the filtered WGS SNPs relative to the H exp from the filtered WGS SNPs, further called overestimation of H exp (OHE; Eq (2)), which was calculated per population.
An OHE of zero means that the estimates are equal, while an OHE of one describes doubling of the unbiased estimate.
The effects of the population group assignments on the OHE of the random population assignments were evaluated by pairwise comparisons of least square means (LSMEANS; calculated with the R package emmeans [58,59] by using Tukey correction for multiple pairwise contrasts) of the population groups. An underlying mixed linear model for the estimation of LSMEANS was fitted using the R package lme4 [60] as shown in Eq (3), where the OHE depended on an overall mean μ, the fixed effect of the population group popG i (i can be discovery-, validation-, application-or outgroup), a random effect for the j th repetition of random population grouping (rep j eNð0; Is 2 rep Þ) and a random error e ijk eNð0; Is 2 e Þ. The procedure is comparable to simple pairwise comparisons of group means, the correction by the repetition only reduces the error variance and thus decreases the confidence intervals.

Numbers of SNPs
The SNP calling identified 28.5 M biallelic autosomal SNPs from which 20.9 M SNPs passed GATK's filtering procedure. 540 k SNPs from the unfiltered WGS SNP set are also mapped on the original Axiom™ Genome-Wide 580 k Chicken Array. The remodeling of the array according to the design process of the original array returned 10.9 M SNPs from the discovery step, which were reduced to approximately 580 k in steps as described. Numbers of identified SNPs for the additional runs differed depending on the populations and settings used and are listed in S1 Table. It has to be noted that the different sub-steps of the equal spacing algorithm con-

Underrepresentation of rare SNPs
A clear underrepresentation of rare SNPs in all ascertained SNP sets compared to WGS is evident from the allele frequency spectra (Fig 2). Major changes in the allele frequency spectra during the array development process were observed after the SNP discovery step and the equal spacing step. The SNP discovery led to an underrepresentation of rare SNPs compared to sequence data, which was intensified by the equal spacing step (Fig 2). The process finally resulted in a spectrum which was comparable to the spectrum of the original array, albeit slightly more right skewed. Randomly choosing populations as discovery populations confirmed the shape of the first remodeling, where the population groups were chosen according to the original array [21]. As major changes in the spectra mainly occurred after the SNP discovery and equal spacing, further results will concentrate on those steps. The allele frequency spectra (Fig 3) within discovery populations, compared to the spectra over all populations, clearly showed the cutoff from the MAF 0.05 filter. Furthermore, the allele frequency spectra of the discovery populations revealed a higher share of common SNPs than the overall spectra after equal spacing. In contrast, the spectra within validation-and application populations showed less pronounced peaks after the discovery step and the outgroup (Gallus varius) revealed fixation of most SNPs variable in the discovery populations.

Influence of number of discovery populations and target density on allele frequency spectra
Not surprisingly, an increased number of discovery populations resulted in a higher number of rare alleles after the discovery step, and thus an allele frequency spectrum with a more pronounced peak of rare alleles (Fig 4A). Apparently, the shift of the allele frequency spectrum after the equal spacing step was dependent on the number of discovery populations, as an increase in the number of discovery populations shifted the allele frequency spectra towards a higher proportion of alleles with a low derived allele frequency. With an increasing number of discovery populations, the shape of the allele frequency spectra got closer to the spectrum of the original array.
A very low target density, indicating that SNPs were mostly called due to being common backbone SNPs, resulted in an allele frequency spectrum with the majority of alleles having a MAF of around 0.5 (Fig 4B). Increasing the target density for the equal spacing and thus reducing the influence of the initial backbone of common SNPs shifted the peak of the allele frequency spectrum left towards a higher proportion of alleles with small derived allele frequencies. Using only the backbone SNPs common over all discovery populations and thus calling SNPs mostly by the equal spacing procedure resulted, independently from the target density, in a spectrum similar to the one obtained with a high target density with backbone ( Fig 4B). Fig 5 shows the H exp of different SNP sets by population. The H exp obtained from the filtered WGS SNPs were slightly higher than from the unfiltered WGS SNPs. H exp obtained from the ascertained SNP sets showed an even more pronounced overestimation together with an increase during the design steps. In general, the correlations between the H exp obtained in the different SNP sets were relatively high (� 0.95; S2 Table). Especially the H exp of the two WGS SNP sets showed a nearly perfect correlation of > 0.99, which led to an almost constant OHE of -0.23 (Table 1) for the unfiltered WGS SNPs. As already recognizable from the H exp themselves, the OHE was positive for all ascertained SNP sets (0.66-1.29), which at the same time showed a slightly reduced correlation to the filtered WGS SNP set (0.95-0.97). Comparable to the allele frequency spectra, the most pronounced increase of the OHE was caused by the SNP discovery and followed by the equal spacing step (OHE increased by 0.66), while the OHE  Table 1) laid in the range covered by the remodeling steps.

Overestimation of H exp
Averaging the OHE within the population groups revealed a 30% higher OHE of the discovery populations compared to validation and application populations after the discovery step. The equal spacing step reduced this difference to an only 1% larger OHE for discovery populations, while it came with a substantial increase of the variance of OHE, which was larger for the discovery populations than validation and application populations. The validation step then increased the OHE of the validation populations more than the OHE of discovery and application populations. This stronger OHE of discovery populations was also apparent within the array SNPs (24% higher). In contrast to the other populations, the outgroup showed an underestimation of the H exp , resulting in an OHE of < -0.84 for all ascertained SNP sets (Fig 5; Table 1).
A closer look on the contribution of the sub-steps during the equal spacing step revealed that 62% of the SNPs which were preserved during equal spacing were variable in all of the four closely related broiler populations (BR1, BR2, MR, AR; maximum pairwise Nei's distance of 0.06 and FST of 0.17 in the filtered SNP set), while only 3% of the SNPs were retained due to    Table 2) of the population groups revealed 24% larger OHE for discovery populations than for validation and application populations after discovery and cluster removal step, which was decreased to a numerically insignificant difference after the equal spacing step. Interestingly, and in contrast to the findings from the first remodeling, SNP validation led to a significantly higher OHE (5% larger) for application populations than discovery and validation populations. Fig 6A shows that increasing the number of discovery populations reduces the median OHE of discovery populations after SNP discovery while not affecting the OHE of validation and application populations. Equal spacing (Fig 6B)removed the average difference of OHE between the different population groups. Due to the limited number of populations in the complete set, the number of validation populations had to be reduced with more populations in the discovery set. This led to an increasing impact of individual validation populations on the ascertainment. The OHE of validation populations therefore increased with a high number of discovery populations (S9D Fig), comparable to the higher OHE of discovery populations for a small number of discovery populations. In our case, the biased array for validation populations was therefore obtained with a combination of 30 populations in the discovery set and 7 populations in the validation set. However, the least biased array for discovery and application populations was the array with the maximum number of discovery populations (40).

Influence of number of discovery populations and target density on H exp
In the equal spacing step, using only backbone SNPs resulted in a higher OHE for discovery than for non-discovery populations. Increasing the target density and thus increasing the proportion of SNPs due to the equal spacing part of the algorithm reduced the difference in OHE between the population groups ( Fig 7A). If the SNPs from the initial backbone were not used, no difference of OHE between discovery and non-discovery populations was present, regardless of the target density (Fig 7B).

Discussion
In this study we used a uniquely diverse collection of sequenced wild, commercial and noncommercial chicken populations, mainly based on samples of the Synbreed Chicken Diversity Panel [13]. Parts of our set were also involved in the development process of the Axiom™ Genome-Wide 580 k Chicken Array [21]. This offered an excellent possibility for assessing the impact of ascertainment bias on real data in a complex scenario. In general, results derived from this study should therefore be transferable to other species. However, domestic chickens show a rich history of hybridization and crossbreeding events [13]. The effects of using a discovery set closely related to the commercial populations and distributing the discovery set randomly across the spectrum of populations were therefore comparably small in this study. Special patterns of population structure e.g. the stronger differentiation in cattle due to the two subspecies Bos taurus and Bos indicus [61] accompanied by limiting the discovery set to one of the two clades, should increase the impact of population structure dependent ascertainment bias.

Potential impacts of the SNP calling pipeline
As the state of the art pipeline of GATK relies on a supervised machine learning approach for filtering the SNP calls, which needs a highly reliable set of known SNPs, we started with

PLOS ONE
examining potential impacts of the filtering procedure on ascertainment bias. The number of rare variants was slightly reduced by the filtering procedure and thus increased estimates of H exp were obtained in the filtered WGS set. As rare variants have a higher risk to be discarded as sequencing errors [62], this reduction is expected when applying quality filters. However, a clear assessment of correctly and falsely filtered variants is not possible here and one has to balance this tradeoff based on the study purpose. Another source of ascertainment bias could be the use of array SNPs as training set for GATK RecalibrateVariants, which potentially leads to discarding rare variants more likely if they are not present in the discovery populations of the used array. As the correlation between the H exp of the unfiltered and filtered WGS SNPs was nearly one, this source seems to be negligible and the use of array variants as a highly reliable training set seems to be unproblematic.
Due to computational limitations, we had to assume a ploidy of two for pools during the SNP calling process, which resulted in a minimal reduction of rare alleles. However, this effect was shown to have a very minor impact on the findings of this study (S2 File). Nevertheless, pooled sequencing itself can slightly bias allele frequency estimates compared to individual sequencing [63][64][65][66]. As all frequency estimates for single SNPs were taken from the same data source throughout the study, this does not affect our results. However, estimates for the magnitude of the ascertainment bias for single populations have to be understood rather relative to our gold standard than as absolute values.

General impact over all groups
The general reduction of rare alleles in array data compared to WGS data and the resulting overestimation of H exp supports findings of previous studies [25,26,30,34]. This reduction of rare alleles was mainly seen at steps where selection was explicitly biased towards high MAF alleles (MAF filter for quality control in discovery step and use of common alleles for the backbone in the equal spacing step) and/ or was applied to a small number of populations (small discovery set vs. small validation set). Thereby, the strongest shifts of the allele frequency spectra and increases of H exp are observed after SNP discovery and equal spacing. Both, cluster removal and second downsampling had almost no effect on the allele frequency spectra and H exp , while the validation step slightly decreased the share of rare SNPs.
The discovery step had the strongest impact on discovery populations, when a small set of discovery populations was used (Fig 6A). Similarly, the influence of the validation step on validation populations was strongest in case of a small number of validation populations (S9D Fig). A balancing of these two groups of samples is therefore necessary, if the number of available DNA samples for array development is limited. Instead of using separate populations for discovery and validation, we rather suggest to space the discovery set across all available populations and validate test arrays on additional samples of the same populations.
If the equal spacing step contains a preselection of SNPs based on their variability within population groups, the bias is stronger towards high MAF SNPs and thus yields a higher OHE. This effect was reduced by increasing the target density and thus selecting relatively more SNPs due to the equal spacing instead of common occurrence.

Differences between groups
If allele frequency spectra are changed in the same way for all populations and are therefore biasing heterozygosity estimates to the same extent, findings for between population comparisons will be little affected. Ascertainment bias then is only of importance if one compares populations based on different arrays, and corrections of the allele frequency spectrum as reviewed by Nielsen [25] should be possible. As correlations between H exp of ascertained SNP sets and unfiltered/ filtered WGS SNP sets were consistently high (> 0.94), arrays designed in the way as performed in this paper should mostly be suitable for robust and cost efficient analyses. Biasedness of estimates could be reduced even more by considering filter strategies according to Malomane et al. [30].
However, we could show that the bias acts with different extent on different population groups (population structure dependent bias) and therefore changes ranking of populations and can affect conclusions. This population structure dependent bias was already shown to have severe impact on findings from SNP arrays. For example, Bradbury et al. [67] found a demographic decline up to an approximately 30% lower H exp for Atlantic cod based on the distance to the sampling location of the discovery panel and McTravish and Hillis [29] showed strong deviations between simulated and observed polymorphisms for different combinations of migration and ascertainment scenarios on simulated cattle populations. In concordance with this, populations which are closely related to the discovery populations of the original array in our study on average showed a 24% higher OHE than validation and application populations for the original array.
This population structure dependent bias was mainly introduced by the initial discovery step. It was also observed in the random population groupings, but to a slightly different extent. The difference in overestimation decreased with an increase in the number of discovery populations (Fig 6) and was smallest if the discovery populations showed minimum distance to the application and validation populations (results not shown). Comparable observations were already made by Frascaroli et al. [68] which found very small ascertainment bias for European elite maize lines when using a SNP panel discovered in a combination of a maize diversity set and inbred lines, but strong ascertainment bias when using SNPs which were discovered in American elite lines. Therefore, we suggest to ideally choose an array where the discovery panel does span the scope of populations it will be applied to, and by this covers the existing variation in a most representative way, or to design such an array for oneself if it does not exist.
The equal spacing step lowered the difference in mean OHE between population groups in most of our remodeling scenarios, but obviously not in case of the original array. In the remodeling, we saw this difference only with a low target density and thus calling SNPs in the equal spacing step mainly due to being common over many populations (Fig 7A). However, the equal spacing step also increased the variance of OHE in the discovery panel, meaning that the OHE was increased more for some of the discovery populations than for others, thus causing more uncertainty for resulting effects. This effect is driven by the unequal contribution of variable SNPs to the chosen SNP set by the different populations during the equal spacing step (S3 Fig). The equal spacing step increases the OHE for some of the discovery populations, while it decreases it for others, and hence it does not remove the population structure dependent bias. This means that the knowledge of which discovery populations were used is not sufficient to draw conclusions regarding a possible ascertainment bias, since their relative contribution varies through the described pipeline.

Outgroup
Gallus varius as an outgroup showed a different behavior than all other populations. It already exhibited the lowest H exp in the unfiltered WGS SNP set, which was most likely driven by the small number of only two samples in the pool, and showed less upward bias of H exp in the filtered WGS SNP set than all other populations. The Gallus varius sequence reads on average showed weak Phred-scaled mapping quality scores of 19 (1.3% probability of misalignment), while the mean quality scores of the other populations ranged from 25 (0.3%) to 28 (0.1%).
Variation, only present in Gallus varius, will therefore be more likely missed due to misplacement of the reads or discarded as possible sequencing errors. Additionally, every ascertained SNP set showed an OHE for Gallus varius of < -0.84, as variation being present only in Gallus varius was not found in Gallus discovery panels and, vice versa, variants from Gallus were not variable in Gallus varius (Fig 3). This demonstrates that arrays should not be used if different species (even closely related ones) are included in the research project. Even sequence based estimates can be slightly biased, if the reference genome does not fit properly.

Potential impact on other breeding applications
In general, we cannot infer the impact on breeding applications which require phenotypic data (e.g. genomic selection [69] or genome wide association studies [70]) and/or individually sequenced or genotyped individuals (e.g. linkage disequilibrium decay [71] or runs of homozygosity analyses [72]) from this study. However, literature highlights the increased power of high MAF SNPs to capture/ detect effects which are caused by common variants due to stronger linkage disequilibrium and higher levels of variance explained. Therefore, increasing MAF in a first instance increases prediction accuracy when the number of SNPs is limited [73] and therefore some SNP ascertainment schemes intentionally bias the used SNPs towards high MAF within the desired populations [17]. The switch to WGS data, and therefore the additional inclusion of rare alleles, is then expected to increase the possibility of capturing the effects of rare alleles [73][74][75]. However, the increase in efficiency by higher numbers of SNPs levels off when going towards WGS data [76]. Nevertheless, we would expect negative impacts of ascertainment bias due to the across population use of the arrays. When biasing the genotyped variation towards the discovery population, the variability in populations, which are less related to the discovery populations, is less increased or even reduced, and arrays therefore become less valuable in non-target populations. Slight effects of this were demonstrated by simulation [73] and we can clearly support these findings by the levels of differences in the genotyped heterozygosity which we observed in this study. For the effect of ascertainment bias on a broader set of applications, we further refer the interested reader to studies which specifically address those issues (e.g. 25,30,31,35,71).

Further recommendations for future studies
We showed that existing arrays come with a large potential for ascertainment bias which is barely predictable due to a diverse set of promoting factors. Strongly decreasing costs for WGS and increasing availability of powerful computing resources therefore promote an intensified use of WGS for population genetic analyses, especially when diverse populations are included in the studies. However, costs and computational effort will still be substantial for large scale projects. Possible cost effective alternatives could be reduced library sequencing approaches like Genotyping-by-Sequencing [62,77], even though such methods introduce other problems related to the use of restriction enzymes which are reviewed by Andrews et al. [78].
For the purpose of monitoring genetic diversity in a large set of small non-commercial populations, the development of a specialized new array for cost effective high throughput genotyping could be still a good option. For the design of such an array, unbiasedness would thereby be represented by a random draw of the total variation within the target populations. As this is only a theoretical possibility, the practical solution closest to unbiasedness one can achieve would be a random draw form the SNPs present in the discovery set. It is thereby crucial to extend the discovery set in a way which represents the total variability over all populations as balanced as possible. The use of publicly available sequences can be helpful to reach this goal. The ascertainment of the SNPs should then be done preferably over a large set of highly diverse populations covering a wide spectrum of the diversity within a species available populations instead of biasing the process towards common alleles by performing within population ascertainment.
Supporting information S1 File. Abbreviations for breeds and accession numbers for the sequencing data. The tree was calculated from the filtered WGS SNPs. Populations defined as layers or broilers, which form in total the discovery set for the array design close to the original array, are highlighted. The plot was produced using the R package ape [55]. Note that the plot is only supposed to reveal close clustering chicken populations and cannot be interpreted in depth as chickens show a rich history of hybridization events. The interested reader can find all underlying pairwise Nei's standard genetic distances [53] in S3 File and additionally pairwise FST values [54]