![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||
Copyright © 2008, Cold Spring Harbor Laboratory Press Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals 1 Departments of Integrative Biology and Statistics, University of California, Berkeley, California 94720, USA; 2 Wilhelm Johannsen Centre for Functional Genome Research, Department of Cellular and Molecular Medicine, University of Copenhagen, 2200 Copenhagen, Denmark; 3 Bioinformatics R&D, Applied Biosystems, Rockville, Maryland 20850, USA; 4 Computational Genetics, Applied Biosystems, Foster City, California 94404, USA; 5 Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA 6Corresponding author.E-mail inesh/at/berkeley.edu; fax (510) 642-2740. Received November 8, 2007; Accepted April 7, 2008. This article has been cited by other articles in PMC.Abstract We introduce a simple, broadly applicable method for obtaining estimates of nucleotide diversity θ from genomic shotgun sequencing data. The method takes into account the special nature of these data: random sampling of genomic segments from one or more individuals and a relatively high error rate for individual reads. Applying this method to data from the Celera human genome sequencing and SNP discovery project, we obtain estimates of nucleotide diversity in windows spanning the human genome and show that the diversity to divergence ratio is reduced in regions of low recombination. Furthermore, we show that the elevated diversity in telomeric regions is mainly due to elevated mutation rates and not due to decreased levels of background selection. However, we find indications that telomeres as well as centromeres experience greater impact from natural selection than intrachromosomal regions. Finally, we identify a number of genomic regions with increased or reduced diversity compared with the local level of human–chimpanzee divergence and the local recombination rate. The nature of population genetic data has changed dramatically over the past few years. For the past 15–20 yr the standard data were Sanger sequenced DNA from one or a few genes or genomic regions, microsatellite markers, AFLPs, or RFLPs. With the availability of new high-throughput genotyping and sequencing technologies, large genome-wide data sets are becoming increasingly available. The focus of this article is the analysis of tiled population genetic data, i.e., data obtained as many small reads of DNA sequences that align relatively sparsely to a reference genome sequence or in segmental assemblies. These data differ from classical sequence data in several ways. The main difference is that for each nucleotide position under scrutiny, a different set of chromosomes is sampled. While this problem is similar to the usual missing data problem in directly sequenced data, it is different for diploid organisms, because it is unknown how many chromosomes from an individual are represented in any segment of the assembly. This implies that for any particular segment of the alignment it is not known whether aligned sequence reads are drawn from one or both chromosomes. The main objective of this study is to develop and apply statistics for addressing these problems. We will primarily do this in the framework of composite likelihood estimators (CLEs). CLEs are becoming popular for dealing with large-scale data in population genetics. They form the basis for a number of recent methods for analyzing large-scale population genetic data, including methods for estimating changes in population size (e.g., Nielsen 2000; Wooding and Rogers 2002; Polanski and Kimmel 2003; Adams and Hudson 2004; Myers et al. 2005) and methods for quantifying recombination rates and identifying recombination hotspots (Hudson 2001; McVean 2002). A fundamental parameter of interest in population genetic analyses is θ = 4Neμ, where Ne is the effective population size and μ is the mutation rate per generation. There are several estimators of θ, including the commonly used estimator by Watterson (1975) based on the number of segregating sites. One reason for the interest in this parameter is that it is informative regarding both demographic processes (for review, see Donnelly and Tavare 1995) and natural selection (Hudson et al. 1987). For example, a reduction in θ in a region with normal or elevated between-species divergence suggests the action of recent natural selection acting in the region. Therefore, estimates of θ can be used to identify candidate regions of recent selection. In addition, the relationship between recombination rates and θ is highly informative regarding the relative importance of genetic drift and natural selection in shaping diversity in the genome. In Drosophila, it is well established that θ varies with the local recombination rate (Begun and Aquadro 1992). This has been interpreted as evidence for the action of selection in the genome. Both positive and negative selection can lead to a reduction in population genetic variability, and in both cases the effect is stronger in regions of low recombination. In flies, some recent evidence suggests that positive selection is the dominant force (Andolfatto and Przeworski 2001; Sawyer et al. 2003; Andolfatto 2005), and the results from several recent studies suggest that positive selection may also be common in the human genome (Voight et al. 2006; Tang et al. 2007; Williamson et al. 2007). However, there has been very little evidence for a strong correlation between θ and recombination rate in humans beyond what can be explained by possible mutagenic effects of recombination (Hellmann et al. 2003, 2005). There is no simple way of reconciling the lack of a correlation between diversity and recombination rate with claims of selection in the human genome. While in the near future most tiled population genetic data will undoubtedly be generated by platforms such as 454 pyrosequencing (Roche) (Margulies et al. 2005), Illumina (formerly Solexa) (Bentley 2006), and SOLiD sequencing (ABI), once the sequences are assembled and single nucleotide polymorphisms (SNPs) are identified, the population genetic problems relating to the analysis of these data are the same as the ones arising when analyzing assemblies of reads obtained through traditional Sanger sequencing. We therefore illustrate the potential for population genetic analysis of this type of data on a classical assembly of Sanger sequencing reads in humans: the Celera Genomics human sequencing and SNP discovery data (Venter et al. 2001). Based on these data, we obtain unbiased estimates of θ in windows throughout the genome, and re-examine the relationship between human diversity and recombination. Finally, we identify regions with increased or reduced ratios of polymorphism to divergence, which can be seen as candidate regions for either balancing selection or selective sweeps, respectively. Therefore, the aim of this study is twofold: to illustrate how shotgun assembly data can be used for population genetic analysis and to illustrate this kind of population genetic analysis using data from the Celera shotgun assembly. Results and Discussion Composite likelihood estimation The composite likelihood estimators (CLEs) are constructed by taking the product of individual likelihood functions and maximizing this product, even if these marginal likelihood functions are not independent. In the context of DNA data, this usually implies taking the product of the likelihood calculated in individual nucleotide sites (e.g., Nielsen 2000) or pairs of nucleotide sites when linkage disequilibrium is of interest (Hudson 2001). Assuming data from one population, the likelihood function in a single site is given by p(X = x|γ), the probability of a nucleotide variant segregating at frequency x/n in the population, x = 1, 2. . .n − 1, in a sample of n chromosomes, under a model parameterized by γ. The composite likelihood function for γ is then defined as (e.g., Nielsen 2000; Adams and Hudson 2004):
The CLEs can be generalized to tiled population genetic data, by summing over all possible (unknown) chromosomal sample sizes in a segment. The marginal sampling distribution for a single SNP from a particular segment can then be calculated as
The method can also be extended to data from multiple populations by considering the joint frequency spectrum from the populations. For example, for two populations, the data for a single SNP consists of the allelic counts, X1 and X2, in the two populations. The likelihood function in a single SNP is then p(X1 = x1,X2 = x2|γ), and everything follows as before. However, in this study we will treat the data as if it has been sampled from only one population. The estimator of θ we develop is a modification of Watterson’s (Watterson 1975) classical estimator applicable to the tiled shotgun sequencing data. It assumes an infinite sites model and a constant sequencing error rate. It can be derived as a composite likelihood estimator, but in the Methods section we provide a simpler derivation based on the method of moments. We assume that the alignment can be divided into v segments, where the v − 1 divisions between segments are chosen to fall at the points where a sequencing read starts or ends (Fig. 1
The assumption of errors occurring at a constant and independent rate is not necessarily realistic for DNA sequence data, but deviations from this assumption may not affect the analysis much, as long as the analysis is done on a regional scale and read-by-read variance of the error rate averages out over larger regions. However, it is clearly desirable to develop more accurate error models for particular types of data. Such error models can be incorporated directly into the population genetic analysis by modifying the expression for the expected number of errors in the region. Genome-wide estimates of nucleotide diversity in humans For purposes of estimating θ, we will use the original whole-genome shotgun sequences by Celera Genomics (Venter et al. 2001) and the associated SNP-discovery data (see Methods) that contains DNA from seven individuals. The SNPs in conjunction with the mappings of the actual shotgun reads allowed us to obtain genome-wide estimates of nucleotide diversity. We used a window of 100 kb, sliding it by steps of 20 kb. The average sequence coverage within the windows was on average five reads for each segment, which corresponds to approximately two chromosomes. In order to quantify the statistical uncertainty around our estimates, we conducted neutral coalescent simulations under realistic recombination and mutation rates (see Methods). The coefficient of variation in the estimate of θ for 100-kb windows for this simulated data ranged from 0.1 to 0.67 (Supplemental Fig. S1). This indicates that, although the average sample size in number of chromosomes is low, there is useful information regarding in 100-kb windows. On average, we estimate to be 0.00163, a value somewhat higher than is generally cited, and possible reasons for the difference are given below.The effect of selection on in the human genomeNegative selection (e.g., background selection) and positive selection (e.g., selective sweeps due to hitch-hiking), reduce the average nucleotide diversity at linked neutral sites (Begun and Aquadro 1992; Charlesworth et al. 1993). The number of affected linked sites depends on the recombination rate per site per generation (ρ). Therefore, if either background selection (BS) or hitch-hiking (HH) are common, regions of low recombination are expected to have a lower diversity than regions with high recombination. Innan and Stephan (2003) suggested a simple method to distinguish between the two types of selection. This method is based on two simplified equations that describe the reduction in neutral diversity θ0 under a model of BS and HH. In the BS model
When we plot against ρ, there is a clear reduction at very low recombination rates that fits rather well with both the BS- and HH-model (Fig. 2A
In 1000 bootstrap samples of the data (see Methods), the HH-model provided, in all but one case, a better fit than the BS-model. We also simulated data under a BS model given the estimated parameter values, and applied the same bootstrapping procedure to each of the simulated data sets (Fig. 2B Identifying outliers In order to identify candidate regions for recent selective sweeps and balancing selection, we conducted coalescent simulations in a sliding window along the genome, taking the observed distribution of sequence reads, local mutation rate estimated from human–chimpanzee divergence (d), and local recombination rate into account (see Methods). Furthermore, we did half of the simulations under the best-fitting background selection model (see above). The expected value of θ, given these factors, is denoted by θE. The 324 and 80 regions in the genome had smaller and larger values of , respectively, than any of the 2000 simulations for the region. The 10 regions with the lowest values of /θE are given in Table 1 and the 10 regions with the highest values of /θE are summarized in Supplemental Table S2. Regions that have recently experienced a selective sweep should be marked by a low /θE. However, as the expected value of θE will be calculated based on d, an increased d could have a similar effect. Similarly, an increased /θE may be indicative of balancing selection (Kreitman and Hudson 1991), but might also be caused by misassemblies of the human shotgun reads, or reduced levels divergence. We compare our results with other genome-wide scans for selection in Table 2.
Regions with high /θE are candidates for balancing selection or they might contain more slightly deleterious variants, i.e., substitutions that can segregate within a population, but are unlikely to become fixed. In protein-coding regions, both possibilities result in an excess of nonsynonymous polymorphisms, as can be detected in a McDonald-Kreitman test. Indeed, if we match the high /θE regions to genes identified in Bustamante et al. (2005) to be under negative selection, we find them to be significantly enriched (Table 2). Furthermore, the HLA-cluster on chromosome 6 contains five regions with highly elevated /θE (Supplemental Fig. S4), of which one is the second highest overall (Supplemental Table S2). This is encouraging because the HLA-region has previously been shown to evolve under balancing selection (Klitz et al. 1986; Erlich and Gyllensten 1991; Begovich et al. 1992; Hughes et al. 1993). Furthermore, we also find that large clusters of olfactory receptors (as annotated in Aloni et al. 2006 and with more than three human genes), exhibit unusually high values of /θE (Table 2). The largest cluster with 103 genes in humans is located on chromosome 11. This region encompasses ~1 Mb and contains five /θE peaks (Supplemental Fig. S5). Another chromosome 11 olfactory receptor cluster has previously been shown to be under positive selection (Clark et al. 2003; Gilad et al. 2003; Nielsen et al. 2005). Further indication that high regions may also have experienced selective sweeps is that they show a significant overlap with the regions of recently selected genes as identified by Tang et al. (2007).A third possible explanation for elevated /θE are copy number polymorphisms where a copy is gained. Indeed, the region with the highest value of /θE is surrounded by common copy number polymorphisms (CNPs). The actual peak in /θE does not lie within a known copy number gain, suggesting that the increased value of /θE has not been inflated by a gain in copy number in one or more individuals compared to the assembly. However, CNPs may affect alignments, thereby inflating .Characterizing extreme /θE regionsWe summarize the results for gene-specific analyses of regions with elevated or reduced values of /θE by dividing genes into different GO categories (Ashburner et al. 2000). RefSeq genes were associated with the nonoverlapping windows, and if a window contained multiple genes with the same GO category, this GO category was only counted once for this window. Thus, we avoid GO categories from becoming significant, just because of one cluster of genes. For example, unlike other studies, we do not find the GO categories related to olfaction to be significant, although individual OR clusters show clear signals. We find that all three ontologies—biological process, cellular component, and molecular function—show a significant enrichment of outlier regions in certain GO categories (Supplemental Table S3). We show a comprehensive summary of the results for what we consider the most informative category, biological process, in Supplemental Table S5.Regions with reduced /θE contain an enrichment of categories traditionally associated with selective sweeps and positive selection (Chimpanzee Sequencing and Analysis Consortium 2005; Nielsen et al. 2005; Gibbs et al. 2007), including the following immune response related categories: regulation of B cell activation, B cell differentiation, and leukocyte chemotaxis (Supplemental Table S5). Other immune-related categories and categories involved in apoptosis are not among the most significant categories (Supplemental Table S4). One explanation for this is that these categories are also more likely to have experienced balancing selection, and hence also have an elevated /θE. This is true for several apoptosis related groups (Supplemental Table S6).The region in the genome with the lowest value of /θE is the cystatin cluster on chromosome 20 (Table 1). The cystatins in this cluster are potent inhibitors of cysteine proteases, especially cathepsin B. The cystatins in the middle of the /θE valley belong to the S-cystatins (Supplemental Fig. S6), which are abundant in saliva, but occur also in other body fluids such as tears. Presumably, they have a protective function. However, this does not appear to fully explain their abundance in saliva (Dickinson 2002).The region with the second lowest value of /θ has only one gene in its proximity, and that is the ephrin receptor A6 (Fig. 3
We think that these two loci might be interesting cases of recent selection that are worthwhile to study in more depth. Subcentromeric regions Subcentromeric regions generally have very low recombination rates (Yu et al. 2001; Kong et al. 2002). Consequently, centromeres may have a reduced level of diversity under both a model of background selection and of selective sweeps. Additionally, centromeric repeats are thought to be prone to meiotic drive acting during female meiosis (for review, see Henikoff et al. 2001). Subcentromeric regions may, therefore, be affected by selective sweeps more frequently than other genomic regions. A previous scan for selective sweeps in the human genome (Williamson et al. 2007) found increased evidence for selective sweeps around centromeric regions. We define windows as centromeric if they overlap with the chromosomal band labeled as centromeric in the UCSC genome browser (http://genome.ucsc.edu). As expected, these windows are associated with extremely low recombination rates (Fig. 4A appears to be reduced, while human–chimpanzee divergence is increased relative to intrachromosomal regions (Fig. 4B,C in subcentromeric regions is still significantly lower than expected (MWU-test: n = 159, P = 1.5 × 10−6). This result strongly suggests that selective sweeps are, in fact, more common in centromeric regions.
Such a picture could also emerge if the number of chromosomes (n) was overestimated due to assembly errors. As the number of reads (m) in the alignment is indeed elevated for the centromeres (Fig. 4C still remains significantly lower than expected (MWU-test: n = 112, P = 0.009). Therefore, we believe that the reduction in centromeric diversity is indeed due to positive selection, consistent with the theory of a centromeric meiotic drive.Subtelomeric regions One of the unexplained results from the analysis of the chimpanzee genome was the observation of increased divergence in subtelomeric regions (Fig. 4C As was the case for subcentromeric regions, subtelomeric regions also show a decrease in /θE (MWU-test: n = 1175, P < 10−15), possibly suggesting increased levels of selective sweeps in telomeric regions. Unlike centromeres, telomeric repeats show no evidence of meiotic drive. However, telomeres are enriched for segmental duplications (for review, see Cheng et al. 2005; Riethman et al. 2005), and hence, one could speculate that subtelomeric regions are hubs for neofunctionalization of duplicate genes and, therefore, more variants are fixed due to positive selection.Discussion Our average estimate of = 0.00163, is higher than in other studies. Halushka et al. (1999) reported estimates of θ based on numbers of segregating sites for silent substitutions of 0.0015 and for introns of 0.00105. The estimates from the resequenced data in the Seattle SNP database (http://pga.gs.washington.edu/summary_stats.html; Akey et al. 2004), based on the average number of pairwise differences is 0.00085. One explanation for the difference between the results by Akey et al. (2004) and our results is that the Seattle SNPs data set only includes genic regions. However, our estimate is also higher than the estimate from 50 intergenic regions (Voight et al. 2005). A slight difference in the estimates is expected, because our estimator based on the number of segregating sites will give higher estimates than the estimator based on pairwise differences in the presence of a negative Tajima’s D value, as observed for the Seattle data and at least one population of the Voight data. Further, Voight et al. (2005) report only diversity for the populations separately, but not the overall diversity. Finally, as pointed out in Johnson and Slatkin (2006), estimates of θ may be biased when quality values have been used to call SNPs. As our data have been subject to an initial quality screening, it is likely that our data are also affected by this bias. However, in the absence of regional differences in the use of protocols to call SNPs, none of our conclusions should be affected by this bias.Furthermore, our estimate of diversity was made without correction for demographic influences. In part, this is because the sample for the Celera Genomics study included an overdispersed sampling of humans from the major geographic groups. Therefore, we want to stress that the P-values and confidence intervals that we obtain through the simulations are only exact if the assumptions of our simulations were right. If we overestimate, or the individuals that we analyzed did not come from a Wright-Fisher population but from a population with a more complex demography, we may, for some demographic models, underestimate the variance of .We examined the overlap between regions with low values of /θE and regions identified in other genome-wide screens for positive selection. Voight et al. (2006) conducted a genome-wide screen for selective sweeps based on haplotype structure. This method has maximal power to detect ongoing selective sweeps, while the reduction in diversity that we measure is strongest after a sweep has just finished. Therefore, the lack of overlap in candidate regions identified by the two studies is not surprising (Table 2). Next, we looked for overlap between our data and the data by Williamson et al. (2007). Their test statistic is based on the frequency spectrum at variable sites and, hence, their power is also best for finished sweeps. However, since the statistic only looks at variable sites, the power of this test for regions of very low diversity will also be lower than in our test. This might explain why we find so little overlap in candidate regions. Another possible cause is that our confidence intervals are widest for regions of low recombination. Therefore, the average recombination rate of our candidate regions is 1.87 cM/Mb, while the two LD studies show an opposite trend (median recombination rates: 0.76 and 1.2 cM/Mb). On the other hand, there is good overlap between our candidate regions and regions identified in a recent study contrasting patterns of LD among different populations to detect nearly complete selective sweeps (Tang et al. 2007).Further, we find a significant overlap with the study by Bustamante et al. (2005) and the clusters of positively selected genes as identified by the Chimpanzee Sequencing and Analysis Consortium (2005) (Table 2). Both studies use the ratio of nonsynonymous to synonymous mutations in human–chimpanzee alignments to detect selection within protein-coding genes. Bustamante et al. additionally used ratios of nonsynonymous to synonymous human diversity in an extension of the McDonald-Kreitman test (McDonald and Kreitman 1991). If no selection were acting in the genome, we would not expect a correlation between diversity to divergence ratios and ratios of nonsynonymous to synonymous mutations. The strong correspondence between the studies, therefore, helps solidify the argument that extreme values of /θE, in fact, do provide evidence for positive selection.A number of factors can affect the analyses presented in this study. The methods used for identifying SNPs and accommodating sequencing, assembly, and alignment errors may affect local estimates of genetic diversity. Future studies may incorporate more specific and detailed modeling of errors based on experimental evidence and genotype confidence scores. We also notice that we used a very simple standard population genetic model, assuming constant population sizes and no population structure. Finally, the sample size is very low (seven individuals from diverse racial groups) with one individual contributing a large proportion of the reads analyzed. However, the main conclusions of the study stand and are unlikely to be influenced by this: (1) tiled population genetic data can easily be dealt with in population genetic analyses using appropriately modified composite likelihood methods, (2) the human genome shows a reduction in variability in regions of low recombination that cannot be explained by possible mutagenic effects of recombination, (3) both telomeres and centromeres show a decrease in the levels of diversity compared with the expectation given between-species divergence and recombination rate, (4) outlier analyses of variability identify a number of candidate genes for both balancing and directional selection including HLA and olfactory receptor clusters. However, our ability to reliably identify outlier regions may be challenged if there is an undetected regional variation in the error rate. While even this relatively simplistic approach to demography and sequence errors will see immediate application, an exciting challenge in future studies is to incorporate inference procedures for more elaborate and realistic models. This can be achieved using the composite likelihood framework outlined in this study. Methods SNPs from shotgun data This procedure was described in dbSNP under http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?method_id=2929. Briefly, potential nucleotide variants from the WGA2 alignment were identified using only the Celera reads. Potential nucleotide variants need to pass the sequence quality value (QV), neighbor quality value (NQV), and the heterozygosity check. The default QV value is >23 for the polymorphic base and >21 for the minimal neighbor QV (4 bp). For the deep covered minor alleles, the QV threshold is adjusted lower. Every supported minor allele will decrease the threshold, but the minimal QV cutoff is not below 16. During the heterozygosity check, sequences containing more than two alleles per individual were filtered out. The locations of all reads were mapped relative to the assembly. We then divided the shotgun assembly and associated SNPs into segments according to shotgun read ends (Fig. 1 Error rate λ Based on Altshuler et al. (2000), we assumed that the error rate λ = 1/35,000, which most likely corresponds to an upper bound. If we assume that the error rate is approximately constant, the actual value of λ influences θ linearly. Further, the number of expected errors is on average 20-fold lower than our SNP counts; hence, the relative magnitude of θ should be robust with respect to assumptions about λ. Human–chimpanzee divergence We downloaded the axt—alignments between the human genome hg16 and the chimpanzee genome panTro2 (http://hgdownload.cse.ucsc.edu/goldenPath/hg16/vsPanTro1/axtRBestNet/). For each window, we counted the number of bases that differed and divided them by the number of bases that could be compared. Recombination rates We downloaded genetic and physical distances from http://www.stats.ox.ac.uk/mathgen/Recombination.html and calculated the recombination rates as slopes from regressing genetic on physical distance for windows of 500 kb centered on a given 100-kb window. Here, estimates of recombination rate are based on patterns of LD; therefore, there may be concerns that the estimates of recombination rates depend on SNP density. However, SNP density mainly influences estimates of variation in recombination rate (e.g., inferences of recombination hotspots) and not so much the average rate over large distances as examined here. Also, the difference in recombination rate estimates between pedigree map (Kong et al. 2002) and the LD map does not vary systematically with (Supplemental Fig. S2). Hence, it seems like an acceptable approximation to use the LD map for this purpose.Estimating θ The estimator of θ we develop is a modification of Watterson’s (Watterson 1975) classical estimator applicable to the tiled shotgun sequencing data. It can be derived as a composite likelihood estimator, but we will provide the simpler derivation using a method of moments estimator. To this end, we divide the genome into v segments, where the v − 1 divisions between segments are chosen to fall at the points where a fragment starts or ends (Fig. 1
For n chromosomes, in the absence of sequencing errors and assuming an infinitely many sites model (Watterson 1975), the expected number of segregating sites in a segment of length L is (Watterson 1975) L(θ∑ (1/i)). The expected number in a tiled alignment is then obtained similarly by summing over all possible values of n. The expected number or false SNPs introduced by sequencing errors is LλmI(m > 1) + O(λ2) as λ → 0, where λ is the sequencing error rate per nucleotide and m is the total number of reads in the segment. Assuming that the error rate is low enough that two sequencing errors in the same site can be ignored, the expected number of segregating sites for tiled data in an alignment segment is:
similar to the Watterson (1975) estimator can be obtained by rearranging (Equation 2) and summing over alignment segments (Equation 3).Confidence intervals for ![]() The variance in cannot be obtained analytically in the presence of recombination, even for complete data that has not been obtained by shotgun sequencing. Only in the presence of no recombination or full recombination, i.e., no linkage disequilibrium among SNPs, can formulas for the variance be obtained. Such formulas can also be derived for tiled shotgun data, but are not of much practical use, as linkage disequilibrium is indeed widely observed in most data. Our approach for the estimation of confidence intervals is, therefore, based on simulating data, taking into account variation in local recombination rates, sequencing errors, and the number of reads sampled for any particular genomic segment. To this end, we used the program ms to do coalescent simulations (Hudson 2002) under the standard neutral model and a simple background selection model, which will be described below. After we obtain the sample sequences from the coalescent simulations, we sample “reads” from those sequences in exactly the same way as they were obtained from the Celera data, and then calculate as described.Parameter estimation under selection models In order to estimate the selection parameters α and u, we need to correct for variation in θ0 due to mutation rate variation. To this end, we scaled θ0 using a scalar ci for window i based on estimates of chimpanzee–human divergence as proxies for mutation rate variation. Estimates of ρ for each window were calculated from the Myers-map (Myers et al. 2005). Because these simple models assume that the strength and frequency of positive and negative selection is constant across the genome, we decided to reduce the noise by binning the data according to recombination rates. We sorted nonoverlapping windows according to their recombination rate into bins of 100. Then, we fitted the models of selection described in Equations 4 and 5 to the binned data, thus obtaining estimates of θ0, α, and u. The model was fitted using the Nelder-Mead Simplex algorithm as implemented in the Gnu Scientific library (http://www.gnu.org/software/gsl/) using least squares as a test-statistic. We also attempted to fit the model to unbinned data; however, the model fit, as assessed by a simple sum of squares, was always inferior to that obtained with summarized data. We also tried a more complicated model of background selection that could also accommodate variation in recombination rates across the flanking regions of each window. Again, as for the simple model and the unbinned data, the more complicated model failed to fit the data better. In order to compare the fit of the BS and HH models, we generated 1000 bootstrap samples over bins and counted how often the least squares statistics was better for either the BS or the HH-model. Simulations under the background selection and a neutral model We used the program ms for all coalescent simulations (Hudson 2002). For coalescent simulations under a background selection model, we reduced the effective population size by e–u/ρ, where u is the deleterious mutation rate per generation per base pair and ρ is the number of crossovers per generation per base pair. For each window i, we simulated for θE = θ0e−u/ρici, where θ0 is the average diversity. ci is a scalar allowing variation in the mutation rate, calculated as ci = di/d, whereas di is the human–chimpanzee divergence of window i and d is the mean divergence. Thus, we simulated 14 chromosomes. We then subsampled segments from these chromosomes, corresponding to the segments obtained from the Celera shotgun reads. The probability that both chromosomes from an individual were sampled was taken as p = 1 − 0.5xrj−1, where xrj is the number of reads from individual j in segment r. To simulate sequencing errors, we added Se errors drawn from a Poisson distribution with mean λ∑ LrmrI(mr > 1), where Lr is the length of the segment r in bp, m is the number of reads in segment r, and I is and indicator function.Outlier analysis In order to identify candidate regions for recent selective sweeps and balancing selection, we conducted 200 coalescent simulations (100 under the standard neutral and 100 under a BS-model) for each 100-kb window, sliding by 20 kb, taking the observed distribution of sequence reads in each window into account, and using a window-specific value of θ to drive the simulations. We identified all windows with observed values of falling outside the distributions of in the simulated data under both neutrality and background selection. Those windows were merged with all adjacent windows where fell among the 5% most extreme values on either side in the simulated data. For the resulting 1046 regions with elevated values and 589 regions with reduced values of we conducted another 2000 simulations to get a more precise estimate of the P-value (Supplemental Table S1).Gene Ontology analysis All locuslink genes overlapping with the 1046 high or 589 low regions as well as all nonoverlapping 100-kb windows outside of these regions were identified. The locuslink identifiers were then used in BioMart (Kasprzyk et al. 2004) to associate locuslink identifiers with Gene Ontology groups (GO) (Ashburner et al. 2000). We only took reviewed annotations into account, i.e., we disregarded annotations with evidence code IEA and ND. The Gene Ontology version from May 2007 was used. For each region/window only a nonredundant set of GO-identifiers was kept. To identify GO-groups with over-representations of either high /θE or low /θE regions, we used the program FUNC (Prufer et al. 2007). More specifically, we used the hypergeometric test, requiring a minimum of 10 windows associated with a given node. After obtaining the general statistics for each ontology, we made use of the refinement option in FUNC that keeps only the most specific, significant categories; higher categories that are solely significant because of genes from a significant subordinate category are removed.Acknowledgments We thank M. Przeworski and an anonymous reviewer for helpful comments. This work was supported by Danish FNU and Danmarks Grundforskningsfond. I.H. was supported by a HFSP long-term fellowship. Footnotes [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.074187.107. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||
Genetics. 2000 Feb; 154(2):931-42.
[Genetics. 2000]Genetics. 2002 Aug; 161(4):1641-50.
[Genetics. 2002]Genetics. 2003 Sep; 165(1):427-36.
[Genetics. 2003]Genetics. 2004 Nov; 168(3):1699-712.
[Genetics. 2004]Science. 2005 Oct 14; 310(5746):321-4.
[Science. 2005]Theor Popul Biol. 1975 Apr; 7(2):256-76.
[Theor Popul Biol. 1975]Annu Rev Genet. 1995; 29():401-21.
[Annu Rev Genet. 1995]Genetics. 1987 May; 116(1):153-9.
[Genetics. 1987]Nature. 1992 Apr 9; 356(6369):519-20.
[Nature. 1992]Genetics. 2001 Jun; 158(2):657-65.
[Genetics. 2001]Nature. 2005 Sep 15; 437(7057):376-80.
[Nature. 2005]Curr Opin Genet Dev. 2006 Dec; 16(6):545-52.
[Curr Opin Genet Dev. 2006]Science. 2001 Feb 16; 291(5507):1304-51.
[Science. 2001]Genetics. 2000 Feb; 154(2):931-42.
[Genetics. 2000]Genetics. 2001 Dec; 159(4):1805-17.
[Genetics. 2001]Genetics. 2004 Nov; 168(3):1699-712.
[Genetics. 2004]J Math Biol. 2006 Nov; 53(5):821-41.
[J Math Biol. 2006]Theor Popul Biol. 1975 Apr; 7(2):256-76.
[Theor Popul Biol. 1975]Theor Popul Biol. 1975 Apr; 7(2):256-76.
[Theor Popul Biol. 1975]Science. 2001 Feb 16; 291(5507):1304-51.
[Science. 2001]Nature. 1992 Apr 9; 356(6369):519-20.
[Nature. 1992]Genetics. 1993 Aug; 134(4):1289-303.
[Genetics. 1993]Genetics. 2003 Dec; 165(4):2307-12.
[Genetics. 2003]Genetics. 1991 Mar; 127(3):565-82.
[Genetics. 1991]Bioinformatics. 2006 May 1; 22(9):1036-46.
[Bioinformatics. 2006]Nature. 2005 Oct 20; 437(7062):1153-7.
[Nature. 2005]Am J Hum Genet. 1986 Sep; 39(3):340-9.
[Am J Hum Genet. 1986]Hum Immunol. 1991 Feb; 30(2):110-8.
[Hum Immunol. 1991]J Immunol. 1992 Jan 1; 148(1):249-58.
[J Immunol. 1992]Genetics. 1993 Mar; 133(3):669-80.
[Genetics. 1993]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]PLoS Biol. 2005 Jun; 3(6):e170.
[PLoS Biol. 2005]Science. 2007 Apr 13; 316(5822):222-34.
[Science. 2007]Crit Rev Oral Biol Med. 2002; 13(6):485-508.
[Crit Rev Oral Biol Med. 2002]J Comp Neurol. 2003 Feb 10; 456(3):203-16.
[J Comp Neurol. 2003]Cytokine Growth Factor Rev. 2002 Feb; 13(1):75-85.
[Cytokine Growth Factor Rev. 2002]Dev Dyn. 2007 Apr; 236(4):951-60.
[Dev Dyn. 2007]Nature. 2001 Feb 15; 409(6822):951-3.
[Nature. 2001]Nat Genet. 2002 Jul; 31(3):241-7.
[Nat Genet. 2002]Science. 2001 Aug 10; 293(5532):1098-102.
[Science. 2001]PLoS Genet. 2007 Jun; 3(6):e90.
[PLoS Genet. 2007]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]Nat Genet. 2002 Jul; 31(3):241-7.
[Nat Genet. 2002]Trends Genet. 2002 Jul; 18(7):337-40.
[Trends Genet. 2002]Am J Hum Genet. 2003 Jun; 72(6):1527-35.
[Am J Hum Genet. 2003]Genome Res. 2007 Oct; 17(10):1420-30.
[Genome Res. 2007]Nature. 2005 Sep 1; 437(7055):88-93.
[Nature. 2005]Chromosome Res. 2005; 13(5):505-15.
[Chromosome Res. 2005]Nat Genet. 1999 Jul; 22(3):239-47.
[Nat Genet. 1999]PLoS Biol. 2004 Oct; 2(10):e286.
[PLoS Biol. 2004]Proc Natl Acad Sci U S A. 2005 Dec 20; 102(51):18508-13.
[Proc Natl Acad Sci U S A. 2005]Genome Res. 2006 Oct; 16(10):1320-7.
[Genome Res. 2006]PLoS Biol. 2006 Mar; 4(3):e72.
[PLoS Biol. 2006]PLoS Genet. 2007 Jun; 3(6):e90.
[PLoS Genet. 2007]PLoS Biol. 2007 Jul; 5(7):e171.
[PLoS Biol. 2007]Nature. 2005 Oct 20; 437(7062):1153-7.
[Nature. 2005]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]Nature. 1991 Jun 20; 351(6328):652-4.
[Nature. 1991]Nature. 2000 Sep 28; 407(6803):513-6.
[Nature. 2000]Nat Genet. 2002 Jul; 31(3):241-7.
[Nat Genet. 2002]Theor Popul Biol. 1975 Apr; 7(2):256-76.
[Theor Popul Biol. 1975]Theor Popul Biol. 1975 Apr; 7(2):256-76.
[Theor Popul Biol. 1975]Bioinformatics. 2002 Feb; 18(2):337-8.
[Bioinformatics. 2002]Science. 2005 Oct 14; 310(5746):321-4.
[Science. 2005]Bioinformatics. 2002 Feb; 18(2):337-8.
[Bioinformatics. 2002]Genome Res. 2004 Jan; 14(1):160-9.
[Genome Res. 2004]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]BMC Bioinformatics. 2007 Feb 6; 8():41.
[BMC Bioinformatics. 2007]