![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2005, Cold Spring Harbor Laboratory Press Genome-wide regulatory complexity in yeast promoters: Separation of functionally conserved and neutral sequence Department of Biochemistry and Biophysics, University of California, San Francisco, California 94143, USA 1These authors contributed equally to this work. 2Corresponding author.E-mail haoli/at/genome.ucsf.edu; fax (415) 514-2617. Received September 8, 2004; Accepted November 23, 2004. This article has been cited by other articles in PMC.Abstract To gauge the complexity of gene regulation in yeast, it is essential to know how much promoter sequence is functional. Conservation across species can be a sensitive means of detecting functional sequences, provided that the significance of conservation can be accurately calibrated with the local neutral mutation rate. By analyzing yeast coding and promoter sequences, we find that neutral mutation rates in yeast are uniform genome-wide, in contrast to mammals, where neutral mutation rates vary along chromosomes. We develop an approach that uses this uniform rate to estimate the amount of promoter sequence under purifying selection. This amount is ~30%, corresponding to roughly 90 bp for a typical promoter. Furthermore, using a hidden Markov model, we are able to separate each promoter into distinct high and low conservation regions. Known regulatory motifs are strongly biased toward high conservation regions, while low conservation regions have mutation rates similar to that of the neutral background. Certain Gene Ontology groupings of genes (e.g., Carbohydrate Metabolism) have large amounts of high conservation sequence, suggesting complexity in their transcriptional regulation. Others (e.g., RNA Processing) have little high conservation sequence and are likely to be simply regulated. The separation of functionally conserved sequence from the neutral background allows us to estimate the complexity of cis-regulation on a genomic scale. The regulation of gene expression is a universal, yet complex, process in biological systems. In the model organism Saccharomyces cerevisiae, several hundred transcription factors are thought to be involved in the regulation of ~6000 genes (Gene Ontology Consortium 2000; Dolinski et al. 2004). Much of this regulation is mediated by transcription-factor-binding sites in promoter sequences, making knowledge of these sites essential for understanding the logic of cis-regulation. One promising approach for detecting binding sites is phylogenetic footprinting, the identification of selectively constrained elements by their conservation across species. For example, the genome sequences of S. cerevisiae and several of its close relatives have been used to predict motifs likely to describe transcription-factor-binding sites (Chiang et al. 2003; Cliften et al. 2003; Kellis et al. 2003; Pritsker et al. 2004; Siddharthan et al. 2004; Tanay et al. 2004). Although a number of transcription factors and binding motifs have been studied in detail, some basic parameters of transcriptional regulation are not known. One important feature that has not been characterized is the amount of functional sequence under purifying selection in yeast promoters. This is crucial for assessing how many transcription factors bind a typical promoter, how prevalent combinatorial control is, and whether regulation is more complex for particular gene families. Inspection of aligned yeast promoters shows rich structure in conservation patterns. There are blocks of highly conserved sequence, as well as blocks with lower conservation rates. The extent of conservation also varies from promoter to promoter. But do these variations in conservation reflect differences in the complexity of cis-regulation, or simply differences in local neutral mutation processes? Conservation of a sequence does not necessarily imply functionality. Conservation may also be neutral, because of lack of divergence time. Variations in conservation could be explained by regional biases in mutation rates. In mammals, for example, mutation rates are known to vary in blocks several megabases long (Hardison et al. 2003; Chuang and Li 2004), but it is not known whether such regional effects are present in yeast. To determine the amount of functionally conserved sequence in yeast promoters, it is necessary to measure the neutral conservation rates along the genome, as these rates provide the calibration for significance. In the first section of this paper we determine the local neutral mutation rates by measuring the degree of sequence conservation across the genome, using data from silent site positions in aligned yeast coding sequences (Kellis et al. 2003) from S. cerevisiae, Saccharomyces paradoxus, Saccharomyces bayanus, and Saccharomyces mikatae. Our results indicate that, unlike in mammals, the neutral mutation rate is uniform across the S. cerevisiae genome. We are able to distinguish the small set of genes that deviate from neutral expectations because of codon usage selection. With knowledge of the neutral mutation rate, it becomes possible to determine what parts of yeast promoters evolve neutrally. In the second section, we show that yeast promoters can be separated into neutral and functionally conserved regions. Using a hidden Markov model (HMM), we are able to distinguish regions of high and low sequence conservation. We find that the conservation rates in the low conservation regions are similar to the neutral mutation rate determined from the silent sites. The highly conserved regions, on the other hand, contain an over-abundance of known transcription-factor-binding sites. We next estimate the total amount of promoter sequence under selection in all S. cerevisiae promoters. Through an analysis of the frequencies of conserved blocks of different lengths, we find that ~30% of sites in the promoters are under selection. This result is robust over several different species comparisons. Finally, we analyze the length of sequence in high conservation regions for each promoter, as this provides a rough measure of how much regulation acts on each gene. We perform a functional analysis of the types of genes having long lengths of high conservation regions, finding several Gene Ontology categories with unusual conservation levels. Results Neutral mutation rates are uniform genome-wide An understanding of neutral mutation rates is important for calibrating the functional significance of sequence conservation between yeast species. To determine neutral rates of conservation, we measured rates of conservation in genes shared among the species S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus, using data from fourfold degenerate base positions (see Methods). It might be argued that fourfold sites are not appropriate for measuring neutral rates, because codon usage selection affects silent sites in some S. cerevisiae genes. However, as we show below, it is possible to distinguish selective effects from neutral ones by analyzing the distribution of conservation rates and codon usage bias. Regional biases can have a strong effect on mutation rates, as has been observed in mammals (Hardison et al. 2003; Chuang and Li 2004). Therefore, proper accounting for such biases could potentially be crucial for calibration of phylogenetic footprinting. We tested for regional biases in the yeast genome by first comparing S. cerevisiae to its closest sequenced relative, S. paradoxus, which diverged from it ~5 million years ago (Myr) (Kellis et al. 2003). We used this closest relative to minimize the possible effects of chromosomal rearrangements, which could obscure regional biases when species are further diverged. For each gene, we measured the conservation rate as the fraction of shared fourfold sites that are identical in S. cerevisiae and S. paradoxus. Genome-wide, the mode conservation rate at fourfold sites was 0.74. To properly account for finite-length effects, we mapped each rate to a normalized value r, by subtracting the mode conservation rate in the genome and dividing by the standard deviation predicted by an independent sites model (Chuang and Li 2004) (see Methods). We used these normalized rates to test for regional correlations, through an analysis of the autocorrelation function r(0)r(x) , where r(0) is the normalized conservation rate of a gene, r(x) is the conservation rate of a gene x base pairs downstream from the first gene, and ... indicates an average over all gene pairs separated by a distance x. Distances were measured along the S. cerevisiae coordinate. Since the rates r were normalized around r = 0, we expected r(0)r(x) 0 if rates were not correlated at a distance x (see Methods).Overall, rate correlations between genes were weak. The autocorrelation function is plotted versus the gene separation x in Figure 1
We next measured the rates at which silent sites were conserved among all four yeast species, to allow for phylogenetic footprinting using all of the genomes. For each gene shared by all four species, we measured the fraction of fourfold sites in which every species has the same base. Although classifying sites based on four-species data ignores lineage-specific effects, it has the advantage of providing much stronger signal-to-noise than any two-species comparison. Because of this choice of data, we generally use the term “selection” to indicate purifying selection common to all four species. Conversely, we use “neutral” to indicate the absence of such four-species selection. The four-species conservation rates were then normalized, analogously as for the S. cerevisiae-S. paradoxus conservation rates, using the genome-wide mode conservation rate of 0.33. We repeated the autocorrelation analysis using the four-species rates, and again found no significant correlations. For completeness, we tested rate autocorrelations in the comparisons S. cerevisiae-S. mikatae and S. cerevisiae-S. bayanus, in each case finding no correlations. The conclusion that mutation rates are uncorrelated along the yeast genome was further supported by the genome-wide distribution of normalized four-species conservation rates. A useful property of normalized rates is that they should be Gaussian-distributed with unit standard deviation, if all fourfold sites mutate independently at the same rate (see Methods). The observed rate distribution followed the characteristics expected for independently mutating sites, as shown in Figure 2
Although the observed and expected four-species rate distributions were largely similar, the observed distribution had a bias toward high conservation rates. This is evidenced by the long tail at large values of r in Figure 2 The high conservation values for the remaining 8% of the genes were explainable by codon usage selection (Li and Sharp 1987). Genes with high conservation values were observed to have high codon usage bias, as measured by the codon adaptation index (CAI) (Li and Sharp 1987) values for the S. cerevisiae versions of the gene. The Pearson correlation of the normalized substitution rate with CAI (Dolinski et al. 2004) was 0.67 (3568 genes; p <10-250). Many of the genes having high conservation rate were among the types known to be under codon usage selection, such as ribosomal genes and those involved in carbohydrate metabolism (Fig. 2 Neutral conservation rates in promoters Promoters, the sequences upstream of coding regions, contain functional elements such as transcription-factor-binding sites. Conservation across species can be an effective way to detect functional elements, although conservation can also be due to shared ancestry. To detect the functional elements, they must be separated from the neutral background. We used a hidden Markov model (HMM) (Rabiner 1989) to decompose promoters into neutral and selectively constrained regions based on their patterns of conservation, where conservation was defined as sharing of the same base in all four species. Hidden Markov models have previously been applied to several sequence decomposition problems, such as the identification of genes or CpG islands in DNA sequence (Burge and Karlin 1997; Durbin et al. 1998). Our HMM (see Methods; full details are in the Supplemental material) was designed to identify functional and neutral regions in promoters by breaking the promoters into high conservation regions (HCR) and low conservation regions (LCR). Since neutral rates were found to be uniform, we chose a hidden Markov model with a single set of global parameters, and trained it using a set of 2453 promoter sequences (Kellis et al. 2003) that had four-species alignments. An example of a sequence decomposition is shown in Figure 3
The HCRs and LCRs identified by the HMM had distinct conservation rates (Fig. 4
The neutral rates in the LCRs were consistent with the neutral rates obtained from the fourfold site analysis, suggesting that both methods accurately measure the neutral conservation rate. The distribution of conservation rates for the LCRs overlapped well with that of the fourfold sites (solid bars in Fig. 4 The HCRs, on the other hand, contained an excess of functional elements. Using a set of regulatory motifs identified computationally by W. Wang, M. Cherry, Y. Nochomovitz, E. Jolly, D. Botstein, and H. Li, (in prep.) from chromatin immunoprecipitation data, we tested whether regulatory motifs were biased toward HCRs over LCRs. While the HCRs covered only 34.3% of the promoter regions, they contained 406 of the 567 motifs in the test set and in our promoters (71.6%). As a control, we randomized the locations of the motifs within their respective promoters. In the random case, only 178 ± 10 motifs were inside HCRs, giving the observed results a p-value of 10-107. We found similar enrichment when a set of motifs from SCPD (Zhu and Zhang 1999) was used. Out of 234 motifs, 168 (71.8%) overlapped with our HCRs, while only 85 ± 6 overlapped when their locations were randomized. Thus, we found strong overlap of the predicted motifs with our HCR regions, despite the fact that both the SCPD and Wang et al. motif sets were obtained independently of conservation information. Genome-wide amount of promoter sequence under selection In the previous section we showed that the fourfold site neutral conservation rate can provide a calibration for detecting the functionally conserved regions within promoters. In this section, we use the neutral rate to calculate the fraction of all promoter sequence that is under purifying selection. Although a good approximation, the HCRs and LCRs did not always correspond to functional and neutral regions, as the separation by the HMM was imperfect. To precisely evaluate the amount of functional sequence, we therefore used a different approach. The procedure involved counting the numbers of blocks of n consecutive conserved bases in the promoter sequences, which were then compared to neutral expectations. This approach did not infer whether any specific region is functional, instead only predicting the total amount of functional sequence. This method, which we refer to as the Frequency of Conserved Blocks (FCB) method, was more robust than the HMM for inferring the amount of selectively conserved sequence (see Methods). The FCB method yields an accurate estimate of the amount of functionally conserved sequence subject to two requirements: (1) the frequency distribution of conserved blocks in neutral sequence is known; and (2) this neutral component can be extracted from the real frequency distribution. Both of these requirements could be met for the yeast data set. We were able to generate the conserved n-mer distribution from synthetic neutral sequences of the same lengths as the real promoters using the known neutral conservation rate. We could extract the neutral component of the observed frequency distribution by considering the counts of conserved n-mers at small n. This extraction was based on the assumption that for small n, the conserved n-mer counts would be dominated by the neutral component, because functional conservation would be unlikely to cause isolated conserved bases. The assumption was supported by the neutral conservation rate inferred from the small n counts, which agreed with the rate estimated from gene silent sites (see below). We calculated the amount of promoter sequence selectively conserved among all four species, as well as the amount in the pairwise comparisons of S. cerevisiae to S. paradoxus, S. mikatae, and S. bayanus. The observed conserved n-mer distribution and the simulated neutral distribution for S. cerevisiae-S. bayanus are shown in Figure 5
We obtained the percentage of promoter sequence evolving neutrally using the ratio of conserved singlet (n = 1) counts for several different species comparisons (Table 1). This neutral percentage was calculated 100 times using independent simulations of neutral promoter sequence. The estimated percentages were similar (70%-74%) in all of the pairwise and four-species comparisons. In other words, ~30% of the promoter sites are under selection in each of these lineages. The robustness of this estimate is remarkable, given that the number count of conserved singlets varies by 2.5-fold over the different comparisons. These similarities in amounts of functional promoter sequence reflect the close phylogeny of these species (divergence times were 5-20 Myr). We also expected that more closely related species would have slightly higher levels of functional conservation, because they should have more shared functional sequences. This trend was, indeed, observed, as shown in Table 1, where the rows are arranged in order of increasing divergence time.
The separation of the neutral component from the frequency distribution also allows one to calculate the likelihood that a block of n conserved bases is functional. This can be done by comparing the estimated number of neutrally conserved n-mers to the total number of conserved n-mers. For example, for a 6-mer conserved across all four species, the probability that it is functional is 90% (see Methods). The inset in Figure 5 Gene-specific selection in promoters Gene promoters have varying levels of functional conservation. Because the HCRs strongly correlate with known transcription-factor-binding sites, much of this functional conservation is likely to be related to transcriptional regulation, although some other functional features may contribute as well (see Discussion). Thus, the HCRs provide a rough characterization of the transcriptional regulation in each promoter. All HCRs have conservation rates higher than the typical LCR region (Fig. 4 The mode length of HCRs in promoters was found to be 90 bases, although the distribution has a long tail (see Supplemental material). This corresponds to most genes having 15%-25% of their promoter sequence in HCRs. We systematically surveyed all Gene Ontology (GO) (Gene Ontology Consortium 2000) terms to determine if any were biased toward long HCR regions. For each GO term, we calculated the mean length of the HCRs for the associated genes. This mean length was compared to the mean for an equal number of randomly chosen genes, to determine an HCR length z-score zl (see Methods). There were several strong outliers with positive z-scores, but fewer for negative z-scores. The GO terms with the largest HCR length biases were those involved in the energy generation (Glucose Catabolism, Alcohol Catabolism) and steroid synthesis (Steroid Metabolism, Sterol Biosynthesis) pathways, suggesting that these types of genes have unusually complex regulation. The GO terms with the shortest HCRs included RNA processing (GO: 0006396) and condensed chromosomes (GO:0000793), although their biases were not as pronounced as for the high z-scores. The full list of GO terms can be found in the Supplemental material. The frequency of amino-acid-changing mutations in genes (Ka) provided an interesting comparison for the amount of HCR sequence (see Methods). These quantities measure the amounts of selective pressure on protein sequence and cis-regulation, respectively. Figure 6
Discussion By analyzing sequence conservation patterns at silent sites in yeast coding sequence as well as in promoter sequences, we have made several findings. First, we found that the neutral conservation rate is uniform across yeast genomes (S. cerevisiae, S. paradoxus, S. mikatae, and S. bayanus). This uniformity was found in pairwise comparisons of S. cerevisiae to each of the other species and among all four species collectively. Knowledge of the uniform neutral conservation rate allowed us to separate functional and neutral sequence. A significant fraction of promoter sequence was under purifying selection. For various species comparisons, the amount of selectively conserved sequence ranged from 26% to 30%. The similarities in amount of functional sequence suggested that lineage-specific functional elements are rare compared to elements common to all four species. More recently diverged species were found to have a slightly larger amount of shared functional sequence. For example, we predicted that S. cerevisiae and S. bayanus, which are diverged by ~20 Myr (Kellis et al. 2003) have 27% of their promoter sequences functionally conserved. Meanwhile, S. cerevisiae and S. paradoxus, which are diverged by ~5 Myr (Kellis et al. 2003) have 30% of their promoter sequence functionally conserved. This mere 3% change between these two evolutionary branches suggests that S. cerevisiae is unlikely to have much more than 30% of its promoter sequence functional, even including species-specific elements. Promoters contained blocks of highly conserved sequence embedded in a background mutating at the uniform neutral rate. Known functional elements were found to be strongly biased to these high conservation regions (HCRs). The HCR blocks were typically 8-20 bases long, although some blocks had as many as 40 consecutive conserved bases. Thus, a typical block may contain one or two protein-binding sites. The typical length of all HCR regions in a promoter was 90 bases, providing an upper bound of ~10 transcription-factor-binding sites in a promoter, although this number could be mitigated by sites conserved for other reasons. The example of the HIS4 promoter in Figure 3 Genes involved in energy generation and steroid synthesis were found to have the strongest biases toward long HCRs. This suggested that these genes may be subject to complex transcriptional regulation, because the number of transcription-factor-binding sites should increase with the length of HCRs in a promoter, and increased binding sites would allow more specific responses to different cellular conditions. The long HCRs of these genes may be related to the fact that these genes are often members of multiple pathways. For example, ERG10 (YPL028W) is a steroid metabolism gene with 215 HCR sites, and it is involved in nine different KEGG pathways (Kanehisa and Goto 2000). CDC19 (YAL038W) is a glycolysis gene with 394 HCR sites, and it is involved in four different pathways. Energy generation and steroid synthesis genes are each important for fitness—both were found to have above average protein sequence conservation as measured by KA. One quantitative distinction between these types of genes is their level of silent site conservation. Energy generation genes have high silent conservation rates, which could be an indication of high translation efficiency (Akashi 2004), while we found that steroid synthesis genes are more average in this regard. At the other extreme of promoter sequence conservation, genes with the shortest HCR lengths tended to be involved in RNA processing. This was consistent with the view that RNA processing genes are constitutively expressed, which would require simple transcriptional regulation. Our observation that the neutral mutation rate is uniform in the yeast genome is in sharp contrast to what is known in mammals, where the rate varies along chromosomes. This result was surprising, because recombination rates vary along the yeast genome (Gerton et al. 2000) and recombination events have been reported to increase local mutation rates (Rattray et al. 2002). However, recombination-associated mutagenesis does not appear to have occurred frequently enough to have had a major impact on the sequences. We found that recombination rates in S. cerevisiae genes, as obtained from Gerton et al. (2000), were not correlated with our four-species normalized conservation rates (3530 genes; r = 0.02, p = 0.18). Since the mechanism for variable rate in mammals is still unknown, we can only speculate on what may be responsible for the difference between yeast and mammals. One nonselective possibility is that yeast chromosomes are too short to have heterogeneity in their mutational environment. For example, in the mouse-human comparison, rate correlations have a typical length scale of several megabases, while most yeast chromosomes are <1 Mb long. Alternatively, a selective possibility is that nonuniform rates create hotspots that provide a benefit in mammals (Cox 1972; Chuang and Li 2004), but not in yeast. The general approach we followed was to determine the local neutral conservation rate and then to calibrate the separation of functional and neutral sequence in promoters by this rate. This approach was of a qualitatively different nature than those of Kellis et al. (2003) and Cliften et al. (2003), who also analyzed sequence conservation in yeast promoters. Both of these groups searched for sequence motifs of a specified pattern, and then assessed their significance by evaluating the overall conservation rate across all motif occurrences in the genome. By applying bootstrapping-type methods, they were able to determine which motifs of the specified pattern had outlier behavior. In contrast, we analyzed the overall conservation pattern and not specific motifs. By accurately calibrating for the neutral rate of conservation, we were able to decompose the neutral and functionally conserved sequence on a global scale. Our approach was similar in spirit to that taken by the Mouse Sequencing Consortium in estimating the total amount of functional sequence in the human genome (Mouse Genome Sequencing Consortium 2002). They did so by analyzing the distribution of alignment scores for all aligned human-mouse 50-mers and comparing it to a distribution of scores from presumably neutral repeat sequences. Our separation method based on the frequency of conserved blocks has higher resolution, as it is able to evaluate functionally constrained sites a few base pairs in length, and is also robust over different species' divergence times. Mutational uniformity was a key aspect of our findings, as it justified genome-wide approaches to separating functional from neutral sequence. In addition to the HMM method, it allowed us to apply a novel separation method based on the frequencies of conserved blocks. Despite the fact that the neutral conservation rate differed 2.5-fold among different yeast comparisons, our conserved block method was able to robustly infer similar amounts of selectively conserved sequence in each of these lineages. If uniformity of mutation can be established in other species, it should be possible to apply these separation methods to them as well. For genomes with regional mutation biases, such as mouse and human, it may be possible to apply these methods by breaking up the genome into blocks with internally homogeneous neutral mutation rates. By measuring the lengths of the HCRs in each promoter, we produced a novel dimension for quantifying selection on yeast genes. Traditionally, selective arguments have been applied to genes based on conservation of their protein sequence (KA) and/or their silent sites in the coding sequence (KS). In contrast, the total length of HCRs serves as a measure of regulatory complexity for each gene. Certainly, this is a simplification, as not all functional elements need be conserved across species, and not all of the conserved functional elements need be transcription-factor-binding sites; for example, they could be RNAs (Ludwig 2002; Martens et al. 2004), replication origins, or may be related to translation (Vilardell and Warner 1997; Cliften et al. 2003). However, the strong bias of known transcription-factor-binding sites to HCRs suggests that the HCRs are a useful approximation to regulatory sequence. These HCRs will be beneficial not only for characterizing regulation in each gene promoter, but may also shed light on the evolution of yeast promoters and gene expression. Decompositions of each promoter into high conservation and low conservation regions are available at http://genome.ucsf.edu/YeastReg. Methods Calculation of substitution rates from fourfold sites We obtained coding sequence data from the data set of Kellis et al. (2003) for each of the four species S. cerevisiae, S. paradoxus, S. bayanus, and S. mikatae. We translated each coding sequence into protein sequence, aligned the protein sequences with their orthologs using CLUSTALW, and then back-translated to determine the aligned coding sequences. In some cases, the coding sequences contained stop codons. In these cases, any sequence following a stop codon was discarded. We used fourfold sites, the third bases of codons for which the amino acid is specified by the first two bases, to analyze mutation rates. If the sequence before a stop codon contained fewer than 20 fourfold sites, the entire sequence was discarded. For each of the 4541 genes in the data set shared by S. cerevisiae and S. paradoxus, we calculated the fraction of fourfold sites that were identical. For each of the 3571 genes shared by all four species, we calculated the fraction of fourfold sites identical in all species. Fourfold site conservation rates were analyzed because they do not affect amino acid sequence and therefore provide an estimate of neutral conservation processes. These conservation rates were then mapped to normalized values to account for finite-size effects. We defined the normalized substitution rate to be
where p is the fourfold conservation rate in the gene, p0 is the mode conservation rate in the genome, N is the number of fourfold sites, and σ(N) is the standard deviation calculated from a binomial model, equal to Although this mutation model is rather simple with respect to base composition, it has several good features: It gives r =0 when the mutation rate is equal to the typical rate in the genome, accounts for fluctuations due to gene length, and predicts a normal distribution for r values if fourfold sites mutate independently with a uniform rate (Chuang and Li 2004). Mutational uniformity We considered all pairs of genes on continuous orthologous blocks, starting from the first neighbor up to the 35-th gene downstream. Orthologous block boundaries were defined by genes at which the S. cerevisiae chromosome changes. This allowed us to get hundreds of measurements of r(0)r(x)for x values as large as 100 kb. We binned these data into 50 uniformly spaced groups covering x [0, 300000] and then averaged over each of these bins to determine the correlation function r(0)r(x) , where ... indicates an average over all gene pairs in the bin. Error bars were given by the standard deviation of the values in each bin, multiplied by a correction factor of M-1/2, where M is the number of gene pairs in the bin, because we were interested in the error in the mean. Since the plotted values for each bin are averages over the M gene pairs, our autocorrelation analysis may still report uniformity if there are fewer than M1/2 gene pairs whose mutation rates are correlated. The number of gene pairs in any bin is proportional to the number of genes in our data set (~4500 for S. cerevisiae-S. paradoxus), meaning the sensitivity of our autocorrelation analysis is ~70 gene pairs. The conclusion of uniformity from the autocorrelation analysis is a statement that there are no more than ~70 pairs of genes (1%-2% of the genome) whose rates are correlated at any given distance.For randomly sampled gene pairs, the average value of r1 × r2 was 0.01. r(0)r(x) values were only considered significant if the lower error bar was above this value. In practice, 0.01 is indistinguishable from 0 for the scale of values in Figure 1 r(0)r(x) , this slight autocorrelation vanished. The slight autocorrelation was caused by a block of 12 yeast genes on Chromosome 14 (YNL320-YN323, YNL325-YNL332) with high sequence conservation between S. cerevisiae and S. paradoxus. When these 12 genes were removed, the unusual values disappeared. Furthermore, correlations in the rates of neighboring genes largely vanished (Full data set: Pearson correlation = 0.10, p-value = 3.3 × 10-12; data set with these genes removed: Pearson correlation = 0.04, p-value = 0.01). Because these genes were a special case, Figure 1The rates determined from the four species comparisons also indicated no regional biases in mutation rate. Neighboring genes had only marginal correlation in conservation rate (Pearson correlation = 0.03, p-value = 0.05). Autocorrelation analysis was more complicated than for the S. cerevisiae-S. paradoxus comparison because of the significant number of genes with high four-species conservation rates, as illustrated by the long tail in Figure 2 r(0)r(x) to have a spurious nonzero expectation (~0.1) at all x. However, we found that when genes with extremely high conservation rates (r ≥ 5, 88 genes) were removed, this effect vanished. After removal, r(0)r(x) had an average value of 0.02 in the region t (0, 100 kb). Removal of these genes did not correspond to removing any clusters of slowly mutating genes; the median distance between these cold genes was 82 kb (~30 genes away).As mentioned in the main text, there was also no correlation between the normalized conservation rates of the LCRs inferred from the HMM and the fourfold sites of their genes. It might be supposed that regional effects could exist for the small group of genes with the highest fourfold conservation rates. However, this was not the case. For those genes with fourfold conservation z-scores >3, there was also no correlation (Pearson correlation r = -0.18; p = 0.1, 195 genes). We also performed uniformity tests for comparisons of S. cerevisiae-S. mikatae and S. cerevisiae-S. bayanus. For the S. cerevisiae-S. mikatae comparison, neighboring genes had insignificant Pearson correlation in their normalized fourfold substitution rates (r = 0.004; p = 0.80), the median values of the autocorrelation were all within error of zero, and the rate distribution closely followed a Normal distribution. For S. cerevisiae-S. mikatae, these qualities all held true as well (neighboring gene Pearson correlation: r = 0.023; p = 0.10). Separation of high and low conserved regions with a hidden Markov model A hidden Markov model (HMM) (Durbin et al. 1998) is a model that assumes that there are hidden states (in our case, the two hidden states are neutral or under selective constraint) at each position in a sequence. The observed sequence values (in our case, these are the conservation patterns along the aligned promoter sequences) are probabilistic outcomes emitted by the hidden states. These emission probabilities, as well as the transition probabilities between hidden states, are unknown parameters that can be learned through an iterative procedure that attempts to maximize the likelihood of the observed sequence as a function of these parameters (Baum 1972). The value of the hidden state at each position can be recovered by considering the sequence of hidden states most likely to have produced the observed sequence values. Our HMM methods follow standard protocols as described in Durbin et al. (1998). Full details of the HMM are provided in the Supplemental materials. Genome-wide percentage of promoter sites under selection In the frequencies of conserved blocks (FCB) method, we defined a conserved n-mer to be n consecutive conserved sites flanked by nonconserved sites or gaps. The numbers of conserved n-mers f(n) were used to calculate the percentage of sites under selection. This percentage of sites under selection was determined by comparing the fobs(n) distribution for the observed data to the fsim(n) function found for simulated neutrally evolving sequence. Such simulated neutrally evolving sequence was generated using the neutral conservation rate obtained from the fourfold site analysis. In each species comparison, the neutral conservation rate was obtained by Bayesian demixing. For each promoter, we generated a conservation pattern by randomly assigning each site to be conserved with a probability equal to the neutral fourfold conservation rate. The gap positions and the length of each promoter were preserved. This yielded a distribution fsim(n), which was expected to have the same n-dependence as would be expected for neutrally evolving sequence. However, since fsim(n) was generated using the full lengths of promoters, it was necessary to separately determine how much of the real promoter sequence followed the fsim(n) distribution. An example of the observed distribution and the simulated distribution is shown in Figure 5 Note that this implies the overall fraction of sequence in promoters which is neutral is It also follows that for a conserved n-mer, the probability that it is functional is We used the FCB method to estimate the amount of functionally conserved sequence, rather than just the HCRs from the HMM, because the FCB method was more robust. In general, the HMM decomposition worked if the neutral and functional conservation rates were sufficiently distinct. This appeared to be the case in the four-species comparison: The amount of predicted functional sequence was similar from the HMM and the conserved n-mer methods, the conservation rates in LCRs overlapped well with the fourfold rates, and the HCR and LCR rate location dependences were consistent with functionality and neutrality, respectively (see Supplemental material). However, the HMM tended to overestimate the amount of functional sequence when species were closely related. The HMM method identified 75% of the promoter sequence as HCR in the S. cerevisiae-S. paradoxus comparison, 50% for the S. cerevisiae-S. mikatae comparison, and 38% for the S. cerevisiae-S. bayanus comparison. On the other hand, the FCB method gave consistent estimates of the amount of functional sequence over a range of species divergences. z-score in Gene Ontology analysis For a given type of measurement, for example, the average HCR length, the z-score for each GO category was calculated by comparing the value of the measurement for the genes in the GO category against the values for randomly sampled genes. For example, if a GO category had x genes, a mean ms and standard deviation σs of the measurement s was calculated from 1000 samples of x randomly chosen genes. The z-score for the GO category was then reported as z = (m - ms)/σs, where m was the value of the measurement from the genes in the GO category. To calculate za, the z-score based on amino-acid-changing changes in the coding sequence, we used the Ka values published by Kellis et al. (2003) for the S. cerevisiae-S. bayanus comparison. Acknowledgments J.C. is supported by the National Science Foundation under a grant awarded in 2003. C.C. and H.L. are supported in part by the NIH (GM070808 to H.L.) and by a Packard fellowship in science and engineering (to H.L.). The authors thank E. O'Shea, P. O'Farrell, B. Tuch, and M. Samanta for comments on the manuscript. Footnotes [Supplemental material is available online at www.genome.org and http://genome.ucsf.edu/YeastReg.] Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.3243305. Article published online before print in January 2005. References
WEB SITE REFERENCES
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Biol. 2003; 4(7):R43.
[Genome Biol. 2003]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Genome Res. 2004 Jan; 14(1):99-108.
[Genome Res. 2004]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Genome Res. 2003 Jan; 13(1):13-26.
[Genome Res. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Nucleic Acids Res. 1987 Feb 11; 15(3):1281-95.
[Nucleic Acids Res. 1987]J Mol Biol. 1997 Apr 25; 268(1):78-94.
[J Mol Biol. 1997]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Res. 2004 Aug; 14(8):1530-6.
[Genome Res. 2004]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Proc Natl Acad Sci U S A. 2003 Apr 29; 100(9):5136-41.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2000 Jan 1; 28(1):27-30.
[Nucleic Acids Res. 2000]Proc Natl Acad Sci U S A. 2000 Oct 10; 97(21):11383-90.
[Proc Natl Acad Sci U S A. 2000]Genetics. 2002 Nov; 162(3):1063-77.
[Genetics. 2002]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Curr Opin Genet Dev. 2002 Dec; 12(6):634-9.
[Curr Opin Genet Dev. 2002]Nature. 2004 Jun 3; 429(6991):571-4.
[Nature. 2004]Mol Cell Biol. 1997 Apr; 17(4):1959-65.
[Mol Cell Biol. 1997]Science. 2003 Jul 4; 301(5629):71-6.
[Science. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Nature. 2003 May 15; 423(6937):241-54.
[Nature. 2003]Bioinformatics. 1999 Jul-Aug; 15(7-8):607-11.
[Bioinformatics. 1999]