Logo of geneticsGeneticsCurrent IssueInformation for AuthorsEditorial BoardSubscribeSubmit a Manuscript
Genetics. 2006 Jan; 172(1): 569–577.
PMCID: PMC1456184

Variation in Mutation Dynamics Across the Maize Genome as a Function of Regional and Flanking Base Composition


We examine variation in mutation dynamics across a single genome (Zea mays ssp. mays) in relation to regional and flanking base composition using a data set of 10,472 SNPs generated by resequencing 1776 transcribed regions. We report several relationships between flanking base composition and mutation pattern. The A + T content of the two sites immediately flanking the mutation site is correlated with rate, transition bias, and GC → AT pressure. We also observe a significant CpG effect, or increase in transition rate at CpG sites. At the regional level we find that the strength of the CpG effect is correlated with regional A + T content, ranging from a 1.7-fold increase in transition rate in relatively G + C-rich regions to a 2.6-fold increase in A + T-rich regions. We also observe a relationship between locus A + T content and GC → AT pressure. This regional effect is in opposition to the influence of the two immediate neighbors in that GC → AT pressure increases with increasing locus A + T content but decreases with increasing flanking base A + T content and may represent a relationship between genome location and mutation bias. The data indicate multiple context effects on mutations, resulting in significant variation in mutation dynamics across the genome.

EVOLUTION is ultimately dependent on mutation and thus characterizing mutation rates and biases, within and among genomes, is a prerequisite for studying genomics and molecular evolution. For example, comparative genomics requires an understanding of mutation dynamics in different lineages (e.g., Dermitzakis et al. 2002), and compositional patterns such as the possible isochore structure in vertebrates (Bernardi 2000, but see Cohen et al. 2005) cannot be adequately studied without an understanding of how mutation bias varies along chromosomes (e.g., Duret et al. 2002). Increasingly, analyses of large SNP data sets, such as the recent analysis of 2,576,903 human SNPs (Zhao and Boerwinkle 2002), are proving to be valuable for studies of mutation bias. The availability of SNP data from many different taxa now makes it feasible to develop a more detailed knowledge of factors that contribute to variation in mutational biases.

A number of analyses of mutations have demonstrated that context, or the composition of nucleotides flanking a mutation, can have a significant influence on both mutation bias and overall mutation rate (Bulmer 1986; Morton 1995; Krawczak et al. 1998; Zhao and Boerwinkle 2002; Morton 2003). Although context effects are not often considered in studies that apply mutation parameters (although see Arndt et al. 2003; Siepel and Haussler 2003), there is evidence that understanding and incorporating such effects may be very important for interpreting genomic data (Morton 2003; Siepel and Haussler 2003) since they can result in variation in mutation dynamics across sites. In nuclear genes, the most apparent neighboring nucleotide effect that has been studied to date is the CpG effect, which is an increased rate of transitions at CpG dinucleotides as a result of deamination of methylated CpG sites (Duncan and Miller 1980; Bulmer 1986; Cooper and Youssoufian 1988). The CpG effect has been primarily studied in vertebrate genomes (Krawczak et al. 1998; Zhao and Boerwinkle 2002; Fryxell and Moon 2005), and in human sequences there is a fivefold increase in the rate of transitions at CpG sites due to deamination of methylated cytosines (Krawczak et al. 1998). The CpG effect appears to be weaker in G + C-rich regions, possibly due to greater local helix stability (Fryxell and Moon 2005), and appears to be slightly stronger on the coding strand than on the template strand near genes (Krawczak et al. 1998).

Context dependency of mutations has also been studied in grass chloroplast DNA (cpDNA; Morton 1995, 2003). In this genome there is a significant correlation between the A + T content of the two sites flanking a mutation (the A/T context) and both the overall substitution rate and the transition:transversion (Ts:Tv) bias, due to a decreasing rate of transition substitutions as the A/T context increases (Morton 2003). Since the observed context dependency is not consistent with CpG deamination, and since CpG methylation is not known to occur in cpDNA, it has been suggested that factors such as polymerase fidelity and variable repair efficiency may be responsible for context-dependent mutation biases (Morton 2003). Neighboring base composition also influences substitution dynamics in cpDNA in other ways; both the bias toward A + T and the bias toward pyrimidines are a function of context (Morton 2003). Similar context-dependent mutation patterns appear to exist in cpDNA across different flowering-plant lineages (Morton 1997; Yang et al. 2002).

Given the growing body of evidence regarding context dependency and the lack of data about regional variation in mutation properties, there is a need to better understand context dependency and how mutation dynamics vary across individual genomes. To further our understanding of mutational context and variation, we have analyzed a large SNP data set generated from nuclear genes of maize (Zea mays ssp. mays) with respect to both regional and flanking base composition. We find evidence that the A + T content of flanking nucleotides has an influence on various aspects of mutation dynamics and report a correlation between regional base composition and both CpG effect and the relative rates of GC → AT and AT → GC mutations, or GC → AT mutation pressure.


Sequence data:

The sequence data analyzed in this article were reported previously (Wright et al. 2005; Yamasaki et al. 2005; GenBank nos. BV123534BV144210, BV446558BV447590, and BV106362BV123527). Briefly, PCR primers were designed to amplify the 3′ regions of ∼2000 sequences from the Maize Mapping Project/Dupont unigene set (http://www.agron.missouri.edu/files_dl/MMP/Cornsensus). For each locus, PCR was performed on genomic DNA from 14 individuals representing the genetic diversity of modern maize inbreds. The sequencing, processing, alignment, and quality of the DNA sequence data were described previously (Wright et al. 2005; Yamasaki et al. 2005).

We modified the alignments in three ways. First, any SNP site that was not supported by a phred quality score of at least 30 for both variants was assigned an “N” for all individuals and ignored in analyses. Second, some alignments were modified slightly to correct for apparent indel errors in coding regions (see below). Third, some loci were excluded from our analyses, either because they did not contain sequences from at least four of the inbred lines or because coding region assignment was uncertain. In total we analyzed 1776 loci with an average A + T content of 53.0% and a variance of 7.3%.

Definition of coding and noncoding regions:

To define coding regions, the unigene sequences were compared to the annotated rice peptide set (version 2 at http://www.tigr.org) and Arabidopsis peptide set (http://www.ncbi.nlm.nih.gov/ on August 16, 2004) with BLASTx. Any hit with an e-value <1e-5 was retained and considered a putative protein coding region (pCDS). The pCDS for each unigene was also estimated by finding the longest open reading frame on the basis of analyses with the bio perl module “getorf” of the EMBOSS package (Rice et al. 2000). Getorf was applied without assuming 5′–3′ directionality and without assuming the presence of a start codon.

To ascertain whether any portion of pCDSs from unigenes were present in genomic alignments, we compared the pCDS to genomic data with BLASTn. All BLAST hits with an e-value <1e-5 were retained, as were the extreme 5′ and 3′ sites of the region(s) of the pCDS aligned by BLAST. The portion of the pCDS defined by the 5′ and 3′ sites was aligned to the entire genomic alignment with the program sim4 (Florea et al. 1998), using default settings. Sim4 aligns EST sequence to genomic sequence while accounting for genomic features such as consensus intron/exon junctions. Each alignment was also edited by hand both to confirm consensus intron/exon junctions and to eliminate 1-bp indels in coding regions, which were assumed to be sequencing errors when present in only one or two sequences. If there were larger indels or potential frameshifts, the coding region definition was considered ambiguous and the locus was removed from analysis. The 1776 alignments used in this study, including coding regions, are available from http://gautlab.bio.uci.edu/data.

Analysis of mutations:

The alignments were analyzed using a Java package written by one of the authors (B. R. Morton). Sites with a gap introduced into any sequence and SNPs at sites defined as coding were excluded from the analyses. At every variable noncoding site the most parsimonious number of changes was assumed and, given the lack of data from an outgroup taxon, mutations were polarized using the most frequent nucleotide at that site. The reliability of this method of polarization has yet to be established, so any conclusions dependent upon polarization must be considered in this light. As data from outgroup taxa become available, they will allow us to evaluate the validity of this method of polarization.

The context of every site, conserved or variable, was calculated using the majority base at the appropriate neighboring site(s). The contexts analyzed were (1) composition of the 5′ neighbor, (2) composition of the 3′ neighbor, (3) composition of the two 5′ neighboring nucleotides, (4) composition of the two 3′ neighboring nucleotides, (5) composition of both the 5′ and 3′ neighbor, and (6) the composition of the four flanking nucleotides, two on each side. Note that all sites occur in multiple contexts since many of these cases overlap. Heterogeneity in mutation dynamics among contexts was assessed by a likelihood-ratio test, or G-test (Sokal and Rohlf 1995).

For every context we analyzed mutations as both polarized and unpolarized. For unpolarized changes we simply scored the change as a transition or a transversion. For those sites where there were two changes possible, due to three character states across the sequences, we inferred one transversion (which is necessary) and one unknown change. The latter were included in rate calculations but not in transition:transversion calculations. Only 74 of the 5932 noncoding SNP sites (1.2%) had multiple changes and exclusion of these sites did not affect the conclusions (data not shown). For the analysis of polarized mutations we generated 4 × 4 mutation matrices for every context analyzed. For each matrix, the entry mij is the number of sites observed to have a change from nucleotide i to nucleotide j, with the matrix diagonal representing the conserved sites. The rate of each mutation type was then calculated from the matrix by dividing each element by the row total. In addition, for each matrix we calculated the stationary vector (Morton 2003), which represents the expected equilibrium composition for a sequence evolving under that mutation (substitution) model. This stationary vector can be used as a descriptive parameter of the mutation matrix similar to Ts:Tv.

We also examined regional effects on mutation. For this we calculated the overall A + T content of each locus, including both coding and noncoding sites, and then divided the loci into five classes: (1) A + T < 48%, (2) 48% ≤ A + T < 52%, (3) 52% ≤ A + T < 56%, (4) 56% ≤ A + T < 60%, and (5) A + T ≥ 60%. Mutations occurring in loci within each class were then grouped and analyzed together.


Sequence composition:

We analyzed data from a resequencing project in which loci were sequenced from genomic DNA of up to 14 maize inbred lines (Wright et al. 2005; Yamasaki et al. 2005). Each locus is a single transcribed region of the genome that was amplified using primers designed from a unigene sequence. An alignment was generated for each locus using the coding strand sequence data. We examined 1776 of these loci for which the coding regions could be reliably defined with an average sample size of n = 12.1 sequences. The 1776 loci represent a combined alignment length of 531,503 nucleotides, of which 260,475 (49.0%) are noncoding. A total of 10,472 SNPs representing ∼2% of the sites were scored. A total of 5932 (56.6%) of the SNPs were at noncoding sites. Each SNP was scored in two ways: as an unpolarized (nondirectional) change and as a polarized (directional) change, for which the most frequent nucleotide at the site was taken as the ancestral state.

The distribution of A + T content from these loci is shown in Figure 1 for all sites as well as for only noncoding sites. In general the loci are slightly A + T-rich with an average A + T content of 53.0%. The noncoding sites are only slightly higher in A + T content, with an average composition across loci of 55.1% A + T. Along with the bias toward A + T, we observed a consistent bias of G over C and T over A both in the sequences overall and in only the noncoding sites (a “GT skew”). If we measure the T-A skew by (T − A)/(T + A) and the G-C skew by (G − C)/(G + C), the T-A skew in the noncoding sites of our data is 12.0% while the G-C skew at noncoding sites is 5.6%. This skew toward G and T in the noncoding regions near genes is similar to a recent observation of human genes (Louie et al. 2003).

Figure 1.
Distribution of A + T content across loci. A separate distribution is shown for the A + T content of all sites and for the A + T content of noncoding sites only.

To study the effect of regional composition on mutation bias, loci were divided by A + T content into the following categories: (1) A + T < 48%, (2) 48% ≤ A + T < 52%, (3) 52% ≤ A + T < 56%, (4) 56% ≤ A + T < 60%, and (5) A + T ≥ 60%. These will be referred to as the regional composition classes. The results reported here are for the SNPs at noncoding sites but all conclusions discussed below were unchanged when analyses were repeated using all SNPs, although the higher proportion of noncoding SNPs relative to noncoding sites may reflect constrained sites within the coding regions. In addition, varying the categories into which loci were divided by A + T content did not change the general results (data not shown).

General mutation patterns:

Overall, the polarized SNP data yielded a G and C nucleotide mutation rate (the GC rate) that is ∼1.6 times the rate of mutation for A and T nucleotides (the AT rate) (Table 1). The higher GC rate could potentially be due to the CpG effect, which is discussed in detail below. However, when the GC and AT rates were calculated for different 5′ and 3′ flanking nucleotides, there was a higher GC rate in every context. Thus, although the effect of CpG deamination is apparent in the higher GC rates when there is a 5′ C or a 3′ G (Table 1), the CpG effect cannot account for the overall higher GC rate. The ratio of GC-to-AT rates, which reflects the mutational AT pressure, varies across the regional composition classes; the GC:AT rate ratio is higher in those regions with a higher A + T content and lower in those regions with a lower A + T content (Table 2). This variation in mutation pressure is discussed in more detail below. Note that the rates in Table 1 tend to be slightly lower than the rates in Table 2 since accounting for context reduces the number of sites in the analysis by eliminating the first and/or last sites as well as any internal site for which context is ambiguous.

Rates of change from G or C nucleotides as compared to A or T nucleotides as a function of the flanking base composition
A comparison of GC and AT rates across the regional composition classes

We also examined mutation bias by looking at the Ts:Tv ratio, which has not been well characterized in plant nuclear genomes. Overall, transitions occur at a rate ∼1.5 times that of transversions (Table 3). This ratio is consistent across loci: although there is a slight variation in this ratio across loci as a function of regional composition, the variation is not significant (G = 3.6, P > 0.05). The maize nuclear Ts:Tv ratio is slightly higher than that of grass cpDNA, which shows an overall 1.3:1 Ts:Tv ratio. Note, however, that the Ts:Tv ratio in grass cpDNA ranges from <1 to >2.5 as a function of flanking base composition (Morton 2003).

Observed transitions and transversions across the regional composition classes

The effect of cytosine deamination:

To examine the influence of context on mutation bias, we first compared the frequency of transition events at CpG dinucleotides, which are known to be methylated in plant nuclear DNA, to the transition rate at other dinucleotides. Deamination of methylated cytosines at CpG dinucleotides is known to generate a significant increase in transition rate in many vertebrate taxa (Krawczak et al. 1998; Fryxell and Moon 2005) so we hypothesized that a similar CpG effect would exist in our data.

Since both strands at a CpG dinucleotide are methylated, deamination will lead to the observation of either a CG → CA change, for a deamination on the template strand, or a CG → TG change if the deamination is on the coding strand. To measure the CpG effect, we compared the rate of transition in the CpG context to the average rate in all other contexts. For the template strand this involved calculating the ratio of the rate of CG → CA changes to the average rate of AG → AA, TG → TA, and GG → GA changes. For deamination on the coding strand we calculated the ratio of the rate of CG → TG changes to the average of CA → TA, CT → TT, and CC → TC. The average CpG effect was then calculated as the average of the two strand values.

Using the polarized SNP data (see materials and methods) the rates of mutation for each dinucleotide are shown in Table 4. Overall there is a 2.1-fold increase in transition rate in the CpG context relative to other contexts and this increase at CpG dinucleotides is significant (G = 78.0, P < 10−6). The CpG effect is also apparent when the rates of all possible dinucleotide changes are compared: the various transitions have higher rates of change than transversions do, as expected from the Ts:Tv > 1 described above, with the highest rates being transitions from the CpG dinucleotide CG → CA and CG → TG changes (Figure 2). Across the regional composition classes there is a correlation between the CpG effect and regional A + T content with A + T-rich regions showing a much stronger CpG effect than A + T-poor regions (Table 4). There is also a significant increase in CpG transition rate with increasing regional A + T content (G = 21.3, P < 0.001).

Figure 2.
The rate of every dinucleotide mutation for (A) mutations at the 5′ nucleotide of the pair and (B) mutations at the 3′ nucleotide of the pair. Transitions away from a CpG dinucleotide, which could involve cytosine deamination, are lightly ...
Rates of transitions at CpG dinucleotides relative to transitions at other dinucleotides

When we compared the rate of CpG transition for the two different strands, the rate of CG → CA was found to be significantly lower than the rate of CG → TG (G = 7.1, P < 0.01). Both CG → CA and CG → TG rates increase with increasing regional A + T content but the latter rate is higher in each composition class. These data suggest that the two DNA strands are affected differently by CpG deamination, similar to the data from humans (Krawczak et al. 1998). However, there is no apparent difference in the increase of CG → CA changes and the CG → TG changes relative to G → A and C → T transitions, respectively (Table 4), so it is possible that the rate differences between CG → CA and CG → TG are more general than only CpG deamination. Overall, our data do not unambiguously indicate a difference in CpG effect between the two strands.

Context and transition:transversion bias:

In addition to the apparent effect of methylated cytosine deamination, we studied the general relationship between neighboring base composition and mutation bias. Given the observation from grass cpDNA that flanking base A + T content is correlated with mutation bias (Morton 2003), we divided all sites into three categories depending on the number of A/T base pairs (0, 1, or 2) in the two immediate neighbors and defined this as the A/T context. SNPs that differed in A/T context were then analyzed separately for comparison.

As observed in cpDNA, we found a significant negative correlation between A/T context and Ts:Tv due to a decreasing rate of transitions with increasing A/T context (Table 5). This decreasing rate of transitions also results in a significant decrease in overall mutation rate with increasing A/T context. From Table 5, the overall rates of mutation in the A/T = 0, A/T = 1, and A/T = 2 contexts are 0.0276, 0.0238, and 0.0218, respectively. A comparison of variable (SNP) to conserved sites reveals that this variation in rate among contexts is significant (G = 26.8, P < 10−5). The negative correlation between A/T context and transition bias was observed in the regional composition classes where A + T > 52% but not in the regions with lower A + T content. Unlike the case for cpDNA, however, this correlation between Ts:Tv and A/T context in nuclear DNA could be due solely to the CpG effect. To remove the CpG effect, we repeated the analysis for the A/T = 1 and A/T = 2 contexts using only sites without a 5′C or 3′G. (There is only a single A/T = 2 context without a potential CpG—sites with a 5′G and 3′C—so we excluded this context altogether.) There was still a significant difference in Ts:Tv between the A/T = 1 and A/T = 2 contexts (G = 15.9, P < 10−4) and, again, this context effect tended to be significant in regions with higher A + T content (Table 5). The data in Table 5 show that flanking bases influence mutations beyond the CpG effect and in a manner similar to what is observed in cpDNA. The variation in Ts:Tv across the three A/T contexts, however, is weaker in these data than in cpDNA (Morton 2003).

Transition:transversion ratio as a function of A/T context

Previous studies of other taxa have indicated that nucleotides beyond immediate neighbors can influence nucleotide mutation biases (Krawczak et al. 1998; Morton 2000; Zhao and Boerwinkle 2002). We thus examined the effect of context beyond the nucleotide sites that immediately flank an SNP. However, previous studies have not always separated the effects of immediate neighbors from the composition of more distant nucleotide sites (Zhao and Boerwinkle 2002). In our analysis we controlled for the composition of the immediate neighbors by holding the composition of these sites constant and then comparing the composition of the nucleotides one base removed, both 5′ and 3′, from the SNP sites. For these data, we assessed both mutation rate and the Ts:Tv ratio. No significant relationship was found between the composition of these sites and either mutation rate or bias (data not shown).

Context and mutational AT pressure:

In this section we examine the relationship between context and GC → AT pressure using the polarized SNP data. All sites, both conserved and SNP, were separated by context. Two different sets of contexts were used: (1) A/T context (number of A/T base pairs immediately flanking the site, as above) and (2) regional A + T composition (the regional composition classes described above). Using all sites within a specified context, we generated a 4 × 4 matrix where πij is the rate of change from nucleotide i to nucleotide j in that context as described in materials and methods. Once the matrix for each context was determined, the matrices wereanalyzed using two approaches. The first approach involved finding the equilibrium composition of a sequence evolving under each mutation model. This was determined by calculating the stationary vector for each matrix, which represents the expected equilibrium distribution for that mutation model (see materials and methods). For the second approach we compared the GC → AT and AT → GC mutation rates within each matrix.

A correlation is observed between the A/T context and equilibrium A + T composition (Table 6). As A/T context increases, predicted equilibrium A + T content of a site decreases. This trend is observed across the regional composition classes, indicating that sites evolving in a local context that is more A + T rich are themselves less biased toward A and T than sites in a local context that is A + T poor. The opposite trend is observed across regional composition classes. SNPs in loci that are more A + T rich overall predict a higher A + T bias than SNPs in loci that are relatively A + T poor (Table 6). Therefore, variation across regional composition classes cannot be explained by the influence of immediate neighbors since the two influences are in opposite directions and must represent some other feature of mutations.

Predicted equilibrium A + T content given the observed mutation dynamics in loci of different composition

The second approach, comparing GC → AT and AT → GC mutation rates directly, yielded similar results. As shown in Table 2, the GC:AT rate ratio increases with increasing regional A + T content. However, the GC and AT rates presented above were not limited to the GC → AT and AT → GC changes, which determine AT pressure. Therefore, we partitioned the GC and AT rates into two components each: the GC rate into GC → AT and GC → GC rates and the AT rate into AT → GC and AT → AT rates. The data demonstrate that the regional A + T content is correlated with GC → AT mutation pressure; the ratio of GC → AT:AT → GC rates increases with increasing regional A + T content as does the ratio of GC → AT:GC → GC change rates (Table 7). On the other hand, the rate of GC → GC transversions is not much higher than the rate of AT → AT transversions and this ratio decreases with increasing regional A + T content (Table 7), indicating that it is specifically GC → AT rates that are associated with regional composition, not only a general GC mutation rate.

Rates of GC → AT mutations expressed relative to other mutation rates

Since transitions occur at a higher rate than transversions and GC → AT changes include transitions while GC → GC changes do not, we repeated the analysis using only G → T and C → A as well as T → G and A → C transversion mutations. These data show the same correlation between regional A + T content and GC → AT pressure (Table 7). Overall, the mutation dynamics shown in Tables 6 and and77 demonstrate a bias toward GC → AT changes, a bias that is stronger in regions with higher A + T content, and show a direct relationship between AT pressure and regional composition.


The SNP analyses presented here yield some of the first data about context and variation in mutation dynamics within a genome. They demonstrate that context has a significant influence on mutation dynamics in maize nuclear DNA: there is a relationship between flanking base composition and mutation bias, an increased rate of transitions at CpG dinucleotides, and a relationship between regional base composition and GC → AT pressure. We should note that a number of our observations are based on polarizing mutations. For our analyses we polarized mutations by using the majority base at each site to infer the original state. This will not affect the analyses concerning flanking base effect on rate and transition bias and, therefore, the overall conclusions about context effects. In addition, although the polarization does allow us to infer the mutation rate away from CpG dinucleotides and provides stronger evidence, the high rate of transitions at these sites is in itself strong support for a CpG effect. Conclusions based on predicted equilibrium composition and GC → AT pressure are, however, fully dependent on polarizing the mutations that allow us to generate the 4 × 4 matrices. Future analysis using an outgroup taxon will allow us to examine these effects and to assess the validity of using the majority base to polarize mutations.

The most notable context effect is an elevated rate of CG → TG and CG → CA transitions relative to other transitions (Figure 2). Given the existence of CpG methylation in plants (Tariq and Paszkowski 2004), this rate elevation is most likely the result of a deamination of methylated cytosines at these dinucleotides. It is difficult to compare the magnitude of the CpG effect observed here directly to studies of nonplant taxa since methodologies differ, but it appears that the increase in transition rate that we observed at CpG sites, roughly a 2.1-fold increase relative to the transition rate at other sites, is not as high as what has been observed in vertebrates (Krawczak et al. 1998). Although we observe an overall 2.1-fold increase in transition rate due to CpG deamination, this increase ranges from a 1.7-fold increase in regions with lower A + T content (<48%) to a 2.6-fold increase in regions with higher A + T content (>60%) and shows a general increase with increasing regional A + T content (Table 4). This trend may reflect variation in the degree of CpG methylation across loci or that repair of deamination products is more efficient in G + C-rich regions (Fryxell and Moon 2005).

Along with a significant CpG effect, there are other influences of context on mutations apparent in our data. In particular, the composition of the two immediate neighbors, one 5′ and one 3′, of the mutation site is correlated with overall rate, transition bias, and GC → AT pressure. These effects are similar to what is observed in grass cpDNA and it is likely that they are due to an influence of local composition on polymerase misincorporation or mismatch repair (Morton 1995, 2003). The similar relationship between context and mutation properties in both nuclear and cpDNA is interesting since it suggests shared replication and/or repair processes or that these properties are fundamental to mutations. Much remains to be learned about replication and repair in plants, but it is known that the two genomes do not share the same replication machinery and have significant differences in repair dynamics (Heinhorst and Cannon 1993; Cannon et al. 1995; Hada et al. 1998; Kimura et al. 2002, 2005). As more is uncovered about the replication and repair processes in the two genomes, we should be able to better understand the causes of similar context effects.

Although we found a correlation between the composition of the two immediate neighbors and mutation properties, we did not see a clear relationship between mutation and the composition of individual neighboring nucleotides that do not flank the mutation. This contrasts with a recent study of human SNPs (Zhao and Boerwinkle 2002). Again, however, differences in methodology make it difficult to draw any specific conclusions about differences in context effects. In our study we controlled for the composition of the immediate neighbors, something that was not done in the study of human SNPs. Thus, it is possible that the human SNP study confounded immediate flanking base effects and nonrandom dinucleotide composition.

Despite the lack of correlation between specific individual nucleotides beyond the immediate neighbors and mutation dynamics, we do observe a correlation between regional composition and GC → AT mutation pressure. It is possible that this correlation is not a context effect but a secondary effect arising from a relationship between chromosome location and replication/mutation dynamics. For example, a correlation between location, replication time and the available nucleotide pool, which could affect misincorporation biases, could potentially lead to a relationship along the lines of what we observe.

One interesting feature of our inferred mutation dynamics is that they predict an A + T content at equilibrium that is higher than the observed base composition. Although we observe a correlation between regional A + T content and predicted A + T content (Table 6), the observed A + T content is lower than expected in each of the regional composition classes. If we group all mutations from our data set into one matrix, we predict an A + T content of 62.0% at equilibrium (Table 6), which is higher than the average regional A + T content of 55.1% observed for noncoding sites. Although, as stated above, the predicted equilibrium may not be accurate since the context of most sites will vary over time, the fact that in every composition class even the lowest predicted equilibrium A + T (typically in the A/T = 2 context) is higher than the observed A + T indicates a real discrepancy. This discrepancy is similar to what was observed in noncoding cpDNA (Morton 2003) and suggests two possibilities. One is that the sequence is not at equilibrium and the A + T content is increasing in this lineage, as has been proposed recently for other taxa (e.g., Duret et al. 2002; Tiffin and Hahn 2002; Ebersberger and Meyer 2005). The other is that there is a fixation bias, such as selection or biased gene conversion. Investigating these two possibilities in future studies should yield important insights into plant mutational dynamics.

Finally, the mutation dynamics inferred from the SNP data predict the GT skew observed in the data (see results). The total 4 × 4 matrix inferred from the SNPs predicts an equilibrium composition of 20.0% G, 18.0% C, 28.0% A, and 34.1% T, which is a 9.8% skew of T over A and a 5.3% skew of G over C, similar to the 12.0% and 5.6% T-A and G-C skews, respectively, observed in the noncoding sequences. Similar T-A and G-C skews are found when we consider SNPs in the different contexts described above (data not shown). Since our alignments are of coding strand sequences in transcribed regions, they further suggest the possibility that the bias is associated with transcription.

This skew toward T over A and G over C has recently been reported for human genes (Louie et al. 2003). Since, like our data, their observation was for noncoding sequences near genes on the coding strand and is found across numerous loci, they proposed that the skew was due to a transcription-coupled mismatch repair system. If this is the case, then the similar finding in our data suggests a similar mechanism in plant nuclear genes. It also raises the possibility that the G over C and T over A skew observed along the leading strand in prokaryotic genomes (Lobry 1996; McInerney 1998; McLean et al. 1998; Morton 1999) is at least partially the result of a transcription-coupled repair mechanism. The possibility of a transcription-coupled repair mechanism has significant implications for our understanding of compositional bias in genes, such as codon usage bias.


The authors thank Stephen Wright, Richard Morton, Brian Golding, Shozo Yokoyama, and two anonymous reviewers for helpful comments. This work supported by National Science Foundation grants DBI0096033, DBI9872655, and DBI0321467 and by the United States Department of Agriculture-Agricultural Research Service.


  • Arndt, P. F., C. B. Burge and T. Hwa, 2003. DNA sequence evolution with neighbor-dependent mutation. J. Comput. Biol. 10: 313–322. [PubMed]
  • Bernardi, G., 2000. Isochores and the evolutionary genomics of vertebrates. Gene 241: 3–17. [PubMed]
  • Bulmer, M., 1986. Neighboring base effects on substitution rates in pseudogenes. Mol. Biol. Evol. 3: 322–329. [PubMed]
  • Cannon, G. C., L. A. Hedrick and S. Heinhorst, 1995. Repair mechanisms of UV-induced DNA damage in soybean chloroplasts. Plant Mol. Biol. 29: 1267–1277. [PubMed]
  • Cohen, N., T. Dagan, L. Stone and D. Graur, 2005. GC Composition of the human genome: in search of isochors. Mol. Biol. Evol. 22: 1260–1272. [PubMed]
  • Cooper, D. N., and H. Youssoufian, 1988. The CpG dinucleotide and human genetic disease. Hum. Genet. 78: 151–155. [PubMed]
  • Dermitzakis, E. T., A. Reymond, R. Lyle, N. Scamuffa, C. Ucla et al., 2002. Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature 420: 578–582. [PubMed]
  • Duncan, B. K., and J. H. Miller, 1980. Mutagenic deamination of cytosine residues in DNA. Nature 287: 560–561. [PubMed]
  • Duret, L., M. Semon, G. Piganeau, D. Mouchiroud and N. Galtier, 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics 162: 1837–1847. [PMC free article] [PubMed]
  • Ebersberger, I., and M. Meyer, 2005. A genomic region evolving towards different GC contents in humans and chimpanzees indicates a recent and regionally limited shift in the mutation pattern. Mol. Biol. Evol. 22: 1240–1245. [PubMed]
  • Florea, L., G. Hartzell, Z. Zhang, G. M. Rubin and W. Miller, 1998. A computer program for aligning a cDNA sequence with genomic DNA sequence. Genome Res. 8: 967–974. [PMC free article] [PubMed]
  • Fryxell, K. J., and W.-J. Moon, 2005. CpG mutation rates in the human genome are highly dependent on local GC content. Mol. Biol. Evol. 22: 650–658. [PubMed]
  • Hada, M., T. Hashimoto, O. Nikaido and M. Shin, 1998. UVB-induced DNA damage and its photorepair in nuclei and chloroplasts of Spinacia oleracea L. Photochem. Photobiol. 68: 319–322.
  • Heinhorst, S., and G. C. Cannon, 1993. DNA replication in chloroplasts. J. Cell Sci. 104: 1–9.
  • Kimura, S., Y. Uchiyama, N. Kasai, S. Namekawa, A. Saotome et al., 2002. A novel DNA polymerase homologous to Escherichia coli DNA polymerase I from a higher plant, rice (Oryza sativa L.). Nucleic Acids Res. 30: 1585–1592. [PMC free article] [PubMed]
  • Kimura, S., T. Ishibashi, T. Yamamoto and K. Sakaguchi, 2005. DNA repair in higher plants. Seikagaku 77: 113–123. [PubMed]
  • Krawczak, M., E. V. Ball and D. N. Cooper, 1998. Neighboring-nucleotide effects on the rates of germ-line single-base-pair substitution in human genes. Am. J. Hum. Genet. 63: 474–488. [PMC free article] [PubMed]
  • Lobry, J. R., 1996. Asymmetric substitution patterns in the two DNA strands of bacteria. Mol. Biol. Evol. 13: 660–665. [PubMed]
  • Louie, E., J. Ott, and J. Majewski, 2003. Nucleotide frequency variation across human genes. Genome Res. 13: 2594–2601. [PMC free article] [PubMed]
  • McInerney, J. O., 1998. Replicational and transcriptional selection on codon usage in Borrelia burgdorferi. Proc. Natl. Acad. Sci. USA 95: 10698–10703. [PMC free article] [PubMed]
  • McLean, M. J., K. H. Wolfe and K. M. Devine, 1998. Base composition skews, replication orientation, and gene orientation in 12 prokaryote genomes. J. Mol. Evol. 47: 691–696. [PubMed]
  • Morton, B. R., 1995. Neighboring base composition and transversion/transition bias in a comparison of rice and maize chloroplast noncoding regions. Proc. Natl. Acad. Sci. USA 92: 9717–9721. [PMC free article] [PubMed]
  • Morton, B. R., 1997. Rates of synonymous substitution do not indicate selective constraints on the codon bias of the psbA gene. Mol. Biol. Evol. 14: 412–419. [PubMed]
  • Morton, B. R., 1999. Strand asymmetry and codon usage bias in the chloroplast genome of Euglena gracilis. Proc. Natl. Acad. Sci. USA 96: 5123–5128. [PMC free article] [PubMed]
  • Morton, B. R., 2000. Codon bias and the context dependency of nucleotide substitutions in the evolution of plastid DNA. Evol. Biol. 31: 55–103.
  • Morton, B. R., 2003. The role of context-dependent mutations in generating compositional and codon usage bias in grass chloroplast DNA. J. Mol. Evol. 56: 616–629. [PubMed]
  • Rice, P., I. Longden and A. Bleasby, 2000. EMBOSS: the European molecular biology open software suite. Trends Genet. 16: 276–277. [PubMed]
  • Siepel, A., and D. Haussler, 2003. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol. Biol. Evol. 21: 468–488. [PubMed]
  • Sokal, R. R., and F. J. Rohlf, 1995. Biometry. W. H. Freeman, New York.
  • Tariq, M., and J. Paszkowski, 2004. DNA and histone methylation in plants. Trends Genet. 20: 244–251. [PubMed]
  • Tiffin, P., and M. W. Hahn, 2002. Coding sequence divergence between two closely related plant species: Arabidopsis thaliana and Brassica rapa ssp. pekinensis. J. Mol. Evol. 54: 746–753. [PubMed]
  • Wright, S. I., I. V. Bi, S. G. Schroeder, M. Yamasaki, J. F. Doebley et al., 2005. The effects of artificial selection on the maize genome. Science 308: 1310–1314. [PubMed]
  • Yamasaki, M., M. I. Tenaillon, I. V. Bi, S. G. Schroeder, H. Sanchez-Villeda et al., 2005. A large-scale screen for artificial selection in maize identifies candidate agronomic loci for domestication and crop improvement. Plant Cell 17: 2859–2872. [PMC free article] [PubMed]
  • Yang, Y. W., P. Y. Tai and W.-H. Li, 2002. A study of the phylogeny of Brassica rapa, B. nigra, Raphanus sativa and their related genera using non-coding regions of chloroplast DNA. Mol. Phylogenet. Evol. 23: 268–275. [PubMed]
  • Zhao, Z., and E. Boerwinkle, 2002. Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. Genome Res. 12: 1679–1686. [PMC free article] [PubMed]

Articles from Genetics are provided here courtesy of Genetics Society of America
PubReader format: click here to try


Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • Nucleotide
    Primary database (GenBank) nucleotide records reported in the current articles as well as Reference Sequences (RefSeqs) that include the articles as references.
  • PubMed
    PubMed citations for these articles
  • Taxonomy
    Taxonomy records associated with the current articles through taxonomic information on related molecular database records (Nucleotide, Protein, Gene, SNP, Structure).
  • Taxonomy Tree
    Taxonomy Tree

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...