Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. 2009 Feb 17; 106(7): 2295–2300.
Published online 2009 Jan 23. doi:  10.1073/pnas.0807350106
PMCID: PMC2650150

Duplicate genes increase expression diversity in closely related species and allopolyploids


Polyploidy or whole genome duplication (WGD) provides raw genetic materials for sequence and expression evolution of duplicate genes. However, the mode and tempo of expression divergence between WGD duplicate genes in closely related species and recurrent allopolyploids are poorly understood. Arabidopsis is a suitable system for testing the hypothesis that duplicate genes increase expression diversity and regulatory networks. In Arabidopsis, WGD occurred more than once before the split between Arabidopsis thaliana and Arabidopsis arenosa, and both natural and human-made allotetraploids are available. Comparative genomic hybridization analysis indicated that single-copy and duplicate genes after WGD were well preserved in A. thaliana and A. arenosa. Analysis of gene expression microarrays showed that duplicate genes generally had higher levels of expression divergence between two closely related species than single-copy genes. The proportion of the progenitors' duplicate genes that were nonadditively expressed in the resynthesized and natural allotetraploids was significantly higher than that of single-copy genes. Duplicate genes related to environmental stresses tended to be differentially expressed, and multicopy duplicate genes were likely to diverge expression between progenitors and in the allotetraploids. Compared with single-copy genes, duplicate genes tended to contain TATA boxes and less DNA methylation in the promoter regions, facilitating transcriptional regulation by binding transcription factors and/or cis-and trans- acting proteins. The data suggest an important role of WGD duplicate genes in modulating diverse and novel gene expression changes in response to external environmental cues and internal genetic turmoil such as recurrent polyploidy events.

Keywords: evolution, gene expression, polyploidy, stress

Polyploidy or whole genome duplication (WGD) is formed by genome duplication within a species (autopolyploidy) or between two or more distinct species (allopolyploidy) (1). Despite longstanding efforts to classify polyploids into auto- or allopolyploids, nature has produced a continuum of polyploid types (24). Moreover, polyploid formation is often recurrent and complex (4). Estimates indicate that 30–80% of angiosperms including many important crops, such as wheat, cotton, canola, potato, coffee, sugarcane, and switchgrass, are polyploids (25). The common occurrence of polyploidy suggests an advantage of having additional raw genetic material for evolution and adaptation. Some traits, such as increasing levels of stress tolerance, apomixis, pest resistance, flowering-time variation, and organ size, may allow polyploids to enter new niches or improve their fitness in harsh environments such as cold climates and high altitudes and latitudes (2, 4).

After WGD, some duplicate genes may be lost or mutated as predicted (6), but many duplicate genes diverge their expression within and between species. Within a species, expression divergence between duplicate genes expands regulatory networks and contributes to morphological diversity (6, 7). Between species, duplicate genes tend to cause expression divergence and evolve faster than single-copy genes (8).

The single-copy or duplicate genes originating in 2 progenitors become homoeologous in an allotetraploid (1). Here, we use homoeologous duplicate and single-copy genes to describe the gene content in the allotetraploids. In the new allotetraploids, gene expression changes rapidly and stochastically (1, 911). We predict that in the recurrent allopolyploids, the homoeologous duplicate genes derived from WGD in the progenitors would further increase expression diversity and expand regulatory interactions, leading to a wider range of growth and morphological changes.

Arabidopsis suecica is a natural allotetraploid species derived from the extant progenitors A. thaliana and A. arenosa 12,000–300,000 years ago (12). The resynthesized allotetraploids between A. thaliana and A. arenosa morphologically resemble A. suecica (11, 13) and display growth vigor (14). These allotetraploid lineages are suitable for testing the above hypothesis. A. thaliana and A. arenosa diverged ≈6 million years ago (Mya) (15), after the most recent WGD ≈20 Mya (1618). The two species share >90% of nucleotide sequence identity in coding regions, and >90% of the ≈26,000 70-mer oligonucleotides (oligos) designed from A. thaliana genes cross-hybridize with A. arenosa genes (13). Thus, the oligo-gene microarrays are suitable for testing nonadditive expression of homoeologous loci in allotetraploids (the null hypothesis for additive expression is 1 + 1 = 2) (1, 13), although A. thaliana and A. arenosa locus-specific expression may not be detected.

In this study, we compared expression divergence between single-copy and duplicate genes in A. thaliana and A. arenosa and examined how the expression of duplicate genes changes in resynthesized and natural allotetraploids. We determined expression divergence among duplicate genes with different copy numbers and investigated regulatory sequence and DNA methylation changes in the promoter regions of single-copy and duplicate genes in A. thaliana.


Sequence Divergence Between A. thaliana and A. arenosa.

To determine sequence divergence between A. thaliana and A. arenosa that split ≈6 Mya (15), we sequenced 1 A. arenosa BAC in the vicinity of the FLC locus (Fig. 1, Table S1 and Table S2). The BAC had a ≈110-kb insert and consisted of 32 genes including FLC (At5g10140). The gene orders and genomic organization were completely colinear between A. thaliana- and A. arenosa-derived sequences, suggesting that these 2 regions are highly conserved. Within the ≈110-kb regions, the nucleotide sequence identities between A. thaliana and A. arenosa were 94.6% in exons, 79.3% in introns, 82.8% in untranslated regions (UTRs), and 42.6% in aligned intergenic regions (Table S2). Together, the available data suggest that coding sequences between A. thaliana and A. arenosa are highly conserved (19).

Fig. 1.
Sequence comparison between A. thaliana and A. arenosa FLC-containing orthologous regions on chromosome 5. BAC sequence from A. arenosa was aligned to A. thaliana genomic sequence using genome VISTA. Purple, blue, and pink represent protein coding exons, ...

To test the expression evolution of single-copy and duplicate genes between species, we selected the duplicate genes from the most recent WGD event that occurred 20–40 Mya (Fig. 2 A–C, Table S3) because they are accurately predicted and generally conserved (1618). Moreover, these duplicate genes are old enough to have accumulated a substantial degree of expression divergence but not too old to make statistical inferences difficult (20). We defined the duplicate genes using following criteria: (i) duplicate genes are present in both A. thaliana and A. arenosa before speciation (Fig. 2 A–C); (ii) orthologous genes equally hybridize with microarray probes; and (iii) paralogous genes do not cross-hybridize. To satisfy these requirements, we selected the duplicate genes with the same hybridization intensities in comparative genomic hybridization (CGH) between A. thaliana and A. arenosa (Table S4 and Table S5). To minimize cross-hybridization in spotted oligo-gene microarrays, we selected duplicate and single-copy genes based on 70-mer oligos that had ≤70% of sequence identity with any other cDNAs and did not have 17 contiguous bases complementary to any other cDNAs. With these selection criteria cross-hybridization should be negligible (21, 22).

Fig. 2.
An experimental model for testing the effects of polyploidy on expression evolution of single-copy and duplicate genes. (A) A whole genome duplication (WGD) occurred 20–40 Mya in Arabidopsis. (B) WGD is accompanied by mutations, deletions, and ...

A single-copy gene was defined by its protein sequence that did not match any other paralogous proteins using BLASTp (E ≤0.01) (8). Only the single-copy genes with the same hybridization intensities in CGH between A. thaliana and A. arenosa were included in the study. Together, 1,347 single-copy and 2,694 WGD duplicate genes that have unique array features in the spotted oligo-gene microarrays were used for further analysis.

Expression Divergence of Duplicate Genes Between Species.

We tested whether more duplicate genes than single-copy genes are differentially expressed between A. thaliana and A. arenosa. Among 26,090 genes tested, 3,923 (≈15%) genes were differentially expressed between the two species using both tests of common and per-gene variances, and up to ≈43% were differentially expressed under per-gene variance (13). By comparing 3,923 differentially expressed genes with 2,694 duplicate genes and 1,347 single-copy genes, we found that the proportion of duplicate genes (≈18%, 478/2,694) that were differentially expressed between the two species was significantly higher than that of single-copy genes (≈13%, 175/1,347) (Fig. 3A and Table S5). The data suggest, that compared with the single-copy genes, the duplicate genes that are retained after WGD tend to show higher levels of expression divergence between the closely related species.

Fig. 3.
Expression divergence of single-copy and duplicate genes in resynthesized and natural allotetraploids and their progenitors. Proportions of single-copy and duplicate genes are shown to be differentially expressed between A. thaliana (At4) and A. arenosa ...

Expression Divergence Between Homoeologous Single-Copy and Duplicate Genes in Allopolyploids.

In Arabidopsis WGD occurred more than once in ancestral species before the divergence between A. thaliana and A. arenosa (1618). The expression divergence among progenitor's duplicate genes in response to recurrent polyploidization is unknown. Resynthesized allopolyploids offer a tractable system for the study because the exact progenitors are known. Moreover, homoeologous genomes in the resynthesized allotetraploids are relatively stable after selfing for 5 generations (13, 19), and morphological differences between 2 allotetraploid lineages (Fig. 2D) are associated with genetic and epigenetic changes (1).

First, we identified the differentially expressed genes in 2 resynthesized allotetraploid lines by comparing mRNA levels in an allotetraploid with an equal mixture of RNAs from the two parents (midparent value, MPV) (13). If gene expression is additive, the expression level of a gene in the allotetraploid should be equal to the sum of 2 parental loci (null hypothesis: 1 + 1 = 2). Nonadditive expression suggests repression (<2) or activation (>2) of a gene in the allotetraploid compared with MPV. This method may underestimate the number of nonadditively expressed genes because we could not detect a situation in which repression of one allele was compensated by the activation of another (1, 13). At least 1362 (≈5.2%) and 1469 (≈5.6%) genes are expressed nonadditively in the 2 allotetraploid lineages Allo733 and Allo738 (13).

Second, we compared nonadditively expressed genes with progenitors' single-copy and duplicate genes in Allo733 and Allo738. The proportion of homoeologous duplicate genes that were nonadditively expressed in both allotetraploids was significantly higher than that of single-copy genes (Fig. 3 B and C and Table S5). The data suggest rapid expression divergence among progenitors' duplicate genes in response to instantaneous allopolyploidization process. Note that the proportion of duplicate genes displaying expression changes may be underestimated because many nonadditively expressed duplicate genes using per-gene variance analysis (13) were excluded from the analysis (see Materials and Methods).

To test whether gene expression changes in resynthesized allotetraploids also occur in “old” allotetraploids, we examined expression diversity of single-copy and duplicate genes in a natural A. suecica strain. Relative to the extant progenitors A. thaliana and A. arenosa, 1,855 (≈7%) genes are nonadditively expressed in A. suecica, which is consistent with the number of nonadditively expressed genes (5–6%) found in 2 resynthesized allotetraploids. Moreover, more homoeologous duplicate genes than single-copy genes are nonadditively expressed in A. suecica (Fig. 3D and Table S5). Although 1 strain of natural allotetraploids was examined, the data suggest that similar to rapid expression divergence among duplicate genes in resynthesized allotetraploids, the duplicate genes derived from extant progenitors display more expression divergence than single-copy genes in natural allopolyploids.

Expression Divergence Between Duplicate Genes Involved in External Processes.

Duplicate genes after WGD are often differentially expressed in various developmental stages and environmental conditions, and external factors accelerate expression divergence between duplicate genes in A. thaliana (20, 23). To test how duplicate genes respond to external (environmental) and internal (developmental) signals, we compared Gene Ontology Slim (GOSlim) biological process classifications of single-copy and duplicate genes that were differentially expressed between the progenitors or nonadditively expressed in the allotetraploids (Fig. 4). Compared with single-copy genes, duplicate genes were enriched in all functional categories except for nucleotide metabolism (NM) that has a small number of single-copy genes (Table S6), suggesting high retention rates of duplicate genes after WGD (17, 24, 25). In the category of “response to stress” (RS) and “response to abiotic or biotic stimulus” (RA), more duplicate genes than single-copy genes were differentially expressed between A. thaliana and A. arenosa (Fig. 4A, Table S7). Moreover, a higher proportion of homoeologous duplicate genes in these two categories than others showed nonadditive expression in 2 resynthesized allotetraploids and A. suecica (Fig. 4 B–D Table S8, Table S9, and Table S10).

Fig. 4.
Differential expression of duplicate genes in GOSlim biological process classifications in allotetraploids and their progenitors. (A) Duplicate genes in external biological processes are differentially expressed between A. thaliana and A. arenosa (RS, ...

Rapid Expression Divergence in Genes with Multiple Paralogs.

Within species, the duplicate genes retained after WGD are more likely to be retained again after a subsequent genome duplication (24)., probably via a “balanced gene drive” mechanism (26, 27). Functional compensation by duplicate genes may decrease constraints on gene dosage and increase variability of biological pathways. To test this hypothesis, we examined the relationship between the number of paralogous genes and gene expression variation between species and in allotetraploids. We divided all single-copy and duplicate genes tested into 4 categories relative to the number of paralogs (0 = single-copy, 1, 2–9, and ≥10). Among the genes that displayed differential expression patterns between species and nonadditive expression in the allotetraploids, those with 2–9 and ≥10 paralogs were significantly overrepresented, whereas single-copy genes were underrepresented (Fig. 5A).

Fig. 5.
High levels of expression divergence among the genes with multiple paralogs and low levels of DNA methylation in duplicate genes. (A) Differentially expressed genes with the number of paralogs between A. thaliana and A. arenosa (open triangle) and nonadditively ...

Using a simple logistic regression model, we analyzed the association of differential gene expression with the number of paralogs. The regression coefficients for the number of paralogs with the proportion of differentially expressed genes between related species and in 2 resynthesized allotetraploids and A. suecica are significantly >0 (Fig. 5A). The data suggest that the proportion of differentially expressed genes in the related species and allopolyploids increases as the number of paralogs increases. The overall expression distribution in A. suecica is higher than in 2 resynthesized allotetraploid lines. A low frequency of nonadditively expressed single-copy genes in A. suecica suggests a functional constraint or fixation of progenitors' single-copy genes in natural allopolyploids. Alternatively, expression of these genes is sensitive to dosage changes. Among the duplicate genes with high-copy numbers (≥10 copies), sequence divergence over time may account for a decreased frequency of nonadditively expressed high-copy duplicate genes in A. suecica.

Promoter Regions of the Duplicate Genes Contain TATA Boxes and less DNA Methylation.

Gene expression changes coevolve with promoter sequences and regulatory elements as shown in yeast (28). The gene expression changes in closely related species and allopolyploids may be caused by cis-regulatory elements and trans-acting and species-specific factors (19, 29). Compared with those without TATA box, the TATA-containing duplicate genes tend to be differentially expressed between A. thaliana and A. arenosa and nonadditively expressed in the allotetraploid lineages (Table S7, Table S9, and Table S10). These data suggest that after species divergence duplicate genes related to environmental cues and stress pathways undergo rapid changes in expression probably via alteration in regulatory elements, and the expression divergence among progenitors' duplicate genes is maintained in recurrent resynthesized and natural allotetraploids.

Upstream sequence divergence between duplicate genes may alter binding affinities to RNA polymerases, transcription factors, and epigenetic modifiers such as DNA and histone methylation that are essential for transcriptional regulation (30). Although DNA methylation occurs in both coding and noncoding regions (31), DNA methylation in the promoter regions is generally associated with transcriptional silencing (31, 32). To test a role of DNA methylation in expression divergence, we investigated DNA methylation patterns in the upstream regions of single-copy and duplicate genes in A. thaliana. A gene is considered to be methylated in the upstream region if 2 or more adjacent probes are significantly methylated within a 1-kbp upstream region (32). We found that in A. thaliana the proportion of duplicate genes with DNA methylation in the upstream regions is significantly lower than that of single-copy genes (Fig. 5B). The data suggest that duplicate genes have a higher potential of expression regulation than sing-copy genes through interactions with transcription and trans-acting factors, leading to expression divergence between duplicate genes in the closely related species and interspecfic hybrids and allopolyploids.


All plant genomes sequenced to date underwent 1 or more WGD in their evolutionary history (33, 34). Our data suggest an advantage of having WGD duplicate genes that promote expression diversity within and between Arabidopsis species as observed in yeast and Drosophila (8). Moreover, despite a few lines tested, more homoeologous duplicate genes than single-copy genes are nonadditively expressed in the recurrent resynthesized and natural allotetraploids, which may lead to nonadditive and novel phenotypic variation such as growth vigor (1, 14). Moreover, gene silencing or activation may occur immediately after allotetraploid formation or take several generations, suggesting gene expression niches for adaptive evolution (11).

A major contribution of the duplicate genes is to expand gene expression regulatory networks, facilitating morphological and adaptive evolution (1, 7). In A. thaliana, the rates of expression divergence between duplicate genes among GOSlim biological processes vary substantially (23), and the level of expression divergence between gene duplicates in response to environmental stress is higher than in response to developmental changes (20). Between 2 closely related species A. thaliana and A. arenosa, the differentially expressed duplicate genes in the category of responses to stresses and abiotic and biotic stimuli are also significantly higher than the differentially expressed single-copy genes. Moreover, duplicate genes inherited from the progenitors in the response to stress category are preferentially nonadditively expressed in new and natural allotetraploids. These data suggest that expression divergence between duplicate genes may promote species-specific defense mechanisms, leading to wider adaptation of polyploid plants to harsh environments (2). Dosage-dependent gene expression increases the range of responses to various stress conditions. Alternatively, neofunctionaliztion (35) and/or escape from adaptive conflict (36) might have occurred in some gene duplicates.

Increasing expression diversity of duplicate genes suggests changes in upstream regulatory elements and/or protein domains (7, 28). In A. thaliana, high rates of single-nucleotide polymorphisms (SNPs) correlate with gene families in response to environmental conditions (37). Presence of sequence variation in regulatory regions may induce expression divergence between duplicate genes. Indeed, the 5′ upstream regions of the duplicate genes related to stresses tend to possess TATA-box and low levels of DNA methylation, which may facilitate expression divergence between duplicate genes through interactions with transcription factors and trans-acting proteins (30). Alternatively, protein domain divergence among duplicate transcription factors may affect downstream genes and pathways such as in response to stresses, which remains to be tested.

Theory predicts that after WGD many gene duplicates are lost (6). However, duplicate genes are well preserved within and between species (24, 25, 34). The high retention fate of duplicate genes may be explained by several models, including neofunctionalization (gain of new function) (35), subfunctionaliztion (functional divergence) (38), and escape from adaptive conflict (free constraints on a gene with novel and ancestral functions) (36). Although increasing expression diversity of gene duplicates does not directly address the high retention rate of duplicate genes, the high levels of expression divergence between duplicate genes in response to environmental stresses are consistent with the high retention rates of duplicate genes in GOSlim classifications (20, 23, 25). The genes retained in duplicate after WGD tend to be retained in duplicate again after a subsequent WGD (24). A recent model of balanced gene drive suggests preferential maintenance of duplicate genes and their products such as transcription factors that increase morphological complexity (26). Indeed, many duplicate genes in transcription and signal transduction are preferentially retained (17, 23), whereas some gene families return single-copy status (39). In Arabidopsis, after 3 rounds of WGD there is >90% increase of the genes in transcription and signal transduction (25).

The positive relationship between the copy-number of gene duplicates and their expression diversity may facilitate the retention of duplicate genes after WGD. On one hand, genome duplication may be considered a mutagen, and imbalance in gene-copy numbers would be deleterious and selected against (40). On the other hand, duplicate genes may escape adaptive conflict (36), reduce constraints on dosage-dependent expression variation, and increase expression diversity for adaptive evolution. The data may explain why duplicate-genes are well preserved after WGD in the closely related species.

In conclusion, the preservation of duplicate genes between related species complements the diploidization process that leads to the reduction of duplicate genes and genome size (6). Gene duplicates retained after WGD increase expression diversity and promotes regulatory sequence divergence with and between species (8, 20, 28). Expression diversity of duplicate genes expands gene expression regulatory networks and facilitates organismal adaptation to its environment (20, 23). The duplicate genes derived from WGD in the progenitors provide a beneficial effect of genome obesity on morphological and adaptive evolution in subsequent allopolyploids. The empirical and experimental data suggest an important role of expression divergence between duplicate genes in adaptive evolution of closely related species and recurrent formation of allopolyploids.

Materials and Methods

Plant Materials.

Plant materials included autotetraploid A. thaliana Landsberg (Ler) (At4, 2n = 4x = 20) (Arabidopsis Biological Resource Center, ABRC, CS3900), tetraploid A. arenosa (2n = 4x = 32) (CS3901), and natural A. suecica (9502, 2n = 4x = 26) (CS22508). Allotetraploids were resynthesized by manually pollinating At4 using A. arenosa pollen (Fig. 2D), and 2 fertile allotetraploid lines (11) in the 6th slefing generation (Allo733, CS3897 and Allo738, CS3899) (13) were used for the study. All plants were grown in vermiculite mixed with 30% soil in a growth chamber with growth conditions of 22 °C/18 °C (day/night) and 16 h of illumination per day. Rosette leaves before bolting were collected from a pool of 10–12 plants in each genotype for the analysis of DNA and gene expression.

Sequencing an Orthologous A. arenosa BAC Containing FLC Locus.

A. arenosa BAC library was obtained from Genomex (http://www.genomex.com). The orthologous A. arenosa FLC BAC was selected using FLC as a hybridization probe (19) and sequenced using shotgun method (Agencourt Bioscience Corporation, Beverly, MA). The sequencing reads were assembled and edited. A. arenosa genes were annotated using FGENESH (www.softberry.com) and BLAST against cDNA sequences of A. thaliana. Annotated genomic sequences are deposited in GenBank (accession no. FJ461780).

DNA Microarray Experiments.

DNA microarrays were performed using comparative genomic hybridization (CGH) between At4 and Aa. Genomic DNA was isolated and sheared using a sonicator. Probe labeling, slide hybridization, and washing were performed as described in ref. 13 (see below). The array features with statistically significantly different signals were excluded (GEO accession nos. GSE9513 and GSM213692).

Analysis of Gene Expression Data.

Gene expression microarray data were obtained from 4 sets of comparisons between (1) A. thaliana and A. arenosa (GSM341923) (2) Allo733 and midparent (an artificial RNA mix of 2 parents) (GSM342087), (3) Allo738 and midparent (13) (GSM342029), and (4) A. suecica and midparent (GSM334287). Two biological replications were used in each comparison. One biological replicate includes 2 sets of RNA samples. Each RNA sample was split into 4 aliquots: 2 aliquots were labeled with Cy3 and Cy5, respectively, and the other 2 were “reverse”-labeled with Cy5 and Cy3, respectively. The Cy3-labeled cDNA from one RNA sample was mixed with Cy5-labeled cDNA from another RNA sample as 1 probe, resulting in a total of 4 probes for hybridizations (2 dye swaps) in 1 biological replication. Thus, each comparison (two biological replicates) comprised 8 hybridizations (13).

A linear model was used to perform ANOVA by array (A), dye (D), gene (G), and treatment (T, or genotype or species) (13). Log (Yijkl) = μ + Gi + Tj + Ak + Dl + (GT)ij + (GA)ik + (GD)il + (GTD)ijl + εijkl, where Y is the hybridization intensity after background subtraction; μ represents the overall mean; G, T, A, and D are major sources of variation; the interaction terms GT, GA, GD, and GTD represent gene-by-treatment, gene-by-array, gene-by-dye, and gene-by-treatment-by-dye, respectively; and εijkl denotes the random error.

A standard t test statistic is used based on the normality assumption for the residuals and the null hypothesis H0: TAS + (GT)i,AS = TMPV + (GT)i,MPV (AS: A. suecica; MPV: midparent value). The type I error rate of multiple testing was controlled ≤0.05 using the false discovery rate (FDR) of Benjamini and Hotchberg (41). The differentially expressed genes that were statistically significant under both common and per-gene variances in resynthesized allotetraploids (13) and under per-gene variance in A. suecica were selected for further analysis.

Identification of Expression of Duplicate and Single-Copy Genes.

Entire cDNA and protein sequences of A. thaliana were downloaded from TAIR database (ftp.arabidopsis.org/home/tair/Genes). All-against-all protein sequence alignment was performed using BLAST. A locus was considered to be a single-copy gene if a protein sequence did not align with any other proteins using BLAST search (E < = 0.01). We used well-characterized gene duplicates that arose from the most recent WGD event 20–40 Mya (1618). The WGD duplicate gene set was further processed to remove ambiguous loci as published (20). To avoid cross-hybridization among paralogous genes in microarrays, we selected the array features if a 70-mer oligo did not match any other cDNA sequences with ≥70% identity and did not have 17 contiguous bases identical to any other cDNAs in A. thaliana.

Identification of Paralogs.

The same definition of paralogs as Gu et al. (8) was used. Two genes were defined as paralogous if their protein sequences were matched using all-against-all BLAST with following criteria: (i) E ≤ 10−10; (ii) sequence identity is >30%; and (iii) The length of the alignable region between 2 protein sequences is >50% of the longer sequence. With these criteria, paralogous genes for duplicate genes were identified, and the number of paralogs of each duplicate gene was counted.

Gene Ontology.

The Gene Ontology of A. thaliana genome was downloaded from The Arabidopsis Information Resource (TAIR) (ftp.arabidopsis.org/home/tair/Ontologies/Gene_Ontology/OLD/ATH_GO_GOSLIM.20080419.txt) released on 19 April 2008. GOSlim was used to classify 13 biological process categories. Among the 2,694 gene duplicates and 1,347 single-copy genes, 2,380 (88%) and 1,205 (89%) genes, respectively, were assigned to GOSlim classifications.

Logistic Regression Model.

A simple logistic regression model was used to test the association of gene expression variation with the number of paralogs. The log odds ratio is related to the categories of gene duplicate numbers by linear model.

equation image

Where, p(x) represents the proportion of differentially expressed genes, and x is the category of gene duplicate numbers (x = 0, number of paralogs = 0; x = 1, number of paralogs = 1; x = 2, number of paralogs = 2 to 9; and x = 3, number of paralog ≥10). The regression coefficient, parameter β1, measures the degree of association between the tendency of differential expression and the number of paralogs.

DNA Methylation in the 5′ Upstream Regions.

Genome-wide DNA methylation data were obtained from the published work (32). Genes methylated in the 5′ upstream region are defined as presence of 2 or more adjacent methylated probes (immunoprecipitated DNA/Input ≥1.28) within a 1-kbp 5′ upstream region. A total of 965 genes were detected to be methylated in the 5′ upstream regions. Fifty-eight out of 2,694 duplicate genes and 54 out of 1,347 single-copy genes, respectively, are methylated in their 5′ upstream regions.

Supplementary Material

Supporting Information:


We thank Wen-Hsiung Li and members of the Z.J.C laboratory for valuable suggestions to improve the manuscript and the editor and anonymous reviewers for their constructive comments to improve the manuscript. The work was supported by National Science Foundation Plant Genome Research Program Grant DBI0733857 and National Institutes of Health Grant GM067015 (to Z.J.C.).


The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The microarray data reported in this paper have been deposited in the Gene Expression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession nos. GSE9513 and GSM213692). The BAC sequence reported in this paper has been deposited in the GenBank database (accession no. FJ461780).

This article contains supporting information online at www.pnas.org/cgi/content/full/0807350106/DCSupplemental.


1. Chen ZJ. Genetic and epigenetic mechanisms for gene expression and phenotypic variation in plant polyploids. Annu Rev Plant Biol. 2007;58:377–406. [PMC free article] [PubMed]
2. Grant V. Plant Speciation. 2nd Ed. New York: Columbia Univ Press; 1981. p. 563.
3. Wendel J, Doyle JJ. Polyploidy and evolution in plants. In: Henry RJ, editor. Plant Diversity and Evolution: Genotypic and Phenotypic Variation in Higher Plants. Wallingford, UK: CABI; 2005. p. 23.
4. Tate JA, Soltis PS, Soltis DE. Polyploidy in Plants. In: Gregory TR, editor. The Evolution of the Genome. New York: Academic; 2004. p. 43.
5. Masterson J. Stomatal size in fossil plants: Evidence for polyploidy in majority of angiosperms. Science. 1994;264:421–424. [PubMed]
6. Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. [PubMed]
7. Carroll SB. Endless forms: The evolution of gene regulation and morphological diversity. Cell. 2000;101:577–580. [PubMed]
8. Gu Z, Rifkin SA, White KP, Li WH. Duplicate genes increase gene expression diversity within and between species. Nat Genet. 2004;36:577–579. [PubMed]
9. Comai L. The advantages and disadvantages of being polyploid. Nat Rev Genet. 2005;6:836–846. [PubMed]
10. Leitch AR, Leitch IJ. Genomic plasticity and the diversity of polyploid plants. Science. 2008;320:481–483. [PubMed]
11. Wang J, et al. Stochastic and epigenetic changes of gene expression in Arabidopsis polyploids. Genetics. 2004;167:1961–1973. [PMC free article] [PubMed]
12. Jakobsson M, et al. A unique recent origin of the allotetraploid species Arabidopsis suecica: Evidence from nuclear DNA markers. Mol Biol Evol. 2006;23:1217–1231. [PubMed]
13. Wang J, et al. Genomewide nonadditive gene regulation in Arabidopsis allotetraploids. Genetics. 2006;172:507–517. [PMC free article] [PubMed]
14. Ni Z, et al. Altered circadian rhythms regulate growth vigor in hybrids and allopolyploids. Nature. 2009;457:327–331. [PMC free article] [PubMed]
15. Koch MA, Haubold B, Mitchell-Olds T. Comparative evolutionary analysis of chalcone synthase and alcohol dehydrogenase loci in Arabidopsis, Arabis, and related genera (Brassicaceae) Mol Biol Evol. 2000;17:1483–1498. [PubMed]
16. Simillion C, Vandepoele K, Van Montagu MC, Zabeau M, Van de Peer Y. The hidden duplication past of Arabidopsis thaliana. Proc Natl Acad Sci USA. 2002;99:13627–13632. [PMC free article] [PubMed]
17. Blanc G, Hokamp K, Wolfe KH. A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res. 2003;13:137–144. [PMC free article] [PubMed]
18. Bowers JE, Chapman BA, Rong J, Paterson AH. Unraveling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature. 2003;422:433–438. [PubMed]
19. Wang J, Tian L, Lee HS, Chen ZJ. Nonadditive regulation of FRI and FLC loci mediates flowering-time variation in Arabidopsis allopolyploids. Genetics. 2006;173:965–974. [PMC free article] [PubMed]
20. Ha M, Li WH, Chen ZJ. External factors accelerate expression divergence between duplicate genes. Trends Genet. 2007;23:162–166. [PMC free article] [PubMed]
21. Kane MD, et al. Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res. 2000;28:4552–4557. [PMC free article] [PubMed]
22. Chou CC, Chen CH, Lee TT, Peck K. Optimization of probe length and the number of probes per gene for optimal microarray analysis of gene expression. Nucleic Acids Res. 2004;32:e99. [PMC free article] [PubMed]
23. Casneuf T, De Bodt S, Raes J, Maere S, Van de Peer Y. Nonrandom divergence of gene expression following gene and genome duplications in the flowering plant Arabidopsis thaliana. Genome Biol. 2006;7:R13. [PMC free article] [PubMed]
24. Seoighe C, Gehring C. Genome duplication led to highly selective expansion of the Arabidopsis thaliana proteome. Trends Genet. 2004;20:461–464. [PubMed]
25. Maere S, et al. Modeling gene and genome duplications in eukaryotes. Proc Natl Acad Sci USA. 2005;102:5454–5459. [PMC free article] [PubMed]
26. Freeling M, Thomas BC. Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res. 2006;16:805–814. [PubMed]
27. Thomas BC, Pedersen B, Freeling M. Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Res. 2006;16:934–946. [PMC free article] [PubMed]
28. Tirosh I, Weinberger A, Carmi M, Barkai N. A genetic signature of interspecies variations in gene expression. Nat Genet. 2006;38:830–834. [PubMed]
29. Wittkopp PJ, Haerum BK, Clark AG. Evolutionary changes in cis and trans gene regulation. Nature. 2004;430:85–88. [PubMed]
30. Li B, Carey M, Workman JL. The role of chromatin during transcription. Cell. 2007;128:707–719. [PubMed]
31. Zhang X, et al. Genome-wide high-resolution mapping and functional analysis of DNA methylation in arabidopsis. Cell. 2006;126:1189–1201. [PubMed]
32. Zilberman D, Gehring M, Tran RK, Ballinger T, Henikoff S. Genome-wide analysis of Arabidopsis thaliana DNA methylation uncovers an interdependence between methylation and transcription. Nat Genet. 2007;39:61–69. [PubMed]
33. Ming R, et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus) Nature. 2008;452:991–996. [PMC free article] [PubMed]
34. Cui L, et al. Widespread genome duplications throughout the history of flowering plants. Genome Res. 2006;16:738–749. [PMC free article] [PubMed]
35. Lynch M, O'Hely M, Walsh B, Force A. The probability of preservation of a newly arisen gene duplicate. Genetics. 2001;159:1789–1804. [PMC free article] [PubMed]
36. Des Marais DL, Rausher MD. Escape from adaptive conflict after duplication in an anthocyanin pathway gene. Nature. 2008;454:762–765. [PubMed]
37. Clark RM, et al. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science. 2007;317:338–342. [PubMed]
38. Lynch M, Force A. The probability of duplicate gene preservation by subfunctionalization. Genetics. 2000;154:459–473. [PMC free article] [PubMed]
39. Paterson AH, et al. Many gene and domain families have convergent fates following independent whole-genome duplication events in Arabidopsis, Oryza, Saccharomyces and Tetraodon. Trends Genet. 2006;22:597–602. [PubMed]
40. Birchler JA, Bhadra U, Bhadra MP, Auger DL. Dosage-dependent gene regulation in multicellular eukaryotes: Implications for dosage compensation, aneuploid syndromes, and quantitative traits. Dev Biol. 2001;234:275–288. [PubMed]
41. Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J Royal Stat Soci Ser B. 1995;57:289–300.

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • BioProject
    BioProject links
  • GEO DataSets
    GEO DataSets
    GEO DataSet links
  • MedGen
    Related information in MedGen
  • Nucleotide
    Published Nucleotide sequences
  • Protein
    Published protein sequences
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...