![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2008 Armisén et al; licensee BioMed Central Ltd. Unique genes in plants: specificities and conserved features throughout evolution 1Unité de Recherche en Génomique Végetale (URGV), UMR INRA 1165 – CNRS 8114 – Université d'Evry Val d'Essonne, 2 rue Gaston Crémieux, CP 5708, F-91057 Evry Cedex, France Corresponding author.David Armisén: armisen/at/evry.inra.fr; Alain Lecharny: lecharny/at/evry.inra.fr; Sébastien Aubourg: aubourg/at/evry.inra.fr Received April 24, 2008; Accepted October 10, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background Plant genomes contain a high proportion of duplicated genes as a result of numerous whole, segmental and local duplications. These duplications lead up to the formation of gene families, which are the usual material for many evolutionary studies. However, all characterized genomes include single-copy (unique) genes that have not received much attention. Unlike gene duplication, gene loss is not an unspecific mechanism but is rather influenced by a functional selection. In this context, we have established and used stringent criteria in order to identify suitable sets of unique genes present in plant proteomes. Comparisons of unique genes in the green phylum were used to characterize the gene and protein features exhibited by both conserved and species-specific unique genes. Results We identified the unique genes within both A. thaliana and O. sativa genomes and classified them according to the number of homologs in the alternative species: none (U{1:0}), one (U{1:1}) or several (U{1:m}). Regardless of the species, all the genes in these groups present some conserved characteristics, such as small average protein size and abnormal intron number. In order to understand the origin and function of unique genes, we further characterized the U{1:1} gene pairs. The possible involvement of sequence convergence in the creation of U{1:1} pairs was discarded due to the frequent conservation of intron positions. Furthermore, an orthology relationship between the two members of each U{1:1} pair was strongly supported by a high conservation in the protein sizes and transcription levels. Within the promoter of the unique conserved genes, we found a number of TATA and TELO boxes that specifically differed from their mean number in the whole genome. Many unique genes have been conserved as unique through evolution from the green alga Ostreococcus lucimarinus to higher plants. Plant unique genes may also have homologs in bacteria and we showed a link between the targeting towards plastids of proteins encoded by plant nuclear unique genes and their homology with a bacterial protein. Conclusion Many of the A. thaliana and O. sativa unique genes are conserved in plants for which the ancestor diverged at least 725 million years ago (MYA). Half of these genes are also present in other eukaryotic and/or prokaryotic species. Thus, our results indicate that (i) a strong negative selection pressure has conserved a number of genes as unique in genomes throughout evolution, (ii) most unique genes are subjected to a low divergence rate, (iii) they have some features observed in housekeeping genes but for most of them there is no functional annotation and (iv) they may have an ancient origin involving a possible gene transfer from ancestral chloroplasts or bacteria to the plant nucleus. Background The role of gene duplications in evolution was suggested forty years ago (see the review by Taylor and Raes 2004 [1]). More recently, complete sequencing of several eukaryotic genomes showed the quantitative importance of duplicated genes [2,3]. In particular, plant genomes contain a high proportion of duplicated genes and, in several plant gene families, the number of paralogous genes is more than one hundred [4,5]. Frequent gene duplications [6], occasional segmental [7], chromosomal and genomic duplications [8-13] shaped present genomes. The underlying mechanisms indicate that the primary molecular events in gene duplication should affect most of the genes independently of their function. Nevertheless, all characterized genomes include single-copy (unique) genes, i.e. genes without apparent homolog in the same genome [14] and, for some of them, without any homolog, even in phylogenetically close relatives [15]. Indeed, evolution is not a one-direction process and a high proportion of duplicated genes are rapidly lost [6,16,17]. This definition of unique gene is fully independent of the gene function and is only based on the protein sequence uniqueness in the whole proteome of a considered species. For instance, in the framework of this definition, the bHLH transcription factors, whatever the different functions that might be assigned to each of them, are not considered as unique because they all share sequence similarity and, as such, are thought to have arisen from a common ancestor. In other words, in this paper we define as single-copy or unique gene, a gene coding for a protein without detectable sequence motif or global similarities with any protein in the same proteome. Unlike gene duplication, gene loss is not an unspecific mechanism but it is instead influenced by functional selection [12,18]. Thus, duplicates that are maintained show a bias toward certain gene functional classes [19] or transcriptional level [6,20,21]. Unique genes may also be duplicates that diverged too much to be distinguished now [22]. With the recent availability of whole plant proteomes it is possible to consider further some questions about the generation and evolution of unique genes in plants. In many evolutionary studies, sound groups of duplicated genes are selected but the genes left apart by the process are far from being all unique genes. Indeed, the potential adaptive significance of duplicated genes and genomes has received great attention [23-25]. It is however more difficult to speculate on the meaning of species- or phylum-specific unique genes mainly because of a critical lack of functional annotation for most of them. Major differences in gene repertoire among species were attributed to proteins with obscure features that lack currently defined motifs or domains (POFs) and are often species- or phylum-specific [26]. The definition of POFs [27] relying only on the absence of characterized conserved sequence signatures is thus independent of the existence or absence of paralogs. POFs and unique genes are nevertheless overlapping populations of genes. Hypotheses on the possible origins of POFs include convergent evolution and rapid divergence [26]. The question of the origin of unique genes, either purifying selection against duplicates or rapid divergence, remains unsolved. In this study we first established and used stringent criteria in order to identify suitable sets of unique genes present in the extensively known proteomes of Arabidopsis thaliana (core eudicotyledons, Brassicaceae) and Oryza sativa (Liliopsida, Poaceae), two plants that diverged ~150 million years ago (MYA) [28,29]. Second, we used the intersection between the two sets of unique genes in order to characterize a set of genes conserved as unique in both A. thaliana and O. sativa, i.e. pan-orthologs as defined by Blair et al. [30]. Third, we searched for gene, promoter and protein features shared between all unique genes and/or within pairs of pan-orthologs. Fourth, using the pan-orthologs between A. thaliana and O. sativa, we searched for their conservation in a green unicellular alga and a moss, for which reasonably good proteomes are also available. Within the limits of the proteomes used, we show that several unique genes are species specific but that a significant number are conserved even outside of the green phylum. The clusters of homologous unique genes highly conserved throughout the green phylum globally present specific structural features that indicate a strong purifying selection supporting the orthology links between the conserved unique genes. These conserved unique genes would be important targets for functional studies since it is likely that they perform ancient but not described biological functions. Results and discussion How many unique genes in Arabidopsis thaliana and Oryza sativa? With the scope to search for possible evidence of particular features of the unique proteins, our method should be stringent enough to deliver a minimum level of false positives. To achieve this objective we used a protocol that mixed detection of conserved motifs (through the PFAM library [31]), and local sequence alignments (BLASTp) taking into account the relative length of the conserved regions. A. thaliana and O. sativa were the first two plants with a whole genome sequenced and annotated [4,5]. The corresponding proteins have been used separately to run our protocol for each species (Figure (Figure1).1
Previous published estimations of the number of A. thaliana unique proteins gave different values ranging from 3,405 to 12,265 proteins [4,33-35] depending on the protocol used. The smaller value (3,405) comes from the PHYTOPROT project [34] and were obtained through extensive all-against-all sequence comparisons using the LASSAP software [36]. The list of unique genes delivered by PHYTOPROT was longer than the list provided by our method but 81% of the unique proteins were shared between both lists. The expertise of additional proteins identified in PHYTOPROT shows that they are members of a PFAM family and, therefore, excluded from our list. Unique proteins conserved and non-conserved between Arabidopsis thaliana and Oryza sativa One protein unique in a given species may have either no, one or several homologs in other species. We named U{1:0} the unique proteins in one species with no homolog in the other one, U{1:1} the unique proteins with only one homolog and U{1:m} the unique proteins with more than one homolog. A 2-letter prefix was added to indicate the plant species when necessary, i.e. AtU{1:m} refers to A. thaliana unique genes with at least 2 homologs in the O. sativa genome. Both U{1:1} and U{1:m} are conserved single copy genes in the reference genome (thereafter called conserved single copy genes) and are respectively qualified as pan-orthologs and syn-orthologs according to Blair et al. [30]. After sequence comparison based on BLASTp, 995 (3.7% of the whole A. thaliana proteome) and 6,418 (11.1% of the whole O. sativa proteome) unique genes were classified as AtU{1:0} and OsU{1:0} respectively (2). Sequence conservation between the Liliopsida and core eudicotyledon members of a pair of proteins is a strong support for the gene prediction of U{1:1} and U{1:m} genes. However, an over-prediction of U{1:0} genes remained possible. Thus, we searched for proofs of transcription for the genes coding for the U{1:0} proteins in both plants. We have found transcript sequences for 544 (out of 995) and 1,462 (out of 6,418) U{1:0} proteins from A. thaliana or O. sativa respectively. This class of proteins for which the corresponding gene structure was sustained by transcript sequences (ESTs and/or cDNA) was named U{1:0}E (for Expressed) genes. Similarly, the class of unique proteins without homologs in the other plant species and without cognate ESTs was named U{1:0}NE (for No proof of Expression) genes (Figure (Figure22
In A. thaliana, we further analysed possible over-prediction of 451 AtU{1:0}NE proteins searching for corresponding gene expression in CATMA [37] and Affymetrix [38,39] transcriptome resources. Statistical proof of expression was found for 311 additional AtU{1:0}NE genes. All together, these data indicated that most of the predicted AtU{1:0} coding genes were expressed and thus actual genes. It was more difficult to conclude on the accuracy of the number of unique genes for O. sativa since there remained a large number of OsU{1:0}NE genes (4,956) with not enough available transcriptome data. Using the 2,570 A. thaliana unique proteins as query in a BLASTp against the 8,041 O. sativa unique proteins we found 974 pairs of AtU{1:1} proteins and 960 OsU{1:1} when doing the inverse search. Of these genes, 937 shared pairs remained as U{1:1}protein pairs after crossing both lists. A manual check of U{1:1} protein pairs present in only one list showed that differences were due to gene splitting/fusions that may come from either actual events or from gene prediction errors in one of the two genomes. These processes changed an actual U{1:1} relationship into an apparent U{1:m} relationship. Topological organization of unique genes Both A. thaliana and O. sativa have large regions that are still recognizable as duplicated regions [4,40]. We analyzed AtU{1:0}, AtU{1:1} and AtU{1:m} gene distribution in A. thaliana non-duplicated regions, which contained 15.7% of the nuclear genome. No significant preferential occurrences of AtU{1:0}, AtU{1:1} and AtU{1:m} genes were observed inside the apparently non-duplicated regions, where we observed about 18% of them. Therefore, this result showed that most of the genes are unique not because they belong to a genomic region deleted after whole genome duplication, but because of the non-reciprocal local losses between two paralogous duplicated genomic regions. We also analysed the distribution of each class of unique genes along A. thaliana and O. sativa chromosomes using a Chi-square test with a confidence level of 99.5% (critical values of 14.86 and 26.76 respectively). All gene classes were evenly distributed among the 5 chromosomes of A. thaliana with a Chi-square of 3.91 for U{1:0}, 3.95 for U{1:1} and 0.63 for U{1:m} genes. The O. sativa distribution was also even for U{1:0} and U{1:m}, chi-square of 23.63 and 25.64 respectively, but unequal (Chi-square of 65.10) on U{1:1} genes. Detailed analysis showed that in O. sativa genome there was a higher density of U{1:1} genes in chromosome 2 and 3 and a lower density in chromosome 11 and 12. This particular distribution is unexpected since chromosomes 11 and 12 are the only two rice chromosomes that do not show evidence for large regional duplications with any other rice chromosomes [40,41]. The recent duplication described between the first 3 Mb of the chromosomes 11 and 12 [13,41,42] only covers 11% of their size which is not sufficient to explain the low number of unique genes observed within each chromosome (60% of the expected number). Thus, our results suggest that in O. sativa, as well as in A. thaliana, non-reciprocal losses between duplicated genomic regions are a frequent mechanism for generating and maintaining unique a set of genes. Unique gene and protein features We compared the intron relative numbers, the presence of some TFBS and the protein lengths between random sets of nuclear genes and the 3 groups of unique genes, U{1:0}E, U{1:1} and U{1:m}. All the U{1:0}NE genes and the U{1:0}E genes not fully covered by cognate transcripts were not included in the study due to the uncertainty on their structural annotation (intron number and positions, CDS size). The GC content of all the groups was not significantly dissimilar to the 44.2% in A. thaliana and the 53.3% in O. sativa. Intron number This feature separates all the unique genes into two distinct groups. On one side, U{1:0} clustered intron poor genes that had 30% fewer introns than all nuclear genes. On the other side, U{1:m} and U{1:1} genes have a higher number of introns with a density of 1.35 and 1.57 introns per 100 amino acids as compared to 1.09 for all the nuclear genes in A. thaliana (Table 1). These differences are the same for rice unique genes. Our results are in agreement with the fact that, in general, evolutionarily conserved genes preferentially accumulate introns [43]. Nevertheless, there is no difference in the number of introns in the 5' and 3' UTRs between unique genes and the whole genome. These observations suggest that the pressure of selection that is at work to keep unique a set of orthologous genes in a genome has an effect down to the level of gene structures mainly in the ORFs. Indeed, functional reasons may be put forward since introns may play a functional role through alternative splicing, effects on gene expression [44,45] or by their involvement in protein transport [46].
Transcription factor binding sites (TFBS) in promoter sequences In the whole genome of A. thaliana and O. sativa we found respectively 20% and 16% of genes with a TATA-box in their promoters. Comparisons with the frequency of these two well characterized TFBS present in promoters of unique genes split them in two groups: the U{1:0} class on one side and the U{1:m} and U{1:1} classes on the other side. On one hand, the promoters of Arabidopsis U{1:m} and U{1:1} genes contains the same relative number of TATA-box (Chi-squared test, P-value = 0.40) and they have a significantly lower frequency of TATA-box (Chi-squared test, P-value = 2.3e-14) than the other nuclear genes (Table 1). On the other hand, TELO-box presence was significantly higher in AtU{1:m} and AtU{1:1}genes than in the other nuclear genes (Chi-squared test, P-value = 0.0057). The same differences are observed in unique O. sativa genes (Table 1). The two other TFBS analysed, SORLIP2 [47,48] and CAAT [49] boxes, present slight variations in each class when compared with whole genome distribution, but these variations were not consistent in both species (Table 1). The different frequencies of TATA and TELO boxes observed in the promoter sequences of unique genes cluster them as the intron density criteria: the class U{1:0} on one side and the two classes U{1:1} and U{1:m} on the other side. This particular clustering conserved in both A. thaliana and O. sativa is discussed below. Protein length We compared the size distribution of each group of unique proteins in the two species (Figure (Figure3).3
In summary, in the A. thaliana genome, there are 2,570 unique genes and 995 do not have a homolog in O. sativa. Conserved single copy genes are both the 974 A. thaliana genes that have only one ortholog and the 601 genes that have more than one homolog in O. sativa. In O. sativa genome, 8,041 genes are unique and 6,418 do not have a homolog in A. thaliana. Furthermore, 960 conserved unique genes have only one ortholog while 663 have more than one ortholog. Even if we might suspect some over-prediction of unique O. sativa genes, our results about the common features shared by unique genes are highly similar in both A. thaliana and O. sativa. First, conserved single copy genes (U{1:1} and U{1:m} classes) have relatively more introns than in the whole genome and their promoter is characterized by a lower presence of TATA-box and a higher presence of TELO-box than in the nuclear genes. Second, unique genes code for shorter proteins than the whole genome and the difference is the highest for unconserved proteins. Functional features of U{1:0} genes We recovered the annotated gene functions available for the 544 AtU{1:0}E. Despite the fact that we used "annotation" in the largest acceptation of the word, only 105 of them have a predicted function (Table 2), i.e. 2 to 3 times less than expected from the whole genome [60]. In the 105 annotated AtU{1:0}E genes we observed 15 genes coding for recognized peptide phytohormones [61] including CLAVATA3 and 5 CLAVATA3 related peptides, POLARIS, 3 PROPEP, RALF and N Hydroxyprolin-rich glycoprotein coding genes. The small peptide phytohormones are involved in signalling roles in defence or non-defence functions [61] Most of the peptide phytohormones are proteolytic products of larger propeptides encoded by different genes. Some peptide phytohormones may be clustered based on short motif conservation such as CLAVATA3 group which is characterised by only 12 residues while the remaining parts of the propeptides are highly divergent. When we searched for peptide phytohormones in AtU{1:1} genes, we did not find any even though there were almost 6 times more genes with predicted functions compared to AtU{1:0}E genes. Another specific feature of the AtU{1:0}E group is to exhibit a relatively high percentage of genes coding for proteins targeted at the endoplasmatic reticulum (Table 2) as pro-peptides coding for secreted peptide phytohormones [61]. This observation suggests that the AtU{1:0}E group might contain many other not yet characterized genes coding for pro-peptides phytohormones and that might be involved in unknown signalling processes. For instance in the AtU{1:0}E group, we found 13 genes coding for proline or glycine rich-proteins that were mainly predicted to be targeted at the endoplasmic reticulum (Table 2). Additionally, genes encoding for secreted peptides have been reported as having a low intron density [53] as we observed for the U{1:0} group of genes.
Structural and functional features conserved in At and OsU{1:1} gene pairs The 937 pairs of U{1:1}genes between A. thaliana and O. sativa were established on local sequence comparisons (reciprocal best hit or RBH) of U{1:1} gene lists with criteria generally accepted to define an orthology relationship [62]. Nevertheless, to support more strongly the orthology and the functional relationships, we looked for some structural features shared by the two members of U{1:1} pairs (see Additional file 1). Protein length Protein lengths of the two members of a U{1:1} pair were highly correlated (Figure (Figure4A)4A
Intron position The conceptual position of introns has been searched in the global alignment of each pair of protein sequences. Nearly 45% of U{1:1} pairs had conserved number and positions of introns, while the mean value for random pairs of conserved unique genes was 0.2% (Table 3). Less stringently, 71% of the U{1:1} pairs exhibited at least one intron at a conserved position as compared to 10.6% in the random pairs. Overall, the high intron conservation is strong evidence for orthology between members of a U{1:1} gene pair, discarding any mechanism of convergence between their sequences. Comparison of gene structures in the U{1:1} pairs also highlights the fact that, since the speciation, the numbers of intron gains or losses are nearly equivalent in the two species. Indeed, the ratio between the number of not conserved introns (in terms of position) in A. thaliana and the number of not conserved introns in O. sativa is 1.03 (Table 3). Comparative studies on A. thaliana and O. sativa genes showed three different evolutionary trends based on the orthology relationships. First, recent duplicated genes are submitted to high loss and gain of introns [63], second, two orthologous genes tend to keep the same gene structure and only a relatively small number of species-specific introns are observed [64] and, third, slowly evolving conserved genes are also subject to an elevated rate of intron gain but tend to conserve their introns [43]. As a consequence, there is a negative correlation between density of introns and sequence evolution rate of genes [43]. The density and the high conservation of intron positions in conserved unique genes, U{1:1}, suggests that these genes are orthologous and slowly evolving genes.
Transcription The methods available to compare the expression of orthologous genes from different species are limited. Since A. thaliana and O. sativa benefit from large collections of EST and cDNA sequences, we used the number of available cognate transcripts of each member of a U{1:1} pair to estimate and compare their expression levels. In order to avoid sampling bias, we focused our comparison on genes with at least 30 cognate transcripts. Retrieved information showed genes with at least 30 cognate transcripts are in similar proportion in the population of U{1:1} genes as in the whole genome whatever the considered species: 14.6% and 17.3% respectively for A. thaliana and 7.2% and 10.1% respectively for O. sativa. A correlation (Kendall's test, P-value = 1e-6) between the normalized numbers of transcripts in A. thaliana and O. sativa could be observed for U{1:1} pairs (Figure (Figure5A).5A
TFBS conservation In the previous section, we showed that conserved unique genes have less frequently a TATA-box and more frequently a TELO-box in their promoters than the other genes. Nevertheless, the general over-representation of one TFBS in the unique gene promoter set does not mean that TFBS are conserved in the two promoters of pan-orthologs. Therefore, we searched for the number of simultaneous TATA-box or TELO-box presence on both promoters of each U{1:1} gene pair. Surprisingly, the percentage of pan-orthologs that presented a TATA-box motif within both promoters was only 0.8% and is not significantly different (Chi-squared test, P-value = 0.13) than the expected value, i.e. the value observed in promoters of randomly selected pairs of genes (0.4%). In contrary, the simultaneous presence of a TELO-box motif within both promoters of a U{1:1} pair was significantly higher (Chi-squared test, P-value = 5.22e-5) than found in random pairs (3.8% compared to 1.6%). In order to complete the promoter comparison between A. thaliana and O. sativa pan-orthologs, we used the CONREAL [68] and CREDO [69] packages to find any other conserved motifs, i.e. known or not known putative TFBS. This phylogenetic footprinting approach did not highlight a promoter sequence conservation different than that detected in random pairs of promoters. Additionally, the global analysis of all pan-ortholog promoter pairs with Motif sampler [70] failed to discover over-represented motifs excepted the previously identified TELO-box. Thus, contrary to our observation of conserved features in the CDS, we found almost no trace of sequence conservation within the promoters of U{1:1}gene pairs even if our dataset of pan-orthologs might be regarded as the best situation to see common regulatory sequences in A. thaliana and O. sativa promoters. Nevertheless, promoter pairs of pan-orthologs might share conserved TFBS (not over-represented in the unique gene population) which we cannot distinguish from background noise through the comparison of two sequences. In summary, conserved genes maintained unique in both A. thaliana and O. sativa have (i) clearly a common origin as indicated by the conservation of the intron positions and the conservation in their product lengths, (ii) no apparent conservation between their promoters which contrasts with (iii) a conservation in their relative transcription level. Nevertheless, the number of ESTs that may be associated to a gene is a general indication of the level of transcription but it is a mixed measurement that is dependent on both high expression in specific situations and expression in a large range of conditions. Transcriptome data from DNA chips inform better on the breadth of expression. Analyses of large transcriptome data collections have shown that A. thaliana genes responding to many stimuli are frequently characterized by the presence of a TATA-box, shorter CDS and fewer introns [71,72]. Conversely, A. thaliana genes controlled by TELO-box have a narrow stimuli response and tend to be larger and have more introns [71]. In this context, the conserved single copy genes, which rarely contain a TATA-box and are relatively short genes containing more introns, might constitute a group of genes quite apart in the whole genome. Are unique A. thaliana and O. sativa genes conserved as unique in other plants? We extended our study to other genomes for which our knowledge was not as complete as for the A. thaliana and O. sativa ones but, nevertheless, with a relatively complete proteome available. Thus, we systematically searched, with our approach, for unique proteins in the available proteomes of Ostreococcus lucimarinus and Physcomitrella patens [73,74]. The nearly complete proteomes of Populus trichocarpa [75] and Vitis vinifera [76] were not used in our phylogenetic analysis in order no to distort our results by an overrepresentation of the core eudicotyledon branch. Two by two comparisons of the unique proteins from the 4 studied species showed that the number of U{1:1} pairs decreased with the evolutionary distance separating the plants. However, the numbers of the observed U{1:1} pairs were always significantly above the number expected by chance (Figure (Figure6).6
The phylogenetic studies of unique gene conservation from O. lucimarinus to A. thaliana provided a final list of 192 unique genes, the intersection between the two lists (200 and 209) provided by comparisons going in the two opposite directions (Figure (Figure6).6 Structural features of U{1:1:1:1} genes showed a mean protein length and exon number similar to features in U{1:1} genes as well as the same tendency towards a low TATA-box and a high TELO-box presence in promoters. These characteristics suggest that unique genes underwent the same kind of selection pressure from the common ancestor to the present organisms. An estimation of this pressure was obtained by calculating the synonymous and non-synonymous substitution rates (dN and dS) with Nei-Gojobori's method [80] included in the Codeml program from the PAML package [81]. Each gene within a cluster of U{1:1:1:1} genes was paired and compared to every other gene included in the cluster (Table 4). Additionally, the dN/dS rate was computed for U{1:1} gene pairs. Results showed a high selective pressure against non-synonymous substitutions with a median dN/dS ratio of 0.32 for the 937 U{1:1} genes and from 0.25 to 0.41 for unique genes conserved among the three land plants and with a maximum median of 0.79 for pairs including O. lucimarinus (Table 4). In comparison, we observed that the median dN/dS ratio calculated from 7,551 alignments of putative A. thaliana – O. sativa orthologous proteins (RBH, Methods section) is 0.33. One dN/dS ratio of 1 is usually considered as the limit between a negative or a purifying selection (a drift being equal to 1 and a positive selection being higher than 1) [82,83]. Thus, our results show purifying selection pressure onto conserved unique genes in plants and strongly suggest that most of these genes are actual functional pan-orthologs.
Phylogenetic conservation of unique genes and functional implications The existence of homologs to U{1:1:1:1} genes in other species was searched by BLASTp against the Uniprot database in order to define the range of conservation in other branches of the tree of life. Our results show that 26% of U{1:1:1:1} genes were specific to plants, 13% were conserved in plants and bacteria, 43% could be found in both plants and metazoa, and 18% were conserved in all plants, bacteria and metazoa phyla. This phylogenetic profile shows that 74% of U{1:1:1:1} genes were highly conserved not only in plants but also in other life phyla. This situation implies an ancient origin of these genes and increases the probability for a critical function promoting their conservation. However, no evidence of shared or similar functions can be found in the fraction of U{1:1:1:1} proteins for which functional information has been inferred from sequence homologies. The fraction of unique conserved genes with a functional annotation, i.e. 60%, is the same as in all A. thaliana nuclear genes [60]. In order to get information about function and origin of unique plant genes, we explored the predicted subcellular localization of the proteins according to their phylogenetic profile (Table 5). This work was based on the analysis of the 937 U{1:1} proteins since the 192 U{1:1:1:1} proteins constitute too small a set to obtain statistically robust results. Compared to 20,000 random A. thaliana nuclear genes, the unique plant genes having homolog(s) only in bacteria frequently encode plastidial proteins since 49.1% of them have a predicted targeting peptide specific to chloroplasts (Table 5). We observed the same tendency within the 192 U{1:1:1:1} proteins. This significant bias (Chi-squared test, P-value = 1e-5) suggests that a large part of the subset of unique conserved plant genes may come from DNA transfer from the chloroplast to the nuclear genome. Horizontal transfer from bacteria to plant genome can also explain a fraction of this gene subset. This gene transfer probably predated the speciation between Liliopsida and core eudicotyledons for the concerned U{1:1} genes and is close to the root of the plant phylum for the group of U{1:1:1:1} genes. Our results suggest that, after their transfer to the nucleus, these genes have been submitted to a strong selection pressure that conserved them as unique. This hypothesis is more parsimonious than many independent gene transfer events in each concerned plant species. In their 26 clusters of pan-orthologs, Zimmer et al. [78] also suggest a DNA transfer from organellar genome, mainly from mitochondria. Our observations on the U{1:1} gene population showed that transfer from mitochondria was also significant (Chi-squared test, P-value = 0.0002) but less important than from chloroplasts (Table 5).
A second subset of U{1:1} genes with homologs in metazoa (including fungi) must have been conserved from ancient eukaryotic cells through the entire phylum and probably has a critical function. Ancient origin, low divergence rate, presence of TELO-box and dearth of TATA-box (Table 5), suggest that they are, or are related to, housekeeping genes [47,84] but no evidence could be retrieved from the Gene Ontology annotation due to the high number of unclassified genes. This metazoan conserved subset represents 28% of the 937 U{1:1} genes but, interestingly, this fraction increases to 43% in the 192 U{1:1:1:1} genes. Conclusion We defined 2,570 and 8,041 proteins as unique in A. thaliana and O. sativa respectively. Unique proteins, products of unique (or single-copy) genes, are proteins with no sequence motif shared by any other protein in the same species. A. thaliana unique genes can be further classified according to the number of orthologous genes found in O. sativa genome or vice-versa. Final classification included: 451 AtU{1:0}NE, 544 AtU{1:0}E, 974 At U{1:1}, 601 AtU{1:m}, 4956 OsU{1:0}NE, 1462 OsU{1:0}E, 960 OsU{1:1} and 663 OsU{1:m} genes (2). Unique genes are distributed all over the genomes including regions with evidence for segmental duplication and suggesting that unique genes have been created by non-reciprocal local losses between two paralogous duplicated genomic regions. These non-reciprocal losses may have been directed by a selective pressure according to the structural features present in unique genes conserved in the two species (U{1:1} and U{1:m} genes). These specific features are a relatively small protein size and a high intron density that have been described as evidence of a slow evolution rate [43]. From a functional point of view, unique conserved genes are characterized by a rare occurrence of TATA-box and a high occurrence of TELO-box in their promoters suggesting that unique genes could be linked to critical housekeeping functions such as protein catabolism and synthesis, RNA processing or DNA repair [47,71,84]. These results differ from previous observations which showed that genes involved in transcription regulation and signal transduction tend to be more duplicated [12,85]. Additionally, even if unique genes have been conserved in plants, no significant over-representation of TFBS related with photosynthesis or light regulation processes, such as SORLIP2 and CAAT boxes, have been found in A. thaliana and O. sativa (Table 1). Unlike conserved single copy genes, the A. thaliana and O. sativa U{1:0} genes exhibit a low intron density, a normal presence of TFBS in their promoters, and they encode for proteins about 2.5 times shorter when compared to all the nuclear genes. Very short proteins have been reported as proproteins, precursors of regulatory peptides [86]. Despite the fact that the function of 80% of AtU{1:0}E genes and 95% of OsU{1:0}E genes remains unknown, the analysis of the 105 AtU{1:0}E with annotated function seems to reinforce this hypothesis as we have found that many AtU{1:0}E code for known precursors of short peptide phytohormones with signalling roles [61]. From a phylogenetic point of view, product length conservation and similar relative transcription level of the 937 pan-orthologous genes in A. thaliana and O. sativa (U{1:1}) are clear evidence of a common origin. However, intron insertion site conservation is the best proof that couples of U{1.1} have evolved from a common ancestor and are not the consequence of convergence. This intron conservation is also evident for the 192 U{1:1:1:1} genes where dN/dS analysis shows that those genes conserved as unique in very distant photosynthetic species are pan-orthologs under negative selection pressure to keep them in a low divergence rate and unique. This situation reinforces the idea of a probable important conserved function. It could be suggested that the characterization of pan-orthologs (conserved single copy genes in two or more species) could be noised by the presence of paralogs in the situation where opposite members of a pair of duplicated genes are lost in two daughter species. Nevertheless, our results about conservation of protein sizes, transcription levels and sequence conservation (dN/dS) argue that, if it is the case, the gene loss occurred before both duplicates diverged enough to allow us to recognized them as paralogs rather than as orthologs. The phylogenetic profiles of conserved single copy genes and the predicted subcellular location of the corresponding proteins, provides additional information on the origin and the function of these particular genes. An A. thaliana subset of unique genes with homologs in plants and bacteria contains 49.1% of genes encoding proteins with targeting peptides specific to the chloroplast. This observation suggests that the origin of this subset of unique genes could be a DNA transfer from chloroplast or bacteria genome posterior to the eukaryote radiation. Our analysis of the conserved single copy genes, coming in addition to many duplicated gene studies, provides new information on plant gene evolution. Thus, an important part of the genes in only one copy in present plant genomes have an ancient origin and a low divergence rate controlled by a strong selection pressure. The species-specific unique genes that have some structural features in common with the conserved single copy genes are probably recruited from some conserved single copy genes experiencing a rapid divergence linked to a speciation event. However, functions of many of these conserved single copy genes remain unknown. Deeper annotation of small coding sequences that may not be identified by gene finders because of the conservative nature of the prediction algorithms, as well as more experimental data could help to decipher the biological functions of this particular gene population. Methods Data sources The complete proteomes were obtained from TAIR [87] for A. thaliana (R6), TIGR [88] for O. sativa (R3), and JGI [89] for P. patens and O. lucimarinus. For A. thaliana and O. sativa, we retrieved data concerning the number of transcripts, the PFAM motifs and the promoter sequences from FLAGdb++ [90]. Expression data were obtained from CATdb [37] and Genevestigator [39]. Unique gene characterization All the proteins encoded by the nuclear genes of each species were retrieved and those from pseudogenes were removed. To identify genes coding for proteins unique in a genome, three different filters were successively applied to the genes (Figure (Figure1).1 Conserved single copy genes A BLASTp of the unique proteins of each species was launched against a database containing the unique protein sequences from every other species. Pairs of proteins showing an e-value lower than e-10, or up to e-5 but satisfying the condition imposed by to the size ratio filter described above, were classified as conserved between the two species. Conserved proteins were then separated into two groups, the U{1:1} proteins if there was only one positive hit or the U{1:m} proteins if there were more than one hit. U{1:1} genes characterized in each species were compared to select only reciprocal best hits (RBH) and allowed us to remove some U{1:1} in one species qualified as U{1:m} in other species due to a splitting/fusion process. A second BLASTp was launched with those proteins without any hit against a database containing all the proteins from every other species. Applying again the same e-value and size ratio filter as described above, we clustered them as U{1:m} proteins if they had more than one hit, and as U{1:0} if they had no hit on the other species, i.e. the species specific unique proteins. Genomic organization of unique genes The limits defining the boundaries of duplicated regions in A. thaliana and O. sativa genomes were retrieved from TIGR database [88]. The even distribution of each group of unique gene pairs between the chromosomes was tested using a chi-square (χ2) test with a confidence level of 99.5% (expected value of 14.86 and 26.76 for 4 and 11 degrees of freedom, respectively). Unique gene and protein features All the different information about genes and proteins was retrieved from the FLAGdb++ database [90]. Information includes protein lengths, number of exons, intron positions and promoter sequences (see Additional file 1). Only the genes with CDS fully covered by experimental transcript data were used (17,108 and 15,814 nuclear genes in A. thaliana and O. sativa respectively). For the analysis of promoter sequences, only genes with at least one cognate transcript covering the regions were studied (14,689 and 17,720 for A. thaliana and O. sativa respectively). Intron positions were compared after aligning protein sequences with ClustalW [91]. Intronic conserved positions included those that diverged by not more than 5 amino acids to take into account minor variability in intron position found in different organisms [92]. For promoter analyses, the TSS (Transcription Start Site) was defined as the point where the 5' UTR (minimum size of 50 bp) started and promoter sequences comprised the 1,000 nucleotides upstream from it. Positions of such well-known promoters as the TATA (TATAWA consensus [93]), TELO (AAACCCTAA consensus [47], SORLIP2 (also called motif II: GGCCA consensus [47,48]) and CAAT (CCAAT consensus [49]) boxes in each species were set with a program developed by Bernard et al [94] capable of defining significant TFBS preferential positions in promoter regions avoiding false positives [95]. If the TSS defines position 1, in A. thaliana preferential positions were set at: -40 to -21 for TATA-box; -60 to 140 for TELO-box; -240 to -21 for SORLIP2-box and -160 to -41 for CAAT-box. Similarly, in O. sativa TFBS were searched for in the following regions: -40 to -21 for TATA-box; -80 to 180 for TELO-box; -280 to -1 for SORLIP2-box and -200 to -1 for CAAT-box. At and Os U{1:1} gene expression We based our estimation of the correlation between U{1:1}gene expression in A. thaliana and O. sativa on EST/cDNA resources. The numbers of associated transcripts of each gene were normalized and logarithmically transformed for comparisons purposes. Normalization avoided biases caused by both the number of transcripts available and the different number of genes for each species. The normalization established an equivalence of 1.56 transcripts in O. sativa for one transcript in A. thaliana. Comparisons of observed values were made against values from 100 random samples of 937 nuclear gene pairs. To avoid sampling biases due to genes with none or very few transcripts, we only considered the gene pairs with at least 30 cognate transcripts for each member. Furthermore, the random samples only contained protein pairs with a maximum size difference of 20 amino acids between the two members. Phylogenetic and functional analyses The phylogenetic evolution of unique genes was analysed from Ostreococcus lucimarinus (Prasinophyceae) to Arabidopsis thaliana including Physcomitrella patens (Funariaceae) and Oryza sativa. With the unique gene characterization method (described above), we systematically searched for unique proteins in the available proteomes of the four species studied. Once obtained, we used them in a BLASTp search to look for O. lucimarinus unique proteins with a pan-ortholog on each branch of evolution (Figure (Figure6).6 Abbreviations TFBS: Transcription Factor Binding Sites; MYA: Million Years Ago; RBH: Reciprocal Best Hit; TSS: Transcription Start Site; aa: amino acids. Authors' contributions DA conducted the analyses and drafted the manuscript. AL and SA supervised the project, contributed to data interpretation and improved the manuscript. All authors have read and approved the final version of the manuscript. Additional file 1 Additional Table1. Information about the 937 U{1:1} genes and proteins. Each line corresponds to a couple of orthologous genes. 1. Id. of the A. thaliana gene. 2. Id. of the O. sativa gene. 3. Size of the protein encoded by the A. thaliana gene (in aa). Shaded in: Grey, genes with CDS fully covered by ESTs/cDNA. 4. Size of the protein encoded by the O. sativa gene (in aa). Shaded in: Grey, genes with CDS fully covered by ESTs/cDNA. 5. Number of exons of A. thaliana gene. Shaded in: Grey, genes with CDS fully covered by ESTs/cDNA. 6. Number of exons of O. sativa gene. Shaded in: Grey, genes with CDS fully covered by ESTs/cDNA. 7. Percentage of similarity between A. thaliana and O. sativa proteins after a ClustalW alignment. 8. Protein targeting according to predicted signal peptide in A. thaliana. Shaded in: Light Red: mitochondria; Blue: nucleus; Light Purple: endoplasmatic reticulum; Light Green: plastid. 9. Protein targeting according to predicted signal peptide in O. sativa. Shaded in: Light Red: mitochondria; Blue: nucleus; Light Purple: endoplasmatic reticulum; Light Green: plastid. 10. TATA-box within the promoter sequence of the A. thaliana gene. Shaded in: Red: no presence; Green: presence. 11. TATA-box within the promoter sequence of the O. sativa gene. Shaded in: Red: no presence; Green: presence. 12. TELO-box within the promoter sequence of the A. thaliana gene. Shaded in: Red: no presence; Green: presence. 13. TELO-box within the promoter sequence of the O. sativa gene. Shaded in: Red: no presence; Green: presence. 14. SORLIP2-box within the promoter sequence of the A. thaliana gene. Shaded in: Red: no presence; Green: presence. 15. SORLIP2-box within the promoter sequence of the O. sativa gene. Shaded in: Red: no presence; Green: presence. 16. CAAT-box within the promoter sequence of the A. thaliana gene. Shaded in: Red: no presence; Green: presence. 17. CAAT-box within the promoter sequence of the O. sativa gene. Shaded in: Red: no presence; Green: presence. 18. Gene conservation in P. patens according to BLASTp results. Shaded in: Green: conservation of both species genes; Yellow: conservation of A. thaliana gene; Blue: conservation of O. sativa gene. 19. Gene conservation in O. lucimarinus according to BLASTp results. Shaded in: Green: conservation of both species genes; Yellow: conservation of A. thaliana gene; Blue: conservation of O. sativa gene. 20. Phylogenetic conservation of genes in plant, bacteria and metazoa taxa. Results based on BLASTp of A. thaliana protein sequence against the Uniprot database. 21. Gene function based on annotated functions retrieved from BLASTp results on the Uniprot database. Click here for file(402K, xls) Acknowledgements We are grateful to Jean-Loup Risler for providing A. thaliana unique genes defined in the PHYTOPROT resource and Vincent Thareau for providing PFAM motifs for O. lucimarinus and P. patens proteomes. We acknowledge Marie-Laure Martin-Magniette for her statistical advice, Virginie Bernard for TFBS definitions, Joan Sobota for correcting the manuscript and the referees for their helpful comments. This work is supported by a European Marie Curie grant to DA. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Annu Rev Genet. 2004; 38():615-43.
[Annu Rev Genet. 2004]J Mol Evol. 1999 Nov; 49(5):591-600.
[J Mol Evol. 1999]Nature. 2007 Sep 6; 449(7158):54-61.
[Nature. 2007]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Science. 2002 Apr 5; 296(5565):79-92.
[Science. 2002]Plant Cell. 2004 Jul; 16(7):1667-78.
[Plant Cell. 2004]Genome Biol. 2002; 3(2):RESEARCH0008.
[Genome Biol. 2002]Proc Natl Acad Sci U S A. 2005 Apr 12; 102(15):5454-9.
[Proc Natl Acad Sci U S A. 2005]Science. 2000 Nov 10; 290(5494):1151-5.
[Science. 2000]Gene. 1999 Sep 30; 238(1):253-61.
[Gene. 1999]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Science. 2002 Apr 5; 296(5565):79-92.
[Science. 2002]J Mol Biol. 1990 Oct 5; 215(3):403-10.
[J Mol Biol. 1990]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Curr Opin Plant Biol. 2007 Apr; 10(2):199-203.
[Curr Opin Plant Biol. 2007]Genetics. 2006 Nov; 174(3):1407-20.
[Genetics. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D351-3.
[Nucleic Acids Res. 2004]Comput Appl Biosci. 1997 Apr; 13(2):137-43.
[Comput Appl Biosci. 1997]BMC Bioinformatics. 2005 Mar 11; 6():53.
[BMC Bioinformatics. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D986-90.
[Nucleic Acids Res. 2008]Trends Plant Sci. 2005 Sep; 10(9):407-9.
[Trends Plant Sci. 2005]Plant Physiol. 2004 Sep; 136(1):2621-32.
[Plant Physiol. 2004]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Genome. 2004 Jun; 47(3):610-4.
[Genome. 2004]Genome. 2004 Jun; 47(3):610-4.
[Genome. 2004]BMC Biol. 2005 Sep 27; 3():20.
[BMC Biol. 2005]Proc Natl Acad Sci U S A. 2004 Jun 29; 101(26):9903-8.
[Proc Natl Acad Sci U S A. 2004]PLoS Biol. 2005 Feb; 3(2):e38.
[PLoS Biol. 2005]Genome Res. 2007 Jul; 17(7):1045-50.
[Genome Res. 2007]Mol Cell Biol. 1994 Apr; 14(4):2243-56.
[Mol Cell Biol. 1994]Genome Res. 2007 Jul; 17(7):1034-44.
[Genome Res. 2007]Mol Microbiol. 2007 Sep; 65(6):1559-67.
[Mol Microbiol. 2007]Plant J. 2003 Mar; 33(6):957-66.
[Plant J. 2003]Plant Physiol. 2003 Dec; 133(4):1605-16.
[Plant Physiol. 2003]J Biomol Struct Dyn. 1988 Jun; 5(6):1231-6.
[J Biomol Struct Dyn. 1988]Trends Genet. 2001 Aug; 17(8):425-8.
[Trends Genet. 2001]Science. 2003 Apr 11; 300(5617):258-60.
[Science. 2003]Plant Physiol. 2006 Nov; 142(3):831-8.
[Plant Physiol. 2006]Genome Res. 2007 May; 17(5):632-40.
[Genome Res. 2007]BMC Genomics. 2007 Jan 17; 8():18.
[BMC Genomics. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1009-14.
[Nucleic Acids Res. 2008]Plant Biotechnol J. 2008 Feb; 6(2):105-34.
[Plant Biotechnol J. 2008]Plant Physiol. 2006 Nov; 142(3):831-8.
[Plant Physiol. 2006]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D347-50.
[Nucleic Acids Res. 2004]Mol Biol Evol. 2006 Aug; 23(8):1548-57.
[Mol Biol Evol. 2006]Mol Biol Evol. 2007 Jan; 24(1):171-81.
[Mol Biol Evol. 2007]Genome Res. 2007 Jul; 17(7):1045-50.
[Genome Res. 2007]Genome Res. 2003 Oct; 13(10):2229-35.
[Genome Res. 2003]Genetics. 2001 Jun; 158(2):927-31.
[Genetics. 2001]Mol Biol Evol. 2006 Feb; 23(2):327-37.
[Mol Biol Evol. 2006]Mol Biol Evol. 2004 Sep; 21(9):1719-26.
[Mol Biol Evol. 2004]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W447-50.
[Nucleic Acids Res. 2005]Bioinformatics. 2005 Dec 1; 21(23):4304-6.
[Bioinformatics. 2005]Bioinformatics. 2001 Dec; 17(12):1113-22.
[Bioinformatics. 2001]PLoS Genet. 2007 Feb 9; 3(2):e11.
[PLoS Genet. 2007]BMC Genomics. 2008 Feb 25; 9():92.
[BMC Genomics. 2008]Proc Natl Acad Sci U S A. 2007 May 1; 104(18):7705-10.
[Proc Natl Acad Sci U S A. 2007]Science. 2008 Jan 4; 319(5859):64-9.
[Science. 2008]Science. 2006 Sep 15; 313(5793):1596-604.
[Science. 2006]Nature. 2007 Sep 27; 449(7161):463-7.
[Nature. 2007]Proc Natl Acad Sci U S A. 1989 Aug; 86(16):6201-5.
[Proc Natl Acad Sci U S A. 1989]J Mol Evol. 2004 Apr; 58(4):424-41.
[J Mol Evol. 2004]Science. 2008 Jan 4; 319(5859):64-9.
[Science. 2008]Mol Genet Genomics. 2007 Oct; 278(4):393-402.
[Mol Genet Genomics. 2007]BMC Evol Biol. 2004 Jan 28; 4():2.
[BMC Evol Biol. 2004]Plant Physiol. 2005 Jan; 137(1):31-42.
[Plant Physiol. 2005]Mol Genet Genomics. 2007 Oct; 278(4):393-402.
[Mol Genet Genomics. 2007]BMC Evol Biol. 2004 Jan 28; 4():2.
[BMC Evol Biol. 2004]Nature. 2007 Sep 27; 449(7161):463-7.
[Nature. 2007]Mol Biol Evol. 1986 Sep; 3(5):418-26.
[Mol Biol Evol. 1986]Mol Biol Evol. 2007 Aug; 24(8):1586-91.
[Mol Biol Evol. 2007]Genome Res. 2002 Jan; 12(1):198-202.
[Genome Res. 2002]Mol Biol Evol. 2001 Aug; 18(8):1585-92.
[Mol Biol Evol. 2001]Nucleic Acids Res. 2008 Jan; 36(Database issue):D1009-14.
[Nucleic Acids Res. 2008]Mol Genet Genomics. 2007 Oct; 278(4):393-402.
[Mol Genet Genomics. 2007]Plant J. 2003 Mar; 33(6):957-66.
[Plant J. 2003]Cell. 2004 Mar 5; 116(5):699-709.
[Cell. 2004]Genome Res. 2007 Jul; 17(7):1045-50.
[Genome Res. 2007]Plant J. 2003 Mar; 33(6):957-66.
[Plant J. 2003]PLoS Genet. 2007 Feb 9; 3(2):e11.
[PLoS Genet. 2007]Cell. 2004 Mar 5; 116(5):699-709.
[Cell. 2004]Plant Cell. 2004 Jul; 16(7):1667-78.
[Plant Cell. 2004]Trends Plant Sci. 2002 Feb; 7(2):78-83.
[Trends Plant Sci. 2002]Plant Biotechnol J. 2008 Feb; 6(2):105-34.
[Plant Biotechnol J. 2008]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D347-50.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2008 Jan; 36(Database issue):D986-90.
[Nucleic Acids Res. 2008]Plant Physiol. 2004 Sep; 136(1):2621-32.
[Plant Physiol. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D138-41.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D347-50.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2003 Jul 1; 31(13):3497-500.
[Nucleic Acids Res. 2003]Genome Res. 2001 Dec; 11(12):2101-14.
[Genome Res. 2001]Cold Spring Harb Symp Quant Biol. 1978; 42 Pt 2():1047-51.
[Cold Spring Harb Symp Quant Biol. 1978]Plant J. 2003 Mar; 33(6):957-66.
[Plant J. 2003]Mol Biol Evol. 2007 Aug; 24(8):1586-91.
[Mol Biol Evol. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D347-50.
[Nucleic Acids Res. 2004]