![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2007, American Society of Plant Biologists Plant Biotechnology, Faculty of Biology, University of Freiburg, D–79104 Freiburg, Germany *Corresponding author; e-mail stefan.rensing/at/biologie.uni-freiburg.de; fax 49–761–203–6945. 2These authors contributed equally to the paper. Received January 10, 2007; Accepted February 19, 2007. This article has been cited by other articles in PMC.Abstract Diversification of transcription-associated protein (TAP) families during land plant evolution is a key process yielding increased complexity of plant life. Understanding the evolutionary relationships between these genes is crucial to gain insight into plant evolution. We have determined a substantial set of TAPs that are focused on, but not limited to, land plants using PSI-BLAST searches and subsequent filtering and clustering steps. Phylogenies were created in an automated way using a combination of distance and maximum likelihood methods. Comparison of the data to previously published work confirmed their accuracy and usefulness for the majority of gene families. Evidence is presented that the flowering plant apical stem cell regulator WUSCHEL evolved from an ancestral homeobox gene that was already present after the water-to-land transition. The presence of distinct expanded gene families, such as COP1 and HIT in moss, is discussed within the evolutionary backdrop. Comparative analyses revealed that almost all angiosperm transcription factor families were already present in the earliest land plants, whereas many are missing among unicellular algae. A global analysis not only of transcription factors but also of transcriptional regulators and novel putative families is presented. A wealth of data about plant TAP families and all data accrued throughout their automated detection and analysis are made available via the PlanTAPDB Web interface. Evolutionary relationships of these genes are readily accessible to the nonexpert at a mouse-click. Initial analyses of selected gene families revealed that PlanTAPDB can easily be exerted for knowledge discovery. The coordinated expression control of the entirety of genes in a given cell determines its physiological state, morphology, and identity in the organism. Reprogramming the set of transcribed genes during development or physiological adaptation requires modulated activation and deactivation of regulatory factors. In eukaryotes, the transcription of protein-coding genes is controlled by complex networks of transcription-associated proteins (TAPs). Specific transcription factors (TFs) activate or repress transcription of their target genes by binding to cis-active elements. Further transcriptional regulators (TRs) include the following: (1) coactivators and corepressors, which bind and influence TFs; (2) general transcription initiation factors, which recognize core promoter elements and recruit components of the basal transcription machinery; and (3) chromatin remodeling factors, which affect the accessibility of DNA through histone modifications and DNA methylation. The modular nature of TFs, possessing DNA-binding and protein-protein interaction domains, facilitates the high diversity of transcriptional regulation. Changes in transcriptional regulation enhance complexity at the genetic level and thus can generate novel signal transduction pathways. Such changes, mediated by recombined complexes of regulatory proteins as well as by altered regulatory sequence elements, were repeatedly proposed to be a major driving force of evolution (Doebley and Lukens, 1998; Tautz, 2000; Hsia and McGinnis, 2003; Levine and Tjian, 2003; Gutierrez et al., 2004; Carroll, 2005). Previous studies have shown that TAPs are highly specific across prokaryotic and eukaryotic lineages and that their diversity appears to be linked to their phylogenetic distance (Coulson et al., 2001; Coulson and Ouzounis, 2003). In eukaryotes, key players of the basal transcription machinery are highly conserved, whereas many families of DNA-binding TFs are taxon specific and show substantial sequence diversity (Coulson and Ouzounis, 2003). Moreover, the size and genomic fraction of TF families seem to correlate with cellular complexity (Levine and Tjian, 2003). The evolution of eukaryotic TF genes involves the processes of specific amplification of common families through duplication and diversification, as well as the shuffling of functional domains, resulting in lineage-specific families that can facilitate novel networks of protein-protein interactions and can take over new functions. In plants, the evolution and expansion of specific gene families seem to be more pronounced than in other eukaryotes (Lespinet et al., 2002). In Arabidopsis (Arabidopsis thaliana), genes involved in transcriptional regulation were preferentially retained following whole-genome duplications (Blanc and Wolfe, 2004; Seoighe and Gehring, 2004). It could be demonstrated that TF genes show a higher duplicability as well as retention rate in seed plants compared to other crown eukaryotes and other plant genes (Shiu et al., 2005), which results in considerable lineage-specific expansion of distinct TF families in plants. Consequently, 45% of the TF genes in Arabidopsis were found to belong to families that are specific to plants (Riechmann et al., 2000). Evidence that many plant-specific proteins resemble TFs (Gutierrez et al., 2004) further supports the assumption that the increase of complexity in transcriptional regulation mechanisms has been crucial for the evolution of plants. In recent years, much emphasis was placed on the understanding of regulatory networks controlling the transcription of genes. Genome-wide comparative analyses aid in revealing the evolution of transcriptional regulation that underlies the diversity of organisms. TAP genes and transcriptional networks have been studied extensively in unicellular organisms (e.g. Kyrpides and Ouzounis, 1999; Perez-Rueda et al., 2004; Madan Babu et al., 2006), as well as in basal metazoans (Satou and Satoh, 2005; Larroux et al., 2006) and crown eukaryotes (Messina et al., 2004; Reece-Hoyes et al., 2005). Within the plant kingdom, only two seed plants, Arabidopsis and rice (Oryza sativa), were globally investigated (for review, see Qu and Zhu, 2006) and their TAP gene families compared to those of the unicellular green alga Chlamydomonas reinhardtii, fungi, and metazoans (Riechmann et al., 2000; Shiu et al., 2005). Little is known about TAPs in nonseed plants, like the bryophyte Physcomitrella patens, and no genome-wide compendium of its TAP genes is available, as is the case for nongreen algae. While phylogenetic studies have been carried out for single TAP families, e.g. sigma factors, LEAFY (LFY)/FLO, MADS, and AP2 (Ichikawa et al., 2004; Maizel et al., 2005; Riese et al., 2005; Shigyo et al., 2006), a large-scale phylogenetic analysis of TAP gene families from nonseed plants is still lacking. Here, we investigated and compared plant TAP gene families on a genome-wide scale across species of all three domains of life to gain insight into the evolution of transcriptional regulation in plants. We covered the whole evolutionary range from unicellular algae through bryophytes to angiosperms by including genomic-scale sequence data of the diatom Thalassiosira pseudonana, the red alga Cyanidioschyzon merolae, the green alga C. reinhardtii, the moss P. patens, the monocot rice, and the dicot Arabidopsis. The moss P. patens diverged from the ancestor of extant flowering plants at least 450 million years ago (Theissen et al., 2001; Hedges et al., 2004). It was chosen as an offset for this study because, in comparison with flowering plants, it might enable inference of the ancestral state of land plant transcriptional regulation. A comprehensive analysis of gene families can be performed using the large collection of clustered expressed sequence tag (EST) data (Rensing et al., 2002; Lang et al., 2005). Starting from the complete set of P. patens candidate TAP genes, we collected homologs using PSI-BLAST and carried out automated filtering and clustering procedures, followed by manual annotation. From the resulting ample pool of TAP genes, taxonomic distribution, lineage-specific expansion, and high-quality phylogenies were inferred. RESULTS AND DISCUSSION Availability: All resources are available via the PlanTAPDB Web interface (http://www.cosmoss.org/bm/plantapdb). Compilation of the Query Dataset In terms of evolution, mosses are located half way between seed plants and algae and were therefore chosen as an offset for the global phylogenetic analysis of plant TAPs. In addition, mosses morphologically resemble the first plants that occupied the land (Kenrick and Crane, 1997). In the moss P. patens, a total of 1,592 putative TAPs (PTs) were identified from a comprehensive clustered and annotated EST database (Lang et al., 2005) by two strategies: (1) TBLASTN searches with plant and algae reference TAPs compiled by relaxed keyword searches, and (2) motif scans for transcription-associated domains. The resulting comprehensive set of candidate moss TAPs included nearly all TF families known from seed plants (http://arabtfdb.bio.uni-potsdam.de/v1.1/, http://ricetfdb.bio.uni-potsdam.de/v2.1/; Riechmann et al., 2000; Guo et al., 2005; Gao et al., 2006), as well as sequences putatively encoding TAPs. False-positive sequences introduced by this compilation of queries were later removed during the annotation process. To avoid potentially fragmentary virtual transcripts, we determined the full-length closest homolog for each of the moss candidate TAPs to be used subsequently as seed query sequence. For a homolog to be considered, its BLASTX match needed to be in the same frame as the original annotation of the moss candidate transcript and its predicted open reading frame (ORF). A closest homolog could be determined for about 99% of the 1,592 P. patens candidate TAPs. For 19 of the candidate sequences, no homolog was found, yet 12 of those were included into the seed query set because they contained a predicted ORF. The complete nonredundant set of closest homologs used for PSI-BLAST searches comprises 1,162 sequences (Fig. 1
Filtering and Clustering of PSI-BLAST Results During the PSI-BLAST searches, 369,118 hits were generated, representing a total of 144,941 distinct protein sequences (Fig. 1 Redundancy Removal and Homology Reduction While it greatly improves taxon sampling, the strategy to use both a huge multispecies-containing database like UniProt and the individual whole-genome protein predictions results in the detection of identical protein sequences from these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants, as well as sequencing and annotation errors. To cope with this problem, redundant copies of genes were eliminated prior to all functional analyses using an identity cutoff of ≥99% for sequences of the same species. The total number of cluster members was thus reduced by 30%, resulting in 42,133 total sequences, 37,247 of which are distinct. In addition, a homology-reduced set of the 540 clusters was compiled to infer phylogenies (Fig. 1 Multiple Alignments and Selection of Conserved Sites Due to errors introduced by the alignment algorithm, a certain fraction of columns in a multiple sequence alignment (MSA) generates noise that disturbs correct inference of phylogenetic relationships (Castresana, 2000; Rosenberg, 2005). Such positions are usually removed manually in the course of a phylogenetic analysis. While current approaches to automated phylogenies (Sicheritz-Ponten and Andersson, 2001; Fuellen et al., 2003; Frickey and Lupas, 2004; Gouret et al., 2005) mostly rely on unprocessed ClustalW alignments, we placed more emphasis on the alignment quality to increase the reliability of the resulting phylogenies. Thus, we used a measure that describes evolutionary informative sites. We implemented a best-of-two approach, during which two alignments were (1) calculated using different state-of-the-art algorithms and (2) filtered using the sum-of-pairs score. In the next step, the alignment with the maximum number of remaining columns was chosen (Fig. 1 Automated Reconstruction of High-Quality Phylogenies Many approaches to phylogenomics rely solely on a distance approach using neighbor joining (NJ; Saitou and Nei, 1987). However, NJ is known to be susceptible to noisy data, provides no confidence measures, and makes it hard to compute reliable distances for strongly divergent sequences. Probabilistic approaches, such as maximum likelihood (ML) and Bayesian methods, are known to overcome most of these problems, but both are very time consuming and thus usually not applied in large-scale phylogenomics approaches. We followed a combined approach by calculating ML consensus branch lengths using gamma-distributed rates from bootstrapped NJ topologies (Fig. 1 Cluster Annotation The functional annotation of the 540 candidate TAP clusters was inferred from identified Inter-Pro domains and associated Gene Ontology (GO) terms (Camon et al., 2004) of the cluster members after redundancy removal. A total of 482 out of 540 clusters contained one or more Inter-Pro domains with a relative occurrence of ≥80% among the nonredundant cluster members. While those were used for automated annotation, clusters with uncertain domain occurrence were manually checked and annotated. In total, only three clusters were composed of sequences from multiple unrelated TAP families. These large mixed clusters were formed due to shared DNA-binding or protein-protein interaction domains (IPR001487 Bromodomain, IPR002110 Ankyrin, IPR002713 FF domain) and were not further considered in this study. Members of 237 clusters are not directly associated with transcriptional regulation but function in related processes, such as DNA and RNA metabolism, and were also not further considered. They derive from the loose initial query selection intended to include as many as possible novel TAP families. The vast majority (94%) of the remaining 300 annotated TAP clusters (Fig. 1 TAP Gene Family Annotation TAP clusters with the same functional annotation (main and subfamily), which had not been merged during single linkage clustering due to the stringent parameters applied there, were manually grouped, resulting in 138 families of TAPs (Supplemental Table S1). This resulted in a total number of 14,680 nonredundant TAP family members, while the remaining overlap among the families was reduced to 3.6% (14,157 distinct nonredundant family members; Fig. 1 We divided the TAP families into three categories according to their molecular function and associated GO terms: (1) DNA-binding TFs (59), which comprise direct activators or repressors of transcription; (2) TRs (56), comprising basal TFs interacting with RNA polymerase II or the core promoter, coactivators/corepressors, and chromatin remodeling factors; and (3) proteins with unknown function and/or domains that are possibly associated with transcriptional regulation (PT, 23; Fig. 1 Previously, plant TF gene families were globally identified in two seed plants, Arabidopsis and rice (http://arabtfdb.bio.uni-potsdam.de/v1.1/, http://ricetfdb.bio.uni-potsdam.de/v2.1/; Riechmann et al., 2000; Guo et al., 2005; Gao et al., 2006). Of the previously described TF families, just 14 are not present among the annotated families due to their absence from the P. patens candidate TAP set. However, eight of those (AS2/LOB, BES1, BZR, GeBP, GFR/ENBP, HRT-like, TCP, VOZ) could be identified in the whole-genome shotgun sequences produced by the U.S. Department of Energy Joint Genome Institute (http://www.jgi.doe.gov/), which became available recently, i.e. they were not covered by the clustered EST database used for query compilation. This confirms earlier estimates (Rensing et al., 2002) and shows that the EST data cover the P. patens transcriptome almost completely (in terms of TAP families, the coverage is 95%). For the other six missing TF families (C2C2-YABBY, NOZZLE [NZZ], PBF-2-like/Whirly, S1Fa-like, STERILE APETALA [SAP], ULTRAPETALA [ULT]), no homologs could be identified in the P. patens genomic traces. This might be due to the actual lack of these genes in the P. patens genome (which might also be a derived feature, i.e. secondary gene loss) or because differing rates of mutation fixation render detection using only homology searches impossible. Yet, the above-mentioned results demonstrate that using moss as an offset to identify a broad scope of plant TAPs is a valid approach, as only 4% of angiosperm TF families are unaccounted for. Furthermore, it provides evidence that the majority of flowering plant TF families can be tracked down to the basal land plant P. patens. The above-mentioned TF gene families that are absent from moss are all of small size and have specialized functions in flowering plants. They probably emerged after the evolutionary split of mosses and seed plants. The vegetative and reproductive development of flowering plants is entirely different from that of mosses, the life cycle of which is dominated by a haploid gametophytic phase. They do not possess flowers, the organs for sexual reproduction of angiosperms. While mosses do contain homologs of some angiosperm (floral) homeotic genes, like KNOX (TF031) and MIKC-type MADS box (TF041), their function remains unclear (Theissen et al., 2001). On the other hand, NZZ, SAP, and ULT all play specific roles during development of flowers (Byzova et al., 1999; Schiefthaler et al., 1999; Carles et al., 2005) and are absent from P. patens. The C2C2 zinc finger protein YABBY is expressed in a polar manner and specifies the abaxial identity of lateral organs of the Arabidopsis sporophyte (Siegfried et al., 1999), while the moss sporophyte is extensively reduced and possesses no lateral organs. Likewise, spinach (Spinacea oleracea) S1F mRNA accumulates in roots and etiolated seedlings (Zhou et al., 1995), while both tissues are not present as such in P. patens. Hence, absence of these specialized TF families from a basal land plant seems plausible. Coverage of Known TAP Families To analyze the level of completeness of our dataset, we compared numbers of PlanTAPDB family members with the size of well-known Arabidopsis TAP families. In Supplemental Table S2, those PlanTAP families that were previously described by Riechmann and colleagues (Riechmann et al., 2000) and/or are included in the current version (Version 2; July 2006) of DATF (Guo et al., 2005) are listed. To allow comparison of PlanTAPDB Arabidopsis members with these resources, only those member sequences corresponding to The Institute for Genomic Research (TIGR) Arabidopsis loci (loci themselves or those replaced by redundant UniProt sequences) were counted. The numbers shown were ascertained immediately after filtering and clustering, as well as after redundancy removal and homology reduction (Supplemental Table S2). Fortunately, the step of redundancy removal in no case accidentally reduced the number of detected Arabidopsis loci. As expected, the homology reduction leads to a decrease in size of large families. The coverage of a minority of Arabidopsis TAP families by PlanTAPDB differs significantly due to possible annotation errors within the different resources (e.g. the C3H family, which probably also includes RNA-binding C3H zinc fingers). Taken together, the data illustrate that the PSI-BLAST approach is able to recover most of the members for the majority of gene families. However, especially in gene families with low sequence conservation apart from functional domains (e.g. MADS, HB), a significant amount of family members might be missing. This depicts an inevitable shortcoming of this automated approach for the discovery of gene families. Nevertheless, on average the filtered and nonredundant Arabidopsis loci as present in PlanTAPDB cover 81% of the previously published gene family members. Web Interface The PlanTAPDB Web interface (http://www.cosmoss.org/bm/plantapdb) provides dynamic access to the results generated in this study. TAP gene families can be retrieved by their accession numbers and identifiers or queried via keyword searches among the family annotations. In addition, all 37,247 TAP cluster sequences (Fig. 1 Different Expansion of TAP Gene Families among Algae and Plant Lineages Previous global comparative studies of plant TAP gene families focused mainly on the subgroup of DNA-binding TFs in seed plants (for review, see Qu and Zhu, 2006). On basis of the PlanTAPDB data, we compared characteristics of plant TAP gene families across six species, for which genome-scale databases were queried during homolog detection. These included three algae, a moss, and two flowering plants to provide a broad evolutionary perspective. The total number of distinct TFs, TRs, and PTs of these species was extracted using the taxonomic annotation of the family members. The numbers of TFs detected by the approach presented here are smaller than previously published results for Arabidopsis, rice (Xiong et al., 2005; Gao et al., 2006; Qu and Zhu, 2006), and C. reinhardtii (http://chlamytfdb.bio.uni-potsdam.de/v2.0/), which is due to the stringent filtering process applied to prevent false-positive hits. There seems to be a trend that total amounts of TAPs (Fig. 2
The gene family data (Fig. 3A
Species-Specific Expansion of Individual TAP Families The absolute size of the 138 annotated TAP families for the above-mentioned six species is shown in Supplemental Table S1. The size distribution of the Arabidopsis TF gene families correlates well with published results (Qu and Zhu, 2006), although the families are generally smaller due to the stringent elimination of false-positive hits applied in this study. The overall lineage-specific expansion of family number and size is evident from Figure 3 As an example, members of a distinct branch of the His triad family (TF033, HIT) known from animals (Kijas et al., 2006) and fungi are only present in rice and moss. Interestingly, the human HIT protein Aprataxin, which belongs to this family, has recently been shown to be involved in the protection against genotoxic stress by interaction with proteins that are involved in DNA repair (Gueven et al., 2004). Apparently, the forefather of this particular gene was already present in ancestral eukaryotes but has been lost in some plant and algal lineages. The P. patens Aprataxin-like protein might be involved as an upstream component of DNA mismatch repair (Trouiller et al., 2006) and thus might be related to the high efficiency of homologous recombination observed in the moss (Kamisugi et al., 2005). Taxonomic Distribution of Plant TAP Families across All Domains of Life For visualization of the distribution of TAP family members across all taxonomic lineages, a taxonomic profile was created and is presented as a heat map in Figure 4
The WUSCHEL/WOX Phylogeny The HB/WUSCHEL (WUS) family (TF032_373) exhibits a rigorous land plant-specific taxonomic profile, comprising the species Arabidopsis, tomato (Solanum lycopersicum), poplar (Populus spp.), rice, and P. patens. The consensus domains for this family are Homeobox (IPR001356), Homeodomain_like (IPR009057), and Homeodomain-rel (IPR012287). During redundancy filtering, 10 nearly identical sequences belonging to Arabidopsis, rice, and poplar were removed. The average identity between the remaining sequences is relatively low (36.26%); therefore, the alignment was reduced from an initial 950 columns to 167 columns that could be unequivocally aligned, comprising mainly the actual homeobox domain. Due to the low conservation grade of the WUS-related (WOX) gene family (e.g. 30.6% amino acid identity between Arabidopsis WOX9 and WOX14), several annotated homologs were not detected by the PSI-BLAST search and thus are missing from the above-mentioned phylogeny. To add those, all annotated Arabidopsis WUS/WOX sequences were retrieved from Swissprot. After retrieval of the remainder of the sequences using the PlanTAPDB Web interface, MSA and tree reconstruction were performed. The phylogeny is available via the Web interface as well, as an example for manually curated data to be added upon request. The resulting tree (Fig. 5
The COP1 Phylogeny The three uppermost clusters of the taxonomic profile (Fig. 4
Caveats PlanTAPDB users should be aware that the automated homolog detection and clustering approach resulted in the loss of some gene families, i.e. a low percentage (approximately 4%) of plant TAP families is missing. In addition, on average 19% of the gene family members known from well-annotated genomes are lacking. To present phylogenetic trees that can be viewed on a normal computer screen, large gene families have been reduced to contain a maximum of 150 homology-condensed members. Due to the fragmentary nature of the data (incomplete genome/transcriptome data, fragmentary sequences, sampling bias), the phylogenetic analyses might be biased or flawed. Taken together, users should take appropriate caution concerning the points raised above while interpreting the data. Potential Uses The PlanTAPDB resource might be used as a starting point for knowledge discovery. Using the family and cluster annotation available through the Web interface, designated gene families can be located, e.g. by name or member sequence accession number. MSAs of the gene families as well as arbitrary sequence subsets can be retrieved. The taxonomic profile (Fig. 4 CONCLUSION So far, most comparative analyses dealing with plant TAPs have focused on TFs of Arabidopsis and rice. To broaden our evolutionary understanding of transcriptional regulation in plants, we have included three algae and a moss into the present analysis, as well as the complete UniProt database. In addition, we have analyzed both TFs and TRs, and have detected several novel PT families. Using automated methods, a stringent detection and representation of gene clusters has been established that can easily be expanded to cover more genomes in the future, while manual curation of gene clusters into families assures their quality. High-quality phylogenetic trees were created from these clusters and are available through an easy-to-use Web interface together with a multitude of accompanying data, such as alignments, domain-based family annotation, and taxonomic profiles. Instant knowledge discovery using the PlanTAPDB is straightforward, as has been demonstrated using several examples. In addition, such comparative data can be applied to aid phylogenomics. The general expansion of both the total number of TAP genes and the amount of TAP families seems to coincide with organism complexity. A dramatic increase in the complexity of transcriptional regulation, particularly at the level of TFs, might have occurred after the development of multicellularity, respective the transition from water to land. Subsequently, during land plant evolution, the intricacy of the previously established TF families enhanced again, possibly reflecting large-scale morphological and physiological changes paralleling angiosperm radiation. Apart from these general trends, distinct TAP gene families were subject to expansion in individual species. Interesting details about the evolution of the stem cell regulator WUS, the photomorphogenesis switch COP1, and the genotoxic stress-related HIT gene family were revealed. MATERIALS AND METHODS Sequence Datasets For the identification of Physcomitrella patens transcription-associated EST sequences, National Center for Biotechnology Information (NCBI) Entrez (Geer and Sayers, 2003) was utilized to query GenPept (Benton, 1990) Release 141. The Arabidopsis Information Resource (TAIR; Rhee et al., 2003) resources were searched via keyword. GenPept Release 151 and the TIGR Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) predicted proteins (see below) were used for the closest homolog determination. For the collection of homologs throughout the available protein space using PSI-BLAST, the UniProt Knowledgebase Release 7.1 (http://www.ebi.uniprot.org/database/download.shtml) was used. In addition, the following organism-specific protein databases were included. Arabidopsis: 28,952 predicted proteins, TIGR ATH1.pep 01/04 (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/a_thaliana/annotation_dbs/ATH1.pep). Rice: 88,149 predicted proteins, TIGR OSA1.pep 04/04 (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_2.0/). P. patens: EST were clustered and assembled according to Lang et al. (2005), http://www.cosmoss.org Release 03/04. For the resulting virtual transcripts, ORF were predicted using FrameD (Schiex et al., 2003) and ESTscan 2.0 (Iseli et al., 1999) with P. patens-specific models, yielding a total of 52,458 ORFs. Thalassiosira pseudonana: 11,397 predicted proteins from Release 1.0, Department of Energy Joint Genome Institute (http://genome.jgi-psf.org/thaps1/thaps1.download.ftp). Cyanidioschyzon merolae: 5,013 translated mRNAs, Release 11/04 (http://merolae.biol.s.u-tokyo.ac.jp/download/cds_nt.fasta). Chlamydomonas reinhardtii: 19,832 predicted proteins from Release 2.0, Department of Energy Joint Genome Institute (http://genome.jgi-psf.org/chlre2/chlre2.download.ftp). For the calculation of Figure 2B Software The results and resources presented here were generated using an automated phylogeny pipeline that utilizes BLAST and PSI-BLAST (Altschul et al., 1997), Inter-ProScan 4.2 (Quevillon et al., 2005), EMBOSS 3.0.0 (Rice et al., 2000), MAFFT 5.8 (Katoh et al., 2005), ProbCons 1.1 (Do et al., 2005), Muscle 3.52 (Edgar, 2004), Phylip 3.65 (Felsenstein, 1989), Tree-Puzzle 5.2 (Schmidt et al., 2002), a modified version of the puzzleboot script (http://www.tree-puzzle.de/puzzlebootREADME.txt), and the PostgreSQL 8.0.8 (http://www.postgresql.org) relational database. This so-called TreePipe is able to construct phylogenetic trees for large datasets without manual interference and is implemented with Perl 5.8.7 (http://www.perl.com), SQL, and shell scripts, making use of the Bioperl CVS “live” branch (Stajich et al., 2002) and the Bio Phylo Version 0.09 (http://search.cpan.org/~rvosa/Bio-Phylo-0.09/) packages. The extensive data that are collected throughout the pipeline are stored in a relational database schema developed for this project, called TreePipeDB. The PlanTAPDB Web interface is implemented using mod_perl 2.0 (http://perl.apache.org/) and Javascript with the TreePipeDB as backend. For the interactive exploration of MSA and phylogenetic trees, we integrated the Jalview multiple alignment editor 2.08.1 (http://www.jalview.org/) and ATV phylogenetic tree viewer 2.0 BETA (http://www.phylogenomics.us/atv/) java applets.Identification of the TAP Query Set NCBI GenBank was queried using the keywords “transcription factor,” “transcription activator,” “transcription repressor,” and “transcription regulator,” as well as taxon IDs of Viridiplantae and nongreen algae (txids 33090, 136419, 3027, 33682, 38254, 2830, 2763, 33634). Additionally, Arabidopsis loci were extracted from TAIR matching the keyword “transcription factor.” With this reference set of 7,476 TAPs, the clustered P. patens EST sequences were searched by TBLASTN. A total of 286 PFAM HMM profiles and 67 PROSITE patterns of transcription-associated domains without taxonomic restriction were used for motif searches in the same database. A total of 1,592 nonredundant P. patens candidate TAP sequences were identified. Full-length closest homologs of the 1,592 moss candidate TAP transcripts were determined via BLASTX (Altschul et al., 1997) with an E-value cutoff of 1E-3 against GenPept and the TIGR Arabidopsis and rice predicted protein databases. The resulting hits were filtered using an alignment length and percent identity threshold of 50 amino acids and 25%, respectively. PSI-BLAST Searches and Filtering of the Results PSI-BLAST searches were performed against the UniProt Knowledgebase, all available whole-genome predicted protein databases of plants and algae, and the predicted ORF of the P. patens virtual transcripts using an E-value threshold of 1E-4, a profile inclusion threshold of 1E-5, and four iterations. Up to 500 results per query were considered and parsed into the TreePipeDB. Each result set (composed of one query and its hits after one of the four PSI-BLAST iterations) was run through a series of six filter steps with increasing stringency concerning the length and percent identity of the PSI-BLAST matches (step 1: 25% identity/50-amino acid alignment length; step 2: 30%/60 amino acids; step 3: 35%/80 amino acids; step 4: 45%/100 amino acids; step 5: 45%/150 amino acids; step 6: 45%/300-amino acid length). For each query and iteration, the filtering process determines the first filtering step that reduces the result set to ≤50 and ≥5 members. Afterward, the optimal iteration (plus determined filtering step) is chosen for each query, using a set of sequentially applied criteria: (1) the most stringent possible filtering step, (2) the maximal number of remaining sequences, and (3) the lowest iteration step (in order to select result sets with low amounts of false-positive hits). Clustering of the Filtered Result Sets Single-linkage clustering using a stringent hit-coverage-based distance measure was implemented in Perl and the TreePipeDB backend. Result sets of two queries were merged if they shared at least one hit covering the same region of this hit sequence. The length of the region to be shared depends on the previously selected filter step, namely, the most stringent filter step possible (e.g. result set A overlaps with B on hit X). A was filtered using step 6 and B using step 5. Hence, A and B can only then be merged into a cluster if they overlap to at least 300 amino acids (step 6 criteria) on sequence X. Result sets without any significant overlaps were added as single-query clusters. For all cluster members, the corresponding NCBI taxonomy annotation was retrieved and stored in TreePipeDB. Redundancy Removal and Homology Reduction For the removal of redundant sequences, a MSA was performed using MAFFT FFT-NS-2 and pairwise distances were calculated using the EMBOSS distmat program. This alignment was used to infer initial phylogenies of the complete clusters. The resulting matrix was scanned for sequence pairs from the same species with a distance ≤1 substitutions per 100 amino acids. For each pair, one representative was selected based on the originating database (UniProt sequences were preferred), sequence length, and lexical sort order of the accession number. The procedure was implemented in Perl using several Bioperl modules, including a modified version of the Bio Tools Run Alignment MAFFT module. For the parsing of the distmat distance matrices, an object-oriented Bioperl module (Bio Matrix IO distmat) was written. Homology reduction was implemented in the same program but follows a different strategy. Beginning with 1 substitution per 100 amino acids and heuristically increasing this distance threshold, the distance matrix is iteratively scanned for sequence pairs with the respective distance, regardless of their species. The iteration stops when the remaining representative cluster members reach a given limit (150 sequences).Multiple Alignments and Selection of Informative Sites Multiple alignments for a given cluster were performed using MAFFT G-INSI and ProbCons (clusters ≤150) or Muscle (clusters >150). Subsequently, sum-of-pairs scores using the BLOSUM62 substitution matrix, gap ratios, and Shannon's entropy scores (Valdar, 2002) were calculated and recorded columnwise in the TreePipeDB. Finally, columns below a sum-of-pairs score of −2 were excised from the alignment. The procedure was implemented in a Perl program, which, besides the filtering of a given MSA, also produced overview graphics of the different scores along the overall alignment. Reconstruction of Phylogenies of the Representative Cluster Members Phylogenies for the representative cluster members were inferred using a Perl program on all clusters. After generation of 100 bootstrapped alignments using seqboot from the PHYLIP package, ML distance matrices were computed for these alignments using puzzleboot as implemented in Tree-Puzzle. These distance matrices were then used to infer topologies by applying the NJ algorithm as implemented in PHYLIP's neighbor program. Afterward, the resulting 100 trees were used to create a ML consensus topology using Tree-Puzzle. For the two steps where Tree-Puzzle was used to compute maximum likelihoods, eight gamma-distributed rates were used to model mutation rate heterogeneity and full (exact) ML parameter estimation was performed for each gene family. Manual ML trees were created using the same parameter settings. The WAG (Whelan and Goldman, 2001) evolutionary model of sequence evolution, which is derived from a database of globular proteins, was used. The resulting phylogenetic tree offered both an overall confidence value, i.e. the ML of the tree, and confidence values for every branch in the form of bootstrap values. Finally, the trees were parsed and midpoint-rooted via an additional Perl program that also collects a large variety of parameters from the tree topologies using both Bioperl and the Bio Phylo modules (e.g. the longest internal branch, the Fiala stemminess [Fiala and Sokal, 1985], and the resolution) and writes them into the TreePipeDB. The initial phylogenies for the complete clusters were inferred in analogy to the procedure described above, using a Perl wrapper combining the PHYLIP tools seqboot and neighbor. However, in this case JTT distances (Jones et al., 1992) were calculated with PHYLIP's protdist and consensus trees with consense to cope with the runtime demands of clusters up to 1,182 members. Finally, the consensus topologies were used to estimate ML branch lengths with the user-tree option of Tree-Puzzle, using uniform rates and exact parameter estimation.Cluster and Gene Family Annotation The nonredundant cluster member sequences were annotated using Inter-ProScan 4.2 with all available databases of the Inter-Pro Release 12.1. The annotated domains and associated GO terms were stored in the TreePipeDB. Inter-ProScan searches (Quevillon et al., 2005) were performed for the 37,247 distinct cluster members after redundancy removal. A total of 99.8% of the sequences could be annotated with Inter-Pro domains. Sixty-two percent of the domains found were from the PANTHER (Mi et al., 2005), PFAM (Finn et al., 2006), and PROSITE (Hulo et al., 2006) databases. Manual curation was performed by inspection of the description lines of the enclosed UniProt sequences and by inferring the classification of Arabidopsis cluster members from DATF (Guo et al., 2005) and ArabTFDB (http://arabtfdb.bio.uni-potsdam.de/v1.1/). To further assign thus far undetected TAP families, their corresponding Arabidopsis and rice members collected from DATF (Guo et al., 2005), ArabTFDB (http://arabtfdb.bio.uni-potsdam.de/v1.1/), DRTF (Gao et al., 2006), and RiceTFDB (http://ricetfdb.bio.uni-potsdam.de/v2.1/) were used to screen the nonredundant cluster members for homologs by BLASTP. Species-Specific Expansion, Taxonomic Profiling, and Statistical Tests The PlanTAPDB family sizes in six genera, Arabidopsis, rice, P. patens, C. reinhardtii, C. merolae, and T. pseudonana, were inferred using the NCBI taxonomy information of the nonredundant list of family members. These values were normalized using the total amount of members per group (TF, TR, or PT) in order to account for the general differences in TAP family sizes. If the fraction of family members in a given species deviated from the arithmetic average of the group with a z score of ≥1.8, it was marked as expanded (no gene family was significantly reduced according to this criterion). The cutoff was chosen based on a distribution plot of all z scores (data not shown). For visualization of the taxonomic composition of the TAP families (taxonomic profile), all taxa were allocated into 20 nonredundant taxonomic groups that were chosen because they contributed significantly to the distribution of NCBI taxonomy strings. After normalization for taxonomic group size (columnwise log ratio per average), the rows were used for average-linkage clustering with a centered Pearson-correlation distance and heat map visualization using Cluster 3.0 and JavaTreeview 1.0.12 (Eisen et al., 1998). Hypothesized differences in the size distribution of TAP gene families between organisms (Fig. 3B Supplemental Data The following materials are available in the online version of this article.
[Supplemental Data]
Acknowledgments We thank T. Kretsch, T. Laux, and M. Woriedh for helpful discussions, A.K. Prowse for critically reading the manuscript, and several anonymous reviewers for helpful comments. Notes 1This work was supported by the German Research Foundation (grant nos. Re 837/7–3 and Re 837/10–1 to R.R.). The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Stefan A. Rensing (stefan.rensing/at/biologie.uni-freiburg.de). [C]Some figures in this article are displayed in color online but in black and white in the print edition. [W]The online version of this article contains Web-only data. [OA]Open Access articles can be viewed online without a subscription. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Plant Cell. 1998 Jul; 10(7):1075-82.
[Plant Cell. 1998]Curr Opin Genet Dev. 2000 Oct; 10(5):575-9.
[Curr Opin Genet Dev. 2000]Curr Opin Genet Dev. 2003 Apr; 13(2):199-206.
[Curr Opin Genet Dev. 2003]Nature. 2003 Jul 10; 424(6945):147-51.
[Nature. 2003]Genome Biol. 2004; 5(8):R53.
[Genome Biol. 2004]Genome Res. 2002 Jul; 12(7):1048-59.
[Genome Res. 2002]Plant Cell. 2004 Jul; 16(7):1679-91.
[Plant Cell. 2004]Trends Genet. 2004 Oct; 20(10):461-4.
[Trends Genet. 2004]Plant Physiol. 2005 Sep; 139(1):18-26.
[Plant Physiol. 2005]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Proc Natl Acad Sci U S A. 1999 Jul 20; 96(15):8545-50.
[Proc Natl Acad Sci U S A. 1999]Comput Biol Chem. 2004 Dec; 28(5-6):341-50.
[Comput Biol Chem. 2004]J Mol Biol. 2006 Apr 28; 358(2):614-33.
[J Mol Biol. 2006]Dev Genes Evol. 2005 Nov; 215(11):580-96.
[Dev Genes Evol. 2005]Evol Dev. 2006 Mar-Apr; 8(2):150-73.
[Evol Dev. 2006]Plant Physiol. 2004 Dec; 136(4):4285-98.
[Plant Physiol. 2004]Science. 2005 Apr 8; 308(5719):260-3.
[Science. 2005]Plant Biol (Stuttg). 2005 May; 7(3):307-14.
[Plant Biol (Stuttg). 2005]Gene. 2006 Feb 1; 366(2):256-65.
[Gene. 2006]BMC Evol Biol. 2004 Jan 28; 4():2.
[BMC Evol Biol. 2004]Plant Biol (Stuttg). 2005 May; 7(3):238-50.
[Plant Biol (Stuttg). 2005]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Bioinformatics. 2005 May 15; 21(10):2568-9.
[Bioinformatics. 2005]Bioinformatics. 2006 May 15; 22(10):1286-7.
[Bioinformatics. 2006]Nucleic Acids Res. 2001 Jul 15; 29(14):2994-3005.
[Nucleic Acids Res. 2001]Protein Eng. 1999 Feb; 12(2):85-94.
[Protein Eng. 1999]Mol Biol Evol. 2000 Apr; 17(4):540-52.
[Mol Biol Evol. 2000]BMC Bioinformatics. 2005 Apr 19; 6():102.
[BMC Bioinformatics. 2005]Nucleic Acids Res. 2001 Jan 15; 29(2):545-52.
[Nucleic Acids Res. 2001]In Silico Biol. 2003; 3(3):313-9.
[In Silico Biol. 2003]Nucleic Acids Res. 2004; 32(17):5231-8.
[Nucleic Acids Res. 2004]Science. 2005 Apr 8; 308(5719):260-3.
[Science. 2005]Cell Cycle. 2006 Feb; 5(4):352-5.
[Cell Cycle. 2006]Plant Mol Biol. 2000 Jan; 42(1):115-49.
[Plant Mol Biol. 2000]Plant Physiol. 2004 Dec; 136(4):4285-98.
[Plant Physiol. 2004]Gene. 2006 Feb 1; 366(2):256-65.
[Gene. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D262-6.
[Nucleic Acids Res. 2004]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Bioinformatics. 2005 May 15; 21(10):2568-9.
[Bioinformatics. 2005]Bioinformatics. 2006 May 15; 22(10):1286-7.
[Bioinformatics. 2006]Trends Plant Sci. 2002 Dec; 7(12):535-8.
[Trends Plant Sci. 2002]Genes Dev. 1999 Apr 15; 13(8):1002-14.
[Genes Dev. 1999]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Bioinformatics. 2005 May 15; 21(10):2568-9.
[Bioinformatics. 2005]Bioinformatics. 2004 Feb 12; 20(3):426-7.
[Bioinformatics. 2004]Bioinformatics. 2001 Apr; 17(4):383-4.
[Bioinformatics. 2001]Plant Biol (Stuttg). 2005 May; 7(3):238-50.
[Plant Biol (Stuttg). 2005]Curr Opin Plant Biol. 2006 Oct; 9(5):544-9.
[Curr Opin Plant Biol. 2006]Plant Mol Biol. 2005 Sep; 59(1):191-203.
[Plant Mol Biol. 2005]Bioinformatics. 2006 May 15; 22(10):1286-7.
[Bioinformatics. 2006]Nature. 2003 Jul 10; 424(6945):147-51.
[Nature. 2003]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Genome Biol. 2005; 6(13):R110.
[Genome Biol. 2005]Dev Genes Evol. 2005 Nov; 215(11):580-96.
[Dev Genes Evol. 2005]Evol Dev. 2006 Mar-Apr; 8(2):150-73.
[Evol Dev. 2006]Nucleic Acids Res. 2003 Jan 15; 31(2):653-60.
[Nucleic Acids Res. 2003]Genome Biol. 2004; 5(8):R53.
[Genome Biol. 2004]Plant Cell. 1998 Jul; 10(7):1075-82.
[Plant Cell. 1998]Curr Opin Genet Dev. 2003 Apr; 13(2):199-206.
[Curr Opin Genet Dev. 2003]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Curr Opin Plant Biol. 2006 Oct; 9(5):544-9.
[Curr Opin Plant Biol. 2006]Plant Physiol. 2005 Sep; 139(1):18-26.
[Plant Physiol. 2005]Plant Mol Biol. 2005 Sep; 59(1):191-203.
[Plant Mol Biol. 2005]J Biol Chem. 2006 May 19; 281(20):13939-48.
[J Biol Chem. 2006]Hum Mol Genet. 2004 May 15; 13(10):1081-93.
[Hum Mol Genet. 2004]Nucleic Acids Res. 2006; 34(1):232-42.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005 Nov 10; 33(19):e173.
[Nucleic Acids Res. 2005]Science. 2000 Dec 15; 290(5499):2105-10.
[Science. 2000]Bioinformatics. 2001 Jan; 17(1):95-7.
[Bioinformatics. 2001]Nucleic Acids Res. 2003 Jan 15; 31(2):653-60.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D247-51.
[Nucleic Acids Res. 2006]Plant Physiol. 2000 Apr; 122(4):1003-13.
[Plant Physiol. 2000]Cell. 1998 Dec 11; 95(6):805-15.
[Cell. 1998]Nature. 2005 Dec 22; 438(7071):1172-5.
[Nature. 2005]Plant Cell. 2006 Mar; 18(3):560-73.
[Plant Cell. 2006]Development. 2004 Feb; 131(3):657-68.
[Development. 2004]Plant Mol Biol. 1999 Sep; 41(2):151-8.
[Plant Mol Biol. 1999]Trends Cell Biol. 2005 Nov; 15(11):618-25.
[Trends Cell Biol. 2005]Development. 2006 Aug; 133(16):3213-22.
[Development. 2006]J Plant Physiol. 2004 Jul; 161(7):823-35.
[J Plant Physiol. 2004]Plant Physiol. 2004 Dec; 136(4):4285-98.
[Plant Physiol. 2004]Brief Bioinform. 2003 Jun; 4(2):179-84.
[Brief Bioinform. 2003]Nucleic Acids Res. 1990 Mar 25; 18(6):1517-20.
[Nucleic Acids Res. 1990]Nucleic Acids Res. 2003 Jan 1; 31(1):224-8.
[Nucleic Acids Res. 2003]Plant Biol (Stuttg). 2005 May; 7(3):238-50.
[Plant Biol (Stuttg). 2005]Nucleic Acids Res. 2003 Jul 1; 31(13):3738-41.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W116-20.
[Nucleic Acids Res. 2005]Trends Genet. 2000 Jun; 16(6):276-7.
[Trends Genet. 2000]Nucleic Acids Res. 2005; 33(2):511-8.
[Nucleic Acids Res. 2005]Genome Res. 2005 Feb; 15(2):330-40.
[Genome Res. 2005]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Proteins. 2002 Aug 1; 48(2):227-41.
[Proteins. 2002]Mol Biol Evol. 2001 May; 18(5):691-9.
[Mol Biol Evol. 2001]Comput Appl Biosci. 1992 Jun; 8(3):275-82.
[Comput Appl Biosci. 1992]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W116-20.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D284-8.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D247-51.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D227-30.
[Nucleic Acids Res. 2006]Bioinformatics. 2005 May 15; 21(10):2568-9.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]