![]() | ![]() |
Formats: |
||||||||||||||||||||||
Copyright © 2009 The Author(s) TARGeT: a web-based pipeline for retrieving and characterizing gene and transposable element families from genomic sequences Department of Plant Biology, University of Georgia, Athens, GA 30602, USA *To whom correspondence should be addressed. Tel: Phone: +1 706-542-1810; Fax: +1 706-542-1805; Email: sue/at/plantbio.uga.edu Received March 7, 2009; Revised April 15, 2009; Accepted April 15, 2009. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Gene families compose a large proportion of eukaryotic genomes. The rapidly expanding genomic sequence database provides a good opportunity to study gene family evolution and function. However, most gene family identification programs are restricted to searching protein databases where data are often lagging behind the genomic sequence data. Here, we report a user-friendly web-based pipeline, named TARGeT (Tree Analysis of Related Genes and Transposons), which uses either a DNA or amino acid ‘seed’ query to: (i) automatically identify and retrieve gene family homologs from a genomic database, (ii) characterize gene structure and (iii) perform phylogenetic analysis. Due to its high speed, TARGeT is also able to characterize very large gene families, including transposable elements (TEs). We evaluated TARGeT using well-annotated datasets, including the ascorbate peroxidase gene family of rice, maize and sorghum and several TE families in rice. In all cases, TARGeT rapidly recapitulated the known homologs and predicted new ones. We also demonstrated that TARGeT outperforms similar pipelines and has functionality that is not offered elsewhere. INTRODUCTION A major discovery of eukaryote genome projects is that unexpectedly large numbers of genes are members of gene families. Gene families comprise 49% of the genes in Caenorhabditis elegans, 41% in Drosophila melanogaster, 38% in Homo sapiens, 65% in Arabidopsis thaliana and 77% in Oryza sativa L. ssp. japonica (1–5). Variation in the sizes of gene families among closely related species indicates that gene duplication and gene family diversification is an ongoing process (6–8). Duplicate genes arise in several ways including whole-genome duplication (9–11) and segmental duplication (12,13). Segmental duplication events can be further classified into tandem and interspersed (14). A tandem duplication event can result from either homologous (15) or nonhomologous recombination mechanisms (16), while interspersed duplication events are mainly caused by the activity of transposable elements (TEs) (17–20). Gene family members can be detected by clustering genes based on their similarity (21,22), and new members can be identified through similarity comparison to known members. Many gene family databases have been established, including Pfam (23), TreeFam (24) and PANTHER (25), etc. While these gene family databases are useful recourses, they are not updated at the same rapid pace as that of newly generated genomic sequences. Researchers interested in particular gene families often have to perform their own searches to obtain the most current collection of sequences. The identification of gene family members using sequence similarity searches is often complicated by the detection of homologs from other gene families. Phylogenetic analysis is a powerful tool to identify homologs of interest and to provide additional information about gene function and evolution. To this end, researchers can perform manual searches using publicly available programs such as BLAT (26), Wise2 (27), BLAST (28), FASTA (29) and HMMER (30), followed by sequence alignment and phylogenetic analysis. However, these procedures can be complicated as they often require extensive manual curation, particularly if homologous regions need to be extracted from genomic sequences. While this is a manageable problem for a small gene family, it can be a tedious and time-consuming process when the target gene family is large. More significantly, the quality of the results often suffers. In addition to the more traditional gene families, TEs can also be viewed as members of ‘special’ gene families that are able to duplicate themselves by the activity of element-encoded proteins. TEs often constitute the largest component of eukaryotic genomes, and their identification and classification are essential to accurate genome annotation (31,32). However, as with large gene families, the very high copy numbers of some TEs make their retrieval from genomic sequence and characterization an extremely difficult task. The increasing pace of genomic sequencing projects demands a computer-assisted pipeline that can rapidly and accurately identify and characterize gene families. Several automated pipelines have been developed to ease homolog identification and most are limited to protein or expressed sequence tag (EST) databases. For example, PhyloBLAST (33), Pyphy (34), HoSeql (35), PhyloGena (36) and TRIBE-MCL (37) perform BLASTP searches and retrieve data from protein databases. SimESTs uses TBLASTN to search EST databases (38). Because these programs only compare protein-coding sequences, they will miss any mutational events that occur within noncoding regions. TARGeT (Tree Analysis of Related Genes and Transposons) is a program to streamline the process of retrieving, annotating and analyzing both gene families and TE families from a genomic database. The core of the TARGeT pipeline is an algorithm called putative homolog identifier (PHI) that uses a series of steps to predict gene structure using BLAST results. From the predicted gene structure, PHI extracts the amino acid sequences of putative homologs for use in subsequent phylogenetic analysis. We have compared TARGeT with two pipelines, FGF and GFScan, which can also be used to retrieve gene families from genomic databases. Results are presented showing that TARGeT significantly outperforms both programs and adds several layers of functionality not present in existing programs. To make it easier for users, especially nonspecialists, TARGeT was implemented as a user-friendly web-based pipeline (http://target.iplantcollaborative.org/). All initial input for TARGeT is organized on a web form and the results are presented in the browser. All results and supporting files are documented and are available for download. TARGeT provides several points where results can be inspected and analyses can be repeated. METHODS TARGeT can use either protein or DNA sequence as the query. BLASTN searches are used for DNA queries, while TBLASTN is used for protein queries. The pipeline that uses TBLASTN is the focus of this article because it is more complex and may have wider application. TARGeT uses Muscle (39) to calculate the multiple alignment and TreeBest (24) to generate the phylogenetic tree of the putative homologs with the neighbor-joining method (40). The other functions of TARGeT are carried out by several Perl scripts developed by the authors. Rice genomic data were obtained from Genbank (41,42) with accession numbers from NC_008394 to NC_008405. Maize genomic data were downloaded from the Maize Genome Sequencing Project (http://www.maizesequence.org; version: Dec. 2008). Sorghum genomic data were from the Sorghum Bicolor Genome Project (http://www.jgi.doe.gov; version: 2008 Sorbi1 assembly). There are five main steps in the TARGeT pipeline with a checkpoint at the end of each: (i) preparation of the query when multiple sequences are to be submitted, (ii) BLAST search (either BLASTN or TBLASTN), (iii) homolog prediction, (iv) multiple alignment and (v) phylogenetic tree estimation (Figure 1
TARGeT can be accessed on a web server, where all data used and generated by TARGeT are entered in a log file. TARGeT output is presented in a single webpage that uses nested tabs to organize the data, images and re-submission forms for each TARGeT run during a session. There is a final tab for each run called Provenance, where the user can view the parameters used by TARGeT in a log file and also download an archive that includes all files and images for offline viewing and analysis. The output includes the XML log file, BLAST results in image and text format, PHI results in image and text format, multiple alignments in FASTA format and the phylogenetic tree in Newick and jpeg formats. RESULTS Searching for APx gene family in rice Rice and Arabidopsis serve as model plant monocot and dicot species, respectively. They diverged from a common ancestor 120–200 million years ago (43) and their genomes are fully sequenced (1,2,44). Thus, they provide excellent opportunities to evaluate the cross-species searching ability of TARGeT. We searched the rice APx gene family using the Arabidopsis APx protein sequence as query and compared the results generated by TARGeT to the published data. The goal of this exercise was to see how well TARGeT would perform at predicting the rice APx family members. We chose APx because it is a small but important gene family that has been well annotated in both Arabidopsis and rice. Based on the literature, there are as many as nine APx family members in Arabidopsis (45) and eight in rice (46) (Table 1). The APx family shares sequence similarity with several other peroxidase families (47) and, as such, is a good dataset to test the ability of TARGeT to discriminate between closely related protein families.
BLAST search To improve the chances of finding target gene family members, multiple queries can be submitted as long as they are homologs. An optional multiple alignment step is provided for users to select sequences from conserved regions (Figure 1
To aid users in viewing the BLAST result, TARGeT produces an image showing a rough estimation of BLAST high scoring pair (HSP) numbers and conserved regions along the length of each query sequence (Figures 1
Putative homolog identification Several factors make it difficult to identify reliable homologs from BLAST output and result in a high false positive rate (48–50). Lack of explicit treatment of frameshifts and introns is also a disadvantage of TBLASTN (51). To solve these problems, we developed a program called PHI, which takes into account the e-value (default 0.01) as well as a second parameter called the minimal match percentage (MMP, defaults to 70%) to find reliable homologs. The two main stages in PHI (grouping and refinement) are explained below. Grouping Introns or low-similarity regions can break a complete alignment into smaller HSPs. In addition when a frameshift occurs, TBLASTN produces separate HSPs. To retrieve the intact sequence of each homolog or pseudogene, PHI sorts the HSPs based on position and strand in the genomic sequence. In this step, HSPs that are from the same homolog are grouped together by the sequence position of query and subject (Figure 4
Refinement After the grouping step, several potential problems often remain in the HSPs of each group. A demonstration figure to illustrate some problems is shown in the lower part of Figure 4 Resolving the boundary between two overlapping HSPs In TBLASTN outputs, two successive HSPs often overlap due to coincident similarity beyond the true boundaries, resulting in misalignment between the query and the subject. An example is shown in Figure 4 Identifying small introns The function of this step is to identify and remove introns that appear as gaps within the HSPs. Any gap in the subject that has a length greater than the minimum intron length parameter (user-adjustable parameter, default 60 nt) is identified as an intron and will be removed resulting in two (smaller) new HSPs (Figure 4 Identifying small exons Small exons will be missed by BLAST searches when their alignments do not meet the e-value cut-off. Such small exons may be found by increasing the e-value. However, for a large database, simply increasing the e-value could increase the computational burden of TARGeT, and there is no guarantee that all exons will be identified because the suitable e-value is unknown. To improve the prediction of small exons, PHI can perform a second round BLAST search, using a small database containing only the sequences of putative homologs (including the predicted intronic and flanking regions). Because e-value calculation is dependent in part on the size of the database, short alignments to the original query sequence(s) may now be significant (Figure 4 Illustration of PHI output After the refinement stage, an image is generated that provides a view of the predicted gene structure for each putative homolog (Figure 1 Using default parameters, 46 putative rice APx homologs were identified and clustered into two groups based on their gene structures (Figure 5
To assess the accuracy of homolog sequences retrieved by TARGeT, we considered two situations (this is not a step of TARGeT). One situation might occur at the ends of the query-target alignment where the program failed to identify some amino acids at the end. We refer to this as ‘missing’ and can occur when the end of homolog sequences are not as well conserved as the sequences within. By comparing the homolog sequence to the query sequence, the numbers of ‘missing’ amino acids were counted manually. For example, if the query is 100 amino acids and the alignment is from 5 to 97, the missing number of this homolog is 4 + 3 = 7. The ‘missed’ rate is calculated by dividing the number of missed amino acids by the length of the query (7% in the above example). In contrast, we refer to an ‘error’ as a situation where the program incorrectly predicts amino acids within a homolog sequence. By comparing the homolog sequence to the previously published rice APx protein sequence, mismatched amino acids were counted manually as the ‘error’ number of this homolog. The ‘error’ rate is calculated by dividing the number of incorrect amino acid assignments by the length of the corresponding region in the previously published rice APx protein sequence. The missed and error rates may vary for each predicted homolog sequence because they depend on the level of conservation between the homolog and the query sequences. For the rice APx example above, the average missed rate is 1.11% and the average error rate is 0.49% (Table 1). Multiple alignment and tree estimation If users are satisfied with the putative homologs found by TARGeT, they can either download the sequences in FASTA format or let TARGeT use the data to generate a phylogenetic tree. Users also have the option to employ other tree estimation methods by downloading the alignment and using the software of their choice. The phylogenetic tree and the figure showing the tree are generated by TreeBest. When there are many homologs, names on the figure will be difficult to read because the figure size cannot be varied. To solve this problem, users can download the newick file and draw their own tree using software such as TreeView (52). We have also provided two more solutions on the server. The first is to use Jalview (53) and the second is to copy the newick format tree file and submit it to PhyloWidget (54), which is a powerful web-based tree viewer. From the TARGeT-generated tree of APx homologs (shaded region in Figure 6
Searching for APx gene family members in maize and sorghum To further evaluate the cross-species search ability of TARGeT, we searched for APx gene families in maize and sorghum, using the same query that was used to search for rice APx genes. The reasons for choosing maize and sorghum are as follows. First, at the time of the final analysis for this study, the available maize and sorghum sequences were incomplete. Maize is being sequenced using a BAC by BAC approach, while sorghum was sequenced using a whole genome shotgun approach. As such, they are more representative of the available genomic databases than the complete rice sequence. Second, search results of maize and sorghum can be compared with the rice and Arabidopsis output. Finally, the APx gene families in maize and sorghum have not as yet been characterized. We identified 11 APx homologs in maize and 9 in sorghum (Supplementary Figures 2–5). To get a comprehensive view of the APx family in plants, we produced a phylogenetic tree with MEGA (55) using the published APx data from Arabidopsis and the data predicted by TARGeT for rice, maize and sorghum (Figure 7
Searching DNA TE families in rice TARGeT is a powerful tool for rapid TE identification, characterization and phylogenetic analysis. We have illustrated this by using TARGeT to search for TEs in the rice genome using as query conserved transposase sequences from five DNA TE superfamilies. The queries were constructed from known TE protein sequences that were downloaded from Repbase (59) and additional sequences annotated as part of another study (data not shown). Here, we focus on the TARGeT results for the Tc1/mariner superfamily because it has been well annotated and characterized in rice. The Tc1/mariner superfamily is widespread in plant and animal genomes (60). A previous study (60) annotated 34 coding mariner-like elements (MLEs) from two partially sequenced rice genomes (14 from the indica database and 20 from the japonica database). Here, we used TARGeT to search the complete japonica database and, in ~1 min, generated a phylogenetic tree that was consistent with that of Feschotte and Wessler (60). TARGeT successfully retrieved the 20 MLEs reported in the previous study and, in addition, detected 27 new MLEs (Figure 8
Evaluating of the speed of TARGeT Many factors can affect the speed of TARGeT, such as the number and length of the query sequences, the gene/TE family size, the database size and the number of exons. Other issues that affect the run time include the server hardware and current usage. In addition, because TARGeT is entirely web based, upload and download times vary from user to user. For the gene or TE families that were analyzed in this study, we calculated the average time for each search as an average of 10 independent runs. For example, TARGeT took ~1.2, 2.5 and 6.8 min to complete the searches of the APx gene family in rice, sorghum and maize, respectively. The search of the rice Tc1/mariner superfamily took ~1 min to complete. Comparison of TARGeT with similar programs Two other pipelines, GFScan (50) and FGF (61), can also retrieve and characterize gene families from genomic databases. GFScan searches for gene family members with the representative genomic DNA motif, while FGF performs TBLASTN search followed by GeneWise and phylogenetic analysis. Here, we briefly compare the features and performance of TARGeT with these two pipelines. TARGeT versus GFScan The cross-species searching ability of GFScan was previously tested by using a human query sequence to retrieve the carbonic anhydrases (CA) family from the mouse genome (50). GFScan was able to identify only 5 of the 11 known CA genes along with two putative new CA genes in the available mouse genome sequence. The authors stated that this discrepancy was due to the large difference between the human and mouse motifs. We did a similar search using TARGeT for CA genes in the mouse. Because there is no record of the version of the mouse genomic database used in the GFScan paper, we chose the latest version of the reference data (18 October 2006) from Genbank. A query composed of 14 protein sequences from 14 known human CA genes was constructed. Using default parameters except that the minimal intron length was set to 80 000 nt, TARGeT found 14 out of 16 known CA genes (data in 2008) in mouse, and the remaining two were identified together with a putative new CA homolog after the MMP cut-off was reduced from 0.7 to 0.5. TARGeT versus FGF Direct comparison between the results of TARGeT and FGF proved difficult. First, the FGF server is often not available. Second, TARGeT and FGF use different local databases. We ran TARGeT with the queries that were used in the paper describing FGF. Using a peptidylprolyl isomerase Cyp2 gene (AK061894, GI: 115443875) as query to search against the rice database with default parameters, TARGeT found six more putative homologs than FGF (Supplementary Figure 6). We also found one possible mistake in the result of FGF: it identified two overlapping homologs, AK061894_chr06 and AK061894_chr06, while there is no such overlap in the result of TARGeT. Using Hsp90 (GI: 40254816) as the query to search against the human database, both FGF and TARGeT found 15 homologs (Supplementary Figure 7). DISCUSSION To date, most gene family search programs can only retrieve homologs from protein sequence databases. More commonly, BLAST has been widely used to search genomic sequence databases. However, manual retrieval of homolog sequences from BLAST outputs requires a great deal of time. This is especially true for large gene or TE families. TARGeT is particularly useful if one wants to quickly retrieve and characterize gene families from DNA databases, especially when a newly sequenced genome is available. TARGeT uses a Perl program named PHI that automatically retrieves homolog sequences from BLAST outputs. In addition, TARGeT can do multiple alignment and phylogenetic analysis with the retrieved homolog sequences. Speed is another major advantage of TARGeT. As demonstrated in this report, TARGeT can routinely retrieve and characterize gene family homologs, including TEs, from plant and animal genome sequences on the order of minutes. Although TARGeT shares similarity with homology-based TE annotation tools like RepeatMasker (62), there are some important differences. First, instead of showing each fragmented match as RepeatMasker does, TARGeT tries to identify homologs that are long enough for phylogenetic tree estimation. A fragmented TE can be identified as long as the sum length of its fragments satisfies the MMP to the query. As such, using the same query and databases, the number of homologs identified by TARGeT is usually lower than the hit number found by RepeatMasker. Second, when there are no repeat libraries available for a particular species, RepeatMasker gives the user the option of performing a BLASTX search to annotate coding regions of TEs in the submitted sequences. In contrast, TARGeT uses a TBLASTN search to identify coding regions from the whole genomic database. Finally, RepeatMasker lacks most of the functionality that is provided by TARGET including the generation of phylogenetic tree and gene structure figures. When used to search genomic databases, protein sequence queries can efficiently detect distantly related homologs even when their DNA sequences cannot be aligned. Based on our experience, TBLASTN can detect sequences with identities as low as 25% to the query (data not shown). Comparison of the results of TARGeT, FGF and GFScan show that TARGeT retrieved more homologs. To further improve TARGeT's ability to identify distantly related homologs, we are planning to optimize matrix and BLAST parameters (such as gap penalties). Using multiple queries can also increase the chances of finding additional gene family homologs. TARGeT can accept multiple queries at one time. Although more than one query may hit one homolog, a unique feature of TARGeT is that it can select the one that has the best match to the homolog. When there is too much sequence divergence between a homolog sequence and the query, the homolog may not be found by TARGeT. However, TARGeT may still provide a clue for users to find them. For most homologs where HSPs are inadequate to meet the MMP cut-off value, they may still have short matches to the query in conserved regions. In this case, the file containing the BLAST HSPs that do not meet the qualified homolog cut-off would be valuable. Inspecting this file may give users a reason why TARGeT failed to detect some homologs and help users design new queries to find additional homologs. TARGeT uses two approaches to separate closely related gene families. Because there is no absolute similarity cut-off among genes that are within or between families, closely related gene families may be retrieved, under certain circumstances, with the target gene family. This is often the case when the query is short, such as a domain sequence. An efficient way to separate closely related gene families is using phylogenetic analysis because homologs from the same family tend to cluster on a phylogenetic tree into the same clade (Figures 6–8 To overcome these limitations, TARGeT displays the gene structure of each homolog and their sequence similarity to the queries. Because different gene families often have distinct gene structures, homologs that have high sequence similarity to the queries and also have similar gene structures can be easily identified as members of the target gene family. For example, the homologs in the shaded clade in Figure 6 FUNDING National Science Foundation (Grant DBI-0607123); Howard Hughes Medical Institute (Grant 52005731 to S.R.W.). Funding for open access charge: Howard Hughes Medical Institute. Conflict of interest statement. None declared. Supplementary Data are available at NAR Online. [Supplementary Data]
ACKNOWLEDGEMENTS The authors would like to thank The iPlant Collaborative for hosting services on its cyber infrastructure and providing systems administration for the TARGeT web server. In addition, the authors would like to thank Drs Hongyan Shan, Jim Leebens-Mack, Russell Malmberg and Yaowu Yuan for critical reading of the manuscript and for valuable discussions. REFERENCES 1. Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science. 2002;296:79–92. [PubMed] 2. Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science. 2002;296:92–100. [PubMed] 3. Li WH, Gu Z, Wang H, Nekrutenko A. Evolutionary analyses of the human genome. Nature. 2001;409:847–849. [PubMed] 4. Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, et al. Comparative genomics of the eukaryotes. Science. 2000;287:2204–2215. [PubMed] 5. Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. [PubMed] 6. Lespinet O, Wolf YI, Koonin EV, Aravind L. The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res. 2002;12:1048–1059. [PubMed] 7. Lyckegaard EM, Clark AG. Ribosomal DNA and Stellate gene copy number variation on the Y chromosome of Drosophila melanogaster. Proc Natl Acad Sci USA. 1989;86:1944–1948. [PubMed] 8. Neitz M, Neitz J. Numbers and ratios of visual pigment genes for normal red-green color vision. Science. 1995;267:1013–1016. [PubMed] 9. Wendel JF. Genome evolution in polyploids. Plant Mol. Biol. 2000;42:225–249. [PubMed] 10. Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997;387:708–713. [PubMed] 11. Dehal P, Boore JL. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 2005;3:e314. [PubMed] 12. Cheung J, Wilson MD, Zhang J, Khaja R, MacDonald JR, Heng HH, Koop BF, Scherer SW. Recent segmental and gene duplications in the mouse genome. Genome Biol. 2003;4:R47. [PubMed] 13. Eichler EE. Recent duplication, domain accretion and the dynamic mutation of the human genome. Trends Genet. 2001;17:661–669. [PubMed] 14. Hurles M. Gene duplication: the genomic trade in spare parts. PLoS Biol. 2004;2:E206. [PubMed] 15. Bailey JA, Liu G, Eichler EE. An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet. 2003;73:823–834. [PubMed] 16. Koszul R, Caburet S, Dujon B, Fischer G. Eucaryotic genome evolution through the spontaneous duplication of large chromosomal segments. EMBO J. 2004;23:234–243. [PubMed] 17. Morgante M, Brunner S, Pea G, Fengler K, Zuccolo A, Rafalski A. Gene duplication and exon shuffling by helitron-like transposons generate intraspecies diversity in maize. Nat. Genet. 2005;37:997–1002. [PubMed] 18. Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR. Pack-MULE transposable elements mediate gene evolution in plants. Nature. 2004;431:569–573. [PubMed] 19. Tchenio T, Segal-Bendirdjian E, Heidmann T. Generation of processed pseudogenes in murine cells. EMBO J. 1993;12:1487–1497. [PubMed] 20. Vanin EF. Processed pseudogenes: characteristics and evolution. Annu. Rev. Genet. 1985;19:253–272. [PubMed] 21. Dayhoff MO. The origin and evolution of protein superfamilies. Fed. Proc. 1976;35:2132–2138. [PubMed] 22. Heger A, Holm L. Towards a covering set of protein family profiles. Prog. Biophys. Mol. Biol. 2000;73:321–337. [PubMed] 23. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–D288. [PubMed] 24. Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, Li R, Liu T, Zhang Z, Bolund L, et al. TreeFam: a curated database of phylogenetic trees of animal gene families. Nucleic Acids Res. 2006;34:D572–D580. [PubMed] 25. Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, et al. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 2005;33:D284–D288. [PubMed] 26. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12:656–664. [PubMed] 27. Birney E, Clamp M, Durbin R. GeneWise and genomewise. Genome Res. 2004;14:988–995. [PubMed] 28. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PubMed] 29. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 1990;183:63–98. [PubMed] 30. Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14:755–763. [PubMed] 31. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. [PubMed] 32. Meyers BC, Tingey SV, Morgante M. Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. Genome Res. 2001;11:1660–1676. [PubMed] 33. Brinkman FS, Wan I, Hancock RE, Rose AM, Jones SJ. PhyloBLAST: facilitating phylogenetic analysis of BLAST results. Bioinformatics. 2001;17:385–387. [PubMed] 34. Sicheritz-Ponten T, Andersson SG. A phylogenomic approach to microbial evolution. Nucleic Acids Res. 2001;29:545–552. [PubMed] 35. Arigon AM, Perriere G, Gouy M. HoSeqI: automated homologous sequence identification in gene family databases. Bioinformatics. 2006;22:1786–1787. [PubMed] 36. Hanekamp K, Bohnebeck U, Beszteri B, Valentin K. PhyloGena–a user-friendly system for automated phylogenetic annotation of unknown sequences. Bioinformatics. 2007;23:793–801. [PubMed] 37. Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 2002;30:1575–1584. [PubMed] 38. Frank RL, Mane A, Ercal F. An automated method for rapid identification of putative gene family members in plants. BMC Bioinformatics. 2006;7(Suppl. 2):S19. 39. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. [PubMed] 40. Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 1987;4:406–425. [PubMed] 41. Karsch-Mizrachi I, Ouellette BF. The GenBank sequence database. Methods Biochem. Anal. 2001;43:45–63. [PubMed] 42. Burks C, Fickett JW, Goad WB, Kanehisa M, Lewitter FI, Rindone WP, Swindell CD, Tung CS, Bilofsky HS. The GenBank nucleic acid sequence database. Comput. Appl. Biosci. 1985;1:225–233. [PubMed] 43. Salse J, Piégu B, Cooke R, Delseny M. Synteny between Arabidopsis thaliana and rice at the genome level: a tool to identify conservation in the ongoing rice genome sequencing project. Nucl. Acids Res. 2002;30:2316–2328. [PubMed] 44. Initiative TAG. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000;408:796–815. [PubMed] 45. Mittler R, Vanderauwera S, Gollery M, Van Breusegem F. Reactive oxygen gene network of plants. Trends Plant Sci. 2004;9:490–498. [PubMed] 46. Teixeira FK, Menezes-Benavente L, Margis R, Margis-Pinheiro M. Analysis of the molecular evolutionary history of the ascorbate peroxidase gene family: inferences from the rice genome. J. Mol. Evol. 2004;59:761–770. [PubMed] 47. Passardi F, Theiler G, Zamocky M, Cosio C, Rouhier N, Teixera F, Margis-Pinheiro M, Ioannidis V, Penel C, Falquet L, et al. PeroxiBase: the peroxidase database. Phytochemistry. 2007;68:1605–1611. [PubMed] 48. Frickey T, Lupas AN. PhyloGenie: automated phylome generation and analysis. Nucleic Acids Res. 2004;32:5231–5238. [PubMed] 49. Koski LB, Golding GB. The closest BLAST hit is often not the nearest neighbor. J. Mol. Evol. 2001;52:540–542. [PubMed] 50. Xuan Z, McCombie WR, Zhang MQ. GFScan: a gene family search tool at genomic DNA level. Genome Res. 2002;12:1142–1149. [PubMed] 51. Gertz EM, Yu YK, Agarwala R, Schaffer AA, Altschul SF. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol. 2006;4:41. [PubMed] 52. Page RD. TreeView: an application to display phylogenetic trees on personal computers. Comput. Appl. Biosci. 1996;12:357–358. [PubMed] 53. Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ. Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics. 2009;25:1189–1191. [PubMed] 54. Jordan GE, Piel WH. PhyloWidget: web-based visualizations for the tree of life. Bioinformatics. 2008;24:1641–1642. [PubMed] 55. Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol. Biol. Evol. 2007;24:1596–1599. [PubMed] 56. Qian W, Zhang J. Gene dosage and gene duplicability. Genetics. 2008;179:2319–2324. [PubMed] 57. Liang H, Plazonic KR, Chen J, Li WH, Fernandez A. Protein under-wrapping causes dosage sensitivity and decreases gene duplicability. PLoS Genet. 2008;4:e11. [PubMed] 58. Papp B, Pal C, Hurst LD. Dosage sensitivity and the evolution of gene families in yeast. Nature. 2003;424:194–197. [PubMed] 59. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J. Repbase Update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 2005;110:462–467. [PubMed] 60. Feschotte C, Wessler SR. Mariner-like transposases are widespread and diverse in flowering plants. Proc. Natl. Acad. Sci. USA. 2002;99:280–285. [PubMed] 61. Zheng H, Shi J, Fang X, Li Y, Vang S, Fan W, Wang J, Zhang Z, Wang W, Kristiansen K. FGF: a web tool for Fishing Gene Family in a whole genome database. Nucleic Acids Res. 2007;35:W121–W125. [PubMed] 62. Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 2004 http://www.repeatmasker.org. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||
Science. 2002 Apr 5; 296(5565):79-92.
[Science. 2002]Science. 2002 Apr 5; 296(5565):92-100.
[Science. 2002]Nature. 2001 Feb 15; 409(6822):847-9.
[Nature. 2001]Science. 2000 Mar 24; 287(5461):2204-15.
[Science. 2000]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Plant Mol Biol. 2000 Jan; 42(1):225-49.
[Plant Mol Biol. 2000]Nature. 1997 Jun 12; 387(6634):708-13.
[Nature. 1997]PLoS Biol. 2005 Oct; 3(10):e314.
[PLoS Biol. 2005]Genome Biol. 2003; 4(8):R47.
[Genome Biol. 2003]Trends Genet. 2001 Nov; 17(11):661-9.
[Trends Genet. 2001]Fed Proc. 1976 Aug; 35(10):2132-8.
[Fed Proc. 1976]Prog Biophys Mol Biol. 2000; 73(5):321-37.
[Prog Biophys Mol Biol. 2000]Nucleic Acids Res. 2008 Jan; 36(Database issue):D281-8.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D572-80.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D284-8.
[Nucleic Acids Res. 2005]Genome Res. 2002 Apr; 12(4):656-64.
[Genome Res. 2002]Genome Res. 2004 May; 14(5):988-95.
[Genome Res. 2004]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Methods Enzymol. 1990; 183():63-98.
[Methods Enzymol. 1990]Bioinformatics. 1998; 14(9):755-63.
[Bioinformatics. 1998]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Genome Res. 2001 Oct; 11(10):1660-76.
[Genome Res. 2001]Bioinformatics. 2001 Apr; 17(4):385-7.
[Bioinformatics. 2001]Nucleic Acids Res. 2001 Jan 15; 29(2):545-52.
[Nucleic Acids Res. 2001]Bioinformatics. 2006 Jul 15; 22(14):1786-7.
[Bioinformatics. 2006]Bioinformatics. 2007 Apr 1; 23(7):793-801.
[Bioinformatics. 2007]Nucleic Acids Res. 2002 Apr 1; 30(7):1575-84.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2004; 32(5):1792-7.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D572-80.
[Nucleic Acids Res. 2006]Mol Biol Evol. 1987 Jul; 4(4):406-25.
[Mol Biol Evol. 1987]Methods Biochem Anal. 2001; 43():45-63.
[Methods Biochem Anal. 2001]Comput Appl Biosci. 1985 Dec; 1(4):225-33.
[Comput Appl Biosci. 1985]Nucleic Acids Res. 2002 Jun 1; 30(11):2316-28.
[Nucleic Acids Res. 2002]Science. 2002 Apr 5; 296(5565):79-92.
[Science. 2002]Science. 2002 Apr 5; 296(5565):92-100.
[Science. 2002]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Trends Plant Sci. 2004 Oct; 9(10):490-8.
[Trends Plant Sci. 2004]Nucleic Acids Res. 2004; 32(17):5231-8.
[Nucleic Acids Res. 2004]J Mol Evol. 2001 Jun; 52(6):540-2.
[J Mol Evol. 2001]Genome Res. 2002 Jul; 12(7):1142-9.
[Genome Res. 2002]BMC Biol. 2006 Dec 7; 4():41.
[BMC Biol. 2006]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Nature. 2000 Dec 14; 408(6814):796-815.
[Nature. 2000]Comput Appl Biosci. 1996 Aug; 12(4):357-8.
[Comput Appl Biosci. 1996]Bioinformatics. 2009 May 1; 25(9):1189-91.
[Bioinformatics. 2009]Bioinformatics. 2008 Jul 15; 24(14):1641-2.
[Bioinformatics. 2008]Mol Biol Evol. 2007 Aug; 24(8):1596-9.
[Mol Biol Evol. 2007]Genetics. 2008 Aug; 179(4):2319-24.
[Genetics. 2008]PLoS Genet. 2008 Jan; 4(1):e11.
[PLoS Genet. 2008]Nature. 2003 Jul 10; 424(6945):194-7.
[Nature. 2003]Cytogenet Genome Res. 2005; 110(1-4):462-7.
[Cytogenet Genome Res. 2005]Proc Natl Acad Sci U S A. 2002 Jan 8; 99(1):280-5.
[Proc Natl Acad Sci U S A. 2002]Genetics. 2008 Aug; 179(4):2319-24.
[Genetics. 2008]Genetics. 2008 Aug; 179(4):2319-24.
[Genetics. 2008]Genome Res. 2002 Jul; 12(7):1142-9.
[Genome Res. 2002]Nucleic Acids Res. 2007 Jul; 35(Web Server issue):W121-5.
[Nucleic Acids Res. 2007]Genome Res. 2002 Jul; 12(7):1142-9.
[Genome Res. 2002]