![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2007 The Author(s) EPGD: a comprehensive web resource for integrating and displaying eukaryotic paralog/paralogon information 1Bioinformatics Center, Key Lab of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Road, 2Graduate School of the Chinese Academy of Sciences, Shanghai 200031, 3Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, Shanghai 200235 and 4Shanghai Information Center for Life Sciences, Shanghai Institutes for Biology Science, Chinese Academy of Science, Shanghai 200031, P. R. China *To whom correspondence should be addressed.Phone: +86 21 54920089, Fax: +86 21 54920143, Email: yxli/at/sibs.ac.cn The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Received August 4, 2007; Revised October 8, 2007; Accepted October 10, 2007. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Gene duplication is common in all three domains of life, especially in eukaryotic genomes. The duplicates provide new material for the action of evolutionary forces such as selection or genetic drift. Here we describe a sophisticated procedure to extract duplicated genes (paralogs) from 26 available eukaryotic genomes, to pre-calculate several evolutionary indexes (evolutionary rate, synonymous distance/clock, transition redundant exchange clock, etc.) based on the paralog family, and to identify block or segmental duplications (paralogons). We also constructed an internet-accessible Eukaryotic Paralog Group Database (EPGD; http://epgd.biosino.org/EPGD/). The database is gene-centered and organized by paralog family. It focuses on paralogs and evolutionary duplication events. The paralog families and paralogons can be searched by text or sequence, and are downloadable from the website as plain text files. The database will be very useful for both experimentalists and bioinformaticians interested in the study of duplication events or paralog families. INTRODUCTION The occurrences and consequences of gene and genome duplication events have been discussed for a long time (1,2). The duplication of genes and large genome regions (or entire genomes) is proposed to be an important mechanism for the evolution of phenotypic complexity, diversity and innovation, and as an origin of novel gene functions. To uncover the evolutionary trajectories of duplicated genes, previous studies have integrated transcriptomic, interactomic and other data (1). Such integrated approaches, focusing on gene duplications in genomes, have already contributed robust insights into important evolutionary questions, such as the complexity of genes (3), the evolution of genome architecture (4), growth of gene networks (5), the 2R hypothesis (6) and diversity of gene expression (7). Moreover, the duplicated genes can be used to investigate diverging gene functions, which, when allied with computational methods, may provide useful information for experimental approaches. An example is the analysis of the molecular basis of the adaptive evolution of the duplicated pancreatic ribonuclease gene in leaf-eating monkeys with both computational and experimental approaches (8). As more genomes are examined, increasing evidences support the dominating role of gene duplication events in the expanding of genome content (2,9). A crucial step in the study of gene duplications is to identify duplicated genes (known as paralogs) in genome sequences and to distinguish these from genes that have similar sequence but arisen from convergent evolution or other mechanisms. Algorithm-based homology detection from primary sequences is the preferred approach to detect paralogs or paralogous regions (4). In contrast to ortholog databases, there are only a few specific paralog databases available in the public domain. Even though several general homolog databases, such as Inparanoid (10), Ensembl Compara (11), NCBI homologene (12), include some paralog information, they did not comprehensively summarize and display the evolution information of paralogs. In order to construct a stable web resource that supports easy browsing and downloading of evolutionary information on paralogous genes, we created EPGD (Eukaryotic Paralog Group Database; http://epgd.biosino.org/EPGD/). Several steps used to identify the paralogs contained in the EPGD were used previously to detect the duplication events in the family of animal transmembrane genes (13). Using this work (13) as a basis, we developed a semi-automatic procedure for collecting the within-species paralog families from genomes and pre-calculating several evolutionary indexes of these families. We collected the paralogs only from eukaryotes, as they are known to have a higher rate of gene duplication than Prokaryotes (14) and are more widely studied in this field. A pioneer in the construction of paralog database is paraDB (15). A highlight of paraDB is the display of paralogons, which have been thoroughly investigated in the human genome (16) and are reviewed by Van de Peer (4). EPGD inherits this feature and adopts the term ‘paralogon’, defined as homologous genomic segments created by partial or complete genome duplication. EPGD focuses on families of paralogs and integrates spatial and temporal data to diagnose gene duplication processes comprehensively (17). The ratio of dN (the rate of non-synonymous substitutions) to dS (the rate of synonymous substitutions) (18), synonymous distance/clock, transition redundant exchange (TREx) clock (19), paralogons and several other features were generated by computational methods and deposited in the database. In the current EPGD version, 26 eukaryotic genomes were processed and 35 991 paralog families and 29 480 paralogons were identified and stored (Table 1). To our knowledge, it is one of the most extensive paralog databases in public domain. All data can be browsed, searched and downloaded directly from the website.
CONSTRUCTION AND CONTENT EPGD is implemented through MySQL relational database (http://www.mysql.com) and JavaServer Pages technology (http://java.sun.com/products/jsp/). The raw datasets of 26 eukaryotic genomes (Table 1) in GeneBank flat file format (GBK) were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genomes) in March 2007. Proteins, coding sequences (CDS) and gene location information were extracted from these GBK files with a PERL script. Overview of the procedure A total of 531 715 coding sequences and corresponding proteins were obtained after preprocessing. Only the protein sequences were used to construct the paralog families. The procedure is briefly described below:
Content in the database Large datasets were obtained when the procedure was applied to 26 genomes. We housed the data in a MySQL relational database. The kernel tables in the schema of EPGD are the table of paralog families and the table of paralogons. The peripheral tables, i.e. evolutionary indexes and annotation information, surround these two core tables. A summary of the data in EPGD is shown in Table 1. Web interface The web interface was implemented using Java and JavaServer Pages technologies. The user can inspect the datasets in the EPGD and see a summary of the current version. The records of paralog families, paralogons and genes (Figure 1
As shown in Figure 1 The outline of the family page is similar to that of gene page (Figure 1
The main part of the paralogon page contains basic information (taxonomy, locations in the chromosomes, average block length, average block density, number of links) of the paralogon, followed by an image thumbnail displaying a graphic view of the paralogon. Here, ‘the average block density’ is the arithmetic mean of the ratio of paralogon-defining genes to all genes in both sides of the paralogon; ‘number of links’ is the number of unique paralog families linked in the paralogon region. When the mouse hovers over this thumbnail, an enlarged view of this image pops up. Gene names and their regions in the enlarged graphic view of this paralogon are hyperlinked to the gene records in database. The user can access the records in the EPGD with customized queries (Figure 2
DATA AVAILABILITY The EPGD is available for download through the ‘DOWNLOADS’ link in the website as a FASTA file containing all proteins, family members lists, evolutionary indexes and paralogon regions in plain text files. RESULTS AND DISCUSSION The properties of the paralog family spaces in EPGD Table 1 gives a summary of the content of the current EPGD version. The proportions of duplicated genes in eukaryotes collected by EPGD range from 9% (Plasmodium falciparum) to 52% (Strongylocentrotus purpuratus), and are smaller than previously reported (e.g. Homo sapiens, 38%; Arabidopsis thaliana, 65%; Drosophila melanogaster, 41%; Caenorhabditis elegans, 49%; Saccharomyces cerevisiae, 30%) (2). This is due to the rigorous criteria for paralog definition used to construct the EPGD and because many duplicated genes have eliminated characteristic signatures from their sequences during their evolution history (2). Since evolutionary indexes are highly unreliable for ancient gene duplications, rigorous criteria are essential for our database. The size of the paralog families tends to be smaller than five genes. The distributions of paralog family size in all species of EPGD follow power law (data not shown) (29,30). As an example, Figure 4
Consistent with previous studies on Bacteria and a small set of Eukarya (9,29,31), large genomes possess more paralog families and a higher proportion of genes belonging to paralog families than small genomes (Figure 3
The number of paralogons increases with the genome size (Figure 3 The example of H. sapiens Taking H. sapiens as an example (Figure 4 Transition redundant exchange (TREx) processes at the position of conserved 2-fold codon sites are thought to offer an approximation for a neutral molecular clock (19). We calculated the TREx distances for each paralog family, which provide a more homogeneous molecular clock than that provided by the dS. If the time since two genes diverged is long relative to the reciprocal of the rate constant with which these silent sites suffer transition substitutions, the TREx distance approximates 0.5. As seen from Figure 4 Similar to the work of Lynch et al. (32), dN was plotted as a function of dS (Figure 4 PERSPECTIVES We plan to update EPGD every six months. As new eukaryotic organisms are fully sequenced and annotated, they will be added to EPGD using our procedure. In the future, ortholog annotation information will also be included. However, the development of the utilities for EPGD will still focus on tools for the analysis of duplication events, such as statistical tests of the paralogons (unpublished data) and chromosome ideograms. Furthermore, we will thoroughly analyze the data in EPGD and present insights into the effect of duplication events on genome evolution. The procedure to build the EPGD is currently semi-automatic. We will make the procedure totally automatic and start an open source project in the future. ACKNOWLEDGEMENTS We thank Zhonghao Yu, Xiaobin Xing, Yun Li, Kang Tu, Guangyong Zhen for helpful comments and suggestions. This research was supported by grants from National Basic Research Program of China (2006CB910700, 2004CB720103, 2004CB518606, 2003CB715901). Funding to pay the Open Access publication charges for this article was provided by National High-Tech R&D Program (863) (2006AA02Z334) and National Basic Research Program of China (2006CB910700, 2004CB720103, 2004CB518606, 2003CB715901). Conflict of interest statement. None declared. REFERENCES 1. Taylor JS, Raes J. Duplication and divergence: the evolution of new genes and old ideas. Annu. Rev. Genet. 2004;38:615–643. [PubMed] 2. Zhang J. Evolution by gene duplication: an update. Trends Ecol. Evol. 2003;18:292–298. 3. He X, Zhang J. Gene complexity and gene duplicability. Curr. Biol. 2005;15:1016–1021. [PubMed] 4. Van de Peer Y. Computational approaches to unveiling ancient genome duplications. Nat. Rev. 2004;5:752–763. 5. Teichmann SA, Babu MM. Gene regulatory network growth by duplication. Nat. Genet. 2004;36:492–496. [PubMed] 6. Makalowski W. Are we polyploids? A brief history of one hypothesis. Genome Res. 2001;11:667–670. [PubMed] 7. Wagner A. Decoupled evolution of coding region and mRNA expression patterns after gene duplication: implications for the neutralist-selectionist debate. Proc. Natl Acad. Sci. USA. 2000;97:6579–6584. [PubMed] 8. Zhang J, Zhang YP, Rosenberg HF. Adaptive evolution of a duplicated pancreatic ribonuclease gene in a leaf-eating monkey. Nat. Genet. 2002;30:411–415. [PubMed] 9. Jordan IK, Makarova KS, Spouge JL, Wolf YI, Koonin EV. Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 2001;11:555–565. [PubMed] 10. O'Brien KP, Remm M, Sonnhammer EL. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res. 2005;33:D476–D480. [PubMed] 11. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, et al. Ensembl 2007. Nucleic Acids Res. 2007;35:D610–D617. [PubMed] 12. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2007;35:D5–D12. [PubMed] 13. Ding G, Kang J, Liu Q, Shi T, Pei G, Li Y. Insights into the coupling of duplication events and macroevolution from an age profile of animal transmembrane gene families. PLoS Comput. Biol. 2006;2:e102. [PubMed] 14. Lynch M, Conery JS. The origins of genome complexity. Science. 2003;302:1401–1404. [PubMed] 15. Leveugle M, Prat K, Perrier N, Birnbaum D, Coulier F. ParaDB: a tool for paralogy mapping in vertebrate genomes. Nucleic Acids Res. 2003;31:63–67. [PubMed] 16. McLysaght A, Hokamp K, Wolfe KH. Extensive genomic duplication during early chordate evolution. Nat. Genet. 2002;31:200–204. [PubMed] 17. Durand D, Hoberman R. Diagnosing duplications – can it be done? Trends Genet. 2006;22:156–164. [PubMed] 18. Masatoshi Nei SK. Molecular Evolution and Phylogenetics. USA: Oxford University Press; 2000. 19. Benner SA. Interpretive proteomics – finding biological meaning in genome and proteome databases. Adv. Enzyme Regul. 2003;43:271–359. [PubMed] 20. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PubMed] 21. Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 1996;266:554–571. [PubMed] 22. Perriere G, Duret L, Gouy M. HOBACGEN: database system for comparative genomics in bacteria. Genome Res. 2000;10:379–385. [PubMed] 23. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 2003;31:3497–3500. [PubMed] 24. Wernersson R, Pedersen AG. RevTrans: multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res. 2003;31:3537–3539. [PubMed] 25. Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 1986;3:418–426. [PubMed] 26. Yang Z, Nielsen R. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 2000;17:32–43. [PubMed] 27. Yang Z. PAML: a program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 1997;13:555–556. [PubMed] 28. Clamp M, Cuff J, Searle SM, Barton GJ. The Jalview Java alignment editor. Bioinformatics. 2004;20:426–427. [PubMed] 29. Enright AJ, Kunin V, Ouzounis CA. Protein families and TRIBES in genome sequence space. Nucleic Acids Res. 2003;31:4632–4638. [PubMed] 30. Kunin V, Teichmann SA, Huynen MA, Ouzounis CA. The properties of protein family space depend on experimental design. Bioinformatics. 2005;21:2618–2622. [PubMed] 31. Pushker R, Mira A, Rodriguez-Valera F. Comparative genomics of gene-family size in closely related bacteria. Genome Biol. 2004;5:R27. [PubMed] 32. Lynch M, Conery JS. The evolutionary fate and consequences of duplicate genes. Science. 2000;290:1151–1155. [PubMed] 33. Li W.-H. Molecular Evolution. Sunderland Massachusetts, USA: Sinauer Associates, Inc.; 1997. |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Annu Rev Genet. 2004; 38():615-43.
[Annu Rev Genet. 2004]Curr Biol. 2005 Jun 7; 15(11):1016-21.
[Curr Biol. 2005]Nat Genet. 2004 May; 36(5):492-6.
[Nat Genet. 2004]Genome Res. 2001 May; 11(5):667-70.
[Genome Res. 2001]Proc Natl Acad Sci U S A. 2000 Jun 6; 97(12):6579-84.
[Proc Natl Acad Sci U S A. 2000]Genome Res. 2001 Apr; 11(4):555-65.
[Genome Res. 2001]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D476-80.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2007 Jan; 35(Database issue):D610-7.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2007 Jan; 35(Database issue):D5-12.
[Nucleic Acids Res. 2007]PLoS Comput Biol. 2006 Aug 11; 2(8):e102.
[PLoS Comput Biol. 2006]Science. 2003 Nov 21; 302(5649):1401-4.
[Science. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):63-7.
[Nucleic Acids Res. 2003]Nat Genet. 2002 Jun; 31(2):200-4.
[Nat Genet. 2002]Trends Genet. 2006 Mar; 22(3):156-64.
[Trends Genet. 2006]Adv Enzyme Regul. 2003; 43():271-359.
[Adv Enzyme Regul. 2003]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Methods Enzymol. 1996; 266():554-71.
[Methods Enzymol. 1996]Genome Res. 2000 Mar; 10(3):379-85.
[Genome Res. 2000]PLoS Comput Biol. 2006 Aug 11; 2(8):e102.
[PLoS Comput Biol. 2006]Nucleic Acids Res. 2003 Jul 1; 31(13):3497-500.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Methods Enzymol. 1996; 266():554-71.
[Methods Enzymol. 1996]Genome Res. 2000 Mar; 10(3):379-85.
[Genome Res. 2000]PLoS Comput Biol. 2006 Aug 11; 2(8):e102.
[PLoS Comput Biol. 2006]Nucleic Acids Res. 2003 Jul 1; 31(13):3497-500.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Jul 1; 31(13):3537-9.
[Nucleic Acids Res. 2003]Mol Biol Evol. 1986 Sep; 3(5):418-26.
[Mol Biol Evol. 1986]Mol Biol Evol. 2000 Jan; 17(1):32-43.
[Mol Biol Evol. 2000]Comput Appl Biosci. 1997 Oct; 13(5):555-6.
[Comput Appl Biosci. 1997]Adv Enzyme Regul. 2003; 43():271-359.
[Adv Enzyme Regul. 2003]Nat Genet. 2002 Jun; 31(2):200-4.
[Nat Genet. 2002]Bioinformatics. 2004 Feb 12; 20(3):426-7.
[Bioinformatics. 2004]Bioinformatics. 2004 Feb 12; 20(3):426-7.
[Bioinformatics. 2004]Adv Enzyme Regul. 2003; 43():271-359.
[Adv Enzyme Regul. 2003]Bioinformatics. 2004 Feb 12; 20(3):426-7.
[Bioinformatics. 2004]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 2003 Aug 1; 31(15):4632-8.
[Nucleic Acids Res. 2003]Bioinformatics. 2005 Jun 1; 21(11):2618-22.
[Bioinformatics. 2005]Genome Res. 2001 Apr; 11(4):555-65.
[Genome Res. 2001]Nucleic Acids Res. 2003 Aug 1; 31(15):4632-8.
[Nucleic Acids Res. 2003]Genome Biol. 2004; 5(4):R27.
[Genome Biol. 2004]Adv Enzyme Regul. 2003; 43():271-359.
[Adv Enzyme Regul. 2003]Science. 2000 Nov 10; 290(5494):1151-5.
[Science. 2000]