![]() | ![]() |
Formats:
|
||||||||||||||||||
Copyright © 2008 The Author(s) DAhunter: a web-based server that identifies homologous proteins by comparing domain architecture 1Korean BioInformation Center, KRIBB, Daejeon 305-806 and 2Department of Bio and Brain Engineering, KAIST, Daejeon 305-701, Korea *To whom correspondence should be addressed. Phone: +82 42 879 8531, Fax: +82 42 879 8519, Email: bulee/at/kribb.re.kr Correspondence may also be addressed to Doheon Lee. Phone: +82 42 869 4316, Fax: +82 42 869 8680, Email: dhlee/at/biosoft.kaist.ac.kr Received February 9, 2008; Revised March 17, 2008; Accepted March 25, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract We present DAhunter, a web-based server that identifies homologous proteins by comparing domain architectures, the organization of protein domains. A major obstacle in comparison of domain architecture is the existence of ‘promiscuous’ domains, which carry out auxiliary functions and appear in many unrelated proteins. To distinguish these promiscuous domains from protein domains, we assigned a weight score to each domain extracted from RefSeq proteins, based on its abundance and versatility. A domain's score represents its importance in the ‘protein world’ and is used in the comparison of domain architectures. In scoring domains, DAhunter also considers domain combinations as well as single domains. To measure the similarity of two domain architectures, we developed several methods that are based on algorithms used in information retrieval (the cosine similarity, the Goodman–Kruskal γ function, and domain duplication index) and then combined these into a similarity score. Compared with other domain architecture algorithms, DAhunter is better at identifying homology. The server is available at http://www.dahunter.kr and http://localodom.kobic.re.kr/dahunter/index.htm INTRODUCTION There are now >600 completely sequenced genomes and >5 million unique protein sequences are available in public databases (1). A common approach for identifying protein function is to assume that proteins with similar sequences have similar functions. Thus, sequence similarity search algorithms, such as, BLAST (2) and FASTA (3), can detect sequence similarities in proteins that have not diverged greatly. However, proteins that have diverged greatly can be homologous even though they exhibit little sequence similarity (4). Thus, sequence-based homology searches can yield false negatives, especially when comparing proteins with multiple domains (5). Domains are the building blocks of proteins and one of the most useful characteristics for determining protein function (6). The functions of the individual domains of a multi-domain protein contribute to our understanding of the properties of the protein as a whole (7). The sequential order of protein domains is known as the domain architecture. Architectures are useful for classifying evolutionarily related proteins, detecting evolutionarily distant homologs, and comparing multi-domain proteins (8,9). Several previous studies have proposed methods of domain architecture comparison for identification of protein homology. CDART (10) presents a list of proteins with similar domain architectures to a given query sequence by counting the number of shared domains. PDART (11) presents domain architectures in the leaves of a species tree. Djorklund et al. (12) proposed a ‘domain distance’, calculated as the number of domains that differ between two domain architectures. Multi-domain proteins evolve by gene duplication and domain shuffling, causing certain domains to appear in many unrelated proteins (13). The functions of these ‘promiscuous’ domains are typically auxiliary to the primary protein function (14). Promiscuous domains also have many combinations with other domains, although the orientation of domain combination and the type of neighboring domains in proteins are generally limited. Because promiscuous domains are not directly related to homology, they should be given less importance in domain architecture comparison than non-promiscuous domains. Another important feature in domain evolution is that two- or three-domain combinations (supra-domains) are re-used in different protein contexts with different partner domains (15). These combinations should be considered, as well as a single domain, in domain architecture comparison. In this study, we define single domains and two-domain combinations within 30 amino acid residues as a ‘domain unit’. Here we present DAhunter, a web-based server that identifies homologous proteins by comparing domain architecture. DAhunter source codes and database contents are freely available to academic users upon request. METHODS Calculating weight scores of domain units The accuracy of DAhunter depends on the domain unit weight scores, which is calculated from the abundance and versatility of domain units. To obtain domain unit weight scores, we downloaded 4 234 906 protein sequences from the RefSeq Release 26 (ftp://ftp.ncbi.nih.gov/refseq/release/) (16) and classified these proteins into eukaryota, bacteria or archaea. Then we analyzed the domain content of these proteins with the Pfam (17) database. The Pfam domain annotations of all RefSeq proteins were obtained from the Similarity Matrix of Proteins (SIMAP) database (http://mips.gsf.de/simap/). This database provides a comprehensive and up-to-date dataset of the pre-calculated sequence features for all proteins in all major public sequence databases (18). For eukaryotic genes with several alternative transcripts, we kept the longest coding structure so as to retain the maximum number of domains. We filtered domain hits in proteins with a cutoff E-value of 0.01 and excluded proteins without Pfam signatures. The Pfam-annotated proteins were converted into domain architectures, with sequences between adjacent domains divided into two types: those ≤30 residues and those >30 residues. Previous researchers have used a threshold value of 30 residues to claim that two domains have a particular functional and spatial relationship (19). We extracted domain units from the domain architectures (Table 1) and assigned weight scores according to their abundance and versatility.
To measure the abundance of a domain unit, we defined the Inverse Abundance Frequency (IAF), which is derived from the Inverse Document Frequency (IDF), a statistic commonly used in information retrieval (5). The IAF of a domain unit, d, is defined as diaf = log2 pt/(pd + α), where pt is the number of total proteins, pd is the number of proteins containing domain unit d and α is a pseudocount parameter to balance protein frequency. To measure the versatility of a domain unit, we defined the Inverse Versatility Frequency (IVF), whose definition is also similar to that of IDF. IVF represents how many domain families are in N- and C-sides adjacent to a domain unit. The definition of IVF is divf = log2 ft/(fd + β) where ft is the number of total domain families, fd is the number of domain families adjacent to domain unit d and β is a pseudocount parameter to balance domain families. The weight score (dws) of a domain unit is simply the product of the IAF and IVF of the domain unit: dws = diaf × divf. If a domain unit is highly promiscuous, it will have a low score. Domain unit analysis of RefSeq proteins shows that the weight score of each domain unit differ among eukaryota, bacteria and archaea. The DAhunter website provides the IAF and IVF for all domain units in the three kingdoms. Construction of the domain architecture database To compare domain architectures of a query protein with proteins in public databases, we built a domain architecture database, consisting of domain architectures of proteins in RefSeq, UniProKB/Swiss-Prot and UniProtKB/TrEMBL (20). The Pfam annotations of these proteins were obtained from the SIMAP database. To illustrate domain combinations in domain architectures, we represent the intervening space in domain architectures with ≤30 residues as ‘∧’ and >30 residues as ‘~~~’. Thus, the three domain architectures, ‘A∧B~~~C’, ‘A~~~B∧C’ and ‘A∧B∧C’ (where A, B and C stand for different Pfam domains), are different even though they have the same three domains. According to the definition of a domain unit, the three architectures also have different two-domain combinations (‘A∧B~~~C’ has AB; ‘A~~B∧C’ has BC; ‘A∧B∧C’ has AB and BC). The domain architecture database has 39 336 different domain architectures (36 634 in RefSeq; 4253 in UniProtKB/Swiss-Prot; and 10 065 in UniProtKB/TrEMBL). Similarity between two domain architectures To identify homologs of a query protein, DAhunter first identifies Pfam domains in the query protein using the hmmpfam program and the Pfam database. If Pfam domains are present, the server extracts domain units from the query domain architecture and selects candidate domain architectures that contain any of the query domain units from the database. Then, DAhunter compares the query domain architecture against candidate domain architectures regarding the content, order and duplication of domain units. Domain unit content The two sets of domain units derived from the two architectures are represented as the indices, which are built using the vector space model (VSM) (21). Domain architectures are represented by a vector in which each component corresponds to a weight score of a domain unit. The similarity of the two vectors is measured by determining their cosine similarity, a measure based on the angle between two vectors (commonly used in text mining algorithms). Thus, if x and y are vectors of two domain architectures X and Y, the cosine similarity is defined as:
and . The range of the cosine similarity is [0, 1], where 1 indicates that x and y have the same domain units and 0 indicates that they share no domain units.Domain unit order To measure the order similarity between two domain architectures, we used the Goodman–Kruskal γ function (22), a symmetric measure based on the difference between concordant pairs (P) and discordant pairs (Q). This function is defined as:
Domain unit duplication Duplication of a domain unit with a higher weight score is more important than duplication of those with a lower weight score. For example, where ‘A’ represents a high-repeat domain and ‘B’ is a low-repeat domain, ‘AB’ to ‘ABB’ is more significant than ‘AB’ to ‘AAB’. To measure duplication similarity between two domain architectures, we developed a function similar to the cosine similarity (defined above). This function uses the VSM, where each component (Cx) corresponds to the product of a weight score of a domain unit and its duplication number. The duplication similarity between two vectors is defined as:
The final similarity score between two domain architectures, X and Y, is obtained by combining the indices from Equations (1–3) (each normalized to [0, 1]) using a simple linear function with parameters a and b,
To evaluate the DAhunter algorithm, we compared the results of DAhunter using the HomoloGene database with those of the PDART program (using default parameters), which does not consider ‘promiscuous’ domains and domain combinations. PDART generated 4961 (59%) matched combinations of the same group. This indicates that DAhunter, which consider ‘promiscuous’ domains and domain combinations, is a better algorithm for comparing the domain architectures. The comparison results are given in DAhunter webpage. IMPLEMENTATION The DAhunter server consists of a web interface, a MySQL database management system (DBMS), and core programs. The web interface is implemented with static HTML and CGI scripts and MySQL DBMS is used to store the DAhunter database. The core programs were written in Perl and are divided into three main steps (Figure 1
INPUT AND OUTPUT Input The query interface accepts protein sequences in the FASTA format. The user can paste the sequences directly into the input form or can upload a file from a local disk. The maximum number of input protein sequences for a single submission is 500 proteins and the length of each sequence is limited to 5000 residues. When submitting more than two protein sequences, users must input an Email address to receive DAhunter results. Output The output of the DAhunter service is an HTML-formatted file (Figure 2
ACKNOWLEDGEMENTS B.L. was supported by a grant from the KRIBB Research Initiative Program. D.L. was supported by the Ministry of Knowledge Economy, Korea, under the ITRC supervised by the IITA (IITA-2008-C1090-0801-0001). Funding to pay the Open Access publication charges for this article was provided by the Ministry of Education, Science and Technology. Conflict of interest statement. None declared. REFERENCES 1. Lee D, Redfern O, Orengo C. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 2007;8:995–1005. [PubMed] 2. Ye J, McGinnis S, Madden TL. BLAST: improvements for better sequence analysis. Nucleic Acids Res. 2006;34:W6–W9. [PubMed] 3. Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol. 1990;183:63–98. [PubMed] 4. Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 1998;284:1201–1210. [PubMed] 5. Song N, Sedgewick RD, Durand D. Domain architecture comparison for multidomain homology identification. J. Comput. Biol. 2007;14:496–516. [PubMed] 6. Chothia C, Gough J, Vogel C, Teichmann SA. Evolution of the protein repertoire. Science. 2003;300:1701–1703. [PubMed] 7. Krishnadev O, Rekha N, Pandit SB, Abhiman S, Mohanty S, Swapna LS, Gore S, Srinivasan N. PRODOC: a resource for the comparison of tethered protein domain architectures with in-built information on remotely related domain families. Nucleic Acids Res. 2005;33:W126–W129. [PubMed] 8. Caetano-Anolles G, Caetano-Anolles D. An evolutionarily structured universe of protein architecture. Genome Res. 2003;13:1563–1571. [PubMed] 9. Fukami-Kobayashi K, Minezaki Y, Tateno Y, Nishikawa K. A tree of life based on protein domain organizations. Mol. Biol. Evol. 2007;24:1181–1189. [PubMed] 10. Geer LY, Domrachev M, Lipman DJ, Bryant SH. CDART: protein homology by domain architecture. Genome Res. 2002;12:1619–1623. [PubMed] 11. Lin K, Zhu L, Zhang DY. An initial strategy for comparing proteins at the domain architecture level. Bioinformatics. 2006;22:2081–2086. [PubMed] 12. Bjorklund AK, Ekman D, Light S, Frey-Skott J, Elofsson A. Domain rearrangements in protein evolution. J. Mol. Biol. 2005;353:911–923. [PubMed] 13. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999;285:751–753. [PubMed] 14. Patthy L. Genome evolution and the evolution of exon-shuffling—a review. Gene. 1999;238:103–114. [PubMed] 15. Vogel C, Berzuini C, Bashton M, Gough J, Teichmann SA. Supra-domains: evolutionary units larger than single protein domains. J. Mol. Biol. 2004;336:809–823. [PubMed] 16. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. [PubMed] 17. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, et al. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–D288. [PubMed] 18. Rattei T, Tischler P, Arnold R, Hamberger F, Krebs J, Krumsiek J, Wachinger B, Stumpflen V, Mewes W. SIMAP—structuring the network of protein similarities. Nucleic Acids Res. 2008;36:D289–D292. [PubMed] 19. Apic G, Gough J, Teichmann SA. An insight into domain combinations. Bioinformatics. 2001;17(Suppl. 1):S83–S89. [PubMed] 20. Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006;34:D187–D191. [PubMed] 21. Glenisson P, Coessens B, Van Vooren S, Mathys J, Moreau Y, De Moor B. TXTGate: profiling gene groups with text-based information. Genome Biol. 2004;5:R43. [PubMed] 22. Jaroszewicz S, Simovici DA, Kuo WP, Ohno-Machado L. The Goodman-Kruskal coefficient and its applications in genetic diagnosis of cancer. IEEE Trans. Biomed. Eng. 2004;51:1095–1102. [PubMed] 23. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank. Nucleic Acids Res. 2008;36:D25–D30. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||
Nat Rev Mol Cell Biol. 2007 Dec; 8(12):995-1005.
[Nat Rev Mol Cell Biol. 2007]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W6-9.
[Nucleic Acids Res. 2006]Methods Enzymol. 1990; 183():63-98.
[Methods Enzymol. 1990]J Mol Biol. 1998 Dec 11; 284(4):1201-10.
[J Mol Biol. 1998]J Comput Biol. 2007 May; 14(4):496-516.
[J Comput Biol. 2007]Science. 2003 Jun 13; 300(5626):1701-3.
[Science. 2003]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W126-9.
[Nucleic Acids Res. 2005]Genome Res. 2003 Jul; 13(7):1563-71.
[Genome Res. 2003]Mol Biol Evol. 2007 May; 24(5):1181-9.
[Mol Biol Evol. 2007]Genome Res. 2002 Oct; 12(10):1619-23.
[Genome Res. 2002]Science. 1999 Jul 30; 285(5428):751-3.
[Science. 1999]Gene. 1999 Sep 30; 238(1):103-14.
[Gene. 1999]J Mol Biol. 2004 Feb 20; 336(3):809-23.
[J Mol Biol. 2004]Nucleic Acids Res. 2007 Jan; 35(Database issue):D61-5.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D281-8.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D289-92.
[Nucleic Acids Res. 2008]Bioinformatics. 2001; 17 Suppl 1():S83-9.
[Bioinformatics. 2001]J Comput Biol. 2007 May; 14(4):496-516.
[J Comput Biol. 2007]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D187-91.
[Nucleic Acids Res. 2006]Genome Biol. 2004; 5(6):R43.
[Genome Biol. 2004]Nat Rev Mol Cell Biol. 2007 Dec; 8(12):995-1005.
[Nat Rev Mol Cell Biol. 2007]IEEE Trans Biomed Eng. 2004 Jul; 51(7):1095-102.
[IEEE Trans Biomed Eng. 2004]Nat Rev Mol Cell Biol. 2007 Dec; 8(12):995-1005.
[Nat Rev Mol Cell Biol. 2007]Nat Rev Mol Cell Biol. 2007 Dec; 8(12):995-1005.
[Nat Rev Mol Cell Biol. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D25-30.
[Nucleic Acids Res. 2008]