Format

Send to

Choose Destination
Genome Inform. 2004;15(1):93-104.

Clustering of database sequences for fast homology search using upper bounds on alignment score.

Author information

1
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan. itoh@kuicr.kyoto-u.ac.jp

Abstract

Homology data are among the most important information used to predict the functions of unknown proteins and thus fast and accurate methods are needed. In this paper, we propose a new approach for fast and accurate homology search using pre-computed all-against-all similarity scores in a target database. We previously developed a method for derivation of an upper bound of the Smith-Waterman score (SW-score) between a query and a homolog candidate sequence using the SW-score between the candidate and a sequence similar to the query. In this paper, by using this upper bound, we first cluster the sequences in the target database so that upper bounds of SW-scores for all the members in the clusters are less than a given value and select representative sequences for respective clusters. Then, the query sequence is searched against the representative sequences and the upper bounds of SW-scores for respective clusters are estimated. Only if the upper bound is higher than a given threshold, SW-alignments are computed for all the sequences in the cluster. We performed computational experiments to test efficiency of the proposed method for the KEGG/GENES database using the KEGG/SSDB. The results suggest that our method is efficient for redundant databases that include multiple closely related species.

PMID:
15712113
[Indexed for MEDLINE]

Supplemental Content

Loading ...
Support Center