Format

Send to

Choose Destination
Bioinformatics. 2001;17 Suppl 1:S5-S12.

An efficient algorithm for finding short approximate non-tandem repeats.

Author information

1
Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Sand 13, Tübingen, 72076, Germany. adebiyi@informatik.uni-tuebingen.de

Abstract

We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.

PMID:
11472987
[Indexed for MEDLINE]

Supplemental Content

Loading ...
Support Center