Format

Send to

Choose Destination
Bioinformatics. 2009 Aug 1;25(15):1869-75. doi: 10.1093/bioinformatics/btp342. Epub 2009 Jun 8.

Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

Author information

1
Department of Biochemistry & Molecular Biology and The Institute for Genome Sciences, University of Maryland, School of Medicine, BioPark II, Baltimore, MD 21201, USA. aneuwald@som.umaryland.edu

Erratum in

  • Bioinformatics. 2009 Nov 15;25(22):3047.

Abstract

MOTIVATION:

The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical.

RESULTS:

This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences.

AVAILABILITY:

A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu.

SUPPLEMENTARY INFORMATION:

Supplementary data are available at Bioinformatics online.

PMID:
19505947
PMCID:
PMC2732367
DOI:
10.1093/bioinformatics/btp342
[Indexed for MEDLINE]
Free PMC Article

Supplemental Content

Full text links

Icon for Silverchair Information Systems Icon for PubMed Central
Loading ...
Support Center