• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Jun 15, 2010; 26(12): 1564–1565.
Published online Apr 22, 2010. doi:  10.1093/bioinformatics/btq208
PMCID: PMC2881392

HangOut: generating clean PSI-BLAST profiles for domains with long insertions

Abstract

Summary: Profile-based similarity search is an essential step in structure-function studies of proteins. However, inclusion of non-homologous sequence segments into a profile causes its corruption and results in false positives. Profile corruption is common in multidomain proteins, and single domains with long insertions are a significant source of errors. We developed a procedure (HangOut) that, for a single domain with specified insertion position, cleans erroneously extended PSI-BLAST alignments to generate better profiles.

Availability: HangOut is implemented in Python 2.3 and runs on all Unix-compatible platforms. The source code is available under the GNU GPL license at http://prodata.swmed.edu/HangOut/

Contact: kim/at/chop.swmed.edu; grishin/at/chop.swmed.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

PSI-BLAST (Altschul et al., 1997) is an indispensable tool for remote homology inference and structure-function predictions (Devos and Valencia, 2000; Friedberg, 2006; Grishin, 2001; Hegyi and Gerstein, 2001). However, false positives in PSI-BLAST can cause errors in automated annotations (Bork and Koonin, 1998). One major source for such false positives is profile corruption, usually resulting from extension of alignments over non-homologous sequence regions (Galperin and Koonin, 1998). For instance, for two 2-domain proteins, AB and A′ C, PSI-BLAST may extend a correct alignment of the homologous domains A and A′ to include sequences from the non-homologous domains B and C. Despite significant effort devoted to this multidomain problem, no satisfactory solution exists (Gonzalez and Pearson, 2010; Galzitskaya and Melnik, 2003; George and Heringa, 2002; Nagarajan and Yona, 2004). Currently, the best approach is to start PSI-BLAST with precisely defined query sequence bounds (Corpet et al., 2000; Wheeler et al., 2001).

However, we found that even a single, well-defined domain does not guarantee a corruption-free profile. Domains hosting insertions, which represent close to 5% of domains in the structural classification of proteins (SCOP) 1.75 database (Murzin et al., 1995), may generate a corrupted PSI-BLAST profile due to incorrect alignment extension around the insertion position. Our analysis shows that the N- and C-terminal segments of the host domain are frequently aligned as separate PSI-BLAST high scoring pairs (HSPs), and the two HSPs overlap when mapped onto the query sequence. Each alignment can be divided into two segments: (i) correctly aligned and (ii) incorrectly aligned or extended (Fig. 1a and Supplementary Fig. S1). These incorrectly aligned ‘overhangs’ are detected and removed by the HangOut program to clean the profile and prepare it for consequent remote homology searches with various tools, such as PSI-BLAST and HHsearch.

Fig. 1.
HangOut method to clean PSI-BLAST profiles. (a) HangOut flowchart. Starting from a query domain (blue–red, with inserted domain in yellow removed), a PSI-BLAST search of the NCBI non-redundant database (NR) is performed (Step 1) to produce alignments. ...

2 METHODS

The HangOut input is a single domain query sequence with the insertion boundary specified. The HangOut algorithm proceeds as follows (Fig. 1a): (1) Run BLAST with the input sequence against the NCBI non-redundant database with e-value threshold 0.001. (2) Detect and remove lower-scoring (see second half of this paragraph for clarification) regions from HSPs and regions matching a PSI-BLAST profile of the inserted domain (see Supplementary Figures S2 and S3 for rationale). (3) Terminate upon convergence or iteration limit. Otherwise, repeat Steps 1 to 3 with the following modifications: (i) PSI-BLAST replaces BLAST, seeded (-B option) with the cleaned profile from Step 2 and (ii) profile scores (PSSM) replace BLOSUM62 scores (for HSP removal). Thus, HangOut builds multiple sequence alignments similarly to PSI-BLAST, but has a ‘clean-up’ step after each iteration intended to remove incorrect extensions. HangOut is based on two assumptions: (i) each HSP contains at least one correctly aligned region, and (ii) incorrectly extended regions exist in every HSP that crosses the insertion point. Based on these assumptions, HangOut splits all local alignments into two segments with a boundary at the insertion point and selects the best scoring (BLOSUM62 or PSSM) segment out of each split pair. The lower scoring segment is removed as a possibly erroneous extension. In addition to this HangOut procedure, we applied RemoveHit, a simpler method that does not require a defined insertion point and removes entire alignments for hits with two overlapping HSPs (Supplementary Fig. S4).

HangOut was tested on a set of 40% representative SCOP 1.75 domains defined to contain insertions (302 domains, see Supplementary Table 1 for the list) to measure the number of corrupted profiles (false positives) and the number of correct homologs found by each discontinuous query domain sequence (with insertion sequence removed). The 302 hidden Markov Models (HHMs) built from each PSI-BLAST profiles, HangOut profiles or RemoveHit profiles were compared to HHMs built from all 9528 SCOP 1.75 40% representative domains (Murzin et al., 1995) using HHsearch ver. 1.5.1 (Soding, 2005). The number of corrupted profiles was increased by one if HHsearch found homologs of inserted domains with probability higher than 0.9. The number of homologs found are counted by the number of hits that have strong profile similarity (HHsearch probability above 0.9) and overall structural similarity (DaliLite Z-score higher than 4) (Holm and Park, 2000) or belonged to the same SCOP superfamilies as the query domains.

3 RESULTS

HangOut is intended to clean PSI-BLAST generated profiles of erroneous extensions caused by domain insertions. One typical example of this domain problem is shown in Figure 1b: an α/β P-loop hydrolase (yellow in Fig. 1b) is inserted into an α-helical bundle (blue and red in Fig. 1b). Corruption of the PSI-BLAST alignment built from hits to the α-helical bundle is evidenced by a profile-based similarity search (HHsearch), which finds the α/β P-loop hydrolase domain with probability 98%. Since the query α-helical bundle does not share any sequence or structural similarities with the hydrolase domain, the high HHsearch probability results from profile corruption (for details see Supplementary Fig. S2).

Given the success of this example, we tested the ability of HangOut to clean profiles of all SCOP domains with defined insertions (302 domains). As a basis for comparison, 91 PSI-BLAST profiles (30%) were corrupted. RemoveHit cleans only 23 of these profiles, while HangOut cleans all but one (Fig. 1c). The single exception is probably due to distant homology, since both the host and inserted domain represent similar doubly wound Rossmann folds (Supplementary Fig. S5). Because the removal of sequence segments from alignments may deprive the profile, we also checked for the loss of true hits. Surprisingly, cleaned HangOut profiles retained ~98% of the homologs found by PSI-BLAST profiles (99.6% for RemoveHit), suggesting that useful information is not lost from the profiles. Compared to RemoveHit, the complexities of HangOut that use domain boundary information are apparently needed to clean corrupted profiles. The presence of overlapping HSPs (removed by RemoveHit) does not sufficiently indicate corrupted segments. For remote homologs, only a single HSP may be found and incorrectly extended to cover part of the insertion. Although our current HangOut procedure does not offer a comprehensive solution to the multidomain problem, it addresses a special case of domains with insertions that represent the major source of profile corruption when PSI-BLAST is initiated with single, discontinuous domain queries. HangOut will be especially useful for large-scale bioinformatics efforts that are initiated from defined structure domains and require uncorrupt sequence profiles for subsequent analysis. Additional work will be done to offer a general solution without prior knowledge of domain boundaries.

Supplementary Material

[Supplementary Data]

ACKNOWLEDGEMENTS

The authors thank Lisa N. Kinch, Jimin Pei and Jeremy Semeiks for helpful comments.

Funding: Welch Foundation I1505 (to N.V.G.).

Conflict of Interest: none declared.

REFERENCES

  • Altschul S, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Bork P, Koonin EV. Predicting functions from protein sequences–where are the bottlenecks? Nat. Genet. 1998;18:313–318. [PubMed]
  • Corpet F, et al. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 2000;28:267–269. [PMC free article] [PubMed]
  • Devos D, Valencia A. Practical limits of function prediction. Proteins. 2000;41:98–107. [PubMed]
  • Friedberg I. Automated protein function prediction - the genomic challenge. Brief. Bioinform. 2006;7:225–242. [PubMed]
  • Galperin MY, Koonin EV. Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1998;1:55–67. [PubMed]
  • Galzitskaya OV, Melnik BS. Prediction of protein domain boundaries from sequence alone. Protein Sci. 2003;12:696–701. [PMC free article] [PubMed]
  • George RA, Heringa J. SnapDRAGON: a method to delineate protein structural domains from sequence data. J. Mol. Biol. 2002;316:839–851. [PubMed]
  • Gonzalez MW, Pearson WR. Homologous over-extension: a challenge for iterative similarity searches. Nucleic Acids Res. 2010;38:2177–2189. [PMC free article] [PubMed]
  • Grishin NV. Fold change in evolution of protein structures. J. Struct. Biol. 2001;134:167–185. [PubMed]
  • Hegyi H, Gerstein M. Annotation transfer for genomics: measuring functional divergence in multi-domain proteins. Genome Res. 2001;11:1632–1640. [PMC free article] [PubMed]
  • Holm L, Park J. DaliLite workbench for protein structure comparison. Bioinformatics. 2000;16:566–567. [PubMed]
  • Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. [PubMed]
  • Nagarajan N, Yona G. Automatic prediction of protein domains from sequence information using a hybrid learning system. Bioinformatics. 2004;20:1335–1360. [PubMed]
  • Soding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005;21:951–960. [PubMed]
  • Wheeler DL, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2001;29:11–16. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...