• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinformLink to Publisher's site
Bioinformation. 2008; 2(7): 279–283.
Published online Feb 22, 2008.
PMCID: PMC2374371

A tool for the prediction of functionally important sites in proteins using a library of functional templates

Abstract

Understanding and characterizing the biochemical and evolutionary information within the wealth of protein sequence and structural data, particularly at functionally important sites, is very important. A comprehensive analysis of physico-chemical properties and evolutionary conservation patterns at the molecular and biological function level is expected to yield important clues for identifying similar sites in as-yet uncharacterized proteins. We present a library of protein functional templates (PFTs) designed to represent the compositional and evolutionary conservation patterns of functional sites at the molecular and biological function level. Subsequently we developed LIMACS (LInear MAtching of Conservation Scores), a software tool that uses the template library for the prediction of functionally important sites in a multiple sequence alignment, transferring the molecular function annotation from the most-similar functional site in the template library to a predicted site.

Availability

The PFT library, the LIMACS program and source code are available for PC, Mac and Linux operating systems from ftp://ftp.ncbi.nih.gov/pub/lanczyck/limacs.

Keywords: prediction, proteins, functional templates, library

Background

Determining the functional importance of proteins remains a challenging task despite it being several years into the post-genomic era. Several computational efforts have been undertaken to derive functional insights using three-dimensional structural information [1,2], sequentially conserved residues [3-5], multiple sequence alignments (MSAs) [6-10] and information extracted from experimental studies and literature searches. Despite these varied efforts, the accuracy of functional site prediction remains low. Prediction accuracy can suffer in part from the lack of comprehensive knowledge of evolutionary conservation patterns of amino acid residues for a wide range of known functionally important sites. Additional factors such as the availability of high quality MSAs, accurate identification of domain structure and distant homologous sequences may also affect the success of protein function prediction.

Here, we present a library of protein functional templates (PFTs) along with the companion software tool LIMACS (LInear MAtching of Conservation Scores), which uses the PFT library to predict functional sites for a query MSA. Each PFT is derived from a specific column of a high-quality MSA where a known functionally important site exists, and summarizes the compositional and evolutionary conservation patterns present at that location. The Conserved Domain Database (CDD) [11] maintains such functionally-annotated MSAs for a wide range of protein families. The curated CDD alignments and their linked functional annotations have been used to compute the various quantitative measures of conservation that define a PFT at each known functionally important site. The LIMACS software implements an algorithm designed to predict functionally important sites in a query multiple alignment, transferring the molecular function annotation from the most-similar PTF to the predicted query sites. This simple prediction approach is solely based on information derived from homologous sequences and no structural information is required. As a result we envisage this method as being extremely useful in the context of large scale functional annotation.

Methodology

PFT library

A library of protein functional templates has been assembled by creating a PFT for each of the 7108 functionally important sites specified in a set of 340 curated alignments taken from the CDD (version 2.09). The CDD alignments for these families contain functionally important sites (e.g., active sites, ligand binding sites, protein binding sites) identified in an extensive manual curation effort, and are based on evidence available in the literature and other relevant scientific sources [11].

Each PFT has a quantitative portion designed to represent the compositional and evolutionary characteristics of the functional site from which it was derived. The former is given by its functional group composition pattern expressed in terms of a ten-letter reduced amino acid alphabet that organizes the twenty standard amino acids based on their physico-chemical properties (see Table SM1 in Supplementary material). A composition pattern vector is defined, the elements of which represent the fraction of a column’s residues in each of the ten reduced functional groups. To quantify the degree of evolutionary conservation we include the information content, median PSSM (Position Specific Scoring Matrix) score, frequency of negative PSSM scores and relative weight of negative PSSM scores in the PFT [12].

Each template also has a qualitative portion comprising its molecular and biological functional assignment. Assignments of molecular and biological function have been performed by the authors through an extensive survey of the available literature and experimental references. Tables SM2 and SM3 (see supplementary material) specify the six molecular and sixteen different biological functional categories used, and Figure 1 provides the representation of each within the PFT template library.

Figure 1
Percentage of sites belonging to six molecular and sixteen biological functional categories within the PFT library. Percentage of sites is plotted for each molecular (a) and biological (b) functional category. Actual numbers of the sites are shown on ...

LIMACS program

LIMACS employs the same scoring function we introduced previously to examine the feasibility of using a template-based approach to functional site prediction [13]. Given a multiple sequence alignment, LIMACS utilizes its heuristic match score to find those alignment columns that are significantly similar to a PFT from the library. The alignment columns so identified are the predicted functionally important sites, and the molecular function annotation of the best-scoring PFT is transferred to each predicted functional site. Only gapless columns of the query MSA are compared to the PFT library, although LIMACS can be extended to deal with gapped alignment columns. Additional details about the LIMACS scoring scheme can be found at Supplementary material.

Features of LIMACS

LIMACS accepts input MSAs in FASTA format. To help avoid false-positives, hits between query columns and PFTs are by default screened against a set of 2000 randomly aligned sequences of 500 residues to estimate statistical noise and deal with cases where the signal-to-noise ratio is low. Various scores are provided for each reported query/template hit to aid the user in interpreting the significance of individual results (see the distribution's ‘README.txt’ file for details.).

Prediction of functional important sites

The performance of LIMACS was measured by computing the average fraction of predicted true positives in a 5-fold cross-validation analysis, where the full template library was divided into five parts: four parts together were used as the database template library with the remaining part as test set. This procedure was repeated five times by randomly generating different subsets. The accuracy of the prediction was calculated as the number of correctly matched functional sites divided by the total number of predictions at given match score threshold. This cross-validation analysis suggests a high accuracy (~73%) in functional site attribution.

Next, the sensitivity and specificity of the prediction algorithm were evaluated using a slightly modified 5-fold cross-validation analysis, in which we examine the ability of LIMACS to pick out functional sites among a mixture of functional and non-functional sites in a MSA. As before, the full template library was randomly divided by removing a test set containing 20% of the PFTs. But in this case we do not independently extract individual PFTs; rather, all of the PFTs in the library derived from the same CDD alignment are extracted together. This results in a test set of MSAs no longer represented among the remaining PFTs in the template library, and each of which has a set of well-defined true-positives (i.e., the 20% of PFTs extracted from the full template library). The MSAs containing PFTs in this test set are used as the input to LIMACS, which attempts to match each column (not just those columns corresponding to the PFTs) against a PFT from the template library. This procedure was repeated five times by randomly generating different subsets. The sensitivity of predicting correct functional sites at 15% false positive rate (error rate) is ~67% (unpublished data).

Caveats and future development

LIMACS is written in C++ and the source code is available for download. The executables are available for Windows, Macintosh and Linux operating systems. NCBI C++ Toolkit and C++ development tools are required to build LIMACS from source code (instructions provided with the package). We have plans to develop a web version of the LIMACS program.

Supplementary material

Data 1:

Acknowledgments

This work was supported by the Intramural Research Program of the National Library of Medicine at National Institutes of Health/DHHS.

References

1. Fetrow JS, Skolnick J. J Mol Biol. 1998;281:949. [PubMed]
2. Zhang B, et al. Protein Sci. 1999;8:1104. [PMC free article] [PubMed]
3. Hofmann K, et al. Nucleic Acids Res. 1999;27:215. [PMC free article] [PubMed]
4. Ashburner M, et al. Nature Genetics. 2000;25:25. [PMC free article] [PubMed]
5. Casari G, et al. Nat Struct Biol. 1995;2:171. [PubMed]
6. Hannenhalli S, Russell RB. J Mol Biol. 2000;303:61. [PubMed]
7. Li L, et al. Proc Natl Acad Sci USA. 2003;100:4463. [PMC free article] [PubMed]
8. Pei J, Grishin NV. Bioinformatics. 2001;17:700. [PubMed]
9. Lichtarge O, et al. J Mol Biol. 1996;257:342. [PubMed]
10. Berezin C, et al. Bioinformatics. 2004;20:1322. [PubMed]
11. Marchler-Bauer A, et al. Nucleic Acids Res. 2002;30:281. [PMC free article] [PubMed]
12. Chakrabarti S, et al. Nucleic Acids Res. 2006;34:2598. [PMC free article] [PubMed]
13. Chakrabarti S, Lanczycki CJ. Protein Sci. 2006;16:4. [PMC free article] [PubMed]

Articles from Bioinformation are provided here courtesy of Biomedical Informatics Publishing Group

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...