Prediction and evolutionary information analysis of protein solvent accessibility using multiple linear regression

Proteins. 2005 Nov 15;61(3):481-91. doi: 10.1002/prot.20620.

Abstract

A multiple linear regression method was applied to predict real values of solvent accessibility from the sequence and evolutionary information. This method allowed us to obtain coefficients of regression and correlation between the occurrence of an amino-acid residue at a specific target and its sequence neighbor positions on the one hand, and the solvent accessibility of that residue on the other. Our linear regression model based on sequence information and evolutionary models was found to predict residue accessibility with 18.9% and 16.2% mean absolute error respectively, which is better than or comparable to the best available methods. A correlation matrix for several neighbor positions to examine the role of evolutionary information at these positions has been developed and analyzed. As expected, the effective frequency of hydrophobic residues at target positions shows a strong negative correlation with solvent accessibility, whereas the reverse is true for charged and polar residues. The correlation of solvent accessibility with effective frequencies at neighboring positions falls abruptly with distance from target residues. Longer protein chains have been found to be more accurately predicted than their smaller counterparts.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acids / chemistry
  • Evolution, Molecular*
  • Linear Models
  • Proteins / chemistry*
  • Regression Analysis
  • Sequence Alignment
  • Sequence Analysis, Protein
  • Solvents / chemistry*

Substances

  • Amino Acids
  • Proteins
  • Solvents