|
Evolution
of protein structures and protein-protein interaction networks.
My research interests focus on
understanding how protein structure, function and protein-protein
interaction networks change in the course of evolution. Protein evolution
occurs under strong functional and structural constraints and understanding
of relationships between protein sequence
(genotype) and structure/function (phenotype) is crucial for
inferring the causes of many diseases.
We investigate how tolerant different protein structures are to sequence
change and whether this structural plasticity depends on protein function
or fold (1).

Proteins perform their functions in
a cell environment via interaction with other proteins/domains. As a result
structural changes which are not deleterious in a single protein/domain can
cause lethal effects if combined together in a protein complex. As the rate
of structures solved continues to increase, the analysis of interactions
between domains within the same protein structure can provide important
clues for understanding the interactions between proteins in a cell. We
analyze the conservation of interaction patterns between protein domains
and investigate how different protein structures can adapt to multiple
interaction partners (2,
3,
4).
Evolution
of protein loops and indels.
The
intervening unaligned regions ("loops") between the superimposable helices and strands in proteins can
exhibit a wide range of similarity and may offer clues to the structural
evolution of folds. One might argue that more closely related proteins
differ less in their nonconserved loop regions
than distantly related proteins and, at the same time, the degree of
variability in the loop regions in structurally similar but unrelated
proteins is higher than in homologs. Conventional
sequence and structure similarity measures comparing proteins in their
cores are often not sensitive enough to detect subtle (dis)similarities
between proteins and therefore we developed loop-based metrics to improve
protein classification and gauge the protein evolutionary relationships (5,6,7)

Changes in protein domains result mostly from
point mutations, insertion and deletion processes. Although amino acid
insertion and deletion (indel) events in proteins
are less frequent than amino acid substitutions, they can have a major
effect in protein evolution. The mechanisms of indel
events are not very well understood and there are only few statistical
models describing these events in evolution. We studied whether the
insertion and deletion events in protein domains are balanced and if there exist trends toward increasing or decreasing indel or domain lengths.
We found that more than one
third of all studies domains (the test set of 362 manually curated domain alignments together with their rooted
phylogenetic trees is available at ftp://ftp.ncbi.nih.gov/mmdb.tree.files)
have a statistically significant tendency to increase/decrease in size in
evolution as judged from the overall domain size distribution as well as
from the size distribution of individual indels.
Moreover, the fraction of domains and individual indels
increasing in size is almost twofold larger than the fraction decreasing in
size. We showed that the tolerance to insertion and deletion events depends
on the domain's taxonomy span. Eukaryotic domains are depleted in
insertions compared to the overall test set, on
the other hand, ancient domain families show some bias towards insertions (8).
Prediction of protein
function.
The
protein classification can be exploited to infer the function between
experimentally annotated and uncharacterized homologous proteins. However,
common descent does not necessarily imply functional similarity and
functional annotation transferred from one homologous protein to another
can result in incorrect assignment. To verify functional assignments we
examine common features conserved among families of homologs to identify
family/subfamily specific functionally important sites (9).
The rapid increase in the amount of protein
sequence data has created a need for automated identification of sites that
determine functional specificity among related subfamilies of proteins. A
significant fraction of subfamily specific sites are only marginally
conserved, which makes it extremely challenging to detect those amino acid
changes that lead to functional diversification.
To address this critical problem
we developed a method named SPEER (specificity prediction using amino
acids' properties, entropy and evolution rate) to distinguish specificity
determining sites from others. SPEER encodes the conservation patterns of
amino acid types using their physico-chemical
properties and the heterogeneity of evolutionary changes between and within
the subfamilies. To test the method, we compiled a test set containing 13
protein families with known specificity determining sites (the set of
alignments together with the subfamily determinants can be obtained at ftp://ftp.ncbi.nih.gov/pub/chakraba/SPEER/).
Extensive benchmarking by comparing the performance of SPEER with other
specificity site prediction algorithms has shown that it performs better in
predicting several categories of subfamily specific sites (10).
Algorithms
of sequence alignment and fold recognition.
Pairwise
sequence alignment methods may fail to detect distant evolutionary
relationships in the twilight zone of sequence similarity whereas methods
based on the analysis of the residue conservation patterns in multiple
sequence alignments have proved very powerful in this respect. Moreover
algorithms of protein structure prediction and fold recognition may
recognize even more remote evolutionary relationships that are not
detectable by sequence comparison alone.
We
develop algorithms of sequence alignment
which score the sequence and
structure conservation within protein families (PSSM-based protein
threading 11,12)
algorithms which find an optimal alignment between two sequence profiles (profile-profile
alignment algorithm,13)
and
refine the existing multiple sequence alignments by retaining the
structural and functional information embedded in the protein family model (14).
Realignment of each
sequence can correct misalignments between a given sequence and the rest of
the profile and at the same time preserves the family's overall block
model. Large-scale benchmarking studies showed a noticeable improvement of
alignment after refinement.
|