![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2007 Dario Benedetto et al. Compressing Proteomes: The Relevance of Medium Range Correlations 1Dipartimento di Matematica, Università di Roma “La Sapienza”, Piazzale Aldo Moro 5, Roma 00185, Italy 2Structural and Computational Biology Unit, EMBL Heidelberg, Meyerhofstraße 1, Heidelberg 69117, Germany *Claudia Chica: Email: chica/at/embl.de Recommended by Teemu Roos Received January 14, 2007; Revised May 28, 2007; Accepted September 10, 2007. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained
in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences. 1. INTRODUCTION Protein sequences have been considered for a long time as nearly random or
highly complex sequences, from the informational content point of view. The
main reason for this is the local complexity of amino acid composition, that is, the type and number of amino acids found in a sequence segment, especially
inside the globular domains [1]. This complexity could be related to the so
called randomness of coding sequences in DNA, already pointed out in a
pioneering work [2] and explained by evolutionary models [3]. Studies on
protein sequence compression show that proteins behave as sequences of
independent characters and have a very low compressibility, around 1%
[4]. The ordered set of protein sequences belonging to one organism,
the proteome, was also considered to be not compressible due to this
little Markov dependency [5]. Improvements are obtained by [6, 7].
However, later studies [8–10] suggest that proteomes contain
different sources of regularities, and can be compressed to rates around
30%. For a relevant discussion on the validity of these results see Cao et al.
[7]. In this work, we focus on the statistical study of proteome sequences, using the concept
of entropy brought into information theory by Shannon [11]. The Shannon entropy is
related to the amount of information of a sequence emitted by a certain source. The
entropy Statistical compression algorithms achieve their goal by assigning shorter code
words to the most probable characters; their efficiency depends on the accuracy
of the model used to estimate each character’s probability. Models try to take
advantage of the correlations between characters considering, for example, how
the preceding characters, that is, the character’s context, determine the
probability of the next one, as in the prediction by partial matching (PPM)
scheme [12]. Most successful algorithms for proteome compression are based on the
identification of duplicated sequences or repeats. The compress protein (CP)
algorithm [5], for example, considers that duplicated sequences in proteomes are
similar but not identical because of mutation and evolutionary divergence. CP
uses a modified PPM that includes the probability of amino acid substitutions
when estimating each residue probability. The ProtComp algorithm [8] optimises
the use of approximate repeats by updating the amino acid substitution matrix
as the repeated similar blocks appear along the sequence. The context-tree
weighting (CTW) [13] is another context-based method that has been
applied for biological sequence compression. In [6] the authors present a
CTW-based algorithm that predicts the probability of a character by
weighting the importance of short and long contexts considering as well the
occurrence of approximate repeats or palindromes in those contexts. The XM
[7] is a statistical algorithm which combines, via a Bayesian average,
the probability of an amino acid calculated on a local scale with the
probability of that same residue being part of a duplicated region of the
proteome. Nonstatistical approaches, based on the Burros-Wheeler transform (BWT) [9],
have also been used for identifying overlapping and distant repeats in proteomes,
and efficiently use them in compression. Even simpler models, that rely on a
block code representation of the protein sequences [10], have proved to be
successful in some cases. All the algorithms commented above put into evidence the existence and importance
of redundancy in proteome sequences. Here we present a purely statistical study of 8 eukaryotic
and prokaryotic proteomes. Firstly, we analyse the correlation function of the whole
sequences and find evidence of medium range correlations, between amino acids
located 100 residues apart. Then we calculate the amino acid correlations considering
the protein boundaries and identify the role of the intra/interprotein
scale in determining the medium range correlations. Furthermore, we
generate groups of amino acids using their pair correlations at distance 100, that
reveal the structural meaning of the medium range correlations. Using the results
of proteome correlations, we propose a statistical model for the distribution of amino
acids in 4 proteomes: Haemophilus influenzae (bacteria), Methanococcus jannaschii
(bacteria), Saccharomyces cerevisiae (eukarya) and Homo sapiens (eukarya), and
we estimate their compression rate to compare our results against previous
works. The sources of nonrandomness studied fall into two scales: the medium range
correlations between amino acids of the same and neighboring sequences,
at distances of order 100, and the short range Markovian correlations
between the contiguous residues up to distance 10. Previous studies [9] show
that proteomes present repeated subsequences at
very long distances (50–300). In
this article, we do not consider these long-range correlations of the order of the
proteome length. Protein length range correlations are in agreement with the
process of sequence duplication, as it has been previously suggested for long-range correlations [9]; in addition to that, we show that they also contain
information about the three-dimensional structure of the proteins. Short range
correlations might instead relate to the local constraints on amino acid
distribution due to secondary structure requirements. 2. RESULTS and DISCUSSION For our statistical analysis, we used the proteomes of 4 prokaryotic
and 4 eukaryotic organisms shown in Table 1. They were retrieved from the database of
the Integr8 web portal [14], with exception of the Hi, Mj, Sc, and Hs
proteomes that were obtained from the protein corpus in [15], for the sake
of comparison of our compression rate results with previous studies on
the same proteomes. The proteomes are not complete (in particular the
version of Hs in the protein corpus) but they represent a natural set of
proteins where the redundancy has a biological meaning. It is important to
remark that the sequence of the proteins in the proteome files of the
Integr8 database is not the natural one. Those files are not useful for our
analysis. Nevertheless, using the additional information available in the
database, it is possible to order the proteins as they are found in the
chromososmes. The proteome files of the protein corpus do not present this
problem, but the sequence of the proteins is not available. Therefore, for the
analysis shown in Table 2 and in Figure 2
2.1. Correlations As a first approximation to the general trends in residue distribution, we study the
cooccurrence of amino acids. More precisely, we calculate the pair correlations at
different distances, that is, the average number of times equal residues a appear at
distance k along the whole sequence
The medium range correlations imply that, in proteomes, the amino acid
distribution of neighboring proteins tends to be more similar than that of
distant ones. This fact can be related to the process of duplication, recognied as
the dominant force in the evolution of protein function [16]. As protein
repeats have been related to duplication at different scales (genome,
gene, or exon) [17], it is possible that the amino acid patterns responsible
for the observed medium range correlation have the same evolutionary
origin. Due to the correlation definition used, the medium range correlations could be
caused either by pairs of amino acids belonging to the same protein, or to different
ones. Therefore, we split the nonlocal correlation into two groups and analyse
them separately: interprotein correlations (between 2 contiguous proteins) and
intraprotein correlations (inside the same protein sequence). In Table 2, we
present the results for the intraprotein correlation between the two halves of the
same protein and the interprotein correlation between corresponding and noncorresponding halves of two contiguous proteins: first half with first half
( These correlations are defined as follows. Let
The correlation defined by means of
Gene duplication can explain both the existence and order dependence of
interprotein correlation, but it is not enough to justify why intraprotein
correlations remain high, because high interprotein correlations can also appear
in a low intraprotein correlations context. Indeed, the presence of intraprotein
correlations indicates a nonrandom distribution of amino acids at a protein
length scale. This nonrandomness can be related to segmental duplication, that is, duplication of segments inside the same protein; likewise, it can reflect
the maintenance of amino acid patterns during the protein divergence that
follows gene duplication as a consequence of the structural constraints imposed
upon protein sequences. As an example, extensive searches of protein databases [18] reveal
the high frequency of tandemly repeated sequences of approximately The evidence obtained from the correlation analysis does not allow to clarify the
nature of the structural constraints measured: do they reflect the modular
repetition of secondary structure elements, caused by duplication or, perhaps,
they depend on the conservation of higher order tertiary structure units like
domains? We try to address this question by defining amino acid groups as
explained in the next section. 2.2. Grouping of amino acids In a previous study [4], the complexity of large sets of nonredundant protein
sequences was measured using a reduced alphabet approximation, that is,
using groups of amino acids defined by an a priori classification. The
Shannon entropy was then estimated from the entropies of the blocks of n-characters.
The authors did not find enough evidence to support the existence of short range
correlations between the amino acids of protein sequences. Conversely, given the above evidence of medium range correlations in proteome
sequences, we build groups of correlated amino acids using the correlations between the
Then, we construct groups of amino acids in such a way that they maximise
the positive medium range correlation; in practical terms it means that
amino acids which are more likely to appear at distances of order 100
would be grouped together. For a given partition of the set of amino acids in
The number and the structure of the groups chosen have the highest value of The idea behind our grouping scheme is to simplify the amino acid pattern
mining by taking advantage of their synonymous relationships. It is well
known that mutations between amino acids sharing geometrical and/or
physico-chemical properties are the basis of neutral evolution at a molecular level
[20]; this fact also explains why there is not a one-to-one relationship between
protein sequences and structures [21]. Moreover, structurally neighboring
residues have been found to distribute differentially (proximally/distally) in
the protein sequences, depending on their physico-chemical properties
[22]. Indeed, the groups defined from the pair correlations at a medium range
(Table 3) almost correspond with the natural classification based on their
physico-chemical properties: hydrophobic, polar, charged, small, and ambiguous.
In particular, the fact that hydrophobic amino acids group together allows us to
think that the correlation function is gathering some of the three-dimensional
information contained in the protein sequence, more precisely tertiary structure
information, as hydrophobic interactions are considered the driving forces of the
protein folding process [23].
Therefore, the reason why intraprotein correlations remain high is not
only related to the repetition of secondary structure units, but is also
the conservation of the amino acids responsible for the protein tertiary
structure. Beside this, it is important to notice that, even if the amino acid usage in
eukaryotes and prokaryotes is very similar [24], the amino acid correlations are
not, as they collect part of the structural information, contained in the sequences.
The number of groups is also different: 3 for H. influenzae and M. jannaschii, 4 for
S. cerevisiae and H. sapiens. This could indicate a higher interchangeability of
residues in some proteomes, but further analysis is needed to confirm this
hypothesis. 2.3. Sequence entropy estimation In order to quantify the capability that a statistical model has to identify the nonrandomness of a sequence, one can use it to construct an arithmetic coding
compressor [25]. We estimate the compression rate of such a compressor with the
sequence entropy
Previous works on protein sequence compression like [5] are based on
short range Markovian models. In those models, the probability of each
amino acid is calculated as a function of the context in which it appears,
considering the frequency with which this amino acid happens to be after the Following this idea, we start our statistical description of proteome sequences taking
into account the information given by the neighboring residues using a variation
of the interpolated Markov models [26]. In order to predict the probability of the ith character, we consider the contexts up to a length
The model depends on the parameters The short range correlations support the existence of periodic patterns in protein
sequences. They can be caused by the alternation of alpha-beta secondary
structure units, as argued in other works on latent periodicity of protein
sequences [27, 28]. From the point of view of protein sequence evolution, the
short range parameters can also reflect the existence of constraints on the
distribution of residues. Protein sequences are modified by mutation, but still
have to cope with folding requirements that determine a nonrandom positioning
of key residues, depending on their geometrical and physico-chemical properties.
In fact, structural alphabets derived from hidden Markov models denote that
local conformations of protein structures have different sequence specificity
[29]. The intra/interprotein correlations identified in previous sections suggest that
the frequencies of the single residues has nonnegligible fluctuations on the
medium range. We take into account these fluctuations in our second model
(model 2 in Table 4):
Finally, in model 3, we use the groups found in Section 2.2 (see Table 3). In
particular, a contribution to the probablity of a given residue is obtained by
computing the probability of the residue to belong to a certain group
and then the conditional probability of the residue once the group is
given is
As one can see in Table 4, the capability of our statistical model to represent the
nonrandom information contained in proteomes is comparable to those models
that consider repeated amino acid patterns at both short and medium scale [6, 7]. The improvement in the performance of models 2 and 3 is due to the fact that
they identify the short range correlations and separate them from the
fluctuations of amino acid frequencies at a protein length range. This
demonstrates that both correlation types are informative and that the statistical
significance of repetitions at those scales is enough to model the amino acid
probabilities. The compression rate achieved when the medium range correlations are modelled
with the frequency of amino acid groups (model 3) is almost equivalent to the
compression rate of model 2. From a biological perspective it indicates that
groups of amino acids are meaningful, and that the redundant information at
medium scale has a structural component might be coming from the
three-dimensional structure constraints. According to our results, there is an important difference in the compressibility
rates of the eukaryotic and prokaryotic proteomes which is in agreement with the
correlation function in Figure 1 3. CONCLUSIONS In this article, we show that the correlation function gathers evolutionary and
structural information of proteomes. Even if proteins are highly complex
sequences, at a proteome scale, it is possible to identify correlations between
characters at short and medium ranges. It confirms that protein sequences are
not completely random, indeed they present repeated amino acid patterns at
those two scales. The alternation of secondary structure units can determine the
local redundancy. This was already known and generally modelled using Markov
models. In our opinion, sequence duplication is a reasonable explanation for the
interprotein correlation. However, it does not account for the intraprotein
correlations; this can instead be related to the maintenance of the amino acid
patterns responsible for the three-dimensional structure, as the segregation
between hydrophobic and polar amino acids indicates. More elaborately, the
sampling of the space of structures during proteome evolution is determined by
the duplication processes but it is highly constrained by the structural and
functional requirements that protein sequences have to meet inside a living
system. Prokaryotic proteomes show lower correlation values, especially for distances
under ACKNOWLEDGMENTS The authors would like to thank Toby Gibson for reading and commenting the
manuscript and the reviewers for their constructive criticism that helped to
improve the quality of the paper. References 1. Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Computers & Chemistry. 1994;18(3):269–285. [PubMed] 2. Blaisdell BE. A prevalent persistent global nonrandomness that distinguishes coding and non-coding eucaryotic nuclear DNA sequences. Journal of Molecular Evolution. 1983;19(2):122–133. [PubMed] 3. Almirantis Y, Provata A. An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome. BioEssays. 2001;23(7):647–656. [PubMed] 4. Weiss O, Jiménez-Montaño MA, Herzel H. Information content of protein sequences. Journal of Theoretical Biology. 2000;206(3):379–386. [PubMed] 5. Nevill-Manning CG, Witten IH. Protein is incompressible. Proceedings of the Data Compression Conference (DCC '99); 1999 March; Snowbird, Utah, USA. pp. 257–266. 6. Matsumoto T, Sadakane K, Imai H. Biological sequence compression algorithms. Genome Informatics. 2000;11:43–52. [PubMed] 7. Cao MD, Dix TI, Allison L, Mears C. A simple statistical algorithm for biological sequence compression. Proceedings of the Data Compression Conference (DCC '07); 2007 March; Snowbird, Utah, USA. pp. 43–52. 8. Hategan A, Tabus I. Protein is compressible. Proceedings of the 6th Nordic Signal Processing Symposium (NORSIG '04); 2004 June; Espoo, Finland. pp. 192–195. 9. Adjeroh D, Nan F. On compressibility of protein sequences. Proceedings of the Data Compression Conference (DCC '06); 2006 March; Snowbird, Utah, USA. pp. 422–434. 10. Sampath G. A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae
. Proceedings of the IEEE Bioinformatics Conference (CSB '03); 2003 August; Stanford, Calif, USA. pp. 287–293. 11. Shannon CE. A mathematical theory of communication. Bell System Technical Journal. 1948;27:379–423 and 623–656. 12. Cleary J, Witten I. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications. 1984;32(4):396–402. 13. Willems FMJ, Shtarkov YM, Tjalkens TJ. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory. 1995;41(3):653–664. 14. Integr8 web portal 2006. ftp://ftp.ebi.ac.uk/pub/databases/integr8/. 15. Abel J. The data compression resource on the internet. 2005. http://www.datacompression.info/
. 16. Orengo CA, Thornton JM. Protein families and their evolution—a structural perspective. Annual Review of Biochemistry. 2005;74:867–900. 17. Heringa J. The evolution and recognition of protein sequence repeats. Computers & Chemistry. 1994;18(3):233–243. [PubMed] 18. Andrade MA, Petosa C, O'Donoghue SI, Müller CW, Bork P. Comparison of ARM and HEAT protein repeats. Journal of Molecular Biology. 2001;309(1):1–18. [PubMed] 19. Kirkpatrick S, Gelatt CD, Jr., Vecchi MP. Optimization by simulated annealing. Science. 1983;220(4598):671–680. [PubMed] 20. Mirny LA, Shakhnovich EI. Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. Journal of Molecular Biology. 1999;291(1):177–196. [PubMed] 21. Huynen MA, Stadler PF, Fontana W. Smoothness within ruggedness: the role of neutrality in adaptation. Proceedings of the National Academy of Sciences of the United States of America. 1996;93(1):397–401. [PubMed] 22. Karlin S. Statistical signals in bioinformatics. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(38):13355–13362. [PubMed] 23. Dill KA. Dominant forces in protein folding. Biochemistry. 1990;29(31):7133–7155. [PubMed] 24. Rost B. Did evolution leap to create the protein universe? Current Opinion in Structural Biology. 2002;12(3):409–416. [PubMed] 25. Rissanen J, Langdon GG., Jr. Arithmetic Coding. IBM Journal of Research and Development. 1979;23(2):149–162. 26. Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Research. 1998;26(2):544–548. [PubMed] 27. Turutina VP, Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Identification of latent periodicity in amino acid sequences of protein families. Biochemistry (Moscow). 2006;71(1):18–31. [PubMed] 28. Korotkov EV, Korotkova MA. Enlarged similarity of nucleic acid sequences. DNA Research. 1996;3(3):157–164. [PubMed] 29. Camproux AC, Tufféry P. Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity. Biochimica et Biophysica Acta. 2005;1724(3):394–403. [PubMed] 30. Bentley SD, Parkhill J. Comparative genomic structure of prokaryotes. Annual Review of Genetics. 2004;38:771–791. 31. Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P. Prediction of effective genome size in metagenomic samples. Genome Biology. 2007;8(1):R10. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||
Comput Chem. 1994 Sep; 18(3):269-85.
[Comput Chem. 1994]J Mol Evol. 1983; 19(2):122-33.
[J Mol Evol. 1983]Bioessays. 2001 Jul; 23(7):647-56.
[Bioessays. 2001]J Theor Biol. 2000 Oct 7; 206(3):379-86.
[J Theor Biol. 2000]Genome Inform Ser Workshop Genome Inform. 2000; 11():43-52.
[Genome Inform Ser Workshop Genome Inform. 2000]Genome Inform Ser Workshop Genome Inform. 2000; 11():43-52.
[Genome Inform Ser Workshop Genome Inform. 2000]Comput Chem. 1994 Sep; 18(3):233-43.
[Comput Chem. 1994]J Mol Biol. 2001 May 25; 309(1):1-18.
[J Mol Biol. 2001]J Theor Biol. 2000 Oct 7; 206(3):379-86.
[J Theor Biol. 2000]Science. 1983 May 13; 220(4598):671-680.
[Science. 1983]J Mol Biol. 1999 Aug 6; 291(1):177-96.
[J Mol Biol. 1999]Proc Natl Acad Sci U S A. 1996 Jan 9; 93(1):397-401.
[Proc Natl Acad Sci U S A. 1996]Proc Natl Acad Sci U S A. 2005 Sep 20; 102(38):13355-62.
[Proc Natl Acad Sci U S A. 2005]Biochemistry. 1990 Aug 7; 29(31):7133-55.
[Biochemistry. 1990]Curr Opin Struct Biol. 2002 Jun; 12(3):409-16.
[Curr Opin Struct Biol. 2002]Nucleic Acids Res. 1998 Jan 15; 26(2):544-8.
[Nucleic Acids Res. 1998]Genome Inform Ser Workshop Genome Inform. 2000; 11():43-52.
[Genome Inform Ser Workshop Genome Inform. 2000]Biochemistry (Mosc). 2006 Jan; 71(1):18-31.
[Biochemistry (Mosc). 2006]DNA Res. 1996 Jun 30; 3(3):157-64.
[DNA Res. 1996]Biochim Biophys Acta. 2005 Aug 5; 1724(3):394-403.
[Biochim Biophys Acta. 2005]Genome Inform Ser Workshop Genome Inform. 2000; 11():43-52.
[Genome Inform Ser Workshop Genome Inform. 2000]Genome Biol. 2007; 8(1):R10.
[Genome Biol. 2007]