pmc logo image
Logo of ejbiosysbioJournal's HomeManuscript SubmissionAims and ScopeAuthor GuidelinesEditorial BoardHome

Formats:

EURASIP J Bioinform Syst Biol. 2007; 2007: 60723.
Published online 2007 October 31. doi: 10.1155/2007/60723.
PMCID: PMC2205990
Compressing Proteomes: The Relevance of Medium Range Correlations
Dario Benedetto,1 Emanuele Caglioti,1 and Claudia Chica2*
1Dipartimento di Matematica, Università di Roma “La Sapienza”, Piazzale Aldo Moro 5, Roma 00185, Italy
2Structural and Computational Biology Unit, EMBL Heidelberg, Meyerhofstraße 1, Heidelberg 69117, Germany
*Claudia Chica: Email: chica/at/embl.de
Recommended by Teemu Roos
Received January 14, 2007; Revised May 28, 2007; Accepted September 10, 2007.
Abstract
We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences.
Protein sequences have been considered for a long time as nearly random or highly complex sequences, from the informational content point of view. The main reason for this is the local complexity of amino acid composition, that is, the type and number of amino acids found in a sequence segment, especially inside the globular domains [1]. This complexity could be related to the so called randomness of coding sequences in DNA, already pointed out in a pioneering work [2] and explained by evolutionary models [3]. Studies on protein sequence compression show that proteins behave as sequences of independent characters and have a very low compressibility, around 1% [4]. The ordered set of protein sequences belonging to one organism, the proteome, was also considered to be not compressible due to this little Markov dependency [5]. Improvements are obtained by [6, 7]. However, later studies [810] suggest that proteomes contain different sources of regularities, and can be compressed to rates around 30%. For a relevant discussion on the validity of these results see Cao et al. [7].
In this work, we focus on the statistical study of proteome sequences, using the concept of entropy brought into information theory by Shannon [11]. The Shannon entropy is related to the amount of information of a sequence emitted by a certain source. The entropy equation M1 of a sequence is the limit of the average amount of information per character, when the length of the sequence tends to infinity. In particular, for a finite sequence of length equation M2, the informational content in bits is approximately Lh and so Lh is the minimum length in bit of any sequence that contains the same information. In this way Lh provides a theoretical lower bound for the sequence’s compression. A compression algorithm is intended to code a sequence into a shorter one, from which it is possible to obtain unequivocally the former. In practise, one cannot compress at a rate equal to the Shannon entropy for the given sequence. Nonetheless, it is possible to approximate such a limit, using an efficient compression algorithm.
Statistical compression algorithms achieve their goal by assigning shorter code words to the most probable characters; their efficiency depends on the accuracy of the model used to estimate each character’s probability. Models try to take advantage of the correlations between characters considering, for example, how the preceding characters, that is, the character’s context, determine the probability of the next one, as in the prediction by partial matching (PPM) scheme [12].
Most successful algorithms for proteome compression are based on the identification of duplicated sequences or repeats. The compress protein (CP) algorithm [5], for example, considers that duplicated sequences in proteomes are similar but not identical because of mutation and evolutionary divergence. CP uses a modified PPM that includes the probability of amino acid substitutions when estimating each residue probability. The ProtComp algorithm [8] optimises the use of approximate repeats by updating the amino acid substitution matrix as the repeated similar blocks appear along the sequence. The context-tree weighting (CTW) [13] is another context-based method that has been applied for biological sequence compression. In [6] the authors present a CTW-based algorithm that predicts the probability of a character by weighting the importance of short and long contexts considering as well the occurrence of approximate repeats or palindromes in those contexts. The XM [7] is a statistical algorithm which combines, via a Bayesian average, the probability of an amino acid calculated on a local scale with the probability of that same residue being part of a duplicated region of the proteome.
Nonstatistical approaches, based on the Burros-Wheeler transform (BWT) [9], have also been used for identifying overlapping and distant repeats in proteomes, and efficiently use them in compression. Even simpler models, that rely on a block code representation of the protein sequences [10], have proved to be successful in some cases.
All the algorithms commented above put into evidence the existence and importance of redundancy in proteome sequences. Here we present a purely statistical study of 8 eukaryotic and prokaryotic proteomes. Firstly, we analyse the correlation function of the whole sequences and find evidence of medium range correlations, between amino acids located 100 residues apart. Then we calculate the amino acid correlations considering the protein boundaries and identify the role of the intra/interprotein scale in determining the medium range correlations. Furthermore, we generate groups of amino acids using their pair correlations at distance 100, that reveal the structural meaning of the medium range correlations. Using the results of proteome correlations, we propose a statistical model for the distribution of amino acids in 4 proteomes: Haemophilus influenzae (bacteria), Methanococcus jannaschii (bacteria), Saccharomyces cerevisiae (eukarya) and Homo sapiens (eukarya), and we estimate their compression rate to compare our results against previous works.
The sources of nonrandomness studied fall into two scales: the medium range correlations between amino acids of the same and neighboring sequences, at distances of order 100, and the short range Markovian correlations between the contiguous residues up to distance 10. Previous studies [9] show that proteomes present repeated subsequences at very long distances (50–300). In this article, we do not consider these long-range correlations of the order of the proteome length. Protein length range correlations are in agreement with the process of sequence duplication, as it has been previously suggested for long-range correlations [9]; in addition to that, we show that they also contain information about the three-dimensional structure of the proteins. Short range correlations might instead relate to the local constraints on amino acid distribution due to secondary structure requirements.
For our statistical analysis, we used the proteomes of 4 prokaryotic and 4 eukaryotic organisms shown in Table 1. They were retrieved from the database of the Integr8 web portal [14], with exception of the Hi, Mj, Sc, and Hs proteomes that were obtained from the protein corpus in [15], for the sake of comparison of our compression rate results with previous studies on the same proteomes. The proteomes are not complete (in particular the version of Hs in the protein corpus) but they represent a natural set of proteins where the redundancy has a biological meaning. It is important to remark that the sequence of the proteins in the proteome files of the Integr8 database is not the natural one. Those files are not useful for our analysis. Nevertheless, using the additional information available in the database, it is possible to order the proteins as they are found in the chromososmes. The proteome files of the protein corpus do not present this problem, but the sequence of the proteins is not available. Therefore, for the analysis shown in Table 2 and in Figure 2Figure 2, we have used the version of Hi, Mj, Sc in the Integr8 database. For the same reason, the data for Hs is missing in Table 2 since the protein order is not obtainable at the Integr8 site.
Table 1
Table 1
Proteome sequences.
Table 2
Table 2
Intra- and interprotein correlation. Intraprotein correlation is always higher than interprotein correlation, and correlation between matching halves (−−) is higher than that of not corresponding halves (equation M84).
Figure 2
Figure 2
Figure 2
Correlation function, at distance of k proteins, between amino acids belonging to corresponding (equation M82), and noncorresponding (equation M83) halves; S. cerevisiae proteome. Correlation between corresponding halves is higher, suggesting that structural requirements modulate (more ...)
2.1. Correlations
As a first approximation to the general trends in residue distribution, we study the cooccurrence of amino acids. More precisely, we calculate the pair correlations at different distances, that is, the average number of times equal residues a appear at distance k along the whole sequence
equation M3
(1)
with
equation M4
(2)
where N is the sequence length, equation M5 is the characteristic function of finding residue a at position i, and equation M6 is the relative frequency of amino acid a in the proteome. According to this definition, a positive correlation means that, for a distance k, the number of pairs of equal amino acid is more frequent than expected due to their frequency in the proteome. The resulting correlation function for the equation M7 proteomes we studied (Figure 1Figure 1) shows that eukaryotic sequences have stronger correlations than prokaryotic ones. Moreover, for all the proteomes, the correlation remains positive at a medium range, for values of k bigger than equation M8 or equation M9, depending on the proteome. We notice that the natural order of proteins in the proteomes, given by the succession of genes in the chromosomes, is relevant: when we randomly permute proteins, the medium range correlations are lost, both in eukaryotes and prokaryotes.
Figure 1
Figure 1
Figure 1
Correlation function for the 8 proteomes. Notice that the function remains positive for distances up to 1000 and that eukaryotic proteomes (continuous lines) tend to present higher values.
The medium range correlations imply that, in proteomes, the amino acid distribution of neighboring proteins tends to be more similar than that of distant ones. This fact can be related to the process of duplication, recognied as the dominant force in the evolution of protein function [16]. As protein repeats have been related to duplication at different scales (genome, gene, or exon) [17], it is possible that the amino acid patterns responsible for the observed medium range correlation have the same evolutionary origin.
Due to the correlation definition used, the medium range correlations could be caused either by pairs of amino acids belonging to the same protein, or to different ones. Therefore, we split the nonlocal correlation into two groups and analyse them separately: interprotein correlations (between 2 contiguous proteins) and intraprotein correlations (inside the same protein sequence). In Table 2, we present the results for the intraprotein correlation between the two halves of the same protein and the interprotein correlation between corresponding and noncorresponding halves of two contiguous proteins: first half with first half (equation M10) and second half with first half (equation M11).
These correlations are defined as follows. Let equation M12 be the number of proteins, let equation M13 and equation M14 be the relative frequency of the residue a in the first and the second half of the ith protein, respectively, and let equation M15 be the corresponding mean value. We define
equation M16
(3)
for instance,
equation M17
(4)
We also define
equation M18
(5)
The intraprotein correlation is
equation M19
(6)
The two interprotein correlations are
equation M20
(7)
The correlation values in Table 2 have the same trend for all the proteomes: intraprotein correlation is always higher than interprotein correlation.
The correlation defined by means of equation M21 are different from the traditional correlation equation M22 which is the correlation of the symbol a at distance k, where k is the number of residues: we have calculated the correlation function of the frequencies of the amino acids at the distance of one protein. In Figure 2Figure 2, we also analyse how the interprotein correlations between matching and nonmatching protein halves vary with the number k of proteins separating the two halves. We compare
equation M23
(8)
As an extension of the results in Table 2, we find that the correlation between matching halves is kept higher than that of noncorresponding halves along the proteome. Analogous results to Table 2 and Figure 2Figure 2 hold for second-second and first-second halves.
Gene duplication can explain both the existence and order dependence of interprotein correlation, but it is not enough to justify why intraprotein correlations remain high, because high interprotein correlations can also appear in a low intraprotein correlations context. Indeed, the presence of intraprotein correlations indicates a nonrandom distribution of amino acids at a protein length scale. This nonrandomness can be related to segmental duplication, that is, duplication of segments inside the same protein; likewise, it can reflect the maintenance of amino acid patterns during the protein divergence that follows gene duplication as a consequence of the structural constraints imposed upon protein sequences.
As an example, extensive searches of protein databases [18] reveal the high frequency of tandemly repeated sequences of approximately equation M24 amino acids, ARM and HEAT, in eukaryotic proteins. Moreover, those repeats present a core of strongly conserved hydrophobic residues even when the other residues start to differ at several other positions.
The evidence obtained from the correlation analysis does not allow to clarify the nature of the structural constraints measured: do they reflect the modular repetition of secondary structure elements, caused by duplication or, perhaps, they depend on the conservation of higher order tertiary structure units like domains? We try to address this question by defining amino acid groups as explained in the next section.
2.2. Grouping of amino acids
In a previous study [4], the complexity of large sets of nonredundant protein sequences was measured using a reduced alphabet approximation, that is, using groups of amino acids defined by an a priori classification. The Shannon entropy was then estimated from the entropies of the blocks of n-characters. The authors did not find enough evidence to support the existence of short range correlations between the amino acids of protein sequences.
Conversely, given the above evidence of medium range correlations in proteome sequences, we build groups of correlated amino acids using the correlations between the equation M25 amino acids. We calculate equation M26, the correlation between all amino acid pairs equation M27 at distances k, in the same way we calculate equation M28 in the previous section:
equation M29
(9)
A quick look at the resulting equation M30 matrix for equation M31 (Figure 3Figure 3), which presumably includes both intraprotein and interprotein correlation, puts in evidence that the signs of the matrix elements, and thus the positive and negative correlations, are not distributed randomly among residues but, instead, in a grouped fashion: some amino acids present positive or negative correlations with the same subset of residues.
Figure 3
Figure 3
Figure 3
Correlation between the 20 amino acids for Hi. Positive (black) and negative (grey) correlations determine amino acid groups.
Then, we construct groups of amino acids in such a way that they maximise the positive medium range correlation; in practical terms it means that amino acids which are more likely to appear at distances of order 100 would be grouped together.
For a given partition of the set of amino acids in equation M32 groups, we calculate the sum of the correlation function between any pair of residues equation M33 belonging to a same group. More precisely, groups are obtained by maximising the following quantity:
equation M34
(10)
which is function of a partition equation M35 of the amino acids in equation M36 disjoint sets equation M37. Due to the huge number of possible choices for the groups, we maximise this value using a simulated annealing algorithm. This is a Monte Carlo algorithm used for optimisation [19]. For a given partition G, we construct a new partition equation M38 choosing at random a residue and changing its group. If equation M39, the algorithm accepts the new partition. Iterating this procedure we would reach a local maximum which may not be the absolute maximum. In order to avoid being trapped in a local maximum, the algorithm accepts, with a small probability P, a new partition equation M40 for which equation M41. The value of this probability P slowly decreases to zero as the number of iterations increases in such a way that the convergence of the algorithm to the absolute maximum of F is guaranteed.
The number and the structure of the groups chosen have the highest value of equation M42 and represent an equilibrated partition of the 20 amino acids, that is, groups with only one element are not accepted.
The idea behind our grouping scheme is to simplify the amino acid pattern mining by taking advantage of their synonymous relationships. It is well known that mutations between amino acids sharing geometrical and/or physico-chemical properties are the basis of neutral evolution at a molecular level [20]; this fact also explains why there is not a one-to-one relationship between protein sequences and structures [21]. Moreover, structurally neighboring residues have been found to distribute differentially (proximally/distally) in the protein sequences, depending on their physico-chemical properties [22].
Indeed, the groups defined from the pair correlations at a medium range (Table 3) almost correspond with the natural classification based on their physico-chemical properties: hydrophobic, polar, charged, small, and ambiguous. In particular, the fact that hydrophobic amino acids group together allows us to think that the correlation function is gathering some of the three-dimensional information contained in the protein sequence, more precisely tertiary structure information, as hydrophobic interactions are considered the driving forces of the protein folding process [23].
Table 3
Table 3
Groups of amino acids determined by maximisation of the positive medium range correlation. Amino acids that are more likely to appear at 200 residues distance are grouped together.
Therefore, the reason why intraprotein correlations remain high is not only related to the repetition of secondary structure units, but is also the conservation of the amino acids responsible for the protein tertiary structure.
Beside this, it is important to notice that, even if the amino acid usage in eukaryotes and prokaryotes is very similar [24], the amino acid correlations are not, as they collect part of the structural information, contained in the sequences. The number of groups is also different: 3 for H. influenzae and M. jannaschii, 4 for S. cerevisiae and H. sapiens. This could indicate a higher interchangeability of residues in some proteomes, but further analysis is needed to confirm this hypothesis.
2.3. Sequence entropy estimation
In order to quantify the capability that a statistical model has to identify the nonrandomness of a sequence, one can use it to construct an arithmetic coding compressor [25]. We estimate the compression rate of such a compressor with the sequence entropy
equation M43
(11)
using the model to calculate the probability equation M44 of character equation M45 at position i. The better is the model, the lower is the estimated value of the sequence entropy. We construct three models to estimate the probability of each character, considering the previous ones and taking into account both short and medium range correlations. For each model, we find parameters that minimise the sequence entropy. The equation M46 value obtained is taken as an estimate of the compression rate of a running arithmetic codification [25] of the proteomes and is used to compare our results with other compression algorithms (Table 4).
Table 4
Table 4
Compression rate in bit per character for the studied proteomes. One-character entropy is the entropy of the sequences considering that their residues are independently distributed.
Previous works on protein sequence compression like [5] are based on short range Markovian models. In those models, the probability of each amino acid is calculated as a function of the context in which it appears, considering the frequency with which this amino acid happens to be after the equation M47 previous residues.
Following this idea, we start our statistical description of proteome sequences taking into account the information given by the neighboring residues using a variation of the interpolated Markov models [26]. In order to predict the probability of the ith character, we consider the contexts up to a length equation M48 (number of contexts) that precede it, that is, the substrings equation M49 for equation M50. For any character a, we count the number equation M51 of previous occurrences of the substring equation M52. The conditional frequency of finding character equation M53 after the context equation M54 is obtained dividing by the sum over all amino acids b at position i:
equation M55
(12)
Our model 1 predicts the probability of character a at position i with
equation M56
(13)
We remark that the main difference between our short range approach and CTW is that we give a weight to the different contexts, while in [6] a weight is given to their corresponding conditional probabilities. We find that the most informative positions were the previous 8; this length is in qualitative agreement with the results found in [6]. Model 1 in Table 4 indicates the results obtained considering only the short range correlations for equation M57.
The model depends on the parameters equation M58 that are optimised, using standard algorithms for minimisation, in order to achieve the best estimate of the compression rate. This “entropy minimisation” stage is very time expensive. In a real compression procedure, those parameters should be specified and therefore would contribute to the estimated entropy. In our case this contribution is negligible.
The short range correlations support the existence of periodic patterns in protein sequences. They can be caused by the alternation of alpha-beta secondary structure units, as argued in other works on latent periodicity of protein sequences [27, 28]. From the point of view of protein sequence evolution, the short range parameters can also reflect the existence of constraints on the distribution of residues. Protein sequences are modified by mutation, but still have to cope with folding requirements that determine a nonrandom positioning of key residues, depending on their geometrical and physico-chemical properties. In fact, structural alphabets derived from hidden Markov models denote that local conformations of protein structures have different sequence specificity [29].
The intra/interprotein correlations identified in previous sections suggest that the frequencies of the single residues has nonnegligible fluctuations on the medium range. We take into account these fluctuations in our second model (model 2 in Table 4):
equation M59
(14)
Here we added
equation M60
(15)
This quantity is proportional to the frequency of the amino acid a in the subsequence of length L, with L a distance of medium scale, starting from the position equation M61. The factor equation M62 guarantees that equation M63, so that it increases with i in the same way as the other terms of the sum (e.g., equation M64). The parameter equation M65 is optimised as equation M66. The optimal values for equation M67 found during the entropy minimisation stage are equation M68 for Hi, equation M69 for Mj, equation M70 for Sc, and equation M71 for Hs.
Finally, in model 3, we use the groups found in Section 2.2 (see Table 3). In particular, a contribution to the probablity of a given residue is obtained by computing the probability of the residue to belong to a certain group and then the conditional probability of the residue once the group is given is
equation M72
(16)
where equation M73 is the group of a, equation M74 is the relative frequency of a in its group, as measured up to the position equation M75, and
equation M76
(17)
For this model, the optimal values of the parameter L are equation M77 for Hi, equation M78 for Mj, equation M79 for Sc, and equation M80 for Hs.
As one can see in Table 4, the capability of our statistical model to represent the nonrandom information contained in proteomes is comparable to those models that consider repeated amino acid patterns at both short and medium scale [6, 7].
The improvement in the performance of models 2 and 3 is due to the fact that they identify the short range correlations and separate them from the fluctuations of amino acid frequencies at a protein length range. This demonstrates that both correlation types are informative and that the statistical significance of repetitions at those scales is enough to model the amino acid probabilities.
The compression rate achieved when the medium range correlations are modelled with the frequency of amino acid groups (model 3) is almost equivalent to the compression rate of model 2. From a biological perspective it indicates that groups of amino acids are meaningful, and that the redundant information at medium scale has a structural component might be coming from the three-dimensional structure constraints.
According to our results, there is an important difference in the compressibility rates of the eukaryotic and prokaryotic proteomes which is in agreement with the correlation function in Figure 1Figure 1. The sequences of S. cerevisiae and H. sapiens are more redundant, and thus more compressible, than those of H. influenzae and M. jannaschii; correspondingly, the correlation functions of Sc and Hs remain positive for longer distances than Hi and Mj. This additional redundancy could be related to the presence, in eukaryotic proteomes, of paralogous proteins with very similar distribution of synonymous amino acids, but different function. There is evidence suggesting that paralogous genes have been recruited during evolution of different metabolic pathways and are related to the organism adaptability to environmental changes [16]. On the other hand, the lower compressibility of the Hi and Mj proteomes is in agreement with the reduction of prokaryotic genome size as an adaptation to fast metabolic rates [30, 31].
In this article, we show that the correlation function gathers evolutionary and structural information of proteomes. Even if proteins are highly complex sequences, at a proteome scale, it is possible to identify correlations between characters at short and medium ranges. It confirms that protein sequences are not completely random, indeed they present repeated amino acid patterns at those two scales. The alternation of secondary structure units can determine the local redundancy. This was already known and generally modelled using Markov models. In our opinion, sequence duplication is a reasonable explanation for the interprotein correlation. However, it does not account for the intraprotein correlations; this can instead be related to the maintenance of the amino acid patterns responsible for the three-dimensional structure, as the segregation between hydrophobic and polar amino acids indicates. More elaborately, the sampling of the space of structures during proteome evolution is determined by the duplication processes but it is highly constrained by the structural and functional requirements that protein sequences have to meet inside a living system.
Prokaryotic proteomes show lower correlation values, especially for distances under equation M81 residues, and a smaller compressibility than eukaryotic proteomes. These characteristics point at a higher redundancy of eukaryotic proteome sequences, and suggest that the increase of proteome size does not imply de novo generation of protein sequences, with completely different amino acid distribution.
ACKNOWLEDGMENTS
The authors would like to thank Toby Gibson for reading and commenting the manuscript and the reviewers for their constructive criticism that helped to improve the quality of the paper.
1. Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Computers & Chemistry. 1994;18(3):269–285. [PubMed]
2. Blaisdell BE. A prevalent persistent global nonrandomness that distinguishes coding and non-coding eucaryotic nuclear DNA sequences. Journal of Molecular Evolution. 1983;19(2):122–133. [PubMed]
3. Almirantis Y, Provata A. An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome. BioEssays. 2001;23(7):647–656. [PubMed]
4. Weiss O, Jiménez-Montaño MA, Herzel H. Information content of protein sequences. Journal of Theoretical Biology. 2000;206(3):379–386. [PubMed]
5. Nevill-Manning CG, Witten IH. Protein is incompressible. Proceedings of the Data Compression Conference (DCC '99); 1999 March; Snowbird, Utah, USA. pp. 257–266.
6. Matsumoto T, Sadakane K, Imai H. Biological sequence compression algorithms. Genome Informatics. 2000;11:43–52. [PubMed]
7. Cao MD, Dix TI, Allison L, Mears C. A simple statistical algorithm for biological sequence compression. Proceedings of the Data Compression Conference (DCC '07); 2007 March; Snowbird, Utah, USA. pp. 43–52.
8. Hategan A, Tabus I. Protein is compressible. Proceedings of the 6th Nordic Signal Processing Symposium (NORSIG '04); 2004 June; Espoo, Finland. pp. 192–195.
9. Adjeroh D, Nan F. On compressibility of protein sequences. Proceedings of the Data Compression Conference (DCC '06); 2006 March; Snowbird, Utah, USA. pp. 422–434.
10. Sampath G. A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae . Proceedings of the IEEE Bioinformatics Conference (CSB '03); 2003 August; Stanford, Calif, USA. pp. 287–293.
11. Shannon CE. A mathematical theory of communication. Bell System Technical Journal. 1948;27:379–423 and 623–656.
12. Cleary J, Witten I. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications. 1984;32(4):396–402.
13. Willems FMJ, Shtarkov YM, Tjalkens TJ. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory. 1995;41(3):653–664.
14. Integr8 web portal 2006. ftp://ftp.ebi.ac.uk/pub/databases/integr8/.
15. Abel J. The data compression resource on the internet. 2005. http://www.datacompression.info/ .
16. Orengo CA, Thornton JM. Protein families and their evolution—a structural perspective. Annual Review of Biochemistry. 2005;74:867–900.
17. Heringa J. The evolution and recognition of protein sequence repeats. Computers & Chemistry. 1994;18(3):233–243. [PubMed]
18. Andrade MA, Petosa C, O'Donoghue SI, Müller CW, Bork P. Comparison of ARM and HEAT protein repeats. Journal of Molecular Biology. 2001;309(1):1–18. [PubMed]
19. Kirkpatrick S, Gelatt CD, Jr., Vecchi MP. Optimization by simulated annealing. Science. 1983;220(4598):671–680. [PubMed]
20. Mirny LA, Shakhnovich EI. Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. Journal of Molecular Biology. 1999;291(1):177–196. [PubMed]
21. Huynen MA, Stadler PF, Fontana W. Smoothness within ruggedness: the role of neutrality in adaptation. Proceedings of the National Academy of Sciences of the United States of America. 1996;93(1):397–401. [PubMed]
22. Karlin S. Statistical signals in bioinformatics. Proceedings of the National Academy of Sciences of the United States of America. 2005;102(38):13355–13362. [PubMed]
23. Dill KA. Dominant forces in protein folding. Biochemistry. 1990;29(31):7133–7155. [PubMed]
24. Rost B. Did evolution leap to create the protein universe? Current Opinion in Structural Biology. 2002;12(3):409–416. [PubMed]
25. Rissanen J, Langdon GG., Jr. Arithmetic Coding. IBM Journal of Research and Development. 1979;23(2):149–162.
26. Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Research. 1998;26(2):544–548. [PubMed]
27. Turutina VP, Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Identification of latent periodicity in amino acid sequences of protein families. Biochemistry (Moscow). 2006;71(1):18–31. [PubMed]
28. Korotkov EV, Korotkova MA. Enlarged similarity of nucleic acid sequences. DNA Research. 1996;3(3):157–164. [PubMed]
29. Camproux AC, Tufféry P. Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity. Biochimica et Biophysica Acta. 2005;1724(3):394–403. [PubMed]
30. Bentley SD, Parkhill J. Comparative genomic structure of prokaryotes. Annual Review of Genetics. 2004;38:771–791.
31. Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P. Prediction of effective genome size in metagenomic samples. Genome Biology. 2007;8(1):R10. [PubMed]

See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph
See more articles cited in this paragraph