Format

Send to

Choose Destination
See comment in PubMed Commons below
Comput Chem. 2000 Jan;24(1):71-94.

A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins.

Author information

1
Computational Biology Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA. hwan@nih.gov

Abstract

Different local regions of natural amino acid or nucleotide sequences show remarkable heterogeneity in residue composition, reflecting diversity in evolutionary history and physiochemical constraints. Compositional complexity measures are helpful for describing and understanding this variegation. Motivated by some open problems in comparative genomics and protein folding, we have developed a new 'global' compositional complexity measure, G1, which overcomes a crucial limitation of earlier methods. The 'local' measures used in previous research resemble entropy functions and are inherently dependent on an underlying probability distribution. Local measures cannot rigorously compare complexity across sequences of substantially different size, because real sequences show very irregular heterogeneity and do not have the necessary ergodicity in scaling and asymptotic properties. G1 is a member of a new class of scale-independent, distribution-independent complexity functions. For a sequence S of length L on an N-letter alphabet, G1 is derived from ratios in the integer partition lattice, P¿L,N¿ of L with N parts, where the elements of P¿L,N¿ are the state vectors of S, (n1, n2,..., nN), ranked by an order principle. We present theorems and proofs relating to the metric properties of G1 and its relationship to other state-vector-dependent compositional complexity functions, together with a fully-efficient O(L) algorithm to compute G1. The distributions of G1 were calculated for the entire sets of translated proteins encoded by extensively sequenced genomes. The results establish the existence of a clear evolutionary principle, common to bacteria, archaea and eukaryotes, that the proteins encoded by more extreme AT-rich and GC-rich genomes have generally lower compositional complexity than those of more typical organisms.

PMID:
10642881
[Indexed for MEDLINE]
PubMed Commons home

PubMed Commons

0 comments
How to join PubMed Commons

    Supplemental Content

    Full text links

    Icon for Elsevier Science
    Loading ...
    Support Center