• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of pnasPNASInfo for AuthorsSubscriptionsAboutThis Article
Proc Natl Acad Sci U S A. Jan 8, 2002; 99(1): 333–338.
PMCID: PMC117561

Amino acid runs in eukaryotic proteomes and disease associations


We present a comparative proteome analysis of the five complete eukaryotic genomes (human, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Arabidopsis thaliana), focusing on individual and multiple amino acid runs, charge and hydrophobic runs. We found that human proteins with multiple long runs are often associated with diseases; these include long glutamine runs that induce neurological disorders, various cancers, categories of leukemias (mostly involving chromosomal translocations), and an abundance of Ca2 + and K+ channel proteins. Many human proteins with multiple runs function in development and/or transcription regulation and are Drosophila homeotic homologs. A large number of these proteins are expressed in the nervous system. More than 80% of Drosophila proteins with multiple runs seem to function in transcription regulation. The most frequent amino acid runs in Drosophila sequences occur for glutamine, alanine, and serine, whereas human sequences highlight glutamate, proline, and leucine. The most frequent runs in yeast are of serine, glutamine, and acidic residues. Compared with the other eukaryotic proteomes, amino acid runs are significantly more abundant in the fly. This finding might be interpreted in terms of innate differences in DNA-replication processes, repair mechanisms, DNA-modification systems, and mutational biases. There are striking differences in amino acid runs for glutamine, asparagine, and leucine among the five proteomes.

Several human inherited neurodegenerative diseases are triplet-repeat diseases associated with proteins containing long runs of glutamine (long CAG codon iterations; for reviews, see refs. 1 and 2). Disease severity seems to be correlated with the extent of iterations of the CAG codon above a threshold (3). Strikingly, many of the triplet-repeat disease proteins contain multiple long runs of amino acids other than glutamine. Listing all runs of lengths of at least five residues (and using the standard one-letter amino acid code), the huntingtin protein contains Q23, P11, P10, E5, E6; atrophin-1 (dentatorubral pallidoluysian atrophy, DRPLA) contains Q20, S7, S10, P6, H5; the androgen-receptor protein (Kennedy's disease) contains Q26, Q6, Q5, P8, A5, G24; and the brain-voltage-dependent calcium channel protein CCAA (spinocerebellar ataxia 6) contains H10 and Q11.

Consequences of hyperexpansion of DNA-triplet repeats might include altered rates of transcription or translation, mRNA instability, and aberrant DNA-hairpin structures (4, 5). Protein aggregation attributed to attachment of glutamine-rich proteins to unrelated molecules may lead to inappropriate multimerization or to formation of “polar zippers,” in which a long stretch of glutamine residues link strands by hydrogen bonds (68).

The foregoing examples motivate our comparative analysis of eukaryotic proteomes focusing on proteins containing multiple amino acid runs. The complete genomes investigated are those of the Human Genome Project tentative draft,§ Drosophila melanogaster (fly), Caenorhabditis elegans (worm), Saccharomyces cerevisiae (yeast), and Arabidopsis thaliana (weed). Many eukaryotic proteins with multiple amino acid runs show other unusual protein sequence properties, including anomalous charge distributions, high counts of amino acid multiplets, extended alternating basic and acidic charge residues, and periodic histidine patterns. In each genome, the number of acidic runs exceeds the number of basic runs by a factor of three or four, and acidic runs tend to be longer than basic runs. In human sequences, proteins with multiple long amino acid runs are often associated with diseases (see below). Fly proteins with multiple runs have predominantly developmental functions and/or transcriptional regulatory capacities, with the majority of them active in central or peripheral nervous system function and development.

Runs of Individual Amino Acids in the Five Eukaryotic Genomes.

For a “typical” protein of 400 residues and average composition, a run of an individual amino acid is statistically significant (at the 0.1% significance level) if it is five or more residues long (9). Table Table11 displays the percentage of all proteins ≥200 residues long in the five eukaryotic genomes that have at least one amino acid run, along with the percentage of runs accounted for by each amino acid type. The percentage of proteins with at least one run ranges from 13% in worm and 15% in yeast to around 20% in human and weed and 27% in the fly. The residues A, S, and Q account for a significant proportion of the runs in each eukaryote; S runs range from 13.7% (human) to 33.4% (weed); A runs range from 4.7% (yeast) to 26.3% (fly), and Q runs range from 5.8% (weed) to 33.9% (fly). Amino acid runs emphasize small polar residues and the acidic residues E and D but avoid aliphatic, aromatic, arginine, and cysteine residues. Runs of the hydrophobic residues I, V, M and runs of the aromatics Y, F, and W are sparse in all five genomes; no human or fly protein has more than one run of each. However, leucine (L) runs occur in 19% of human sequences with a run, whereas in the other genomes, only 3.4–5.2% of proteins with runs have a run of L. In human sequences, ≈90% of L runs occur within 40 amino acids of the amino terminus of the protein, recognizable as part of a signal-peptide sequence. This proportion is much lower in the other eukaryotes, with only 0.4% of fly and weed proteins, 0.2% of worm, and 0.04% of yeast proteins having a similarly located L run. Enigmatically, runs of asparagine (N) are very infrequent in human proteins but are substantial in yeast and fly sequences. Specifically, only 0.06% of human proteins have an N run, compared with 1.0% among weed proteins and 2.7% among fly proteins. The dearth of N runs among human proteins applies to all mammalian species and contrasts sharply with N runs in invertebrate protein sequences (see ref. 10 and Discussion).

Table 1
Frequency of runs of amino acids among eukaryotic proteins

The data indicate that human proteins are more abundant in specific charged amino acid runs than fly proteins. R runs account for 3.2% of runs in human, 2.1% in fly, and 1.6–2.2% in the other genomes. The corresponding figures for K runs are 6.2% in human, 3.0% in fly and 3–7.2% in the others, with K runs being more frequent than R runs in each proteome. This finding may reflect the relative A + T nucleotide richness of the yeast, weed, and worm genomes. Genes in human and Drosophila favor strong amino acid types (G + C-rich codons: alanine, proline, and glycine), whereas yeast, weed, and worm genomes favor A + T rich codons. Glycine runs occur in only 1.0% of yeast proteins with runs, in 11.0% of weed, and in 19.5% of fly. Proline runs account for 18.2% of runs in human proteins and 14% in fly and are least common in the yeast genome (about 6%). The proportion of histidine runs among fly proteins is about 6.1%, exceeding by a factor of two the percent of H runs in the other genomes. Strikingly, the proportion of D runs in yeast proteins (14%) is higher than in the other genomes (4–8%); E runs compose about 20% of runs in human (the highest percentage of all amino acid types), 11% in worm proteins, 14% in yeast, 16% in weed, but only 7% in fly.

Multiple Amino Acid Runs.

A protein has multiple amino acid runs if it has one or more runs, each ≥5 residues, with aggregate length ≥20 amino acids. With this definition, a random sequence of 1,000 residues has only a 0.1% chance of containing multiple long runs (9). As shown in Table Table1,1, the proportion of fly proteins with multiple runs (7.2%) is dramatically higher than for human (1.9%), yeast (1.7%), worm (1.1%), or weed (1.0%). Multiple runs differ most for A, G, H, S, and Q. At least one A run is found in 47% of fly sequences with multiple runs, and S runs occur among 37.4% of these proteins. Of human proteins, 43.9% have multiple runs of P, 35.4% have a run of A, 33.3% of G, 28.8% of S, 26.8% of E, 22.2% of Q, and 10.6% of H (data not shown). All are lower than the corresponding assessments in the fly genome.

Except for A and L, hydrophobic runs almost never occur in proteins with multiple runs. Human and fly sequences containing multiple runs are comparable for A (about 40%), G (30–33%), D (6–8%), H (9–14%), S (28–38%), and R (3–4%) but differ significantly with respect to E (human 27%, fly 8%), K (human 7%, fly 3%), L (human 7%, fly 1%), P (human 44%, fly 23%), T (human 3%, fly 16%), N (human 1.0%, fly 22%), and Q (human 22%, fly 70%). Intriguingly, of the five proteomes, human has the lowest percentage of Q runs in proteins with multiple runs.

Hydrophobic and Charge Runs.

A useful concept applicable to all sequence statistics is the grouping of letters in one alphabet to form natural new alphabets. In this context, amino acids can be classified according to structural, chemical, charge, hydrophobicity, physical and/or kinetic properties, and associations with secondary structure. For example, the Lehninger functional alphabet is based on four amino acid categories: acidic (−), represented by the amino acids D or E; basic (+), represented by K or R; polar uncharged, (p) = (GHNPQSTY); and hydrophobic (h) = (IVLMFACW). This reduced alphabet (+, −, p, and h) requires a longer run (n ≥ 6 instead of n ≥ 5) to achieve the same significance level. In terms of this alphabet, Table Table22 reports numbers, percentages, and lengths of charged and hydrophobic runs in the five complete eukaryotic genomes; the distributions are rather similar.

Table 2
Frequency and length distribution of charge, noncharged polar, and hydrophobic runs (≥6 long) in eukaryotic proteins

The percentage of acidic runs among the five complete genomes is of the same order, ranging from 5.9 to 7.2% (except for worm, 4.4%). The percentages of basic runs range from 1.2 to 2.2%. The worm has no basic runs exceeding 10 residues in length. The disparity between basic and acidic runs in protein sequences is pronounced, the latter being far more numerous and longer. The longest uninterrupted acidic run is 38 residues in human, 33 residues in fly, 31 residues in worm, 56 residues in yeast and 41 in weed. In contrast, the longest uninterrupted basic runs are of length 14 in human, 12 in fly, 14 in yeast, and 10 in worm. For each eukaryotic genome, the percent of acidic runs exceeds the percent of basic runs by a factor of 3 or 4. By contrast, bacterial genomes have roughly equal proportions of acidic and basic runs (data not shown).

For most extended charge runs, there is considerable variation in codon usage, which argues for an essential function for these charge runs. For example, in Drosophila, the codon counts for the positive run R9 in sevenless are (CGC)4, (CGG)1, (AGA)2, and (AGG)2. Large variation at codon site three also is observed for long acidic runs in the cut protein (fly), Rad6p (yeast ubiquitin-protein ligase), Cenp-B (human major centromere autoantigen B), and others. Variable codon usages suggest that the longer runs are likely not generated entirely by strand slippage.

We found acidic runs exceeding 10 residues in 134 sequences of human, 86 of fly, 75 of worm, 63 of yeast, and in 149 of the mustard weed plant (data not shown). The number of proteins with basic runs exceeding 10 residues are far fewer: 2 in human, 2 in fly, 0 in worm, 1 in yeast, and 8 in weed. Paradoxically, on average, proteins show anionic frequencies in ≈11.5–12.0% and cationic frequencies in ≈11.0–11.5%, yet the numbers of proteins with long (at least six residues in length) acidic runs well exceed the numbers of long basic runs. The longest hydrophobic residue run in the eukaryotic genomes under study is in the range 20–27 amino acids, whereas noncharged polar runs can extend beyond 60 amino acids in length (Table (Table2). 2). Hydrophobic long runs appear frequently as helical transmembrane segments, but these are generally confined to 17–25 amino acids in length.

Multiple Amino Acid Runs and Disease Associations.

There are 192 human protein sequences (of the 10,651 ≥ 200 amino acids long) that have multiple amino acid runs (see Table 4, which is published as supporting information on the PNAS website, www.pnas.org). More than 40% of these proteins are associated with diseases [as identified in Online Mendelian Inheritance in Man (OMIM), which can be found at http://www.ncbi.nlm.nih.gov/Omim/], including: triplet-repeat proteins with long glutamine runs that underlie certain neurodegenerative disorders (Table (Table3);3); 14 cancer-related proteins [e.g., adenomatosis polyposis coli, breast carcinoma-associated antigen (BCAA), and matrix metalloproteinase 24 (MMP24)]; 10 leukemia-related proteins often resulting from chromosomal translocations (e.g., anaplastic lymphoma kinase Ki-1, myeloid/lymphoid mixed-lineage leukemia 2, meningioma 1); 14 channel proteins, mainly voltage-gated Ca2+ and K+ channel proteins (Table (Table3;3; see also ref. 11); 6 proteases including sperm trypsin-like acrosin, calpain 4, and some metalloproteinases (see also ref. 12); 7 kinases; and a variety of disease syndrome-related proteins (e.g., Wiskott-Aldrich syndrome, cat-eye syndrome, and cleidocranial dysplasia). A key aspect of 82 of the 192 human protein sequences is their role in transcription, translation, and development regulation. Many of these proteins are homeotic homologs of Drosophila developmental sequences and transcription factors, including forkhead, frizzled, engrailed, distal-less, timeless, diaphanous 1–3, pumilio, trithorax, runt-related and caudal. Other examples include isoforms of E2F transcription factor, neuregulin, several translation–initiation factors, numerous homeobox genes, global transcription factors such as GATA-binding proteins 4 and 6, POU domain class proteins 3 and 4, and various nuclear-receptor coactivators and corepressors. There are two major immune-system proteins among the 192 with multiple runs: immunoglobin superfamily member 4 (IGSF4) and HLA-B associated transcript 2 (D6S51E).

Table 3
Multiple runs in human triplet-repeat (polyglutamine) disease proteins, transport channel proteins, and cancer-related proteins

In marked contrast, no metabolic enzymes (e.g., glycolysis, tricarboxylic acid cycle, pentose phosphate pathway), structural proteins (e.g., actin, myosin, and troponin 1), or housekeeping proteins contain multiple runs. However, several structural–regulatory proteins do have multiple runs, including ankyrin 3, nucleolin, SMARCA2 (actin dependent regulator of chromatin), and synapsin II, which coats synaptic vesicles and may function in the regulation of neurotransmitter release. Major chaperone and degradation proteins, including heat shock protein 70 (Hsp70), Tcp1, and subunits of the proteasome also lack multiple runs. Hsp70, which modulates protein folding and some transport and secretion activities, can counteract the toxic effects of aggregations caused by extended glutamine iterations (1315). The DNA-repair protein repertoire (e.g., Rad51 and -54, Dmc1, uracil glycosylase, ERCC) does not carry multiple runs. Calcium and potassium channel proteins stand out with multiple runs, but transporters of Cu2+, Fe2+, Mn2+, and Zn2+ do not have multiple runs. The hyperpolarization-activated cyclic nucleotide-gated potassium channel 2 (HCN2) is expressed in the heart ventricle and atrium and functions in cardiac pacemaking (16); KCNA4 (CIK4) shaker-related channel protein mediates the voltage-dependent potassium ion permeability of excitable membranes; KCNMA1 (slowpoke Drosophila homolog) is a calcium-activated potassium-channel gene exhibiting many alternative splicings; the small conductance calcium-activated potassium channel KCNN3 is voltage-independent and lacks the transmembrane S4 motif (+,ϕ,ϕ)4–6 of positive-charge residues separated by two hydrophobic residues. Among the Ca2+ channel proteins with multiple runs, CACNA1F is involved with X-linked congenital stationary night blindness and, in addition, is a target for drugs alleviating hypertension. The ataxin-6 calcium channel (SCA6), which also contains extended CAG (polyglutamine) repeats, has been linked to familial hemiplegic migraine.

Strikingly, prokaryote protein analogs/homologs in the human genome do not have multiple amino acid runs. On this basis, multiple runs in human proteins may be a recent evolutionary outcome, concomitant with complex brain development. More than 80% of Drosophila proteins with multiple runs seem to function in developmental and transcription regulation. It is plausible that the corresponding human proteins are developmental proteins that function in embryogenesis and/or neurogenesis and become relatively quiescent during normal life. In a few anomalous cases, some maladies could become exacerbated at adult life stages, as with the late-onset triplet-repeat diseases. Screening mouse for proteins with multiple runs reveals substantial conservation with the human proteins. Specifically, we identified 56 SwissProt mouse entries with multiple runs, of which 52 have a known human homolog. In 43 cases (83%), the human homolog also has multiple runs; 5 (10%) of the mouse proteins have a homolog that has amino acid runs but does not meet the criterion for multiple runs; and 4 (7%) have human homologs that have one or no runs (these are DDX9 ATP-dependent RNA helicase A, DUS8 neuronal tyrosine threonine phosphatase 1, HOXD9 homeobox protein D-9, and UBF1 nucleolar transcription factor 1). Prominent examples of mouse/human homologs that share multiple runs include the CREB-binding protein, diaphanous 1 homolog, even-skipped homolog, GATA-binding proteins 4 and 6, anaplastic lymphoma kinase, MAZ myc-associated zinc finger, and the ZIC2 and ZIC3 proteins.

It is useful to highlight unusual protein sequence features accompanying many proteins with multiple runs. (i) Charge clusters. A charge cluster refers to a protein segment (typically 20–80 residues) with high specific-charge content relative to the charge composition of the whole protein (see ref. 9 for elaborations). The percentage of proteins with at least one significant charge cluster is about 19–23% in most eukaryotic species. In all current complete prokaryotic genomes, the percent of proteins with one or more charge clusters ranges from 6–10%. Proteins with multiple-charge clusters in eukaryotes are uncommon, about 2–4% and <1% in prokaryotes. In eukaryotes, charge clusters are associated with transcriptional activation, membrane receptor activity, and developmental regulation. By contrast, charge clusters are rare among the bulk of housekeeping and metabolic proteins, cytoplasmic enzymes, and among prokaryotic proteins. Primary families of proteins with multiple charge clusters include essential developmental proteins, voltage-gated Ca2+ and K+ ion channel complexes and transporters, and transactivator proteins of large eukaryotic DNA viruses (17). (ii) Alternating charge runs. A typical example is the alternating charge run (−,+)10 = EK(ER)4EK(ER)2EKER observed in the human triplet-repeat disease gene atrophin 1 (DRPLA). The human immune system-related RD-protein possesses an alternating (+,−)24 sequence, and the 42-kDa mouse histocompatibility complex MHC-H2 contains an unparalleled (+,−)26 sequence. The fly female sterile homeotic protein FSH contains (DR)4(ER)3. (iii) Histidine patterns. The period two-histidine pattern (HX)8 = H2HQHSHIHSHLHLHQ in the DRPLA protein is distinctive. In the human N-OCT3 (nervous-system specific octomer binding) protein, we observe the pattern HHADH(HP)2HSHPHQ. Many histidine periodic patterns occur in Drosophila developmental proteins. Histidine is a versatile amino acid that can adopt flexible roles in conformation, in catalytic actions, and in various enzymatic activities. Histidine patterns and runs also provide opportunities for differential charge gradients, hydrogen-bonding networks, and metal coordination. (iv) Multiplets. There are several levels and forms of repetitive structures (18). Multiplets comprise all homodipeptides XX, homotripeptides XXX, etc., where X denotes any specific amino acid. The count of multiplets provides a measure of the homopeptide density of the protein sequence.


Our proteomic analysis comparing the five complete eukaryotic (human, fly, worm, yeast, weed) genomes focuses on proteins containing specific individual amino acid, charge, or hydrophobic runs. Multiple long amino acid runs in human proteins often are associated with diseases; e.g., triplet-repeat diseases, diseases induced by long acidic charge runs (as with lupus antigenic afflictions), CENP-B or nucleolin, and chromosomal translocation proteins, several of which cause leukemia. The fly proteome collection with multiple runs emphasizes proteins involved in developmental activities where glutamine runs are especially profuse. Serine runs are frequent in all genomes at a high level in the range 14–33% of the proteome.

Amino acid runs are significantly more abundant in many respects in the fly proteome compared with the other complete proteomes. (i) About 27% of fly proteins contain at least one amino acid run, whereas at most, 20% of protein sequences in the other genomes have long runs (Table (Table1).1). (ii) For proteins with multiple runs (runs of aggregate length ≥20 residues), fly sequences again stand out, cumulating about 7% of the proteome compared with ≤2.2% for the other genomes. (iii) The fly has 81 protein sequences, each with at least 10 runs, whereas worm has only 9 such proteins, human has 7, weed has 8, and yeast has 2. (iv) The most common amino acid run among fly sequences consists of Q residues (33.9%), but only 6% of runs in human proteins involve Q (the lowest proportion of the five genomes). Yet the human coding triplet-repeat diseases feature excessively long Q runs (Table (Table3).3). The percentage of proteins with runs in fly and human genomes differs significantly for the amino acids Q (fly 33.9%, human 6.0%), N (9.9%, 0.3%), and S (23.7%, 13.7%). What could account for the proliferation of runs in fly sequences compared with human sequences? The fly genome contains (percentage-wise) more protein runs than the other genomes (Table (Table1). 1). This fact cannot be attributed to a protein sampling bias, because we are dealing with complete genomes. Is this abundance of runs true for all Drosophila species (e.g., D. virilis, pseudoobscura) and perhaps other insect populations? Is it possible that the current Drosophila melanogaster laboratory and/or domesticated strain sequences are significantly inbred? Early protein studies suggested that Drosophila exhibits high polymorphism (19). Is there a tie-in between polymorphism and run counts?

Another contingency is that there are innate differences in replication, information processing mechanisms, repair systems, DNA modification operations, and mutational biases between human (mammals in general) and fly, as shown in the following examples. (i) There is a lack of methylation activity in the fly and most invertebrates. (ii) Drosophila (and apparently all protostomes), unlike mouse, lacks embryonic transcription-coupled repair capacity (20). Drosophila also lacks mammalian type uracil DNA glycosylase (21). Does this mean that Drosophila DNA-replication processes are less accurate than those in mammalian eukaryotes? (iii) Drosophila is very different from mouse (and apparently also human) in replication processes. First, Drosophila DNA replicates frenetically in the first hours after fertilization, with replication bubbles distributed about every 10 kb (22). By 12 h, effective origins are spread to around 40 kb. In mice, the rate of replication seems to be uniform throughout developmental and adult stages. Moreover, cell divisions involve DNA stacking on itself and loopouts that need to be decondensed to undergo segregation. The observed narrow limits to intragenomic heterogeneity putatively correlate with conserved features of DNA structure. Second, Drosophila zygotic nuclei divide into 128 copies before the initial cell division (syncitium). It is possible there is DNA exchange (recombination) among these nuclei that generates extra amino acid runs. (iv) A difference in mutational patterns is manifest between human and fly genomes. In fact, complex sequence deletions in the fly are more frequent and extensive, especially evidenced by microsatellite changes (23, 24).

There seems to be some influence of the genome G + C content and dinucleotide relative abundances on occurrence of runs. For example, the yeast genome with only 38% G + C content is very low in the strong amino acids A, G, and P. The worm, yeast, and weed genomes are G + C poor (<40%), even in regions rich with genes, whereas human and fly genes favor enriched G + C content around gene-rich regions. The strong-codon amino acid group (A, G, P) is translated from codon types SSN (S is the strong nucleotide C or G, N is any nucleotide) and the weak-codon amino acid group, WWN (W is A or T) emphasize the amino acids (F, I, M, K, N, Y). The G + C-rich human and fly proteins favor use of strong amino acids, compared with the A + T-rich yeast, worm, and weed sequences.

There is obviously strong selection against asparagine runs among mammalian sequences. Structurally, N runs avoid the secondary structures of α-helices and β-strands and tend to establish disordered loops (25). We further speculate that runs of N may be prone to excessive glycosylation in mammals and seem to be selected against among mammalian protein sequences. For unknown reasons, the very A + T-rich malaria parasite Plasmodium falciparum is replete with N runs (data not shown). We conjecture that this fact may in some way assist Plasmodium in evading the host immune system response. The dearth of N runs in human protein sequences cannot be attributed to differences in amino acid usage. In fact, the median asparagine usage frequency is quite similar across the five genomes: human, 4.3%; fly, 4.5%; worm, 3.7%; yeast, 3.7%; weed, 3.2%. Also, the full quantile usage distributions for asparagine are rather similar across eukaryotes.

Nonspecific hydrophobic runs commonly identify transmembrane segments of receptor or extracellular proteins, and L runs (4–7 residues) stand out in signal peptide sequences near the amino terminus of membrane and extracellular proteins. Unlike other aliphatic and aromatic residues in the human genome, L runs are strikingly high (19.0%). The prominence of L among protein sequences certainly reflects its important role in hydrophobic cores, in transmembrane segments, and in signal peptides, and its prevalence and stability in secondary and tertiary structures. The relatively high alanine frequency in proteins also may reflect on α-helix stability and flexible hydrophobic properties. Interestingly, in human nuclear proteins, serine runs predominate.

Charge Compositional Biases.

For all eukaryotes, the median net charge of proteins is slightly negative (around −0.5%). The aggregate positive charge (K + R) per protein is generally constant over species, at 11.5–12.0%. However, the median K and R frequencies per protein vary individually across the different species. For example, in human, R is under-represented, presumably because of CpG suppression, whereas in E. coli, K is under-represented. Why are E runs more frequent than D runs? From a structural viewpoint, D is recognized as an α-helix breaker, whereas E is favorable to α-helix formation. Moreover, the side chain of E involves two methylene groups as against a single methylene group in D, thus providing greater conformational flexibility. D and E are encoded by similar codon forms (GAR and GAY, respectively), but the juxtaposition of purine-pyrimidine at codon sites 2 and 3 may be sterically unfavorable compared with a purine-purine arrangement (26).

Residues on the surface of proteins presumably need to be highly selective to be able to interact with appropriate structures or to avoid interacting with other structures. From this viewpoint, a general net negative charge or a negative charge run may more easily avoid (for example, mediated by electrostatic repulsion) undesirable interactions with DNA, RNA, membrane surfaces, and other proteins. The extracellular environment for metazoans is mildly alkaline, with pH ~ 7.2–7.4 (27), whereas the intracellular pH is variable, ranging from 5.0 to 7.2, depending on tissue type and subcellular localizations (28, 29). One might speculate that enzyme activity is “optimal” at a pH similar to the pH of the host cells, which in mammalian organisms tend to be slightly acidic. Moreover, protein negative charge runs can contribute in modulating secretion and intracellular transport, in inducing transcriptional activation, and generally, in mediating rapid and potent interactions of protein assemblages. Mixed charge runs often contribute to protein–protein interaction at the interface of quaternary formations (30).

There is strong correlation between protein sequences with multiple runs and highly anomalous charge distribution. In particular, many of these proteins contain two or more charge clusters that putatively function through domain interactions with DNA, RNA, or other proteins and facilitate intramolecular conformation. Segments linking the domains are often uncharged polar regions involving moderate length polar homopeptides. The charge regions might contribute functional properties, whereas uncharged stretches have scaffold or hinge roles, providing flexibility to the three dimensional conformation, or help in fine-tuning domain organization. However, excessively lengthened homopeptides can induce incorrect domain interactions, producing aberrant conformation and inappropriate protein–protein interactions. Extended polyQ tracts may corrupt protein conformation, causing mis-folding of the protein. Also, long glutamine runs or glutamine-rich domains can recruit proteins into polyQ aggregates with concomitant instabilities (4, 31). Long coding CAG triplets (polyglutamine) are unstable and produce insoluble aggregates that seem to be toxic (14). There are dynamic mutations leading to disease based on noncoding triplet nucleotide repeats; e.g., fragile X, myotonic dystrophy, and Friedrich ataxia. It may be the repetitive nature of the nucleotides rather than the ability to code multiple amino acid runs that is critical to the disease mechanism (however, see refs. 68 and 32).

What are the potential benefits and the problems of multiple runs in proteins? Extended runs can provide substrates for caspase cleavage, yielding tangles, plaques, dead neurons, and a signal for apoptosis. Runs may provide binding sites for protein–protein interactions. Also, extended runs may trigger inflammatory brain responses, oxidative damage, and protein aggregations that clog the proteasome (15).

Why do the polyglutamine disease genes all encode multiple amino acid runs in addition to the pathogenic repeat? The reasons are not known. The contemporaneous presence of other unusual protein features such as charge clusters, alternating charge runs, periodic histidine patterns, and high numbers of multiplets is fascinating. Multiple runs likely fulfil a role in protein structure, protein–protein interactions, and transcription regulation. Extensive runs prominently feature glutamine which can produce aggregation with consequent toxicity. Apart from long Q runs, long E runs in human sequences do occur and could engender structural distortions, but perhaps contribute positively to function. Two questions spring to mind. First, are multiple runs highly polymorphic, as is the case with the polyglutamine repeat in many triplet repeat diseases? Second, are multiple runs predictive of disease associations? Both questions may be addressed experimentally, by surveying the population for polymorphism at repeat loci, and by testing whether multiple repeats are expanded in disease phenotypes. Further, novel mouse disease models can be made by expanding the repeats in candidate proteins.

Supplementary Material

Supporting Table:


We are grateful to Allan Campbell, Dmitri Petrov, and Lubert Stryer (Stanford) and Dan Geschwind (Univ. of California, Los Angeles) for helpful discussions and suggestions regarding this manuscript. This research was supported in part by National Institutes of Health Grants 5R01GM10452-36 and 5R01HG00335-14.


§Human proteins from SwissProt have essentially identical statistics (percentages) on runs as those presented for the genome.


1. Zoghbi H Y, Orr H T. FASEB J. 1997;11:864–864.
2. Cummings C J, Zoghbi H Y. Annu Rev Genomics Hum Genet. 2000;1:281–328. [PubMed]
3. Sutherland G R, Richards R I. Curr Opin Genet Dev. 1995;5:323–327. [PubMed]
4. Sinden R R. Nature (London) 2001;411:757–758. [PubMed]
5. Kovtun I V, Goellner G, McMurray C T. Biochem Cell Biol. 2001;79:325–336. [PubMed]
6. Green H. Cell. 1993;74:955–956. [PubMed]
7. Perutz M F, Johnson T, Suzuki M, Finch J T. Proc Natl Acad Sci USA. 1994;91:5355–5358. [PMC free article] [PubMed]
8. Perutz M F. Trends Biochem Sci. 1999;24:58–63. [PubMed]
9. Karlin S. Curr Opin Struct Biol. 1995;5:360–371. [PubMed]
10. Kreil D P, Kreil G. Trends Biochem Sci. 2000;25:270–271. [PubMed]
11. Rolfs A, Hediger M A. J Physiol. 1999;518:1–12. [PMC free article] [PubMed]
12. Yong V W, Power C, Forsyth P, Edwards D R. Nat Rev Neurosci. 2001;2:502–511. [PubMed]
13. Warrick J M, Chan H Y E, Gray-Board G L, Chai Y, Paulson H, Bonini N M. Nat Genet. 1999;23:425–428. [PubMed]
14. Chai Y, Koppenhafer S L, Bonini N M, Paulson H L. J Neurosci. 1999;19:10338–10347. [PubMed]
15. Bence N F, Sampat R M, Kopito R R. Science. 2001;292:1552–1555. [PubMed]
16. Ludwig A, Zong X, Stieber K, Hullin R, Hofmann F, Biel M. EMBO J. 1999;18:2323–2329. [PMC free article] [PubMed]
17. Karlin S, Blaisdell B E, Brendel V. Methods Enzymol. 1990;183:382–402.
18. Karlin S, Blaisdell B E, Bucher P. Protein Eng. 1992;5:729–738. [PubMed]
19. Nevo E, Beiles A, Ben-Shlomo R. Lect Notes Biomath. 1984;53:13–213.
20. deCock J G, Klink E C, Ferro W, Lohman P H, Eeken J C. Mutat Res. 1992;293:11–20. [PubMed]
21. Aravind, L. & Koonin, E. V. (2000) Genome Biol.1, RESEARCH0007. [PMC free article] [PubMed]
22. Blumenthal A B, Kriegstein H J, Hogness D S. Cold Spring Harbor Symp Quant Biol. 1974;38:205–223. [PubMed]
23. Petrov D A, Lozovskaya E R, Hartl D L. Nature (London) 1996;384:364–369.
24. Petrov D A, Hartl D L. Mol Biol Evol. 1998;15:293–302. [PubMed]
25. Richardson J S, Richardson D C. Science. 1988;240:1648–1652. [PubMed]
26. Hunter C A. J Mol Biol. 1993;230:1025–1054. [PubMed]
27. Roos A, Boron W F. Physiol Rev. 1981;62:296–434. [PubMed]
28. Alberts B, Bray D, Lewis J, Raff M, Roberts K, Watson J D. Molecular Biology of the Cell. 3rd Ed. New York: Garland; 1994.
29. Stryer L. Biochemistry. 4th Ed. New York: Freeman; 1995.
30. Zhu Z-Y, Karlin S. Proc Natl Acad Sci USA. 1996;93:8350–8355. [PMC free article] [PubMed]
31. Michelitsch M D, Weissman J S. Proc Natl Acad Sci USA. 2000;97:11910–11911. [PMC free article] [PubMed]
32. Karlin S, Burge C. Proc Natl Acad Sci USA. 1996;93:1560–1565. [PMC free article] [PubMed]

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...