• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Apr 2005; 15(4): 537–551.
PMCID: PMC1074368

Functional insights from the distribution and role of homopeptide repeat-containing proteins


Expansion of “low complex” repeats of amino acids such as glutamine (Poly-Q) is associated with protein misfolding and the development of degenerative diseases such as Huntington's disease. The mechanism by which such regions promote misfolding remains controversial, the function of many repeat-containing proteins (RCPs) remains obscure, and the role (if any) of repeat regions remains to be determined. Here, a Web-accessible database of RCPs is presented. The distribution and evolution of RCPs that contain homopeptide repeats tracts are considered, and the existence of functional patterns investigated. Generally, it is found that while polyamino acid repeats are extremely rare in prokaryotes, several eukaryote putative homologs of prokaryote RCP—involved in important housekeeping processes—retain the repetitive region, suggesting an ancient origin for certain repeats. Within eukarya, the most common uninterrupted amino acid repeats are glutamine, asparagines, and alanine. Interestingly, while poly-Q repeats are found in vertebrates and nonvertebrates, poly-N repeats are only common in more primitive nonvertebrate organisms, such as insects and nematodes. We have assigned function to eukaryote RCPs using Online Mendelian Inheritance in Man (OMIM), the Human Reference Protein Database (HRPD), FlyBase, and Wormpep. Prokaryote RCPs were annotated using BLASTp searches and Gene Ontology. These data reveal that the majority of RCPs are involved in processes that require the assembly of large, multiprotein complexes, such as transcription and signaling.

Single amino acid repeats are regions within proteins that comprise a single homopolymeric tract of a particular amino acid. Uncontrolled genetic expansions of such regions have been shown to lead to the development of serious debilitating human diseases. For example, expanded poly-Q and poly-A tracts are associated with the development of neurological disorders such as Huntington disease and Oculopharyngeal Muscular Dystrophy (OPMD), respectively. Several studies have also demonstrated that many nondisease-linked polyamino acid tracts are toxic to cells and/or lead to protein aggregation or misfolding (Dorsman et al. 2002; Fandrich and Dobson 2002).

Of the polyamino acid repeats characterized to date, poly-Q repeats are the most extensively studied. Nine poly-Q-linked diseases have been identified, and the proteins believed to be responsible for the disease contain expanded poly-Q tracts that have been shown to possess an enhanced tendency to aggregate and form fibrils both in vitro and in vivo (Scherzinger et al. 1997; Skinner et al. 1997; Becher et al. 1998; Holmberg et al. 1998; Li et al. 1998; Warrick et al. 1998; Chow et al. 2004c). The pathogenic length of the poly-Q tract appears to be specific to each protein family (Cummings and Zoghbi 2000). For example, Huntington's disease only develops when the poly-Q repeat within the Huntingtin protein is 38 amino acids (generally encoded by 36 CAG repeats + CAA + CAG). In contrast, Machado-Joseph disease develops when the poly-Q repeat in ataxin-3 is 45 amino acids in length (Cummings and Zoghbi 2000; Chow et al. 2004a). The accumulation of the aggregated state in vivo appears to correlate with cell death and the onset of degenerative disease; however, the mechanism of poly-Q toxicity remains controversial. While the aggregated or fibrillized conformation of poly-Q tracts is postulated to be β-sheet rich, the precise structure and mechanism of fibril formation remains to be characterized (Wetzel 2002; Chow et al. 2004b). Furthermore, the function (if any) of a typical poly-Q tract remains to be determined. In contrast, studies focusing on the function of poly-Q RCPs have revealed interesting functional patterns; an investigation of Drosophila melanogaster poly-Q RCPs revealed that many of these proteins are transcription factors involved in development (Karlin and Burge 1996; Alba and Guigo 2004).

More recently, proteins containing expanded alanine tracts have been linked to several human diseases (Brown and Brown 2004). Like poly-Q RCPs, many of the disease-linked alanine RCPs are transcription factors (Brown and Brown 2004), and proteins containing lengthened alanine tracts (>10) also demonstrate an enhanced tendency to aggregate and form fibrils (Fan et al. 2001). Structural studies have revealed that while short polyalanine peptides form α-helices (Giri et al. 2003), the secondary structure of longer polyalanine tracts is predicted to be predominantly β-strand like (Giri et al. 2003). It is thus suggested that longer alanine-tracts possess an enhanced tendency to form β-sheet-rich fibrillar structures.

Several other repeat types have also been investigated. A polyglycine tract in the plant protein Toc-75 (a component of the protein import machinery in the chloroplast) has been shown to be important for targeting this protein to the outer envelope of the chloroplast (Inoue and Keegstra 2003) and several viral polyarginine-rich proteins have been shown to be involved in RNA binding (Calnan et al. 1991; Nam et al. 2001). Finally, a recent biochemical study (Oma et al. 2004) revealed that long amino acid tracts of almost all types possess a general tendency to aggregate.

To date, the role of many amino acid repeats and RCPs remains somewhat obscure, and it is likely that numerous disease-linked RCPs remain to be identified. To begin to address this problem, a global genome survey has been performed to identify all homopeptide RCPs; these data have been stored in an online database. Resources such as Online Mendelian Inheritance in Man (OMIM) and FlyBase were used to map function onto eukaryote RCPs. BLASTp and Gene Ontology were used to functionally annotate where possible prokaryote RCPs. When considered as a whole, striking functional patterns, independent of amino acid type, can be observed across all RCPs; these data reveal that the majority of RCPs perform roles in processes that require the assembly of large multiprotein or protein/nucleic acid complexes.


A Web-accessible database of RCPs

We identified all homopeptide repeats in GENPEPT greater than six amino acids in length; these data are available at http://repeats.med.monash.edu.au. The homepage includes a table listing the number of RCPs for each amino acid type. A variety of search options are available (accession numbers, description of the protein, etc.) and searches based upon a repeat pattern (for example, poly-A followed by poly-Q) within a single RCP can be performed for multirepeat-containing proteins. A graphical display of the results is presented and all proteins are linked to their National Centre for Biotechnology Information (NCBI) record and an OMIM record, if applicable.

Within GENPEPT (2,677,049 proteins) 1.4% of proteins are RCPs; a total of 54,566 homopeptide repeats could be identified in 37,355 RCPs (Table 1; Fig. 1A). RCPs from environmental sequences (Venter et al. 2004) and viral sequences are listed in Table 1; however, these have not been further considered in this present study.

Figure 1.
Distribution of RCPs in GENPEPT. (A) The distribution of the RCPs in GENPEPT and the total number of repeats in GENPEPT. The bars represent the total number of repeats and the solid diamonds the number of RCPs. (B) Distribution of the repeats based on ...
Table 1.
Number of homopeptide repeats and RCPs in GENPEPT, Eukaryotes, and Prokaryotes

Several general trends are apparent across all the data. The vast majority (87%) of all RCPs are from eukaryotes; prokaryote RCPs are rare (4%) (Table 1). This is in agreement with previous studies (Karlin and Burge 1996; Marcotte et al. 1999; Huntley and Golding 2000). Within GENPEPT, poly-Q repeats are the most common (16%), whereas poly-W repeats are extremely rare (only three poly-W RCPs could be identified) (Fig. 1A). In eukaryotes, poly-Q, poly-N, poly-A, poly-S, and poly-G are the most common repeat types. In prokaryotes, poly-S, poly-G, poly-A, and poly-P are most common; however, poly-Q and poly-N repeats are relatively rare. In both eukaryotes and prokaryotes, Poly-I, Poly-M, Poly-W, Poly-C, and Poly-Y repeats are either absent or very rare.

When classified according to their physicochemical properties and normalized for the overall frequency of single amino acids within GENPEPT, there is an overrepresentation of polar repeats in comparison to hydrophobic repeats and of acidic repeats in comparison to basic repeats (Fig. 1B). These data are in agreement with a previous study that suggested that long stretches of hydrophobic residues possess greatly enhanced toxicity in comparison to similar stretches of hydrophilic residues (Dorsman et al. 2002; Oma et al. 2004) and are thus selected against.

Homopeptide length

Figure 2 shows a graph of homopeptide repeat length against frequency for each amino acid type. Several general trends are apparent. Most hydrophobic or rare repeats (poly-I, poly-F, poly-Y, poly-V, poly-L, poly-M, poly-R, and poly-W) tend be short (<15 amino acids in length). In contrast, more extensive repeats of amino acids such as poly-Q, poly-N, poly-T, poly-S, and poly-E are common. Certain very long (>50 amino acids) repeats can be identified; however, these are very rare (148 repeats).

Figure 2.
Length of homopeptide repeats. Three-dimensional plot showing repeat length (x-axis) versus amino acid type (y-axis; also highlighted in key) versus percentage of each repeat class of a particular length limited to those repeats <51 amino acids ...

Proteins with more than one homopeptide repeat

Within GENPEPT, 23% of all RCPs contain more than one repeat tract (Table 1). In eukaryotes, 24% of RCPs contain multiple repeat tracts and only 9% of proteins in prokaryotes (two archaeal and 113 bacterial proteins) are multirepeat-containing proteins.

The most common pattern in GENPEPT after a single amino acid repeat is the doublet GG (752) followed by QQ (736), PP (596), SS (505), and NN (485). The propensity of one repeat type to occur with another in the same protein was investigated. Repeat pairs were tallied according to the number of related sequence families in which they were found; Table 2 shows the frequency with which a repeat of one type occurs with another. Strikingly, for all repeats except poly-L, poly-R, and poly-V, the strongest association was with either poly-N or poly-Q tracts (excluding self–self pairs).

Table 2.
Frequency of repeat pairing

Distribution of repeats within eukaryote organisms

Figure 3, A and B, shows the distributions of RCPs in eukaryotes whose genomes are either complete or near completion. These data highlight several interesting anomalies. Drosophila melanogaster possesses an overabundance of poly-Q RCPs, >3.5-fold more than that of Homo sapiens and sixfold more than another insect, the mosquito Anopheles gambiae (Fig. 3B). In contrast, poly-Q repeats are extremely rare in Plasmodium falciparum; this organism instead possesses an overabundance of poly-N RCPs. Analysis of other complete eukaryote genomes revealed that poly-Q repeats are absent in Encephalitozoon cuniculi (an intracellular parasite). Another striking difference is in the distribution of poly-N RCPs. Nonvertebrate organisms all contain asparagine RCPs, whereas poly-N tracts are either absent or extremely rare in vertebrates (Fig. 3A). The human genome contains 233 poly-Q RCPs, but only eight poly-N RCPs, all of which are 8-residue repeats in the N terminus of the insulin receptor substrate 2. The genome of Mus musculus contains 170 poly-Q RCPs and only 13 poly-N RCPs (seven from thioredoxin interacting factor, one in the insulin receptor substrate 2, one in a transcription factor, and four in unknown proteins). The genomes of Gallus gallus and Xenopus laevis do not contain asparagine RCPs.

Figure 3.
Distribution of RCPs in eukaryotes. (A) Distribution of the RCPs in vertebrate species, Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Xenopus laevis (frog), Danio rerio (fish), and Gallus gallus (chicken). (B) Distribution of RCPs ...

In order to include an avian representative in our analysis, we also examined the distribution of RCPs in the chicken. These data reveal an apparent paucity of RCPs in G. gallus as compared with H. sapiens and M. musculus (Fig. 3A); however, we cannot exclude the possibility that this observation is a result of the preliminary nature of the available genomic data. Finally, we note that repeats are completely absent in the nucleomorph of Guillardia theta (Chromophyte algae).

Evolution of RCPs

In order to investigate the evolutionary context of RCPs, prokaryote RCPs were clustered into 1435 families (www.genome.org). Of these families, the majority (1056) are “orphan” RCPs (i.e., a single member). In order to search for eukaryote homologs of RCPs, PSI-BLAST searches of GENPEPT using probes from all clusters with more than five members (47) and eight randomly chosen clusters with more than five members (a total of 55 clusters) were performed (Table 3). The largest cluster considered contains 117 members from the xylanase family. Eukaryote putative homologs of this family were identified, and these putative homologs did not contain the repeat tract. Three other large families, the Ribonuclease E (25 members), membrane carboxypeptidase (25 members), and the β-propeller domain of methanol dehydrogenase (23 members), were also identified. Eukaryote homologs of the Ribonuclease family did not contain repeat tracts, and no convincing eukaryote putative homolog of methanol dehydrogenase could be identified. These results are summarized in Table 3. We were able to detect 20 prokaryote RCP families that extended into eukaryotes. Of these, only three families contained analogous repeat regions or an “interrupted” single amino acid-rich repeat tract in both prokaryote and eukaryote members, the heat-shock protein DnaJ (a molecular chaperone; HSP40) (Fig. 4), and the two ribosomal proteins L10 and L12 (Figs. (Figs.55 and and66).

Figure 4.
Multiple sequence alignment of DnaJ. A multiple alignment of the glycine repeat in DnaJ (Hsp40). From prokarya: Bacillus halodurans, Clostridium thermocellum, Sinorhizobium meliloti, Shigella flexneri, Magnetospirillum magnetotacticum, Geobacter sulfurreducens ...
Figure 5.
Multiple sequence alignment of the ribosomal protein L12. A multiple sequence alignment of the glutamic acid repeat in L12 from the prokaryote species M. thermautotrophicus str. Delta H, Archaeoglobus fulgidus DSM 4304, Pyrococcus abyssi, M. kandleri ...
Figure 6.
Multiple sequence alignment of the ribosomal protein L10. A multiple alignment of the glutamic acid-rich region in L10 from the following prokaryote species: Halobacterium sp. NRC-1, Haloarcula marismortui, Haloferax volcanii, Sulfolobus solfataricus, ...
Table 3.
RCP's representatives from the prokaryota clusters with more than four members and the seven randomly chosen clusters with less than five members

Functional groups in H. sapiens, D. melanogaster, C. elegans, and prokaryote RCPs

We used the OMIM database and related resources to functionally group human RCPs. (Fig 7A). Sixty percent of the human RCPs have an OMIM record, and 120 diseases are associated with these records (Supplemental Table 1). In addition, all D. melanogaster RCPs were mapped onto FlyBase (FlyBase Consortium 2003). These were functionally grouped using Gene Ontology (GO) (Fig. 7B; Supplemental Table 2). These data reveal that the majority (85%) of D. melanogaster RCPs are intracellular proteins. Of the remainder, 13% are predicted to be membrane associated, and only 2% are predicted to be extracellular.

Figure 7.
Function of RCPs. (A) Bar graph showing the function of human RCPs, based upon OMIM. Amino acid types are represented by different colors (see key). On the x-axis, the functional classes DNA RNA (i.e., Transcription/chromatin binding/DNA binding/RNA binding/translation), ...

Interestingly, clear functional trends are apparent throughout the data set, the majority of both human and fruit-fly RCPs performing roles in transcription/translation and signaling processes. Enzymes, transport proteins, adhesion proteins, and structural proteins also commonly contain homopeptide repeats. We performed a similar analysis of C. elegans RCPs using Wormpep, and observed similar trends (Fig. 7C). Finally, we functionally annotated, where possible, prokaryote RCPs (Fig. 7D).

Discrete domains within RCPS

In order to investigate the functional context of homopeptide repeats within RCPs, we performed a survey of protein domains associated with repeat tracts, using the OMIM data set. Ninety-three percent of RCPs listed in OMIM contained a domain listed in pfam (Bateman et al. 2004) and/or SMART (Letunic et al. 2004). Of these, 43% contained an N-terminal repeat in relation to a characterized domain, 10% contained a repeat between domains, and 30% contained a C-terminal repeat in relation to a characterized domain. Also, 16% of RCPs contained a repeat within a characterized domain. Many RCPs are transcription factors or involved in transcriptional processes; hence, many repeat tracts are associated with a variety of nucleic acid-binding domains. However, no obvious pattern of association between repeat type and precise domain type could be detected.


Our data reveal that RCPs are far more abundant in eukaryotes than in prokaryotes. In addition, based upon analysis of the D. melanogaster data set, the majority of eukaryote RCPs are predicted to be intracellular proteins. Furthermore, in agreement with the studies of Marcotte et al. (1999), there is an overrepresentation of polar repeats in comparison to hydrophobic repeats, and most hydrophobic repeats are relatively short in comparison to their polar counterparts. These data are consistent with previous studies that suggest hydrophobic RCPs aggregate and form toxic fibrils (Dorsman et al. 2002; Oma et al. 2004).

Glycine, serine, and proline repeats are common in both prokaryotes and eukaryotes; however, common eukaryote repeats such as glutamine, asparagines, and glutamic acid are relatively rare in prokaryote organisms; of 29 asparagine RCPs in prokaryotes, 11 are orphans, 12 do not have eukaryotic homologs, and six have putative eukaryote homologs. However, these homologs do not contain the repeat or an equivalent amino acid-rich region. Of the 51 glutamine RCPs in prokaryotes, 23 are orphans, 21 do not have eukaryotic homologs, and seven have putative eukaryote homologs, but again, these homologs do not contain a repeat or an equivalent amino acid-rich region.

Certain discrepancies are clearly apparent when considering repeat distribution within eukaryotes. For example, glutamine RCPs (the most common eukaryote repeat) are rare or absent in P. falciparum, E. cuniculi, and G. theta. Furthermore, our data reveal that while asparagine repeats are common in nonvertebrates such as insects and nematodes, such repeats are extremely rare in vertebrates. Kreil and Kreil (2000) previously observed the rarity of asparagine repeats in mammals. This is surprising, since glutamine and asparagine are chemically and structurally similar. We could not identify any function specific to poly-N repeats; indeed, an analysis of all D. melanogaster poly-N RCPs reveals that, like other RCPs (see below), a large proportion (>60%) are transcription factors or proteins involved in transcriptional processes (Table 2; Supplemental data). Interestingly, in most multirepeat-containing RCPs, glutamine or asparagine is the most likely associated partner polyamino acid (Table 2). The functional significance of these data remains to be understood.

Very rarely, repeats are conserved across entire protein families, and only three families (DnaJ and the ribosomal proteins L10 and L12) could be identified with repeat regions in eukaryote and prokaryote putative homologs. A sequence alignment of the DnaJ family (Fig. 4) reveals that an extensive glycine repeat is present in most putative homologs. However, in the majority of these, the repeat is interrupted (typically with an alanine or phenylalanine residue), and this region is contracted in many eukaryotic counterparts. DnaJ functions in complex with at least two other proteins (DnaK and GrpE) to control processes such as protein folding, apoptosis, and the degradation of misfolded proteins (Gragerov et al. 1992; Hendrick et al. 1993; Craig et al. 1994; Gotoh et al. 2004). Deletion of the glycine/phenylalanine-rich motif decreases the binding affinity of the substrate σ32 to DnaK (Wall et al. 1995), and it has been proposed that this region may act as a flexible linker to allow DnaJ/K complex formation and subsequent activation of DnaK (Wall et al. 1995).

Both archaeal and eukaryotic L10 and L12 proteins contain a C-terminal region that comprises an alanine-rich region, termed the hinge, followed by a glutamic acid-rich repeat (Figs. (Figs.5,5, ,6).6). L10 and L12 form part of a complex in the large ribosomal subunit termed the stalk protuberance. The hinge, as well as the acidic region in both proteins, are postulated to function as flexible regions that mediate a variety of protein–protein interactions and are important for processes such as elongation (Remacha et al. 1995; Uchiumi et al. 2002a,b; Gonzalo and Reboud 2003). The acidic region in L10 and L12 is severely truncated in certain eukaryote orthologs; for example, this region is absent in L12e from X. laevis. Bacterial L12 and L10 proteins do not contain the acidic tail. The absence of this region has been observed previously (Ramirez et al. 1989), and it is suggested that the evolution of this region arose after the archaea/bacteria split, but prior to the emergence of eukaryotes.

A major question in repeat-related research fields is the role of RCPs and, in particular, the role of the repeat region itself. Evolutionary pressure on repeat regions is likely to include functional requirement, mutability of the underlying nucleotide sequence, and potential toxicity. Our analysis allows us to begin to address these questions from a functional perspective. Assigning function to protein sequences is nontrivial, since many proteins perform overlapping functions (for review, see Whisstock and Lesk 2003). However, using the OMIM database and other annotation resources, we have attempted to classify RCPs under as broad a class as possible. Most eukaryote RCPs are involved in transcription/translation or interact directly with DNA, RNA, or chromatin, irrespective of repeat type (Fig. 7A). Other common classes of RCPs include signaling molecules, structural proteins, transport molecules, and enzymes (Fig. 7). Previous research on smaller eukaryote data sets also note that repeat tracts are overrepresented in transcription factors and DNA-binding proteins (Karlin and Burge 1996; Mar Alba et al. 1999; Alba et al. 2002; Alba and Guigo 2004). The results of this study are consistent with these findings and identify other functional classes rich in RCPs.

We performed a functional analysis of RCPs from prokaryotes (Table 3; Fig. 7D). While 38% of these families perform as yet uncharacterized roles, several of the functional themes apparent in the eukaryote data set are also noticeable in prokaryotes. The most common functional class (accounting for 24% of classified molecules) are RCPs that perform roles as enzymes. Transport proteins, structural proteins, and transcription/translation-related RCPs are also common; however, in contrast to eukaryotes, the dramatic bias toward transcription/translation-related processes is not observed. One possible explanation for these data is that bacterial genomes are smaller in relation to eukaryotes and are not packaged and controlled in such a complex fashion. RCPs involved in signaling-related processes are also relatively rare in prokaryotes in comparison to eukaryotes. Again, we argue that this may be a result of the increased complexity of eukaryote processes; while bacteria utilize intracellular signaling processes to communicate intracellularly and with their environment, these processes are relatively modest in comparison to eukaryote-signaling cascades.

The repeats database provides a basis for understanding the function of RCPs and their associated repeats in all organisms. Strikingly, the majority of RCPs considered are involved in processes that require the assembly or association of large multiprotein and/or nucleic acid complexes (Fig. 7). For example, the ribosome (itself a large protein/RNA complex) requires a large number of additional factors (e.g., elongation factors) to properly function. Processes such as transcription involve the assembly of multiprotein complexes (e.g., RNA polymerase) and the binding of discrete sequences of DNA that may be kilobases apart (Tolhuis et al. 2002); chromatin condensation requires bundling and folding of nucleosome arrays by protein factors such as Nucleoplasmin (for review, see Akey and Luger 2003; Grigoryev 2004); signaling requires the assembly of a large number of proteins into a signalosome or transducisome (e.g., the MAP Kinase complex) (for review, see Burack and Shaw 2000) and ion channels such as the cGMP-gated channels in rod photoreceptors are involved in large protein complexes that link the plasma membrane with the outer segment disk (Korschen et al. 1999). In many of these processes, the involvement of repeats in mediating protein–protein interactions have been noted (for example, the glutamic acid-rich proteins of cGMP-gated channels and the acidic tails of ribosomal proteins) (Remacha et al. 1995; Korschen et al. 1999; Poetsch et al. 2001).

The data presented in this study reveal that the vast majority of the human repeat tracts present in the OMIM data set (83%) are located N-terminal to, as well as C-terminal to, or in-between discrete domains. Alba and Guigo (2004) have also noted that certain repeats (L, A, P, and Q) appear to be preferentially located at the N terminus. Structural studies reveal that most repeat regions do not adopt discrete well-ordered structures, and instead, are often disordered (Huntley and Golding 2002).

The function of RCPs, as well as the interdomain or terminal location of the repeat tract within these molecules and the disordered structure of these repeat sequences, supports the idea that repeats play roles as flexible spacer elements/“tethers” between individual folded domains in molecules that mediate protein–protein or protein–nucleic acid interactions (Karlin and Burge 1996).

Based upon this work, it is suggested that a general function of the majority of repeat sequences is to mediate the assembly of protein complexes, and that RCPs may act as molecular “fishing lines”, mediating interactions either through tethered distant domains, or indeed, through interactions with the repeat itself (e.g., Gerber et al. 1994; Korschen et al. 1999). Such a flexible structure would allow a single protein within a complex to recruit additional factors from the cytoplasmic or nuclear milieu (Fig. 8). A repeat tract would also be able to bridge large distances, such as is required in chromatin packaging or transcription. While many specialized proteins are able to achieve both distance and conformational mobility via an ordered three-dimensional fold (e.g., myosin) (Pollard 2000) or the serpin superfamily of proteinase inhibitors (Whisstock et al. 1998), a flexible unstructured region is a simple way of achieving a similar, albeit conformationally uncontrolled, outcome. A repeat does not require complicated molecular machinery to achieve flexibility and, because it is unstructured, appropriately sized repeats would be predicted to interfere minimally with other molecules in a complex.

Figure 8.
General role of repeats regions. It is suggested that RCPs (red) function from within large mutiprotein and/or nucleic acid complexes (green circle). An example is shown where a two-domain protein (pink circles) functions via a flexible repeat to recruit ...

Several studies have revealed that long stretches of hydrophobic amino acids are more toxic than hydrophilic counterparts (Dorsman et al. 2002; Oma et al. 2004). Thus, in order to bridge large distances, long tracts of polar repeats such as poly-Q, poly-N, and poly-E can be used as suggested by the distribution of repeat length and amino acid type (Fig. 2). The consequence of utilizing single amino acid repeats is that repeat expansion can result in the formation of ordered aggregates or fibrils. Such an event could result in cell death through a variety of mechanisms; the toxic nature of the fibril, loss or gain of function, the destruction of a large essential multiprotein complex, and/or the sequestering of nonspecific factors.

The majority of proteins sequenced to date do not contain repeats. While certain repeats are common throughout entire protein superfamilies (such as the DnaJ family), the data gleaned in this study reveals that repeat proteins are often “orphans.” We suggest that the putative role of repeats is ancient; however, the relatively sporadic distribution of these regions suggests that repeats often evolve to perform in specialized processes unique to a particular organism or set of organisms.


The March (2004) version of GENPEPT (from ftp://ftp.ncbi.nih.gov/blast/db) was used in this study.

Homopeptide searches

All homopeptides longer than 4 amino acids were discovered using the regular expression “([AaVvLlIiPpMmFfWwGgSsTtCc NnQqYyDdEeKkRrHh])\1+” across the GENPEPT database. This expression will find all occurrences of an amino acid repeat such as “aa,” “aaa,” and “gggg” (Friedl 2002). A minimum length threshold of 7 amino acids was set. The results from the study can be queried and visualized via the Web site http://repeats.med.monash.edu.au.

Evolution of RCPs

All RCPs from prokarya were used in an all-against-all BLASTp search with the following parameters: -f T (on), e = 0.001. Single-linkage clustering of the BLASTp results was then performed using a similar approach to the package GeneRAGE (Enright and Ouzounis 2000). That is, an NxN matrix was created from the BLASTp results and forced into symmetry. Clustering was performed across the matrix.

When considering eukaryote RCPs, only completed or near-complete genomes were used, so as to avoid potential bias due to overrepresentation of commonly studied protein families.

Thus, the following species were considered: Homo sapiens, Mus musculus, Rattus norvegicus, Gallus gallus, Danio rerio, Xenopus laevis, Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, Saccharomyces cerevisiae, Plasmodium falciparum, Oryza sativa, Triticum aestivum, and Arabidopsis thaliana.

The longest sequence from each of the prokaryote clusters was used as probes to search GENPEPT using PSI-BLAST. The following parameters were used: j = 5, b = 100,000, e = 0.001, and -F T. All sequences with significant expect scores (<0.001) (Park et al. 1998) from the sequenced eukaryote organisms (listed above) were selected. Sequences that were >98% identical were then removed using the nrdb90 algorithm (Holm and Sander 1998). For each eukaryote species, the top 10 putative homologs (with the lowest expect scores) were used for the multiple-sequence alignment. The alignments were initially generated using CLUSTALW (Thompson et al. 1994) with default parameters. Inspection of the alignments enabled identification of eukaryote putative homologs that contained a repeat or an equivalent amino acid-rich tract; these were subject to further analysis. The alignment shown in Figures Figures4,4, ,5,5, ,66 was initially generated with T-Coffee (Notredame et al. 2000), then manually curated.

Analysis of repeat pairs

In order to explore whether certain repeat types have a tendency to appear together within a given protein, the colocalization of repeat pairs was investigated. PSI-BLAST (three iterations, inclusion threshold E < 0.02, reporting threshold E <1 × 10–6) was used to obtain clusters of sequences from a database of multiple repeat-containing proteins, where each pairwise alignment spanned >50% of the length of the smaller sequence. For each unique repeat pair identified in a protein, a score was given as the reciprocal of the number of related proteins; i.e., a protein with the repeats TNNNNNK and related to four others would increment the scores for repeat pairs TN, TK, NN, and NK by 0.25. This was performed for all multiple repeat-containing proteins.

Functional annotation

Existing information within OMIM (http://www.ncbi.nlm.nih.gov/omim/), HPRD (http://www.hprd.org/), FlyBase (FlyBase Consortium 2003), and Wormpep (http://www.sanger.ac.uk/Projects/C_elegans/WORMBASE/current/wormpep.shtml) was used to annotate the human, Drosophila, and nematode data set, respectively. In order to annotate prokaryote sequences, the longest member from each of the 1435 clusters were used as probes in BLASTp searches of GENPEPT (using the default parameters). The results were manually analyzed, and putative general function assigned if a cluster shared significant similarity with a characterized family (only hits with an expect score <0.001 were considered). Literature searches were used to establish general function where required.


J.C.W. is a National Health and Medical Research Council of Australia Senior Research Fellow and Monash University Logan Fellow. S.P.B. is an NHMRC R.D. Wright Fellow and Monash University Logan Fellow. J.A.I. is an Anti-Cancer council of Victoria Fellow, Monash University Research Fund Fellow and NHMRC C.J. Martin Fellow. M.G.B. is a Monash University Logan Fellow. We thank the NHMRC, the Australian Research Council, the Victorian Partnership for Advanced Computing, and the State Government of Victoria for support. We thank Sophie Katsabanis for discussion and comment on the manuscript and Michael Cameron, Michelle Dunstone, Sheena McGowan, and Michelle Chow for helpful discussion.


Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3096505.


[Supplemental material is available online at www.genome.org.]


  • Akey, C.W. and Luger, K. 2003. Histone chaperones and nucleosome assembly. Curr. Opin. Struct. Biol. 13: 6-14. [PubMed]
  • Alba, M.M. and Guigo, R. 2004. Comparative analysis of amino acid repeats in rodents and humans. Genome Res. 14: 549-554. [PMC free article] [PubMed]
  • Alba, M.M., Laskowski, R.A., and Hancock, J.M. 2002. Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics 18: 672-678. [PubMed]
  • Barton, G.J. 1993. ALSCRIPT: A tool to format multiple sequence alignments. Protein Eng. 6: 37-40. [PubMed]
  • Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138-D141. [PMC free article] [PubMed]
  • Becher, M.W., Kotzuk, J.A., Sharp, A.H., Davies, S.W., Bates, G.P., Price, D.L., and Ross, C.A. 1998. Intranuclear neuronal inclusions in Huntington's disease and dentatorubral and pallidoluysian atrophy: Correlation between the density of inclusions and IT15 CAG triplet repeat length. Neurobiol. Dis. 4: 387-397. [PubMed]
  • Brown, L.Y. and Brown, S.A. 2004. Alanine tracts: The expanding story of human illness and trinucleotide repeats. Trends Genet. 20: 51-58. [PubMed]
  • Burack, W.R. and Shaw, A.S. 2000. Signal transduction: Hanging on a scaffold. Curr. Opin. Cell. Biol. 12: 211-216. [PubMed]
  • Calnan, B.J., Tidor, B., Biancalana, S., Hudson, D., and Frankel, A.D. 1991. Arginine-mediated RNA recognition: The arginine fork. Science 252: 1167-1171. [PubMed]
  • Chow, M.K., Ellisdon, A.M., Cabrita, L.D., and Bottomley, S.P. 2004a. Polyglutamine expansion in Ataxin-3 does not affect protein stability: Implications for misfolding and disease. J. Biol. Chem. 279: 47643-47651. [PubMed]
  • Chow, M.K., Lomas, D.A., and Bottomley, S.P. 2004b. Promiscuous β-strand interactions and the conformational diseases. Curr. Med. Chem. 11: 491-499. [PubMed]
  • Chow, M.K., Paulson, H.L., and Bottomley, S.P. 2004c. Destabilization of a non-pathological variant of ataxin-3 results in fibrillogenesis via a partially folded intermediate: A model for misfolding in polyglutamine disease. J. Mol. Biol. 335: 333-341. [PubMed]
  • Craig, E.A., Weissman, J.S., and Horwich, A.L. 1994. Heat shock proteins and molecular chaperones: Mediators of protein conformation and turnover in the cell. Cell 78: 365-372. [PubMed]
  • Cummings, C.J. and Zoghbi, H.Y. 2000. Fourteen and counting: Unraveling trinucleotide repeat diseases. Hum. Mol. Genet. 9: 909-916. [PubMed]
  • Dorsman, J.C., Pepers, B., Langenberg, D., Kerkdijk, H., Ijszenga, M., den Dunnen, J.T., Roos, R.A., and van Ommen, G.J. 2002. Strong aggregation and increased toxicity of polyleucine over polyglutamine stretches in mammalian cells. Hum. Mol. Genet. 11: 1487-1496. [PubMed]
  • Enright, A.J. and Ouzounis, C.A. 2000. GeneRAGE: A robust algorithm for sequence clustering and domain detection. Bioinformatics 16: 451-457. [PubMed]
  • Fan, X., Dion, P., Laganiere, J., Brais, B., and Rouleau, G.A. 2001. Oligomerization of polyalanine expanded PABPN1 facilitates nuclear protein aggregation that is associated with cell death. Hum. Mol. Genet. 10: 2341-2351. [PubMed]
  • Fandrich, M. and Dobson, C.M. 2002. The behaviour of polyamino acids reveals an inverse side chain effect in amyloid structure formation. EMBO J. 21: 5682-5690. [PMC free article] [PubMed]
  • FlyBase Consortium. 2003. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 31: 172-175. [PMC free article] [PubMed]
  • Friedl, J.E.F. 2002. Mastering regular expressions. O'Reilly, Sebastopol, CA.
  • Gerber, H.P., Seipel, K., Georgiev, O., Hofferer, M., Hug, M., Rusconi, S., and Schaffner, W. 1994. Transcriptional activation modulated by homopolymeric glutamine and proline stretches. Science 263: 808-811. [PubMed]
  • Giri, K., Ghosh, U., Bhattacharyya, N.P., and Basak, S. 2003. Caspase 8 mediated apoptotic cell death induced by β-sheet forming polyalanine peptides. FEBS Lett. 555: 380-384. [PubMed]
  • Gonzalo, P. and Reboud, J.P. 2003. The puzzling lateral flexible stalk of the ribosome. Biol. Cell 95: 179-193. [PubMed]
  • Gotoh, T., Terada, K., Oyadomari, S., and Mori, M. 2004. hsp70-DnaJ chaperone pair prevents nitric oxide- and CHOP-induced apoptosis by inhibiting translocation of Bax to mitochondria. Cell Death Differ. 11: 390-402. [PubMed]
  • Gragerov, A., Nudler, E., Komissarova, N., Gaitanaris, G.A., Gottesman, M.E., and Nikiforov, V. 1992. Cooperation of GroEL/GroES and DnaK/DnaJ heat shock proteins in preventing protein misfolding in Escherichia coli. Proc. Natl. Acad. Sci. 89: 10341-10344. [PMC free article] [PubMed]
  • Grigoryev, S.A. 2004. Keeping fingers crossed: Heterochromatin spreading through interdigitation of nucleosome arrays. FEBS Lett. 564: 4-8. [PubMed]
  • Hendrick, J.P., Langer, T., Davis, T.A., Hartl, F.U., and Wiedmann, M. 1993. Control of folding and membrane translocation by binding of the chaperone DnaJ to nascent polypeptides. Proc. Natl. Acad. Sci. 90: 10216-10220. [PMC free article] [PubMed]
  • Holm, L. and Sander, C. 1998. Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14: 423-429. [PubMed]
  • Holmberg, M., Duyckaerts, C., Durr, A., Cancel, G., Gourfinkel-An, I., Damier, P., Faucheux, B., Trottier, Y., Hirsch, E.C., Agid, Y., et al. 1998. Spinocerebellar ataxia type 7 (SCA7): A neurodegenerative disorder with neuronal intranuclear inclusions. Hum. Mol. Genet. 7: 913-918. [PubMed]
  • Huntley, M. and Golding, G.B. 2000. Evolution of simple sequence in proteins. J. Mol. Evol. 51: 131-140. [PubMed]
  • ———. 2002. Simple sequences are rare in the Protein Data Bank. Proteins 48: 134-140. [PubMed]
  • Inoue, K. and Keegstra, K. 2003. A polyglycine stretch is necessary for proper targeting of the protein translocation channel precursor to the outer envelope membrane of chloroplasts. Plant J. 34: 661-669. [PubMed]
  • Karlin, S. and Burge, C. 1996. Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc. Natl. Acad. Sci. 93: 1560-1565. [PMC free article] [PubMed]
  • Korschen, H.G., Beyermann, M., Muller, F., Heck, M., Vantler, M., Koch, K.W., Kellner, R., Wolfrum, U., Bode, C., Hofmann, K.P., et al. 1999. Interaction of glutamic-acid-rich proteins with the cGMP signalling pathway in rod photoreceptors. Nature 400: 761-766. [PubMed]
  • Kreil, D.P. and Kreil, G. 2000. Asparagine repeats are rare in mammalian proteins. Trends Biochem. Sci. 25: 270-271. [PubMed]
  • Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P., and Bork, P. 2004. SMART 4.0: Towards genomic data integration. Nucleic Acids Res. 32: D142-D144. [PMC free article] [PubMed]
  • Li, M., Miwa, S., Kobayashi, Y., Merry, D.E., Yamamoto, M., Tanaka, F., Doyu, M., Hashizume, Y., Fischbeck, K.H., and Sobue, G. 1998. Nuclear inclusions of the androgen receptor protein in spinal and bulbar muscular atrophy. Ann. Neurol. 44: 249-254. [PubMed]
  • Mar Alba, M., Santibanez-Koref, M.F., and Hancock, J.M. 1999. Amino acid reiterations in yeast are overrepresented in particular classes of proteins and show evidence of a slippage-like mutational process. J. Mol. Evol. 49: 789-797. [PubMed]
  • Marcotte, E.M., Pellegrini, M., Yeates, T.O., and Eisenberg, D. 1999. A census of protein repeats. J. Mol. Biol. 293: 151-160. [PubMed]
  • Nam, Y.S., Petrovic, A., Jeong, K.S., and Venkatesan, S. 2001. Exchange of the basic domain of human immunodeficiency virus type 1 Rev for a polyarginine stretch expands the RNA binding specificity, and a minimal arginine cluster is required for optimal RRE RNA binding affinity, nuclear accumulation, and trans-activation. J. Virol. 75: 2957-2971. [PMC free article] [PubMed]
  • Notredame, C., Higgins, D.G., and Heringa, J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302: 205-217. [PubMed]
  • Oma, Y., Kino, Y., Sasagawa, N., and Ishiura, S. 2004. Intracellular localization of homopolymeric amino acid-containing proteins expressed in mammalian cells. J. Biol. Chem. 279: 21217-21222. [PubMed]
  • Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., and Chothia, C. 1998. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol. 284: 1201-1210. [PubMed]
  • Poetsch, A., Molday, L.L., and Molday, R.S. 2001. The cGMP-gated channel and related glutamic acid-rich proteins interact with peripherin-2 at the rim region of rod photoreceptor disc membranes. J. Biol. Chem. 276: 48009-48016. [PubMed]
  • Pollard, T.D. 2000. Reflections on a quarter century of research on contractile systems. Trends Biochem. Sci. 25: 607-611. [PubMed]
  • Ramirez, C., Shimmin, L.C., Newton, C.H., Matheson, A.T., and Dennis, P.P. 1989. Structure and evolution of the L11, L1, L10, and L12 equivalent ribosomal proteins in eubacteria, archaebacteria, and eucaryotes. Can. J. Microbiol. 35: 234-244. [PubMed]
  • Remacha, M., Jimenez-Diaz, A., Bermejo, B., Rodriguez-Gabriel, M.A., Guarinos, E., and Ballesta, J.P. 1995. Ribosomal acidic phosphoproteins P1 and P2 are not required for cell viability but regulate the pattern of protein expression in Saccharomyces cerevisiae. Mol. Cell. Biol. 15: 4754-4762. [PMC free article] [PubMed]
  • Scherzinger, E., Lurz, R., Turmaine, M., Mangiarini, L., Hollenbach, B., Hasenbank, R., Bates, G.P., Davies, S.W., Lehrach, H., and Wanker, E.E. 1997. Huntingtin-encoded polyglutamine expansions form amyloid-like protein aggregates in vitro and in vivo. Cell 90: 549-558. [PubMed]
  • Skinner, P.J., Koshy, B.T., Cummings, C.J., Klement, I.A., Helin, K., Servadio, A., Zoghbi, H.Y., and Orr, H.T. 1997. Ataxin-1 with an expanded glutamine tract alters nuclear matrix-associated structures. Nature 389: 971-974. [PubMed]
  • Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4680. [PMC free article] [PubMed]
  • Tolhuis, B., Palstra, R.J., Splinter, E., Grosveld, F., and de Laat, W. 2002. Looping and interaction between hypersensitive sites in the active β-globin locus. Mol. Cell 10: 1453-1465. [PubMed]
  • Uchiumi, T., Honma, S., Endo, Y., and Hachimori, A. 2002a. Ribosomal proteins at the stalk region modulate functional rRNA structures in the GTPase center. J. Biol. Chem. 277: 41401-41409. [PubMed]
  • Uchiumi, T., Honma, S., Nomura, T., Dabbs, E.R., and Hachimori, A. 2002b. Translation elongation by a hybrid ribosome in which proteins at the GTPase center of the Escherichia coli ribosome are replaced with rat counterparts. J. Biol. Chem. 277: 3857-3862. [PubMed]
  • Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., et al. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 66-74. [PubMed]
  • Wall, D., Zylicz, M., and Georgopoulos, C. 1995. The conserved G/F motif of the DnaJ chaperone is necessary for the activation of the substrate binding properties of the DnaK chaperone. J. Biol. Chem. 270: 2139-2144. [PubMed]
  • Warrick, J.M., Paulson, H.L., Gray-Board, G.L., Bui, Q.T., Fischbeck, K.H., Pittman, R.N., and Bonini, N.M. 1998. Expanded polyglutamine protein forms nuclear inclusions and causes neural degeneration in Drosophila. Cell 93: 939-949. [PubMed]
  • Wetzel, R. 2002. Ideas of order for amyloid fibril structure. Structure 10: 1031-1036. [PubMed]
  • Whisstock, J.C. and Lesk, A.M. 2003. Prediction of protein function from protein sequence and structure. Q Rev. Biophys. 36: 307-340. [PubMed]
  • Whisstock, J., Skinner, R., and Lesk, A.M. 1998. An atlas of serpin conformations. Trends Biochem. Sci. 23: 63-67. [PubMed]

Web site references

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...