Logo of molbiolevolLink to Publisher's site
Mol Biol Evol. Sep 2010; 27(9): 2198–2209.
Published online Apr 28, 2010. doi:  10.1093/molbev/msq108
PMCID: PMC2922621

Low-Complexity Regions in Plasmodium falciparum: Missing Links in the Evolution of an Extreme Genome


Over the past decade, attempts to explain the unusual size and prevalence of low-complexity regions (LCRs) in the proteins of the human malaria parasite Plasmodium falciparum have used both neutral and adaptive models. This past research has offered conflicting explanations for LCR characteristics and their role in, and influence on, the evolution of genome structure. Here we show that P. falciparum LCRs (PfLCRs) are not a single phenomenon, but rather consist of at least three distinct types of sequence, and this heterogeneity is the source of the conflict in the literature. Using molecular and population genetics, we show that these families of PfLCRs are evolving by different mechanisms. One of these families, named here the HighGC family, is of particular interest because these LCRs act as recombination hotspots, both in genes under positive selection for high levels of diversity which can be created by recombination (antigens) and those likely to be evolving neutrally or under negative selection (metabolic enzymes). We discuss how the discovery of these distinct species of PfLCRs helps to resolve previous contradictory studies on LCRs in malaria and contributes to our understanding of the evolution of the of the parasite's unusual genome.

Keywords: Plasmodium falciparum, low-complexity regions, repeat sequences, genome evolution, recombination


Regions of reduced (low) complexity in protein sequence, also known as compositionally biased regions or nonglobular domains, are composed of a nonrandom, limited alphabet of amino acids (Wootton and Federhen 1993; Wootton 1994). Usually located in loop regions between motifs in secondary structure, these sequences are predominantly hydrophilic and tend to be flexible regions exposed to the external solvent (Aravind et al. 2003). These low-complexity regions (LCRs) evolve rapidly with few constraints on size or exact amino acid sequence (Wootton 1994) and can be composed of homopolymeric repeats of a single amino acid, heteropolymers of a short repeat sequence of residues, or aperiodic mosaics of a few amino acids (Wootton and Federhen 1993). Because they are local regions of reduced complexity, LCRs are also defined in information theory as contiguous stretches of sequence with reduced information content as measured by their Shannon entropy (Wootton and Federhen 1993).

Until recently, LCRs were of relatively little interest to biochemists or geneticists. It has been argued that these regions should be filtered from databases and database searches (using algorithms such as SEG [Wootton and Federhen 1993]) because they evolve rapidly (reducing the reliability of homology searches) and because their amino acid content reflects their hydrophilic functional constraints, rather than the unique properties of the surrounding sequence (Altschul et al. 1994). Recently, however, interest in LCRs intensified when studies revealed their unusually high frequency in the proteins the virulent human malaria parasite, Plasmodium falciparum (Gardner et al. 1998; Bowman et al. 1999; Pizzi and Frontali 2001), and that insertions in these regions account for the greater length of P. falciparum proteins compared with orthologs in Saccharomyces cerevisiae (Aravind et al. 2003) and Plasmodium vivax (Carlton et al. 2008). The P. falciparum proteome is one of the most LCR rich among eukaryotes, with 87% of genes containing at least one LCR, as opposed to an average of 65–70% (DePristo et al. 2006) in other eukaryotes.

The scientific literature describes P. falciparum low-complexity regions (PfLCRs) as sequences with elevated AT composition (Xue and Forsdyke 2003; DePristo et al. 2006) comprising aperiodic sequence or short tandem repeats (Bowman et al. 1999; Pizzi and Frontali 2001; DePristo et al. 2006). They are, additionally, described as being highly divergent from orthologous regions (Bowman et al. 1999; Pizzi and Frontali 2001; DePristo et al. 2006), and frequently, if not universally, polymorphic in terms of size (Kaneko et al. 1997; Bowman et al. 1999; Hughes 2004).

Falciparum malaria is widespread, with 40% of the human population at risk of developing the disease (Toure and Oduola 2004). This public health problem is aggravated by the parasite's ability to quickly adapt to drugs and the human immune system, resulting in antigenic diversity and variation that makes vaccine development difficult (Freitas-Junior et al. 2000; Ferreira et al. 2003, 2004). Elucidating the genetic mechanisms behind these rapid adaptations is of key interest to evolutionary geneticists and medical science.

In an effort to connect the parasite's adaptations and its genetic evolution, studies of the P. falciparum genome have revealed a number of highly unusual features: a high average AT-content (average 75% in coding sequence, 87% in noncoding [Gardner et al. 2002; McCutchan et al. 1984]); a high recombination rate, a minimum of 17 kilobases per centimorgan (Su et al. 1999); and a large portion of protein sequence found in LCRs, about 20% of all residues (DePristo et al. 2006).

Several studies have sought to explain the unusual length and frequency of P. falciparum LCRs in proteins. Some works have emphasized an adaptive significance to PfLCRs, positing their use in immune evasion (Kemp et al. 1987; Hughes 2004); others have taken a more mechanistic approach proposing that they substitute for introns (Xue and Forsdyke 2003) or compensate for a reduced transfer RNA (tRNA) repertoire (Frugier et al. 2010); whereas still others propose that LCRs may have no function or adaptive significance (DePristo et al. 2006).

Here we report the first population-level study of LCRs in P. falciparum. We present a new conceptual framework for the evolution of PfLCRs, one that allows for increased resolution among existing models in the literature. For this study, we looked at a total of 34 LCRs in 28 genes across an array of P. falciparum genomes sampled from populations worldwide. Overall, we find that regions of reduced complexity in P. falciparum proteins are not a single type of sequence, as previously reported summary statistics suggest (Pizzi and Frontali 2001; DePristo et al. 2006) but represent at least three distinct families of sequence with nonoverlapping identities and modes of evolution.

These sequence families comprise the following three groups: the “Heterogeneous” family, characterized by LCRs that are aperiodic regions with a reduced alphabet of amino acids, relatively high AT-content, but low diversity; the “PolyN” family, containing LCRs that show a high level of diversity among P. falciparum and are defined largely by repeats of asparagine (indicated by the single amino acid code N) residues of varying length; and the “HighGC” family, composed of sequences of low AT content but high diversity. Looking at each family individually, we find that most of the earlier explanations of PfLCRs are not relevant to all three. However, each individual family was found to support separate, existing models. The evolution of one of these three PfLCR families, HighGC, is of particular interest because these sequences are among the lowest in AT-content in the P. falciparum genome and appear to be highly recombinogenic. In addition, we show here that all recombinogenic repeat regions (included in LCRs) known are part of this group of P. falciparum sequences.

Materials and Methods

This study was performed in three stages, two involving intraspecific diversity studies and one examining interspecific divergence. The primary study examined the sequence diversity from 16 LCRs (table 2, in bold), from 16 worldwide isolates of P. falciparum (table 1). Sequence data from 14 of these isolates were generated using Sanger sequencing, and data from publicly available sequence reads were used for the two addition isolates (details below). To confirm the trends shown in these data, we used publicly available data to construct a data set for an additional 18 LCRs from a subset of the five isolates also used in the previous stage. Evolutionary divergence was studied among closely related species, comparing our own P. falciparum sequences with publicly available sequence reads for P. reichenowi, and among the more distantly related species of P. berghei, P. yoelii, P. chabaudi, P. vivax, and P. knowlesi.

Table 1.
Sources of Genomic DNA Used in This Study.
Table 2.
LCR Data Summary.

Plasmodium falciparum Sample Selection and DNA Isolation

Plasmodium falciparum isolates in this study were chosen to represent a worldwide sampling of diversity (table 1). Isolates D6, 7G8, HB3, Santa Lucia, D10, Dd2, K1, Malayan Camp V1/S, 3D7, and FCB/FCR3 were obtained from the Malaria Research and Reagent Resource Repository (MR4) and were kept in continuous culture using standard methods (Trager and Jensen 1976). Three additional field isolates from Senegal, Africa, were adapted to culture using the same protocols and used to represent diversity within a population: SenP31.01, SenV34.04, and SenV35.04. DNA was isolated from these 14 samples as described in Volkman et al. (2007). Sequence data generated on site from these isolates were supplemented with sequence from the genomes of the IT and the PFCLIN isolates of P. falciparum (Jeffares et al. 2007), which is publicly available from the Wellcome Trust Sanger Institute (WTSI) (http://www.sanger.ac.uk/Projects/P_falciparum).

LCR Selection

A total of 34 LCRs were examined for levels of diversity and divergence within and among Plasmodium species. LCRs were first identified from a reference isolate, the fully sequenced genome of 3D7 (Gardner et al. 2002) accessed from PlasmoDB (version 4, http://www.plasmodb.org), using the SEG algorithm (Wootton and Federhen 1993) as described in DePristo et al. (2006). Regions used for this study were selected to represent a range of sizes, sequence complexity within the range of low complexity, and putative functions (table 2 and supplementary table 1, Supplementary Material online). In addition, the regions were selected from both subtelomeric chromosomal positions (regions of high recombination) and central (regions or relatively low recombination) (fig. 1A).

FIG. 1.
Loci and Plasmodium species used in the study. (A) Genes shown arranged by chromosomal position, both subtelomeric (areas of higher recombination) and central (regions of lower recombination) genes are represented in this study; (B) a phylogeny of the ...

For the first part of the study, using laboratory-generated sequence, only loci for which sequence data were obtainable from at least 15 of 16 P. falciparum isolates were used, resulting in 16 LCRs from 16 genes (table 2). In the second part of the study, a data set consisting of 18 LCRs from 14 genes was constructed (table 2 and supplementary table 1, Supplementary Material online), the LCRs used were only those found in all five sequenced genomes and for which, in the case of monomorphic loci, P. reichenowi sequence could be found to confirm the monomorphism.

DNA Sequence Generation, Acquisition, and Analysis: Plasmodium falciparum

Primary DNA sequences were generated on site using PCR and Sanger sequencing methods. PCR primers were designed with the aid of the program Primer3 (Rozen and Skaletsky 2000) and tested using the PCR simulation program Amplify (version 3.1.4) (Engels 1993) (primer sequences available in supplementary table 1, Supplementary Material online). PCR was performed using HotStarTaq Mastermix (Qiagen, Valencia, CA) on a DNA Engine Tetrad thermal cycler (MJ Research, CA). Sequencing reactions were prepared using BigDye terminator chemistry and sequence obtained using the ABIPrism 3100 and 3730 DNA capillary sequencers (Applied Biosystems, CA).

LCR sequence data from publicly available P. falciparum databases were assembled for the isolates HB3 and Dd2 (publicly available sequence was used only for the second set of 18 LCRs) from the P. falciparum Sequencing Project at the Broad Institute of Harvard and MIT (http://www.broadinstitute.org/annotation/genome/plasmodium_falciparum_spp/MultiHome.html) and for PFCLIN and IT (used for all LCRs in this study) from the WTSI (http://www.sanger.ac.uk/Projects/P_falciparum). Orthologous sequences were identified from these genomes using the amino acid sequences from 3D7 for each LCR (plus approximately 35 amino acids of flanking sequence on both the N- and C-terminus ends) and the tBLASTx algorithm (Altschul et al. 1990) in the BLAST search engine on the project Web sites. Only sequences nearly identical to the probe LCR and flanking sequences were used. The reference data from the publicly available 3D7 genome sequence (http://www.plasmodb.org; Kissinger et al. 2002) were used as the fifth sequence for the 18 addition LCRs used in the second part of the study (table 2).

Nucleotide sequences were edited using Sequencher v.4.2.2 software (Gene Codes Corporation, MI), and then manually aligned and converted to protein alignments using MacClade software version 4.06 (Maddison DR and Maddison WP 2000). Nucleotide frequencies for LCRs studied for polymorphism were measured directly from sequence data using the 3D7 reference sequence and MacClade software with the option to show Taxon List with the Taxon List options set to show nucleotide frequencies.

From these alignments, different alleles were identifiable by insertions and deletions (indels) in the LCRs (see examples in fig. 4). Here, an allele is defined as an LCR with a unique repeat structure. This structure can differ in the number of repeat units (size of LCR) or composition of the different repeat units. Regions of the same size but different composition have different ancestries and are considered to be distinct alleles. The genetic diversity at each locus was calculated using expected heterozygosity:

An external file that holds a picture, illustration, etc.
Object name is molbiolevolmsq108fx1_ht.jpg

where pi is the frequency of the ith of k alleles.

FIG. 4.
Illustrative examples of each family of PfLCRs shown as amino acid alignments.

DNA Sequence Generation, Acquisition, and Analysis in other Plasmodium Species

To assess the level of evolutionary divergence of all LCRs in this study, in addition to diversity, each LCR was compared with its ortholog (where available) in the closely related species, the chimpanzee parasite P. reichenowi. DNA from this parasite is extremely scarce and not widely available for use; however, sequence is available in the form of reads and assembled contigs from the P. reichenowi genome project from WTSI (http://www.sanger.ac.uk/Projects/P_reichenowi). Reads from this partially completed genome were retrieved from the Web site BLAST search engine in the same manner as for the P. falciparum orthologs; however, only reads were used because contigs usually break at the point of LCRs and do not contain them.

To determine the level of sequence conservation at LCRs, relative to flanking sequence, over a longer span of evolutionary time, orthologous complete sequences for P. berghei, P. yoelii, and P. chabaudi (rodent parasites) and P. vivax and P. knowlesi (human and monkey parasites, respectively) (fig. 1B) were identified using the OrthoMCL algorithm (Li et al. 2003) via the PlasmoDB database (version 6, http://www.plasmodb.org; Kissinger et al. 2002). Alignments were initially generated using CLUSTALW2 software (Larkin et al. 2007) with a BLOSUM30 matrix, a gap opening penalty of 10.0, and gap extension penalty of 2.0. and corrected manually in MacClade 4.0. Because P. falciparum is a divergent species from those used for comparison, several loci do not have orthologs and could not be used for this portion of the study: (the merozoite surface proteins 3 and 8 [PF10_0345 and PFE0120c], the reticulocyte-binding protein 2 homologs A and B [PF13_0198 and MAL13P1.176], and the cation transporting ATPase 1 [PFE0805w]).

Statistical Analyses

Global AT-content for all 28,607 LCRs in the 3D7 reference genome was assessed using the nucleotide sequences for each LCR identified using the SEG algorithm (DePristo et al. 2006). The nucleotide sequences were retrieved using a custom program in Python that located the regions based on their locations in the gene. Another custom Python program was then used to count all A+T nucleotides to measure base content.

Histograms were constructed to examine the distribution of AT content in LCRs and coding regions of the P. falciparum genome using Microsoft Excel and R (http://www.R-project.org) software, and R was used to calculate the first through fourth moment for each distribution. Microsoft Excel was also used to calculate the mean and variance of AT-content for each group of LCRs. Data randomization for mean LCR AT content was performed using Mathematica 5.2 software (Wolfram 2005) using the Combinatorica package. In all, 30,000 randomized data sets were constructed from the actual AT-content of each LCR, these were then randomly split into the data sets of the same size as the HighGC group and all other LCRs. Mean AT-content was assessed for each randomized group and P values for observed mean AT-content for each LCR family were calculated based on the resulting distribution. Values for mean AT-content that were too extreme to be represented on the randomized distribution were given a P value of 0.


AT-content and PfLCRs

Average AT-content at LCRs has been previously shown to be high, approximately 81% (DePristo et al. 2006). We found that the full distribution of PfLCR AT-content diverges from normal, having a long tail into the region of very low AT (negative skew [−1.26] and high kurtosis [2.79]) and a concentration of frequencies at 89% and above. There is also a noticeable spike at 100% AT, showing that many PfLCRs diverge from the mean (higher variance = 100.7) (fig. 2).

FIG. 2.
Distribution of percent AT-content of all 28,607 LCRs in Plasmodium falciparum.

The average percent AT-content for P. falciparum genes selected for this study (mean = 74.9, variance = 7.5) is close to the overall average reported for all P. falciparum coding sequence (mean = 75 [Gardner et al. 2002], variance = 18.8), indicating that the AT-content of the LCRs in this study is not a reflection of their flanking sequence.

Polymorphism at PfLCRs

We found no indel polymorphism in 17 of the 34 LCRs in this study, with the remaining 17 PfLCRs being highly polymorphic, with heterozygosity ranging from 0.531 (53.1%) to 0.914 (91.4%). When heterozygosity was compared with AT-content for the PfLCRs, three distinct clusters formed, shown in figure 3: a group with no heterozygosity, Heterogenous, and two highly polymorphic groups, PolyN and HighGC (these families are described in detail below). LCRs from each cluster were represented in antigenic and nonantigenic loci (table 2) and were found both in subtelomeric regions, where recombination rate is the highest, and in central locations, where recombination rate is the lowest (fig. 1A).

FIG. 3.
Cluster diagram showing the distinct groupings of PfLCR sequences by percent AT-content and heterozygosity. The larger shapes represent the mean values for each family.

The two polymorphic LCRs families can be distinguished by their heterozygosity and AT-content. The first family, composed of 6 PfLCRs, shows a high level of polymorphism, with an average heterozygosity of 0.694 and a high AT-content, 85.4% (higher than the mean, but not significantly so, P > 0.05, random permutation tests). The 11 remaining PfLCRs fall into a distinct second set of polymorphic sequences. This group is the most polymorphic, with an average heterozygosity of 0.732, and it is characterized by an unusually low AT-content (a mean of 69.4% (table 3), significantly lower that the mean (P = 0, random permutation tests).

Table 3.
Plasmodium falciparum Low-Complexity Region Family Characteristics.

Divergence at PfLCRs

When orthologous sequence from P. reichenowi was added to the alignments for the monomorphic sequences, eight showed no size differences, four showed insertions in the P. reichenowi lineage, and two showed insertions for P. falciparum (complete orthologous sequence could not be retrieved for the remaining three regions [table 2]). Some nonsynonymous base changes were shown as fixed differences when compared with P. reichenowi, averaging one amino acid change per LCR.

For the polymorphic PfLCRs, complete orthologous P. reichenowi sequence was found for 7 of 17 regions (2 for the high AT regions and 5 for those with very low AT-content) all showing indels. For the high heterozygosity and high AT-content regions, 2 of P. reichenowi orthologs were retrievable with regions larger in P. reichenowi and the other with the P. reichenowi region within the range of sizes for P. falciparum (table 2). For the PfLCRs with low AT-content, four of the five were smaller in P. falciparum (table 2).

To confirm previous findings of indels at PfLCRs in orthologs among divergent species, alignments were made for the full gene sequences for genes with orthologs in five other Plasmodium species. Of the 28 LCRs in these ortholog alignments, 26 showed indels.

Three Families of LCRs in Plasmodium falciparum

When PfLCRs of either high or no polymorphism were examined in sequence alignments, they showed distinctive sequence features associated with their levels of diversity (fig. 4 and supplementary fig. S1, Supplementary Material online). The sequence characters, coupled with heterozygosity, show that PfLCRs fall into three distinct groups. The first of PfLCRs is the monomorphic group, classified here as the Heterogeneous family (fig. 4A). The mean length of these regions (41 amino acids) and mean AT-content (81.6%) are the same averages found for all PfLCRs (DePristo et al. 2006) (table 3). They also appear to be typical of previous PfLCR descriptions in that they are clearly of reduced complexity but are aperiodic with few recognizable amino acid repeat patterns or motifs (Wootton and Federhen 1993).

The second group of PfLCRs consists of the high AT-content/high heterozygosity cluster shown in figure 3. These regions are of average size (about 37 amino acids) and are largely composed of trinucleotide repeats (codon expansions) that code for asparagine (PolyN stretches) (fig. 4B and supplementary fig. S1B, Supplementary Material online). Because the repeats in these regions are in such small units, they are classified as microsatellites. The indel polymorphisms in these regions are characterized by stepwise changes in the size of the repeat unit, a single amino acid or codon (fig. 4B and supplementary fig. S1B, Supplementary Material online). The elevated AT-content in these regions is expected because asparagine is 2-fold degenerate and coded for by the codons AAT and AAC. These LCRs are a mixture of tandem arrays of either or both of the asparagine codons, often both (see example in supplementary fig. S1B, Supplementary Material online). We refer to this group as the PolyN family of PfLCRs (table 3).

The third family of PfLCRs is the HighGC regions. These regions are the largest of the PfLCRs studied here, about twice the size on average (90 amino acids) as the other regions and are dominated by imperfect arrays of repeats whose unit lengths are longer than those of microsatellites (which are defined as being no longer than 12 base pairs per unit and usually di- or trinucleotide repeats) (Brown 2002). The HighGC repeat units are usually 21 or 24 base pairs in length and are consequently classified as minisatellites. These alignments also show indels in the size of the repeat units (fig. 4C and supplementary fig. S1C, Supplementary Material online), like the PolyN repeats; however, insertions and deletions are much larger, and the repeats in these regions are heterogenous, showing a mixture of repeat units in a single LCR. At the nucleotide level, the repeat units are often distinct, showing synonymous base changes and retaining orthology with repeats among different isolates (fig. S1, Supplementary Material online).

Because the HighGC regions are minisatellites and show numerous indels, it is likely that recombination is occurring in these regions. To examine recombination at HighGC PfLCRs, known LCR recombination hotspots were located from the literature and their AT-content was measured. LCRs known to be frequent recombination breakpoints have been located in one pair of parasite red cell invasion proteins, reticulocyte-binding protein 2 homologs A and B (RBP2) (Cortes 2005; Rayner et al. 2005), and in three sets of surface antigens: merozoite surface protein 2 (MSP2) (Irion et al. 1997), the var genes (DePristo et al. 2006), and in between block 2 and block 3 of merozoite surface protein 1 (MSP1) (Zilversmit MM, unpublished data). All these recombinogenic LCRs show a high level of size polymorphism (Rayner et al. 2005; Kiwanuka 2009) and have very low AT-content: 70.8/70.9% (RBP2A/B), 43.9% (MSP1), a mean of 43.5% (standard deviation 4.7, var gene paralogs), and 54% (MSP2). Because of their GC richness and association with recombination, they are placed in the HighGC family (see fig. 5 and supplementary table 2, Supplementary Material online). The AT-content was also measured for the repeat region of the circumsporozoite protein (CSP), 72%, a surface protein expressed in a different life stage than all the others, and known not to be associated with frequent recombination (Rich et al. 1997).

FIG. 5.
LCRs known to be recombinogenic in Plasmodium falciparum has low AT-content. AT-content distributions of known recombination breakpoints in LCRs, projected on the AT-content of all LCRs in the P. falciparum genome: Reticulocyte-binding protein 2 homologs ...


Previous work has described LCRs in P. falciparum proteins as polymorphic (Kaneko et al. 1997; Bowman et al. 1999; Hughes 2004), AT-rich regions (Xue and Forsdyke 2003; DePristo et al. 2006), evolving by replication slippage coupled with recombination (DePristo et al. 2006). The results presented here show it is rare that one of these protein regions fits this description. Rather, PfLCRs are a diverse, but structured, collection of three groups of LCRs, with regions in each of these categories satisfying only some of the characteristics globally attributed them all.

Diversity of PfLCRs

The distribution of AT-content in PfLCRs shows that they are a diverse group of sequences. The high variance and asymmetrical distribution of AT-content in PfLCRs detected with the SEG algorithm reflects the confounding of data from at least three distinct types of LCRs. Previous work that describes PfLCRs as polymorphic is also shown here to be limited. Although many LCRs were found to be highly polymorphic (showing insertions and deletions in alignments), just as many showed no polymorphism. In addition, comparison with P. reichenowi sequence shows that they are not always, or even frequently, longer in P. falciparum; however, this may be a feature unique to the monophyletic group that contains these two species (Perkins and Schall 2002; Rich et al. 2009).

Establishing a structured set of LCR families that is based on specific sequence characteristics reveals the reason that broader descriptions failed to predict the diversity of PfLCRs. We designate these PfLCR groups as the Heterogeneous group, whose sequences are aperiodic with elevated AT-content; the “PolyN” group, characterized by poly-asparagine codon expansions (microsatellites), also with high percent AT; and the “HighGC” group, the most divergent group, with low AT-content (some of the lowest in the genome) and composed of imperfect minisatellite repeats (table 3 and figs. 3 and and4).4). Further explorations of these families also reveal that they have different evolutionary rates and mechanisms.

Evolution of PfLCRs

The Heterogeneous family of PfLCRs are the slowest evolving of all the regions, although they are evolving more quickly than flanking, high-complexity sequence, as indicated by the presence of indels in all but two LCRs when compared with the orthologous regions of distantly related Plasmodia. The relatively slow rate of evolution in these regions is shown by the lack of indel polymorphism and the lowest level of interspecific divergence; single-base change is likely the dominant mutational mechanism. This is not the case for the polymorphic, faster evolving PfLCRs, however.

The dominant evolutionary mechanism for the high AT-content polymorphic, PolyN, PfLCRs is likely to be replication slippage (slip-strand mispairing) during DNA replication. This is expected because the PolyN regions are microsatellites, and slippage is the most probable mechanism given the short period and homogeneity of the repeats and is evidenced as well as by the frequent insertions and deletions of a single codon in these regions (fig. 3B and supplementary fig. S1B, Supplementary Material online) (Richard and Paques 2000; Ellegren 2004). Replication slippage in microsatellite regions is one of the fastest mechanisms of molecular evolution, even showing changes in repeat number among parents and offspring (Jeffreys et al. 1988). This mechanism would explain both the rapid divergence and the high diversity observed in these regions.

The high level of heterozygosity and divergence in HighGC LCRs indicates that they are also evolving rapidly; however, evidence from these regions reveals that recombination is the probable dominant mechanism. Unfortunately, direct assessment of recombination in P. falciparum loci is difficult due to the paucity of markers (such as synonymous nucleotide changes) relative to the frequency of recombination breakpoints; thus, we use additional lines of evidence. A history of recombination in HighGC PfLCRs is shown by the presence of large insertions and deletions in the alignments, which are caused by unequal crossing-over during recombination. Unequal crossing-over during recombination is what creates minisatellite regions and has been shown to be the dominant evolutionary mechanism in these regions (Richard and Paques 2000; Ellegren 2004). In addition, all known LCR recombination breakpoints are in repeat regions of low AT-content, like these regions.

HighGC PfLCRs are found throughout the genome, in chromosomal regions that are structurally associated with recombination rates that are relatively high (subtelomeres) and low (central). Location does not explain the recombination; however, regions of locally elevated GC content have been associated with increased rates of local recombination and hotspots in most, if not all, eukaryote systems studied (Fullerton et al. 2001; Birdsell 2002; Benovoy et al. 2005). Although this association is widely accepted, the exact mechanism is still debated (Birdsell 2002). The mechanism with the most support is biased gene conversion, which states that the DNA repair process for AT/GC mismatches is biased toward repairing with GC (Birdsell 2002). Thus, the elevated GC content at these regions is most likely a by-product of the repair mechanism for the double-stranded breaks that occur during recombination, rather than a promoter of the mechanism itself.

Adaptation and PfLCRs

In the past, there has been conjecture on the evolution, and possible adaptive significance, of LCRs P. falciparum (Kemp et al. 1987; Pizzi and Frontali 2001; Xue and Forsdyke 2003; Hughes 2004; DePristo et al. 2006). It is difficult to reconcile all the existing models, particularly considering those that use adaptive explanations with ones that favor a more neutral model. The four most prominent theories on the birth and retention of LCRs in the P. falciparum genome are (1) the rapid adaptation/smokescreen concept, which hypothesizes that PfLCRs exist to inhibit the host immune response (Kemp et al. 1987; Rich et al. 1997; Hughes 2004), (2) Xue and Forsdyke's (2003) model that PfLCRs are cryptic introns, present as nucleic acid–level adaptations to balance GC pressure for better RNA folding in intron-poor genes, (3) an idea posited by Frugier et al. (2010) that PfLCRs are tRNA “sponges” that compensate for a reduced number of tRNAs, and (4) the model of; DePristo et al. (2006), which emphasizes neutral over adaptive evolution and states that PfLCR size and abundance is a result of a high AT-content genome with a high recombination rate.

Some of the first PfLCRs discussed in the literature were repeat regions found in the surface antigens MSP1, MSP2, and CSP. These loci were early candidates for vaccine targets, and their diversity and divergence were studied in detail at a time when few other genes in the P. falciparum genome were being studied in depth. When it was found that the CSP repeat region produced a nonproductive immune response (Nussenzweig et al. 1984), a hypothesis was proposed that repeats were adaptive (Verra and Hughes 1999; Ferreira et al. 2004; Hughes 2004). The presence of repeat regions in these three important proteins appeared to support an adaptive model, and when the abundance and large size of P. falciparum LCRs were revealed, the idea was extended to explain the ubiquity of these fast-evolving regions throughout the genome.

Recent work on genome-wide prevalence of PfLCRs (Xue and Forsdyke 2003; DePristo et al. 2006), however, does not support global adaptive hypotheses because too many LCRs are found in nonantigenic genes; as a result, broader theories came into favor. Our population-level work described here, however, gives new support to the idea that some, but not all, PfLCRs may be utilized to generate antigenic diversity, because HighGC PfLCRs can be recombination breakpoints in antigens. Recombination has been shown to generate antigenic diversity by creating new chimeric alleles, either through homologous or ectopic exchange (Freitas-Junior et al. 2000; Ferreira et al. 2004). The very low AT-content of the MSP1 and MSP2 repeat regions, and their associations with frequent recombination breakpoints (Irion et al. 1997; Zilversmit MM, unpublished data), indicates that they are HighGC regions. We do not find evidence, however, that HighGC PfLCRs (or any polymorphic LCRs) are preferentially found in antigens. It is notable that the CSP repeat region, possibly the most frequently studied PfLCR, is not associated with frequent recombination (Rich et al. 1997) and has a number of unusual, if not unique, characteristics that causes it to fall outside the classification system proposed here. This very large repeat region is composed of microsatellite amino acid units of NANP (asparagine, alanine, asparagine, proline), repeated between 30 and 50 times per gene, interspersed with units of NVDP (asparagine, valine, aspartic acid, proline) repeated two to four times per gene (Rich et al. 1997). Despite the clear pattern of tandem repeats in this region, it is more heterogeneous at the nucleotide level, with the NANP units coded for by 10 different codon combinations, and the NVDP by 4, and showing greater conservation at the amino acid level relative to nucleotides than any other region in this study. These features, plus a relatively low AT-content (72%) either indicate that this region is unusual among PfLCRs or it is a member of a group not detected in this study.

In 2006, DePristo et al. presented a nonadaptive model for the abundance and large size of PfLCRs, based on the overall high recombination rate and AT-content of the P. falciparum genome. The model states that LCRs are abundant because, in a genome that is of reduced nucleotide complexity, they can easily form when short tandem repeats begin to grow by replication slippage. Unequal crossing-over during recombination was proposed to be the mechanism by which the slippage-created LCRs would continue to grow to be unusually large. This model predicts that PfLCRs would be large, rich in high-AT codons with regions of codon expansion, and showing evidence of unequal crossing-over during recombination.

The DePristo model does accurately predict some of what we observe in the PolyN family of PfLCRs. Results presented here show that these regions are composed of AT-rich codon expansions, which are likely produced from the association of enough asparagine codons that they begin to frequently mispair during replication. It has been further proposed that the poly-asparagine runs in P. falciparum proteins serve a function, improving efficiency of translation in a system with a paucity of tRNAs (Frugier et al. 2010); however, no evidence has been presented that these codon expansions are preferentially caused or maintained by selection for this function. It is equally plausible that the PolyN regions seed and grow neutrally, even if they are able to compensate for P. falciparum genome's limited repertoire of tRNAs, and more work is needed to explicitly test whether natural selection promotes or retains these regions.

The observations presented here on the repeat structure and heterozygosity of Heterogeneous PfLCRs, however, offer less support for the general applicability of the model of DePristo et al. The birth/expansion scenario could have occurred, but our evidence shows that it would be in the more distant past, in the common ancestor of P. falciparum and P. reichenowi before they split sometime between 5 and 7 million and 10,000 years ago (Escalante et al. 1995, Rich et al. 2009). Thus, another model may be a better fit to describe these PfLCRs, such as that of Xue and Forsdyke (2003), which states that PfLCRs are nucleic acid adaptations that replace introns to allow for optimum folding in RNA. A test of this hypothesis would be an examination of regions orthologous with these Heterogeneous PfLCRs to assess if they are introns in other, related, organisms. The HighGC PfLCRs do not fit the specifics of the DePristo model, which is based on high AT-content sequence, because they represent some of the lowest AT-content in the P. falciparum genome. However, the evidence for recombination in there regions does support the prediction for an elevated recombination rate at PfLCRs.

The work we present here resolves some controversies in the literature and opens avenues for further study. A clear follow-up to this work is a full-scale genomic analysis of the distribution of each type of sequence from the PfLCR families (those presented here and possibly others) in genes under strong positive selection (antigens) versus genes under different selective pressures (e.g., housekeeping enzymes), to examine whether one type of PfLCR is overrepresented. Although it would be difficult to study heterozygosity in PfLCRs using comparative genomics because alignments in repeat regions are difficult, a systematic analysis of AT-content and repeat structure in the functional context of each gene could be used.

Other potentially important directions for further study involve recombination in PfCLRs. One important aspect to examine is whether a recombination motif exists in the high GC content recombinogenic regions, perhaps one similar to the GC-rich repeat motif found to associated with human recombination hotspots (Myers et al. 2008). In addition to these broader studies, deeper work at the intraspecific level, evaluating sequences within and among populations, is needed to research the extent of concerted evolution at PfLCRs, which has been shown to be an important evolutionary mechanism in structured repeats P. falciparum, and is associated with both recombination and replication slippage (Hughes 2004; Putaporntip et al. 2008, 2009).

Here we show how a study combining computational and empirical methods reveals that the reason for the disagreement in the recent literature on the evolution of LCRs in P. falciparum. Earlier works on PfLCRs described them with summary statistics that combined distinct forms of LCRs: aperiodic sequences and at least two different types of tandem repeats. Distinguishing LCRs which are regions of only reduced complexity from structured tandem repeats, and further distinguishing the two types of repeats, shows that much of this previous work is not in conflict but merely applies to only one type of LCR. Although there is no universal functional or adaptive explanation for PfLCRs, examining the different types of sequences reveals critical information for understanding protein evolution, recombination structure in the genome, and the potential for recombination as an important mechanism for generating diversity at antigenic loci.

Supplementary Material

Supplementary figure S1and tables 1 and 2 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

Supplementary Material

[Supplementary Data]


The work presented here was made possible by the technical assistance of Jehee Choi. The authors also wish to thank Drs Jane Carlton, Daniel Neafsey, Scott Roy, Christina Muirhead, Ferran Casals, and Rolf Zilversmit for important contributions through discussion and aid in the editing of the manuscript. We also wish to thank the two anonymous reviewers who made numerous helpful comments to enhance the quality of this paper and enhance the clarity of its presentation. This work was supported by a National Institutes of Health (NIH) Genetics and Genomics Training Grant (to M.M.Z), NIH grant (GM61351 to S.K.V, D.F.W., D.L.H), NIH grant (GM079536 to D.L.H.), and a Human Frontiers in Science Grant (to P.A.)


  • Altschul SF, Boguski MS, Gish W, Wootton JC. Issues in searching molecular sequence databases. Nat Genet. 1994;6:119–129. [PubMed]
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–410. [PubMed]
  • Aravind L, Iyer LM, Wellems TE, Miller LH. Plasmodium biology genomic gleanings. Cell. 2003;115:771–785. [PubMed]
  • Benovoy D, Morris RT, Morin A, Drouin G. Ectopic gene conversions increase the G + C content of duplicated yeast and Arabidopsis genes. Mol Biol Evol. 2005;22:1865–1868. [PubMed]
  • Birdsell JA. Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Mol Biol Evol. 2002;19:1181–1197. [PubMed]
  • Bowman S, Lawson D, Basham D, Brown D, Chillingworth T, Churcher C, Craig A, Davies R, Devlin K, Feltwell T. The complete nucleotide sequence of chromosome 3 of Plasmodium falciparum. Nature. 1999;400:532–538. [PubMed]
  • Brown TA. Genomes. 2nd ed. New York: Garland Science; 2002.
  • Carlton JM, Adams JH, Silva JC, et al. (40 co-authors) Comparative genomics of the neglected human malaria parasite Plasmodium vivax. Nature. 2008;455:757–763. [PMC free article] [PubMed]
  • Cortes A. A chimeric Plasmodium falciparum Pfnbp2b/Pfnbp2a gene originated during asexual growth. Int J Parasitol. 2005;35:125–130. [PubMed]
  • DePristo MA, Zilversmit MM, Hartl DL. On the abundance, amino acid composition, and evolutionary dynamics of low-complexity regions in proteins. Gene. 2006;378:19–30. [PubMed]
  • Ellegren H. Microsatellites: simple sequences with complex evolution. Nat Rev Genet. 2004;5:435–445. [PubMed]
  • Engels WR. Contributing software to the internet: the amplify program. Trends Biochem Sci. 1993;18:448–450. [PubMed]
  • Escalante AA, Barrio E, Ayala FJ. Evolutionary origin of human and primate malarias: evidence from the circumsporozoite protein gene. Mol Biol Evol. 1995;12:616–626. [PubMed]
  • Ferreira M, Nunes MS, Wunderlich G. Antigenic diversity and immune evasion by malaria parasites. Clin Diagn Lab Immunol. 2004;11:987–995. [PMC free article] [PubMed]
  • Ferreira MU, Ribeiro WL, Tonon AP, Kawamoto F, Rich SM. Sequence diversity and evolution of the malaria vaccine candidate merozoite surface protein-1 (MSP-1) of Plasmodium falciparum. Gene. 2003;304:65–75. [PubMed]
  • Freitas-Junior LH, Bottius E, Pirrit LA, Deitsch KW, Scheidig C, Guinet F, Nehrbass U, Wellems TE, Scherf A. Frequent ectopic recombination of virulence factor genes in telomeric chromosome clusters of P. falciparum. Nature. 2000;407:1018–1022. [PubMed]
  • Frugier M, Bour T, Ayach M, Santos MA, Rudinger-Thirion J, Théobald-Dietrich A, Pizzi E. Low complexity regions behave as tRNA sponges to help co-translational folding of plasmodial proteins. FEBS Lett. 2010;584:448–454. [PubMed]
  • Fullerton SM, Bernardo Carvalho A, Clark AG. Local rates of recombination are positively correlated with GC content in the human genome. Mol Biol Evol. 2001;18:1139–1142. [PubMed]
  • Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498–511. [PMC free article] [PubMed]
  • Gardner MJ, Tettelin H, Carucci DJ, et al. (27 co-authors) Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum. Science. 1998;282:1126–1132. [PubMed]
  • Hughes AL. The evolution of amino acid repeat arrays in Plasmodium and other organisms. J Mol Evol. 2004;59:528–535. [PubMed]
  • Irion A, Beck HP, Felger I. New repeat unit and hotspot of recombination in FC27-type alleles of the gene coding for Plasmodium falciparum merozoite surface protein 2. Mol Biochem Parasitol. 1997;90:367–370. [PubMed]
  • Jeffares DC, Pain A, Berry A, et al. (15 co-authors) Genome variation and evolution of the malaria parasite Plasmodium falciparum. Nat Genet. 2007;39:120–125. [PMC free article] [PubMed]
  • Jeffreys AJ, Royle NJ, Wilson V, Wong Z. Spontaneous mutation rates to new length alleles at tandem-repetitive hypervariable loci in human DNA. Nature. 1988;332:278–281. [PubMed]
  • Kaneko O, Kimura M, Kawamoto F, Ferreira MU, Tanabe K. Plasmodium falciparum: allelic variation in the merozoite surface protein 1 gene in wild isolates from Southern Vietnam. Exp Parasitol. 1997;86:45–57. [PubMed]
  • Kemp DJ, Coppel RL, Anders RF. Repetitive proteins and genes of malaria. Annu Rev Microbiol. 1987;41:181–208. [PubMed]
  • Kissinger JC, Brunk BP, Crabtree J, et al. (20 co-authors) The Plasmodium genome database. Nature. 2002;419:490–492. [PubMed]
  • Kiwanuka GN. Genetic diversity in Plasmodium falciparum merozoite surface protein 1 and 2 coding genes and its implications in malaria epidemiology: a review of published studies from 1997–2007. J Vector Borne Dis. 2009;46(1):1–12. [PubMed]
  • Larkin MA, Blackshields G, Brown NP, et al. (13 co-authors) ClustalW and ClustalX version 2. Bioinformatics. 2007;23:2947–2948. [PubMed]
  • Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. [PMC free article] [PubMed]
  • Maddison DR, Maddison WP. MacClade 4: analysis of phylogeny and character evolution. Sunderland (MA): Sinauer Associates; 2000.
  • McCutchan T, Dame J, Miller L, Barnwell J. Evolutionary relatedness of Plasmodium species as determined by the structure of DNA. Science. 1984;225:808–811. [PubMed]
  • Mu J, Awadalla P, Duan J, McGee KM, Joy DA, McVean GA, Su XZ. Recombination hotspots and population structure in Plasmodium falciparum. PLoS Biol. 2005;3:e335. [PMC free article] [PubMed]
  • Myers S, Freeman C, Auton A, Donnelly P, McVean G. A common sequence motif associated with recombination hot spots and genome instability in humans. Nat Genet. 2008;40:1124–1129. [PubMed]
  • Nussenzweig RS, Nussenzweig V, Freeman RR. Towards the immunological control of human protozoal diseases. Philos Trans R Soc London B Biol Sci. 1984;307(1131):117–128. [PubMed]
  • Perkins SL, Schall JJ. A molecular phylogeny of malarial parasites recovered from cytochrome b gene sequences. J Parasitol. 2002;88:972–978. [PubMed]
  • Pizzi E, Frontali C. Low-complexity regions in Plasmodium falciparum proteins. Genome Res. 2001;11:218–229. [PMC free article] [PubMed]
  • Putaporntip C, Jongwutiwes S, Hughes AL. Differential selective pressures on the merozoite surface protein 2 locus of Plasmodium falciparum in a low endemic area. Gene. 2008;427(1–2):51–57. [PMC free article] [PubMed]
  • Putaporntip C, Jongwutiwes S, Hughes AL. Natural selection maintains a stable polymorphism at the circumsporozoite protein locus of Plasmodium falciparum in a low endemic area. Infect Genet Evol. 2009;9(4):567–573. [PMC free article] [PubMed]
  • Rayner JC, Tran TM, Corredor V, Huber CS, Barnwell JW, Galinski MR. Dramatic difference in diversity between Plasmodium falciparum and Plasmodium vivax reticulocyte binding-like genes. Am J Trop Med Hyg. 2005;72:666–674. [PubMed]
  • Rich SM, Hudson RR, Ayala FJ. Plasmodium falciparum antigenic diversity: evidence of clonal population structure. Proc Natl Acad Sci U S A. 1997;94:13040–13045. [PMC free article] [PubMed]
  • Rich SM, Leendertz FH, Xu G, et al. (14 co-authors) The origin of malignant malaria. Proc Natl Acad Sci U S A. 2009;106:14902–14907. [PMC free article] [PubMed]
  • Richard GF, Paques F. Mini- and microsatellite expansions: the recombination connection. EMBO Rep. 2000;1:122–126. [PMC free article] [PubMed]
  • Rozen S, Skaletsky HJ. Primer3 on the WWW for general users and for biologist programmers. In: Krawetz S, Misener S, editors. Bioinformatics methods and protocols: methods in molecular biology. Totowa (NJ): Humana Press; 2000. pp. 365–386. [PubMed]
  • Su X, Ferdig MT, Huang Y, Huynh CQ, Liu A, You J, Wootton JC, Wellems TE. A genetic map and recombination parameters of the human malaria parasite Plasmodium falciparum. Science. 1999;286:1351–1353. [PubMed]
  • Toure Y, Oduola A. Focus: malaria. Nat Rev Microbiol. 2004;2:276–277. [PubMed]
  • Trager W, Jensen J. Human malaria parasites in continuous culture. Science. 1976;193:673. [PubMed]
  • Verra F, Hughes AL. Biased amino acid composition in repeat regions of Plasmodium antigens. Mol Biol Evol. 1999;16:627–633. [PubMed]
  • Volkman SK, Sabeti PC, DeCaprio D. (29 co-authors) A genome-wide map of diversity in Plasmodium falciparum. Nat Genet. 2007;39:113–119. [PubMed]
  • Wolfram S. Mathematica. Version 5.2. Champaign (IL): Wolfram Research, Inc. 2005.
  • Wootton J, Federhen S. Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. 1993;17:149–163.
  • Wootton JC. Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem. 1994;18:269–285. [PubMed]
  • Xue HY, Forsdyke DR. Low-complexity segments in Plasmodium falciparum proteins are primarily nucleic acid level adaptations. Mol Biochem Parasitol. 2003;128:21–32. [PubMed]

Articles from Molecular Biology and Evolution are provided here courtesy of Oxford University Press
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • MedGen
    Related information in MedGen
  • PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...