Mol Biochem Parasitol. 2007 Jul; 154(1): 98–102.
PMCID: PMC1906845

An approach to classifying sequence tags sampled from Plasmodium falciparum var genes

Plasmodium falciparum erythrocyte membrane protein 1 (PfEMP1) appears to play an key role as both a virulence factor and as a target of naturally acquired immunity [1,2]. This large family of molecules is encoded by the highly polymorphic superfamily of var genes of which there are 60 variants in every genome [3].

A rapidly growing collection of var sequences is now available from clinical isolates around the world [4–11]. Despite immense diversity both in terms of overall organization and primary sequence, the majority of var genes contain a DBL1α region [3]. The existence of short islands of homology within this region has enabled the design of primers that can be used to sample sequence from most var genes to create DBL1α sequence tags [5]. A standard approach to classification of these sequence tags would enable direct comparisons to be made between different studies. However, the extreme diversity of var genes and the fact that they undergo intra-genic recombination [4,12,13], makes this difficult.

Despite the high diversity there does appear to be underlying simplicity to the var genes that supports the use of information present in DBL1α sequence tags in making comparisons between the expression levels in different isolates. Analysis of the fully sequenced genome of a single P. falciparum isolate 3D7 suggests that the genomic location of the 60 var genes promotes genetic structuring and the maintenance of genetically distinct sequence types [14–16]. In addition, structural features of the genes within the single genome of 3D7 closely mirrors the range of structural features among collections of DBL1α sequence tags from clinical parasite isolates [9]. We previously used a small number of key sequence features in an algorithm to classify the DBL1α sequence tags from a single geographical location in Kenya into six groups [9] (see Fig. 1A and below). This var tag grouping system, though it is based on portion of the DBL1α domain (see Supplementary information), corresponded well with whole var gene classification based on the whole genome sequence of the parasite line 3d7 [9]. This grouping system appears to be biologically meaningful. Expression of group 2 sequences was strongly associated with the parasite rosetting phenotype in Kilifi whereas expression of group 1 sequences was negatively associated with the repertoire of antibodies to infected erythrocyte surface antigens carried by the patient at the time of disease [9]. Thus DBL1α sequence tags appear to contain useful information about the genes to which they belong that is currently not directly accessible in field studies of clinical parasite isolates.

Fig. 1
The cysteine/PoLV classification approach. (A) Sequence features extracted from DBL1α sequence tags. The input sequence is the DBL1α sequence starting from a DIGDI motif within homology block D and ending in PQFLR motif within homology ...

We have developed a rapid approach to performing the classification using text string analysis functions in Microsoft Excel and Perl (see Supplementary files). This classifies sequence tags directly without the need for prior alignment and can be performed on many sequences simultaneously. The approach is summarized in Fig. 1A. The classification is based around a count of the number of cysteine residues within the tag region and a set of sequence motifs at four positions of limited variability (PoLV 1–4) whose positions within the sequence are fixed in relation to four anchor points (a–d, marked with arrows in Fig. 1A). Thus PoLV1 and PoLV4 are fixed in relation to the 5′ and 3′ ends of the sequence, respectively (anchor points a and d). PoLV2 and PoLV3 are fixed in relation to a “WW” motif (anchor point b). The definition of the groups defined by these features is summarized in the box in Fig. 1. Henceforth we will refer to these groupings as cyteine/PoLV groups.

This text string analysis approach was tested on the original set of sequences from Kilifi, Kenya [9] and sequences from 9 other studies (see Fig. 1B–E). The sequences were pre-screened to ensure that they contained a 5′DIGDI and 3′PQFLR consensus sequences. Overall 99.6% of sequences could be classified using this approach. This included 100% of sequences from Malawi (J. Montgomery unpublished), Papua New Guinea [7,17], Mali [10], Solomon Islands [7], and The Philippines [7] together with 100% of sequences from one dataset from Brazil [6]. A dataset from Venezuela (52 non-identical sequences [8]) carried two sequences that could not be classified. A dataset from Brazil (137 non-identical sequences, [18]) carried one sequence that could not be classified. The original dataset from Kilifi (878 non-identical sequences [9]) carried two sequences that could not be classified. All five of these sequences lacked WW or VW motifs required as anchor points within the sequence.

Part of the rationale for this grouping system came from a search for PoLV motifs that were associated with sequences with distinct length distributions [9]. Two motifs were identified which were independently associated with short sequences. These are MFK* at PoLV1 and *REY at PoLV2 (an asterisk here denotes any amino acid). We hypothesised that if sequences of different length recombine with each other they will generate a wide range of sequences of different lengths whereas genetically isolated sequences, i.e., those that are not recombining with one another are able to maintain distinct distributions in their length. If these groupings are genuine the sequences classified into different groups should have similar lengths in different settings. As shown in Fig. 1B–E, broadly similar distributions of sequence length are observed within the six different groups between three different continents, suggesting that sequences generated in these different studies shared the same set of structural features. Specifically, MFK* (carried at PoLV1 in group 1) and *REY (carried at PoLV2 in groups 2 and 5) are associated with short sequences in each geographical region. No examples of sequences with both MFK* and *REY motifs were found, suggesting that these motifs are mutually exclusive. In addition, though *REY motifs were found in sequences with 2 or 4 cysteine residues (cys2 or cys4), with the exception of a single cys4 (group 4) sequence from the Philippines, MFK* motifs were found exclusively in cys2 (group 1) sequences.

Further support for the cysteine/PoLV groupings comes from recent publications. Trimnell et al. found a good correspondence between cysteine/PoLV groupings of cys2 sequences and groups defined phylogenetically within a globally sampled subset of var genes with a specific upstream control region, upsA [11]. Also evident from sequences reported in that study is the fact that DBL1 from two other globally sampled subsets of var genes can be easily distinguished from DBL1 domains from other vars using unique PoLV motifs. var2csa vars have a unique PoLV2 motif “EVIT”, whereas Type3 vars have a unique PoLV4 motif “PPVV” (data not shown).

Kraemer et al. have recently performed an analysis and re-classification of whole var genes from 3D7, HB3 and IT4 [19]. Fig. 2A and B summarizes the relationship between the cysteine/PoLV groupings and whole var gene classification. With the exception of group 6 sequences which were not found in HB3 var genes all sequence groups were represented. In all three genomes cysteine/PoLV group 1 sequences are exclusively found in group A var gene and long genes with >5 domains whereas cysteine/PoLV group 5 are found only in non-group A genes and those with 4–5 domains. Cys2 sequence tags (groups 1–3) were never found in group C var genes.

Fig. 2
The relationship between the cysteine/PoLV classification approach and other var gene classifications. (A–B) Comparison with whole var gene classification in laboratory isolates 3D7, HB3 and IT4 [19]. (A) The relationship between whole gene classifications ...

Kyriacou et al. used a phylogenetic approach to compare DBL1α sequence tags from Mali [10]. Visual inspection of the layout of these sequences reveals three main groups and a minor group. There was good correspondence between these groups and the cysteine/PoLV groupings (Fig. 2C [10]). This study showed that cys2 sequence tags were more frequent among parasite isolated from children with cerebral malaria than those from children with hyperparasitaemia. However, division of the sequences into cysteine/PoLV groups suggests that the frequency of group 2 sequences is similar in parasites from these two groups of children (Fig. 2D [10]).

At a higher level of resolution, the distinct sequence identifier (DSID) (see Fig. 1A) is a potentially useful method of further classifying sequence tags. This consists of a string of sequence features in the form “PoLV1-PoLV2-PoLV3-number of cysteines-PoLV4-sequence tag length”. The DSID captures more of the overall sequence diversity than the previously described “sequence signature” [9] whilst remaining robust to minor changes introduced by sequencing or PCR errors. Among the 1595 non-identical sequences identified in all the studies described here, there were 1111 DSIDs. Fig. 1F–G illustrates the potential usefulness of this approach to classification. In Fig. 1F, 44 “common” sequences that were shared between more than one study were selected. Fishers exact test was used to determine whether these common sequences were shared between two studies more or less than would be expected by chance (+ or − symbols, respectively). Fig. 1G is the same except that the analysis was done at the level of 157 “common” DSIDs that were shared between more than one study. In contrast to Fig. 1F, there was a highly significant similarity between var genes from South American isolates in support a recent study of Amazonian isolates [18]. In contrast to the low overlap between DSIDs from Kilifi and from South America (Fig. 1G) there is considerable overlap in the constituent PoLV motifs themselves (see Supplementary information). This illustrates the potential for recombination to generate diversity from a limited number of sequence blocks [4,12,13].

Since the cysteine/PoLV system of classification is based on commonly occurring sequence features it is hoped that it will useful for initial analysis and annotation, comparison of different geographical regions over time and identification of unusual sequences.


We thank Norbert Peshu, the director of the Centre for Geographic Medicine Research, Coast, unit at Kilifi and Alister Craig for useful discussion. We are grateful to Joe Smith and Sue Kraemer (Seattle Biomedical Research Institute, Seattle, USA) for pre-publication access to IT4 var sequence information. This paper is published with the permission of the Director of KEMRI. The work was supported by a Wellcome Trust Advanced Training Fellowship in Tropical Medicine (060678) to PB. And Wellcome Trust Project grants 076030 (PB,CN,KM) and 071376 (JM).

Appendix A. Supplementary data


1. Kyes S., Horrocks P., Newbold C. Antigenic variation at the infected red cell surface in malaria. Annu Rev Microbiol. 2001;55:673–707. [PubMed]
2. Bull P.C., Marsh K. The role of antibodies to Plasmodium falciparum infected erythrocyte surface antigens in naturally acquired immunity to malaria. Trends Microbiol. 2002;10:55–58. [PubMed]
3. Gardner M.J., Hall N., Fung E. Genome sequence of the human malaria parasite Plasmodium falciparum. Nature. 2002;419:498–511. [PMC free article] [PubMed]
4. Ward C.P., Clottey G.T., Dorris M., Ji D.D., Arnot D.E. Analysis of Plasmodium falciparum PfEMP-1/var genes suggests that recombination rearranges constrained sequences. Mol Biochem Parasitol. 1999;102:167–177. [PubMed]
5. Taylor H.M., Kyes S.A., Harris D., Kriek N., Newbold C.I. A study of var gene transcription in vitro using universal var gene primers. Mol Biochem Parasitol. 2000;105:13–23. [PubMed]
6. Kirchgatter K., del Portillo H.A. Association of severe noncerebral Plasmodium falciparum malaria in Brazil with expressed PfEMP1 DBL1α sequences lacking cysteine residues. Mol Med. 2002;8:16–23. [PMC free article] [PubMed]
7. Fowler E.V., Peters J.M., Gatton M.L., Chen N., Cheng Q. Genetic diversity of the DBLalpha region in Plasmodium falciparum var genes among Asia-Pacific isolates. Mol Biochem Parasitol. 2002;120:117–126. [PubMed]
8. Tami A., Ord R., Targett G.A., Sutherland C.J. Sympatric Plasmodium falciparum isolates from Venezuela have structured var gene repertoires. Malar J. 2003;2:7. [PMC free article] [PubMed]
9. Bull P.C., Berriman M., Kyes S. Plasmodium falciparum variant surface antigen expression patterns during malaria. PLoS Pathog. 2005;1:e26. [PMC free article] [PubMed]
10. Kyriacou H.M., Stone G.N., Challis R.J. Differential var gene transcription in Plasmodium falciparum isolates from patients with cerebral malaria compared to hyperparasitaemia. Mol Biochem Parasitol. 2006;150:211–218. [PMC free article] [PubMed]
11. Trimnell A.R., Kraemer S.M., Mukherjee S. Global genetic diversity and evolution of var genes associated with placental and severe childhood malaria. Mol Biochem Parasitol. 2006;148:169–180. [PubMed]
12. Taylor H.M., Kyes S.A., Newbold C.I. Var gene diversity in Plasmodium falciparum is generated by frequent recombination events. Mol Biochem Parasitol. 2000;110:391–397. [PubMed]
13. Freitas-Junior L.H., Bottius E., Pirrit L.A. Frequent ectopic recombination of virulence factor genes in telomeric chromosome clusters of P. falciparum. Nature. 2000;407:1018–1022. [PubMed]
14. Kraemer S.M., Smith J.D. Evidence for the importance of genetic structuring to the structural and functional specialization of the Plasmodium falciparum var gene family. Mol Microbiol. 2003;50:1527–1538. [PubMed]
15. Robinson B.A., Welch T.L., Smith J.D. Widespread functional specialization of Plasmodium falciparum erythrocyte membrane protein 1 family members to bind CD36 analysed across a parasite genome. Mol Microbiol. 2003;47:1265–1278. [PubMed]
16. Lavstsen T., Salanti A., Jensen A.T., Arnot D.E., Theander T.G. Sub-grouping of Plasmodium falciparum 3D7 var genes based on sequence analysis of coding and non-coding regions. Malaria J. 2003;2:27. [PMC free article] [PubMed]
17. Kaestli M., Cortes A., Lagog M., Ott M., Beck H.P. Longitudinal assessment of Plasmodium falciparum var gene transcription in naturally infected asymptomatic children in Papua New Guinea. J Infect Dis. 2004;189:1942–1951. [PubMed]
18. Albrecht L., Merino E.F., Hoffmann E.H. Extense variant gene family repertoire overlap in Western Amazon Plasmodium falciparum isolates. Mol Biochem Parasitol. 2006;150:157–165. [PubMed]
19. Kraemer S.M., Kyes S.A., Aggarwal G. Patterns of gene recombination shape var gene repertoires in Plasmodium falciparum: comparisons of geographically diverse isolates. BMC Genom. 2007;8:45. [PMC free article] [PubMed]
20. Smith J.D., Subramanian G., Gamain B., Baruch D.I., Miller L.H. Classification of adhesive domains in the Plasmodium falciparum erythrocyte membrane protein 1 family. Mol Biochem Parasitol. 2000;110:293–310. [PubMed]

Save items

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


  • Compound
    PubChem chemical compound records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records. Multiple substance records may contribute to the PubChem compound record.
  • PubMed
    PubMed citations for these articles
  • Substance
    PubChem chemical substance records that cite the current articles. These references are taken from those provided on submitted PubChem chemical substance records.

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...