![]() | ![]() |
Formats:
|
||||||||||||||||
Copyright © 2005, American Society of Plant Biologists Arabidopsis Special Issue Department of Plant Biology, University of Minnesota, St. Paul, Minnesota 55108 *Corresponding author; e-mail kvandenb/at/cbs.umn.edu; fax 612–625–1738. 2These authors contributed equally to the paper. Received January 25, 2005; Revised March 4, 2005; Accepted March 9, 2005. This article has been cited by other articles in PMC.Abstract Defensins represent an ancient and diverse set of small, cysteine-rich, antimicrobial peptides in mammals, insects, and plants. According to published accounts, most species' genomes contain 15 to 50 defensins. Starting with a set of largely nodule-specific defensin-like sequences (DEFLs) from the model legume Medicago truncatula, we built motif models to search the near-complete Arabidopsis (Arabidopsis thaliana) genome. We identified 317 DEFLs, yet 80% were unannotated at The Arabidopsis Information Resource and had no prior evidence of expression. We demonstrate that many of these DEFL genes are clustered in the Arabidopsis genome and that individual clusters have evolved from successive rounds of gene duplication and divergent or purifying selection. Sequencing reverse transcription-PCR products from five DEFL clusters confirmed our gene predictions and verified expression. For four of the largest clusters of DEFLs, we present the first evidence of expression, most frequently in floral tissues. To determine the abundance of DEFLs in other plant families, we used our motif models to search The Institute for Genomic Research's gene indices and identified approximately 1,100 DEFLs. These expressed DEFLs were found mostly in reproductive tissues, consistent with our reverse transcription-PCR results. Sequence-based clustering of all identified DEFLs revealed separate tissue- or taxon-specific subgroups. Previously, we and others showed that more than 300 DEFL genes were expressed in M. truncatula nodules, organs not present in most plants. We have used this information to annotate the Arabidopsis genome and now provide evidence of a large DEFL superfamily present in expressed tissues of all sequenced plants. Organisms are constantly confronted with potentially pathogenic microorganisms. Yet, few encounters result in disease, due to the multilayered lines of defense each organism possesses. In vertebrates, adaptive immunity has long held center stage because of its ability to recognize almost any foreign antigen. The ancient innate immune system is equally important and provides a critical line of defense in vertebrates, invertebrates, plants, and insects (Thomma et al., 2002; Beutler, 2004; Bulet et al., 2004; Finlay and Hancock, 2004). In plants, innate immunity occurs via elaborate mechanisms (Dangl and Jones, 2001; Veronese et al., 2003). The plant cell wall serves as a barrier to microbial penetration. Antimicrobial compounds deter would-be invaders. Should penetration occur, recognition leads to the production of reactive oxygen intermediates, cell wall strengthening, activation of protein kinase pathways, and the production of signaling intermediates. Signaling events lead to localized responses such as a hypersensitive response (programmed cell death) or to the release of antimicrobial compounds. The plant is also immunized against unrelated pathogens via systemic acquired resistance (Delaney, 1997; Dong, 2001; Gozzo, 2003). Much attention has focused on the interaction of plant resistance genes (R-genes) and pathogenic avirulence (avr) genes (Dangl and Jones, 2001). Plants have evolved R-genes to aid in the detection of specific pathogen avr gene products. Locked in an arms race, avr and R-genes are under strong pressure to evade and reestablish detection. Arabidopsis (Arabidopsis thaliana) has more than 150 of the nucleotide-binding-site/Leu-rich-repeat (NBS/LRR) class of R-genes (Baumgarten et al., 2003; Meyers et al., 2003). Considerable effort has been made to elucidate the prevalence and activity of pathogenesis-related proteins such as antimicrobial peptides (AMPs; Broekaert et al., 1997; Garcia-Olmedo et al., 1998; Theis and Stahl, 2004). AMPs are widespread throughout the plant kingdom and include thionins, defensins, lipid transfer proteins, knottins, heveins, and snakins. Each of these classes of small, cationic secreted peptides has a characteristic number and linear arrangement of Cys pairs. These Cys pairs form disulfide bridges in class-specific three-dimensional folds (Broekaert et al., 1997). Members from most of these classes are active in vitro and in transgenic plants against a broad spectrum of bacterial and fungal pathogens (Broekaert et al., 1997; Berrocal-Lobo et al., 2002). AMPs frequently exhibit tissue-specific expression in epidermal and peripheral cell layers, or are nonspecifically expressed in response to wounding and pathogen attack (Broekaert et al., 1997; Garcia-Olmedo et al., 1998). Among the AMPs, plant defensins are particularly important. They have been identified in diverse taxa with many defense roles, including antifungal (Gao et al., 2000; Park et al., 2002; Cabral et al., 2003; Lay et al., 2003a), antibacterial (Osborn et al., 1995; Segura et al., 1998; Koike et al., 2002), anti-insect (Chen et al., 2002; Lay et al., 2003b), and protease inhibitory (Osborn et al., 1995; Wijaya et al., 2000) activities. In many cases, small differences in amino acid sequence can predict the specificity of the defense role (Garcia-Olmedo et al., 1998). In contrast to many field studies of R-genes, defensins have recently been shown to confer broad-spectrum resistance to pathogens in crops (Gao et al., 2000; Kanzaki et al., 2002). Defensins are thought to be members of small gene families. While Arabidopsis has 15 documented defensins (Thomma et al., 2002), human and mouse have less than 50 (Schutte et al., 2002). These numbers are likely gross underestimates. Spurred by the discovery of more than 300 defensin-like Cys cluster proteins (CCPs) in the legume Medicago truncatula (Fedorova et al., 2002; Mergaert et al., 2003; Graham et al., 2004) and more than a dozen similar unannotated open reading frames in the Arabidopsis genome (Graham et al., 2004), we set out to systematically identify more defensin-like sequences (DEFLs) in the Arabidopsis genome and in expressed sequences of higher plants. RESULTS Identification of DEFLs in Arabidopsis Our search strategy used successive iterations of hidden Markov model (HMM; Durbin et al., 1998) and BLAST (Altschul et al., 1997) searches to identify small, secreted Cys-rich peptides in plants. As a starting point for the search, we used a set of HMMs generated from legume sequences identified in earlier work (Graham et al., 2004). These sequences encoded putative CCPs with distant homology to plant defensins and scorpion toxins (Graham et al., 2004), two groups known to have antimicrobial properties (Thomma et al., 2002; Graham et al., 2004; Yount and Yeaman, 2004). We identified 317 DEFLs in Arabidopsis, including all 15 known defensins (Figs. 1
Nearly all DEFL genes are composed of two exons. The first exon (approximately 65 bp) encodes the signal peptide, and the second (approximately 200 bp) encodes the mature peptide. The average intron size for expressed and predicted DEFLs, compiled separately, is 210 bp. More than 80% of expressed and predicted DEFLs have introns between 75 and 275 bp in size (size distributions in Supplemental Fig. 2). Further, the average position of the donor splice site relative to the start-ATG in predicted sequences closely matches expressed DEFLs (68 ± 2 bp versus 66 ± 1 bp, respectively). Arabidopsis DEFLs were divided into 46 subgroups, each modeled with separate HMMs. Within a subgroup, signal peptide sequence, intron position, and intron size were well conserved. For example, 75% of sequences had intron start positions within 3 bp and intron sizes within 75 bp of the subgroup average (Supplemental Figs. 3 and 4). However, the mature peptides within subgroups were highly divergent with the exception of conserved Cys. This is reflected in the predicted pIs for mature peptides within individual subgroups, where some members are quite basic and others highly acidic. Indeed, the distribution of pI values for all DEFLs is bimodal with a distinct trough around 7.0 pH units (Supplemental Fig. 5). Of these subgroups, 78% had a Cys-stabilized alpha beta (CSαβ) motif common to defensins (Cornet et al., 1995; Lay et al., 2003b), and 80% had a γ-core motif common to all classes Cys containing AMPs (Fig. 2 Genome Organization of DEFLs in Arabidopsis Clusters of DEFLs were explored to determine if they arose from tandem duplication and/or unequal recombination (Fig. 1
In addition to duplications within clusters, we also found evidence of single or multiple gene duplications to remote sites. Sequences within closely related subgroups appear dispersed throughout the genome (Supplemental Fig. 6). Within subgroups, there is evidence for at least 80 independent segmental duplication events (Supplemental Table II), half of which involve pairs of DEFLs with >50% amino acid identity. Dotplot analysis reveals that these segmental duplications are very small (typically 500–1,500 bp) and encompass only the defensin and adjacent regulatory regions (Supplemental Table III). One stunning example is a 1,450-bp duplication on chromosomes 3 and 5. The duplicated DEFLs (At3g59930 and At5g33355) share 97% nucleotide identity throughout the signal peptide, intron, and mature protein. Once duplication to remote sites occurred, duplicated sites often underwent subsequent independent duplication or recombination events. For example, two pairs of DEFLs are duplicated on chromosomes 2 and 5 (At2g26010 and At2g26020, and At5g44430 and At5g44420). The duplicated regions, approximately 4,300 bp in length, share 79% nucleotide identity. Using the polymorphisms identified in the alignment, it is clear that two genes were duplicated as a unit. However, one duplicated pair has undergone subsequent unequal recombination. Interestingly, relatively few non-local related DEFL pairs overlap known large-scale segmental duplications in Arabidopsis (Supplemental Fig. 7; Vision et al., 2000; Cannon et al., 2003). Experimental Verification of Expression in Arabidopsis Given the high percentage of novel genes predicted in this work, we attempted to verify the expression of representatives within the six largest clusters. Of the 12 primer combinations used, nine (75%) amplified expressed DEFLs from five different clusters, two primer pairs failed to detect expression, and one primer pair failed to amplify either expressed DEFLs or the genomic DNA control (Fig. 3
Sequencing of cloned reverse transcription (RT)-PCR products identified 19 different DEFLs (GenBank accession nos. AY803252–AY803270). Of these, two represented alternate transcripts of the same gene (AY803263 and AY803265). We estimated that the nine primer pairs used in cloning could have amplified 27 different genes, including one predicted pseudogene. Therefore, 63% of the possible sequences were recovered. Of the 17 unique DEFLs identified, 12 had no previous evidence of expression. While this is a limited sample size, it suggests a large percentage of the 317 predicted DEFLs will be expressed. Cloned sequences spanning introns were used to test the accuracy of our intron predictions. With the exception of the one sequence with two splice variants from cluster 4, all predicted intron boundaries were correct. Both splice variants disagreed with our prediction. Evolution of DEFLs in Arabidopsis To examine the evolutionary pressures acting on DEFL genes, the rates of nonsynonymous (Ka) and synonymous (Ks) substitutions were determined between 75 gene pairs representing 16 different clusters (Fig. 4
Identification of Expressed DEFLs from Higher Plants Outside of Arabidopsis, 1,089 unique DEFLs were identified from 62 different plant species. These sequences contributed 47 subgroups lacking an Arabidopsis counterpart. Of these, 83% had the CSαβ motif, while 76% had the γ-core motif. Sequences within most of the 93 total subgroups displayed tissue-specific patterns of expression, particularly in seeds and other reproductive tissues (Table II). Note that reproductive tissues have been heavily sampled in the EST databases. They account for 41% of all plant ESTs. Thus, a subgroup of DEFLs composed almost entirely of ESTs from reproductive tissues may appear simply by chance. Despite this possibility, statistical analyses of our tissue-specific subgroups reveal that 26 of the 27 subgroups listed in Table II are indeed tissue specific (P < 0.05).
We also found evidence of taxon specificity among subgroups. Table II shows that 66% of all subgroups were highly specific to a single taxonomic family. In particular, many grass (Poaceae) sequences cluster into their own subgroups. Even though the Poaceae account for 53% of all unique EST sequences, the taxonomic specificity observed is statistically significant in 24 out of the 28 Poaceae-specific subgroups reported in Table II (P < 0.05). Expanded detail on the taxonomic and tissue distribution for ESTs in all subgroups is provided in Supplemental Table V. DISCUSSION Are These Genes Really Defensins? The 317 genes described in this work have all of the hallmarks of defensin genes. Nearly all encode small putatively secreted peptides that are quite diverse with the exception of six, eight, or 10 conserved Cys. Roughly 80% have either a defensin CSαβ motif (Cornet et al., 1995; Lay et al., 2003b) or a γ-core motif common to all classes of Cys-rich AMPs (Yount and Yeaman, 2004). We have shown that they have a genomic organization virtually indistinguishable from mammalian defensins (Schutte et al., 2002; Maxwell et al., 2003) and plant R-genes (Baumgarten et al., 2003; Meyers et al., 2003), which have been amplified by successive rounds of duplication and divergent selection. Defensin-Like Genes Constitute Large Gene Families in Many Plants Previous work suggests that defensins exist as small gene families (Schutte et al., 2002; Thomma et al., 2002). However, we have shown that the Arabidopsis genome contains 317 DEFLs. In addition, mining the collective EST data for many higher plants identified very high representation of DEFLs, particularly among the grasses. In earlier work (Fedorova et al., 2002; Mergaert et al., 2003; Graham et al., 2004), more than 300 DEFLs were identified in M. truncatula. The vast majority of these were expressed exclusively in the nodule, an organ not even present in Arabidopsis. Are plants truly anomalous in their number of DEFLs? It is possible. More likely, however, this highly divergent superfamily has eluded detection using current experimental and bioinformatics practices. Current Bioinformatics Practices Have Hindered the Detection of DEFLs The Arabidopsis genome has been nearly complete for several years. Gaps remain in centromeric and a few euchromatic regions (Hosouchi et al., 2002). Given the status of the genome sequence, it seemed surprising that such a large gene family as the DEFLs remained undiscovered. Initially, we hypothesized that current gene finding algorithms were limited in their abilities to find small genes because they were trained to recognize characteristics of known, and likely larger, genes (Zhang, 2002). Alternatively, computational data filters could have been used to remove short sequences (Scheetz et al., 2003; Wortman et al., 2003). The original Arabidopsis genome annotation suffered from the lack of accurate gene prediction software that is available today. Hence, many genes remained unpredicted, including the majority of our DEFLs. In The Institute for Genomic Research's (TIGR's) reannotation of the Arabidopsis genome, the latest suite of gene finders was utilized to capture missing gene annotations (Haas et al., 2005). A one-time minimum cutoff of 110 amino acid residues was applied to avoid adding numerous short false-positive predictions. Because of this size cutoff, the majority of our DEFLs continued to lack representation in the latest Arabidopsis genome annotation generated at TIGR (B. Haas, TIGR, personal communication). The method we and others (Pegg and Babbitt, 1999; Vanoosthuyse et al., 2001; Schutte et al., 2002) have used is complementary to current annotation approaches. Computational size filters are often used in whole-genome annotation efforts to eliminate short, false-positive predictions. However, they also eliminate small genes present in the genome. Our approach offers a means to distinguish spurious predictions from families of real genes. Common features shared among predicted genes offer clues of biological importance. We and others (Pegg and Babbitt, 1999) have observed that members of large divergent superfamilies may have poor overall sequence similarity, yet have associations of biological significance. Statistical similarity between subgroups of DEFLs is very low; yet, they share similar signal peptides, Cys arrangements, and genomic organization. In our searches, we were less rigid in requiring sequence similarity but required potential hits to have an upstream signal sequence that was not built into our motif search. The consistency of the predicted donor splice site relative to the translation start site provided further validation of our predictions. This enabled a high level of confidence even prior to our experimental verification. DEFLs Evolve by Duplication and Selection As mentioned previously, the DEFLs in Arabidopsis exist as single genes and clusters throughout the genome. Clearly, clusters have arisen by successive rounds of local duplication. In addition, clusters have been dispersed to remote regions of the genome by segmental duplication. Within clusters, analyses of nonsynonymous and synonymous amino acid substitution rates provide evidence for evolutionary pressures that might be acting on these genes. We found that the signal peptide is conserved, while the mature peptide may be under diversifying or purifying selection depending upon the cluster analyzed. The results are similar to what has been seen in mammalian defensins (Maxwell et al., 2003; Semple et al., 2003) and the NBS/LRR family of R-genes (Baumgarten et al., 2003; Meyers et al., 2003). The extreme divergence between subgroups and even within local clusters has made accurate sequence alignments of DEFL genes problematic, which is a requirement for accurate phylogenetic inference. However, reliable phylogenetic studies performed on NBS/LRR genes in Arabidopsis may provide insight into the evolution of DEFL genes. Baumgarten et al. (2003) and Meyers et al. (2003) found that clusters of NBS/LRR genes have evolved by successive rounds of duplication, unequal recombination, and segmental duplication to remote regions of the genome. However, these two groups disagree on the physical scale of the segmental duplications. Baumgarten et al. (2003) assert that large-scale segmental duplications and chromosomal rearrangements are responsible for the distribution of NBS/LRR genes in the genome. By contrast, Meyers et al. (2003) found that segmental duplications have occurred on a microscale level. In our analyses of the DEFL genes, our observations are closely aligned with those of Meyers et al. (2003). Numerous DEFLs May Be Required to Protect against Potential Pathogens Organisms are constantly confronted with potential pathogens. Therefore, it stands to reason that each organism should possess a wide range of genes to combat threats to their growth and survival. Characterized defensin peptides have been shown to have broad-spectrum activity in vitro; however, their potency is highly dependent on ionic concentrations and synergistic interactions with other AMPs (Broekaert et al., 1997; Garcia-Olmedo et al., 1998). Another important observation is that defensins and other AMPs are often expressed in an organ- or tissue-specific manner (Broekaert et al., 1997; Garcia-Olmedo et al., 1998). Thus, considerable redundancy in function may exist: multiple gene products expressed in distinct tissues may defend against the same or overlapping sets of pathogens. In our previous work in M. truncatula, the majority of DEFLs were expressed in nodules. The symbiosis between plant and rhizobium leads to suppression of typical defense responses (Mithöfer, 2002; Mitra and Long, 2004). We hypothesized that nodule-specific DEFLs protect the nutrient-rich nodule from the multitude of pathogens present in the soil. Like nodules, seeds are also nutrient rich. Large amounts of protein, polysaccharides, and lipids provide energy and raw materials for germination and development of the seedling (Wang et al., 2003). When dormant, seeds may be unable to respond to biotic threats by induction of defense response genes. Therefore, developmental control of antimicrobial peptide accumulation in seeds may be a preventive measure to avert attack on nutrient-rich resources. The observation of specific expression in seeds and other reproductive tissues for many of our DEFL genes is consistent with this hypothesis. DEFLs May Be Involved in Non-Host Resistance With the discovery of so many DEFLs largely specific to individual plant families, one may speculate that DEFLs could be major contributors to non-host resistance. Non-host resistance is a phenomenon in which an entire plant species is resistant to a specific pathogen (Heath, 2000; Mysore and Ryu, 2004). It is believed to be a complex phenomenon involving both preformed barriers to microbial penetration and inducible defense responses. Non-host resistance provides broad-spectrum, durable protection in the field. Defensins are ideal candidates for key players in this response. They are constitutively expressed in peripheral cell layers of nutrient-rich tissues and are inducible by microbial penetration in other tissues (Broekaert et al., 1997; Garcia-Olmedo et al., 1998). They engage in complex synergistic interactions with other AMPs to increase their potency. Moreover, individual defensins are active against a broad spectrum of microbes, and have been shown to confer resistance to microbes in transgenic crops, durable over several generations (Gao et al., 2000; Kanzaki et al., 2002). Zimmerli et al. (2004) recently showed that several defensins in Arabidopsis were up-regulated in response to the non-host pathogens responsible for barley powdery mildew and potato late blight, but not in response to closely related host pathogens. These findings are consistent with the hypothesis that defensins and related DEFLs may be major contributors to non-host resistance. DEFLs May Have Functions Unrelated to Defense Not all small secreted Cys-rich plant peptides have roles in defense. Some of our DEFL genes could be involved in reproductive regulation as are members of the stig1 gene family (Goldman et al., 1994), or the Sterility-locus (S-locus) Cys-rich (SCR; Schopfer et al., 1999) and related pollen coat proteins (Watanabe et al., 2000). Beginning with the male determinant of sporophytic self-incompatibility (SSI), SP11 (a SCR protein), Vanoosthuyse et al. (2001) used iterative BLAST searches to discover 37% of the peptides we identified. SP11 adopts the same three-dimensional fold as many defensins (Chookajorn et al., 2004) and displays high levels of divergent selection and allelic diversity (Watanabe et al., 2000). Binding of SP11 from self-pollen to the stigma-specific S-locus receptor kinase (SLK) starts the cascade of responses that leads to rejection. SP11 and SLK are genetically linked at the S-locus and coevolve together (Sato et al., 2002; Chookajorn et al., 2004). Despite the similarities between defense and pollen recognition, it is hard to see why so many SCRs would lie outside the S-locus and why they would be expressed in so many other reproductive and somatic tissues. It would make more sense that SP11 was coopted from an ancient defensin to perform a new function (Nasrallah, 2002). While defensins are widely dispersed among eukaryotes, only a limited distribution of flowering plants use SSI, suggesting that it has only recently evolved (Hiscock and McInnis, 2003). CONCLUSION We set out to systematically identify DEFLs in the Arabidopsis genome and in the expressed sequences of higher plants. In Arabidopsis, we experimentally confirmed the expression of a subset of these genes. Genome analysis demonstrates that this large gene family has evolved by successive rounds of tandem and segmental duplication followed by purifying or diversifying selection. Members of the DEFL superfamily are not restricted to legumes and are far more abundant and diverse than previously appreciated. Thus, DEFLs constitute excellent candidates for crop improvement. MATERIALS AND METHODS Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial research purposes, subject to the requisite permission from any third-party owners of all or parts of the material. Obtaining any permissions will be the responsibility of the requestor. The cloned sequences reported in this manuscript have been deposited in the GenBank database (accession nos. AY803252–AY803270). Identified DEFL sequences from the Arabidopsis (Arabidopsis thaliana) genome were provided to TAIR, and Arabidopsis Genome Initiative (AGI) gene codes were assigned. Identification of DEFLs from Arabidopsis and Other Plants Our search strategy used successive iterations of HMM builds and searches to identify small, secreted Cys-rich peptides in plants. BLAST (Altschul et al., 1997) similarity searches were also used as a complementary approach. As a starting point for the search, we used a set of HMMs generated from legume sequences identified in earlier work (Graham et al., 2004). These sequences encoded putative CCPs with distant homology to plant defensins and scorpion toxins (Graham et al., 2004), two groups known to have antimicrobial properties (Thomma et al., 2002; Graham et al., 2004; Yount and Yeaman, 2004). HMMs representing 15 groups of CCPs were chosen from that work (groups 36, 40, 41, 645, and 31.01–31.11) because they identified homologs in Arabidopsis. These initial HMMs and subsequent HMMs were constructed only from the mature peptide, not the signal sequence. The signal peptide was left out because it appears as a separate exon in genomic sequences. Although eliminating the signal sequence from the HMMs reduced the sensitivity of the models, it provided a measure of confidence because we required genome hits to have an upstream signal peptide. Starting with this set of legume HMMs, the entire Arabidopsis genome (Huala et al., 2001) was translated in all six frames, and scanned with each of the HMMs using hmmsearch from the HMMer package version 2.2g (Durbin et al., 1998) with default parameters. The upstream sequence of all hits with E < 10 was manually screened using showorf (Rice et al., 2000) for the presence of a signal peptide. Signal peptides were confirmed with SignalP version 3.0 (Bendtsen et al., 2004). Splice sites for the single expected intron were predicted using the NetPlantGene server (Hebsgaard et al., 1996). If the server failed to predict a donor or acceptor, then the missing splice sites were predicted using alignment with close homologs when available. After all hits were manually examined and false-positive hits were removed, the new sequences were added to a master list of putative Arabidopsis defensins. To pick up neighboring homologs in sequence space, BLASTP version 2.2.1 (Altschul et al., 1997) was used to scan all new protein sequences against the translated Arabidopsis genome. The parameters -G 12 -E 2 -M BLOSUM45 -F F were used to emphasize distant homologs. Each new hit was examined as above to identify the upstream signal peptide and splice sites. Finally, acceptable hits were added to the master list. All new protein sequences in the master list were aligned via ClustalW version 1.82 (Thompson et al., 1994), and split out into subgroups using the dendrogram generated by that software. Each of the resulting rough set of sequence subgroups was then separately realigned via ClustalW and visualized using JalView (Clamp et al., 2004). The alignments were trimmed to remove the signal peptide. The trimmed alignments were then used as input for hmmbuild and hmmcalibrate (Durbin et al., 1998) using default parameters. All sequences in the master set were scanned against each of the HMMs, and shuffled around among models until each sequence scored best against its own HMM (after realigning and rebuilding any HMMs affected by sequence exchanges). At this stage, a full iteration cycle was completed, and the new HMMs were used to rescan the translated Arabidopsis genome. Successive iterations were carried out until convergence occurred (i.e. no new hits were found). The relevant details for all identified Arabidopsis sequences in this work (e.g. precise genome location; size, position, and prediction status of the intron; expression status) are provided in Supplemental Table I. The prediction status and location of each sequence are also depicted in Supplemental Figure 1. Following convergence within the Arabidopsis genome, the search was expanded to include the unigene sequences from all 25 plant gene indices at TIGR (Quackenbush et al., 2001), the comprehensive uniref100 collection (version 2.2) of all known protein sequences (Apweiler et al., 2004), and the emerging genomic sequence data from the Medicago truncatula sequencing project (September 2004, http://www.medicago.org/genome). The TIGR plant gene indices used included (total unigene counts and ESTs, respectively): Arabidopsis AGI version 11.0 (45,683, 227,670), Capsicum annuum CaGI version 1.0 (10,712, 22,804), Gossypium hirsutum CGI version 5.0 (24,350, 52,818), Chlamydomonas reinhardtii ChrGI version 4.0 (30,339, 152,263), Glycine max and Glycine soja GmGI version 11.0 (67,826, 333,481), Helianthus annuus HaGI version 3.0 (20,520, 59,426), Hordeum vulgare HvGI v. 8.0 (49,190, 341,924), Lycopersicon esculentum LeGI version 9.0 (31,012, 155,317), Lotus japonicus LjGI version 3.0 (28,460, 109,618), Lactuca sativa LsGI version 2.0 (22,185, 68,120), Mesembryanthemum crystallinum McGI version 4.0 (8,455, 25,640), M. truncatula MtGI version 7.0 (36,976, 189,714), Nicotiana benthamiana NbGI version 1.0 (6,118, 18,832), Nicotiana tabacum NtGI version 1.0 (10,232, 9,998), Oryza sativa OsGI version 15.0 (88,765, 272,567), Allium cepa OnGI version 1.0 (11,726, 19,553), Pinus spp. PGI version 4.0 (31,771, 125,061), Secale cereale RyeGI version 3.0 (5,347, 9,119), Sorghum bicolor SbGI version 8.0 (39,148, 187,282), Saccharum officinarum SoGI version 1.0 (95,884, 255,635), Solanum tuberosum StGI version 9.0 (32,553, 157,197), Triticum aestivum TaGI version 8.0 (123,807, 542,781), Theobroma cacao TcaGI version 1.0 (2,539, 5,981), Vitis vinifera VvGI version 3.1 (23,109, 132,316), and Zea mays ZmGI version 14.0 (56,364, 377,188). After these datasets were translated in six frames (translation excluded for the protein dataset uniref100), they were scanned with the HMMs generated above. The same procedures as described earlier were used to verify new hits, break up alignments into new subgroups, and build new HMMs. Splice site prediction was unnecessary for the gene indices and the uniref100 sequences as they only contained exon sequences. Non-plant hits (e.g. numerous insect defensins and scorpion toxins) from uniref100 were ignored. Iterations were carried out over the 25 gene indices, the uniref100 dataset, and the Arabidopsis and Medicago genomes until no new hits were found. In determining the tissue-specificity of identified subgroups, the following terms were identified as reproductive tissue (1,383,976 ESTs): aleurone, anther, caryopsis, coleoptile, crown, ear (maize), embryo, endosperm, fiber (cotton), flower, fruit, grain, head (wheat), inflorescence, kernel, ovary, ovule, panicle, pedicel, pericarp, pistil, pod, pollen, scutellum, seed, silk (maize), silique, sperm cell, spike, and tassel. Seed tissues (558,352 ESTs) included: aleurone, caryopsis, coleoptile, embryo, endosperm, fiber (cotton), grain, kernel, pedicel, pod, and seed. Tissues of origin for all EST libraries among the TIGR's 25 plant gene indices were determined by judicious examination of the GenBank records, and merging this extracted information with the incomplete library descriptions at TIGR. The total number of ESTs from identifiable tissues was 3,336,384. The compiled list of tissue origins is available upon request. A tabular summary of the characteristics of the sequence-related subgroups is provided in Supplemental Table V. Additionally, alignments, fasta files, HMMs, and expression summaries are available upon request from the corresponding author. Additional nodule-specific subgroups that lack close Arabidopsis homologs are not included in the supplemental files, nor have they been included in the statistics thus far. They have been reported previously (Graham et al., 2004) and are the subject of a separate expanded genomic analysis (unpublished data). Statistical Analyses of Taxon and Tissue Specificity for DEFL Subgroups We used a chi-squared association test (Dunn and Clark, 2001) to determine whether sequence subgroups were statistically enriched by sequences from a specific tissue or simply represented tissues that were overly sampled in the EST databases. This method was used as long as the expected observed count was at least five for each of two data sets within the subgroup: sequences derived from the chosen tissue set and those derived from other tissues. The chi-squared association test exaggerates statistical significance when expected observed counts are low; therefore, in cases where these counts were less than five, we used the Fisher exact probability test (Siegel, 1956). This protocol was also used to assess the taxon specificity of sequence subgroups with unigene counts replacing EST counts. Analysis of Defense Peptide Sequence Motifs The alignments for most subgroups had clear evidence of characteristic defense motifs described in the literature. We visually scanned each subgroup alignment for two of these motifs: the CSαβ motif, which is common to all proteins adopting the three-dimensional knottin defensin fold (Cornet et al., 1995); and the γ-core motif, found in all known Cys-containing AMPs (Yount and Yeaman, 2004). The CSαβ motif contains the residues C…CXXXC…C…CXC, where C = Cys, X = any amino acid, and … indicates a nonconserved number of amino acids. The γ-core motif is described by GXCX{3,9}C, or its enantiomeric sequence permutations CX{3,9}CXG and CX{3,9}GXC, where G = Gly and the numbers in braces indicate a range of nonconserved amino acids. In several cases, as noted in Supplemental Table V, we extended the range of nonconserved amino acids. Subgroups were considered to have a motif if the majority of the sequences within the alignment matched it. Duplication of DEFLs in Arabidopsis Sequences falling within a 100,000-bp window in the Arabidopsis genome were grouped. Clusters of four or more closely spaced sequences were then assigned cluster numbers and analyzed for evidence of local duplication (Fig. 1 Degenerate Primer Design to Identify Expressed Genes To verify defensin expression and splice site predictions, six of the clusters identified in Figure 1 Two primer pairs were used as controls for experimental procedures. Primers (Supplemental Table VI) secNS2/secNS3 correspond to the secret agent gene (SEC, At3g04240), and primers act2f/act2r2 correspond to actin 7 (ACT7, At5g09810). SEC and ACT7 are expressed in a variety of tissues and developmental stages (TAIR; http://www/arabidopsis.org). Plant Materials To obtain shoot and root material for RT-PCR, Arabidopsis ecotype Columbia seeds were surface sterilized and grown on agar plates containing Murashige and Skoog salts (2.16 g L−1; Sigma, St. Louis), 1% (w/v) Suc, and 0.8% (w/v) agar. Plates were chilled for 2 d at 4°C and then placed vertically in a growth chamber programmed to provide a 16-h-day/8-h-night cycle at 21°C. The light intensity of the growth chamber was approximately 80 μE m−2 s−1. Following 10 d of growth, shoots and roots were separated and frozen in liquid nitrogen. Floral material was obtained from plants grown on soil (LG3 and LP5; Sungrow Horticulture, Bellevue, WA) at 22°C with a constant light intensity of 80 μE m−2 s−1. Floral material ranged from immature inflorescence to flowers 2 DPA. RT-PCR of DEFLs Prior to RT-PCR, total RNA was isolated from root, shoot, and flower samples using the RNeasy Plant mini kit (Qiagen, Valencia, CA). To remove contaminating genomic DNA, RNA samples were treated with DNA-free (Ambion, Austin, TX). First-strand cDNA synthesis of all three samples was performed using Transcriptor reverse transcriptase (Roche, Indianapolis), Oligo-p(dT)15 primer (Roche), and 0.25 μg of total RNA, following the manufacturer's recommendations. Minus reverse-transcriptase libraries were made from all three samples to test for genomic DNA contamination. Following cDNA synthesis, PCR was performed using a PTC-225 DNA Engine thermocycler from MJ Research (Watertown, MA). Control primers mentioned above were used to test for genomic DNA contamination and cDNA synthesis efficiency. Once genomic DNA contamination was ruled out, the 12 primer combinations shown in Supplemental Table VI were used to monitor the expression of the DEFLs in flowers, shoots, and roots. PCR reactions were 20 μL in volume and contained 1× Promega PCR buffer, 2.2 mm MgCl2, 200 μm each dNTP, 0.2 μm each primer, 2 μL of template cDNA, and 0.5 units of Taq DNA Polymerase (Promega, Madison, WI). PCR cycling conditions were 94°C for 2 min, 35 cycles of 94°C for 45 s, anneal for 30 s, 72°C for 45 s, followed by 72°C for 7 min. In addition to the six cDNA templates, genomic DNA was used as a positive control for PCR amplification. Following PCR, products were visualized on 1.5% agarose gel. Upon analysis of the RT-PCR results, the flower RNA sample was chosen for use as template in all cloning reactions with the nine PCR primer pairs yielding visible product. Cloning of RT-PCR Products PCR reactions were repeated as above; however, the PCR reaction volume was doubled to 40 μL. PCR products were purified and concentrated using Microcon YM-30 centrifugal filter devices (Millipore, Billerica, MA). PCR products were cloned using the PGEM-T Easy Vector System II, following the manufacturer's recommendations. Plasmid DNA was prepared using the Qiaprep Spin Miniprep kit (Qiagen). Eight clones from each primer pair were selected for sequencing. Double-pass sequencing was performed by the Advanced Genetic Analysis Center at the University of Minnesota. RT-PCR product sequences were analyzed using the Sequencher software (Gene Codes, Ann Arbor, MI) and then imported into the PILEUP alignment containing all gene sequences from the corresponding cluster. Sequence comparisons were made to determine which gene produced the specific RT-PCR product and to determine the size and position of the intron if present. The size and position of identified intron sequences were compared to those predicted computationally in our analysis. Evolutionary Analyses of DEFLs in Arabidopsis To estimate the rates of nonsynonymous (Ka) and synonymous (Ks) substitutions, the predicted coding sequences of defensin-like genes were divided into two regions: the signal sequence as determined by SignalP and the mature protein. The coding sequences of the signal peptides and the predicted mature proteins from each cluster of defensin-like genes were aligned using PILEUP. In tobacco, defensins have been identified that contain a C-terminal prodomain in addition to the signal peptide and mature defensin (Lay et al., 2003a). Therefore, the predicted mature defensin was trimmed to begin at the first conserved Cys residue. Since the Cys residues are known to be well conserved (Thomma et al., 2002), the corresponding nucleotides were also removed. Predicted pseudogenes and DEFLs bearing no similarity to other genes in the cluster were removed from the alignments. In some cases, clusters were subdivided if subgroups within a cluster could not be reliably aligned. Equivalent subdivisions were used in analysis of both the signal peptide and mature protein. The Ka and Ks values for the signal peptide and the trimmed mature protein were determined using DIVERGE (Wisconsin package; Fig. 4 Supplemental Data I_
Supplemental Data II
Acknowledgments We thank Dr. Eva Huala for providing TAIR AGI codes and examining each of our gene predictions, Brian Haas for verifying whether or not gene prediction algorithms used at TAIR failed to detect DEFLs, and Drs. Chris Town and Jeff Esch for helpful discussions. Floral RNA samples were provided by Dr. David Marks (University of Minnesota, St. Paul). Primer pairs ACT7 and SEC were provided by Dr. Lynn Hartweck (University of Minnesota, St. Paul). Notes 1This work was supported by a National Science Foundation Plant Genome Research Program award on Medicago truncatula genomics (DBI no. 0110206; Principal Investigator Douglas R. Cook) and by funds from the University of Minnesota College of Biological Sciences. [w]The online version of this article contains Web-only data. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||
Planta. 2002 Dec; 216(2):193-202.
[Planta. 2002]Mol Immunol. 2004 Feb; 40(12):845-59.
[Mol Immunol. 2004]Immunol Rev. 2004 Apr; 198():169-84.
[Immunol Rev. 2004]Nat Rev Microbiol. 2004 Jun; 2(6):497-504.
[Nat Rev Microbiol. 2004]Nature. 2001 Jun 14; 411(6839):826-33.
[Nature. 2001]Plant Physiol. 2003 Apr; 131(4):1580-90.
[Plant Physiol. 2003]Plant Physiol. 1997 Jan; 113(1):5-12.
[Plant Physiol. 1997]Curr Opin Plant Biol. 2001 Aug; 4(4):309-14.
[Curr Opin Plant Biol. 2001]J Agric Food Chem. 2003 Jul 30; 51(16):4487-503.
[J Agric Food Chem. 2003]Nature. 2001 Jun 14; 411(6839):826-33.
[Nature. 2001]Genetics. 2003 Sep; 165(1):309-19.
[Genetics. 2003]Plant Cell. 2003 Apr; 15(4):809-34.
[Plant Cell. 2003]Biopolymers. 1998; 47(6):479-91.
[Biopolymers. 1998]Cell Mol Life Sci. 2004 Feb; 61(4):437-55.
[Cell Mol Life Sci. 2004]Plant Physiol. 2002 Mar; 128(3):951-61.
[Plant Physiol. 2002]Nat Biotechnol. 2000 Dec; 18(12):1307-10.
[Nat Biotechnol. 2000]Plant Mol Biol. 2002 Sep; 50(1):59-69.
[Plant Mol Biol. 2002]Protein Expr Purif. 2003 Sep; 31(1):115-22.
[Protein Expr Purif. 2003]Plant Physiol. 2003 Mar; 131(3):1283-93.
[Plant Physiol. 2003]FEBS Lett. 1995 Jul 17; 368(2):257-62.
[FEBS Lett. 1995]Planta. 2002 Dec; 216(2):193-202.
[Planta. 2002]Proc Natl Acad Sci U S A. 2002 Feb 19; 99(4):2129-33.
[Proc Natl Acad Sci U S A. 2002]Plant Physiol. 2002 Oct; 130(2):519-37.
[Plant Physiol. 2002]Plant Physiol. 2003 May; 132(1):161-73.
[Plant Physiol. 2003]Plant Physiol. 2004 Jul; 135(3):1179-97.
[Plant Physiol. 2004]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Plant Physiol. 2004 Jul; 135(3):1179-97.
[Plant Physiol. 2004]Planta. 2002 Dec; 216(2):193-202.
[Planta. 2002]Proc Natl Acad Sci U S A. 2004 May 11; 101(19):7363-8.
[Proc Natl Acad Sci U S A. 2004]Planta. 2002 Dec; 216(2):193-202.
[Planta. 2002]Nucleic Acids Res. 2001 Jan 1; 29(1):102-5.
[Nucleic Acids Res. 2001]Plant Mol Biol. 2001 May; 46(1):17-34.
[Plant Mol Biol. 2001]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]Structure. 1995 May 15; 3(5):435-48.
[Structure. 1995]J Mol Biol. 2003 Jan 3; 325(1):175-88.
[J Mol Biol. 2003]Proc Natl Acad Sci U S A. 2004 May 11; 101(19):7363-8.
[Proc Natl Acad Sci U S A. 2004]Science. 2000 Dec 15; 290(5499):2114-7.
[Science. 2000]Genome Biol. 2003; 4(10):R68.
[Genome Biol. 2003]Structure. 1995 May 15; 3(5):435-48.
[Structure. 1995]J Mol Biol. 2003 Jan 3; 325(1):175-88.
[J Mol Biol. 2003]Proc Natl Acad Sci U S A. 2004 May 11; 101(19):7363-8.
[Proc Natl Acad Sci U S A. 2004]Proc Natl Acad Sci U S A. 2002 Feb 19; 99(4):2129-33.
[Proc Natl Acad Sci U S A. 2002]Mol Immunol. 2003 Nov; 40(7):413-21.
[Mol Immunol. 2003]Proc Natl Acad Sci U S A. 2002 Feb 19; 99(4):2129-33.
[Proc Natl Acad Sci U S A. 2002]Planta. 2002 Dec; 216(2):193-202.
[Planta. 2002]Plant Physiol. 2002 Oct; 130(2):519-37.
[Plant Physiol. 2002]Plant Physiol. 2003 May; 132(1):161-73.
[Plant Physiol. 2003]Plant Physiol. 2004 Jul; 135(3):1179-97.
[Plant Physiol. 2004]DNA Res. 2002 Aug 31; 9(4):117-21.
[DNA Res. 2002]Nat Rev Genet. 2002 Sep; 3(9):698-709.
[Nat Rev Genet. 2002]Bioinformatics. 2003 Jul 22; 19(11):1318-24.
[Bioinformatics. 2003]Plant Physiol. 2003 Jun; 132(2):461-8.
[Plant Physiol. 2003]BMC Biol. 2005 Jan 8; 3():1.
[BMC Biol. 2005]Bioinformatics. 1999 Sep; 15(9):729-40.
[Bioinformatics. 1999]Plant Mol Biol. 2001 May; 46(1):17-34.
[Plant Mol Biol. 2001]Proc Natl Acad Sci U S A. 2002 Feb 19; 99(4):2129-33.
[Proc Natl Acad Sci U S A. 2002]Mol Immunol. 2003 Nov; 40(7):413-21.
[Mol Immunol. 2003]Genome Biol. 2003; 4(5):R31.
[Genome Biol. 2003]Genetics. 2003 Sep; 165(1):309-19.
[Genetics. 2003]Plant Cell. 2003 Apr; 15(4):809-34.
[Plant Cell. 2003]Genetics. 2003 Sep; 165(1):309-19.
[Genetics. 2003]Plant Cell. 2003 Apr; 15(4):809-34.
[Plant Cell. 2003]Biopolymers. 1998; 47(6):479-91.
[Biopolymers. 1998]Biopolymers. 1998; 47(6):479-91.
[Biopolymers. 1998]Trends Plant Sci. 2002 Oct; 7(10):440-4.
[Trends Plant Sci. 2002]Plant Physiol. 2004 Feb; 134(2):595-604.
[Plant Physiol. 2004]Plant Physiol. 2003 Mar; 131(3):886-91.
[Plant Physiol. 2003]Curr Opin Plant Biol. 2000 Aug; 3(4):315-9.
[Curr Opin Plant Biol. 2000]Trends Plant Sci. 2004 Feb; 9(2):97-104.
[Trends Plant Sci. 2004]Biopolymers. 1998; 47(6):479-91.
[Biopolymers. 1998]Nat Biotechnol. 2000 Dec; 18(12):1307-10.
[Nat Biotechnol. 2000]Theor Appl Genet. 2002 Nov; 105(6-7):809-814.
[Theor Appl Genet. 2002]EMBO J. 1994 Jul 1; 13(13):2976-84.
[EMBO J. 1994]Science. 1999 Nov 26; 286(5445):1697-700.
[Science. 1999]FEBS Lett. 2000 May 12; 473(2):139-44.
[FEBS Lett. 2000]Plant Mol Biol. 2001 May; 46(1):17-34.
[Plant Mol Biol. 2001]Proc Natl Acad Sci U S A. 2004 Jan 27; 101(4):911-7.
[Proc Natl Acad Sci U S A. 2004]Science. 2002 Apr 12; 296(5566):305-8.
[Science. 2002]Trends Plant Sci. 2003 Dec; 8(12):606-13.
[Trends Plant Sci. 2003]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Plant Physiol. 2004 Jul; 135(3):1179-97.
[Plant Physiol. 2004]Planta. 2002 Dec; 216(2):193-202.
[Planta. 2002]Proc Natl Acad Sci U S A. 2004 May 11; 101(19):7363-8.
[Proc Natl Acad Sci U S A. 2004]Nucleic Acids Res. 2001 Jan 1; 29(1):102-5.
[Nucleic Acids Res. 2001]Trends Genet. 2000 Jun; 16(6):276-7.
[Trends Genet. 2000]J Mol Biol. 2004 Jul 16; 340(4):783-95.
[J Mol Biol. 2004]Nucleic Acids Res. 1996 Sep 1; 24(17):3439-52.
[Nucleic Acids Res. 1996]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 1994 Nov 11; 22(22):4673-80.
[Nucleic Acids Res. 1994]Bioinformatics. 2004 Feb 12; 20(3):426-7.
[Bioinformatics. 2004]Nucleic Acids Res. 2001 Jan 1; 29(1):159-64.
[Nucleic Acids Res. 2001]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]Plant Physiol. 2004 Jul; 135(3):1179-97.
[Plant Physiol. 2004]Structure. 1995 May 15; 3(5):435-48.
[Structure. 1995]Proc Natl Acad Sci U S A. 2004 May 11; 101(19):7363-8.
[Proc Natl Acad Sci U S A. 2004]Bioinformatics. 2004 Jan 22; 20(2):279-81.
[Bioinformatics. 2004]Plant Physiol. 2003 Mar; 131(3):1283-93.
[Plant Physiol. 2003]Planta. 2002 Dec; 216(2):193-202.
[Planta. 2002]Structure. 1995 May 15; 3(5):435-48.
[Structure. 1995]J Mol Biol. 2003 Jan 3; 325(1):175-88.
[J Mol Biol. 2003]Proc Natl Acad Sci U S A. 2004 May 11; 101(19):7363-8.
[Proc Natl Acad Sci U S A. 2004]