Logo of aemPermissionsJournals.ASM.orgJournalAEM ArticleJournal InfoAuthorsReviewers
Appl Environ Microbiol. 2012 Mar; 78(5): 1523–1533.
PMCID: PMC3294464

Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes

Abstract

Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp).

INTRODUCTION

Phylogenetic analyses of rRNA gene sequences have led to important advances in microbiology, such as the discoveries of the Archaea branch on the tree of life (27) and of new lineages of eukaryotic taxa (13) and the realization that culturable microorganisms comprise a minute fraction of the microbiota present in environmental samples (11). The 16S rRNA gene has been a preferred target for bacterial and archaeal diversity studies, resulting in the growth of extensive sequence databases. To facilitate classification of bacteria, the Ribosomal Database Project (RDP) (7) developed a naïve Bayesian classifier that has become a widely used public resource (25). This classifier can also be applied to other gene sequences, but doing so requires characterization and optimization of the classifier's performance with a set of taxonomically accurate training sequences. We report here the establishment of a training set for fungal large-subunit rRNA (LSU) gene sequences and the performance of the naïve Bayesian classifier using that training set.

Multiple regions of the fungal rRNA genes have been used to study fungal taxonomy and diversity; these include the small-subunit (SSU) and large-subunit (LSU) rRNA genes and the internal transcribed spacer (ITS) region that separates the two rRNA genes (14, 18, 19). The LSU gene region contains two hypervariable regions, designated D1 (Saccharomyces cerevisiae bp 127 to 264) and D2 (S. cerevisiae bp 423 to 636) (9), that are flanked by relatively conserved sequence regions in most fungi. This arrangement allows LSU gene sequences to be aligned for phylogenetic analysis. The LSU region has been used extensively for fungal phylogeny and taxonomic placement, including the Assembling the Fungal Tree of Life (AFTOL) Project and environmental surveys (1, 2, 12, 20). Although the ITS region provides a useful “bar code” for environmental diversity studies, the extent of sequence variability in this region does not allow for robust sequence alignment. The LSU gene provides a molecular marker for placement of new fungal lineages from environmental surveys in a comprehensive phylogenetic framework or for analysis of basal fungal lineages (12, 20).

To enable use of the naïve Bayesian classifier for fungal LSU gene sequences, we developed an 8,506-member sequence data set and characterized the performance of the classifier. Accuracy was tested at phylogenetic levels ranging from phylum to genus, using a leave-one-out cross-validation (LOOCV) approach in which a random sequence was removed from the training data set and then used as a query sequence to test its taxonomic placement against the remaining training sequence set. Classification using the naïve Bayesian classifier was compared to BLASTN classification using the same training set. The influences of sequence read length and sequence location across a 1,400-bp LSU gene region are presented. In addition, features of the training set that affect classifier performance are presented, including variability in training set sequence coverage, entropy, and bootstrap calculations. The training set and classification tool described herein are publicly available at the Ribosomal Database Project website (http://rdp.cme.msu.edu/classifier/classifier.jsp).

MATERIALS AND METHODS

Figure S1 in the supplemental material summarizes the steps used to create the training set and test the performance of the naïve Bayesian classifier and BLASTN approaches for taxonomic assignment.

Fungal LSU gene training set.

A set of 13,475 fungal LSU gene GenBank sequences were manually checked and downloaded from the NCBI database. Sequences with duplicate accession numbers were removed, and sequence and taxonomy information was obtained from the NCBI nucleotide and taxonomy database by using NCBI Entrez in batch mode, along with an in-house Perl script. After removal of duplicate sequences, the LSU gene sequences were aligned using the SILVA website (22), using a reference data set of aligned large-subunit (23S/28S) rRNA gene sequences from all three domains. These SILVA-generated alignments were checked in ARB (16). The Saccharomyces cerevisiae (RDN25-1) sequence was also aligned with the SILVA aligned data set and served as a base position reference across the LSU gene training set. Based on the alignment, sequences that included long regions of the ITS region with only an approximately 30-bp LSU gene fragment and sequences that contained less than 20% of the LSU gene were removed. An additional 993 sequences that had inconsistent taxonomy information were also excluded.

A total of 8,506 fungal LSU gene sequences spanning the first 1,400 bp of the LSU gene were hand curated using published phylogenies for different taxa and taxonomic databases. This final data set was used as the training set for classification analyses and included sequences from 1,702 validated fungal genera spanning 40 classes and 9 phyla, with some nonfungal sequences included for comparison (Table 1). The taxonomic composition of the training set is shown in Table 2. The majority of the sequences (71%) represented 15 classes (and one classification designated incertae sedis) within the Basidiomycota. Twenty-eight percent of the sequences represented 14 classes (and one classification designated incertae sedis) in the Ascomycota. The Agaricomycetes (67.23%), Dothideomycetes (7.96%), and Sordariomycetes (10.31%), which are common fungi found in soils (4, 18, 21), represented most of the training sequences (Table 2). The majority of basal fungal lineages (Chrytridiomycota and Zygomycota) were excluded from this analysis due to current inaccurate taxonomy in the group. Taxonomic outlines and curated databases for these groups are being created for future testing and will be added to the database of the RDP as completed.

Table 1
Taxonomic composition of the fungal LSU gene databasea
Table 2
Taxonomic composition of the 7,730-sequence fungal LSU gene training set used for LOOCV tests, with accuracies of class-level classifications obtained using 100-bp fragments of the D2 region (bp 500 to 599) with the naïve Bayesian classifiera

For the LOOCV analyses of classification accuracy, 776 genera that were each represented by only a single sequence (hereafter termed singletons) were excluded from the analysis, since this test would obviously give an incorrect genus assignment for the singletons. The resulting data set used for LOOCV classification accuracy included 7,730 sequences (Table 2).

Genetic region representation.

All 7,730 LSU gene sequences were aligned using SILVA (22) to generate the master alignment. The reference sequence of the Saccharomyces cerevisiae LSU gene (RDN25-1) was used to designate the sequence positions (without counting S. cerevisiae gaps) and for sequence fragment extraction (see below).

Extraction of different test sequences from the alignment.

Using the LSU gene sequence alignment, two extraction methods were used to obtain short test sequences of two lengths for LOOCV testing: 100 bp to represent Illumina sequence reads and 400 bp to represent 454 Titanium sequence reads. The first approach was a whole-region overlapping extraction where a 100-bp or 400-bp fixed-length sliding window was used to tile across the master alignment. For each step, the sequence extraction was based on 25-base intervals spanning bp locations relative to S. cerevisiae RDN25-1 as a reference sequence (without counting gap positions). Tiled, extracted fragments of 100 bp or 400 bp were then used independently for LOOCV testing. This approach tested the classification accuracy across the entire 1,400-bp sequence length.

The second approach was a PCR primer-anchored extraction where fragments of the S. cerevisiae reference sequence, anchored at the 3′ end of PCR primer LR0R (forward primer; located at ~26 bp to 42 bp) (http://www.biology.duke.edu/fungi/mycolab/primers.htm) or LR3 (reverse primer; located at 635 bp to ~651 bp) (http://www.biology.duke.edu/fungi/mycolab/primers.htm), were extracted. Sequence fragments of 75 bp, 100 bp, 200 bp, or 400 bp were tested independently using the LOOCV approach to determine the impact of the sequence length generated by use of either of these commonly used primers on classification accuracy.

Entropy exploration.

Shannon entropy (23) is a standard measure of the order state of symbol sequences. Given a sequence S with symbols si (i = 0 …), the Shannon entropy can be calculated as Σ[−pi × log(pi)], where pi denotes the probability mass function or probability of k-meri, which equals the frequency of k-meri in this matrix/the total number of k-mers in this matrix. In this case, each k-mer within a subset of sequences extracted from the LSU gene alignment is regarded as a symbol. The occurrence of more k-mers (symbols) within each subset of sequences suggests more potentially variable or unique internal features, which provides better resolution for classifying sequences. For this reason, Shannon entropy can be used to quantify the expected value of the information contained in each sliding window at a given k-mer size. Higher entropy in a window means that the included sequences contain richer information or more distinct features, therefore providing better resolution and a higher accuracy of classification.

We used an entropy calculation across the 1,400-bp LSU gene sequence region to select the optimal k-mer size for LOOCV testing. The k-mer size can influence the accuracy of taxonomic placement, so the Shannon entropy index was calculated for each base position. If all of the k-mers within a test sequence were unique, then the entropy would reach its maximum. Entropy calculations were then normalized by dividing by this maximum (all positions are unique). The relative percentages of maximum entropy for sliding windows along the length of the sequence were plotted for comparison of entropies for different k-mer sizes and different sequence regions.

Composition-based classification using the naïve Bayesian classifier.

The Java tool of the naïve Bayesian classifier was downloaded from RDP's sourceforge page (http://sourceforge.net/projects/rdp-classifier/) and installed locally. For each naïve Bayesian classifier-based LOOCV test, a single random sequence was removed from the training set as a query sequence to test its taxonomic placement against the remaining training sequence set. The process was repeated for all sequences in the training set. In-house Perl scripts were then used to parse the taxonomy assignment.

Bootstrap analysis with the naïve Bayesian classifier.

The naïve Bayesian classifier provides a bootstrap measurement for each assignment, which is a confidence estimate representing the number of times the assigned taxon was selected out of 100 bootstrap trials (25). After assigning a taxon to a query sequence, a one-eighth subset of eight character words was randomly chosen from the set of all overlapped words from each query sequence. The joint probability of observing words in this subset was calculated for each genus. The randomly chosen subset sequence was assigned to the genus giving the highest probability based on the naïve Bayesian assumption. This resampling process was repeated for 100 bootstrap trials, and the bootstrap value was computed using the number of times the assigned taxon was selected out of 100 bootstrap trials. For higher-order assignments, we summed the results for all genera under each taxon. We used this calculation directly as a measure of the maximum confidence achievable with the current training set and to compare the classification accuracies achievable across the 1,400-bp LSU gene sequence when different bootstrap or confidence thresholds were selected.

Similarity-based BLASTN classification.

BLASTN is commonly used to classify rRNA gene sequences and was used here as a comparison approach. BLAST+ executables (5) were downloaded from GenBank (http://www.ncbi.nlm.nih.gov/) and installed locally. The lengths of all test sequences in this study were between 50 bp and around 400 bp, and correspondingly, BLASTN parameters were set to a word size of 6 and an E value threshold of 1,000. For each BLASTN-based LOOCV test, a single sequence was reserved from the 7,730 fungal LSU gene training set sequences as the test sequence, and the remaining 7,729 sequences comprised a reformatted BLAST database (formatdb). The process was repeated for all sequences in the training set. The BLASTN top hits were parsed using an in-house Perl script to gain overall taxonomic information for the accuracy calculation (% correct assignment). The speeds of processing using the naïve Bayesian classifier and BLASTN were compared using a Mac OS X (10.5.8) server with a 2.66-GHz Quad-Core Intel Xeon processor and 3 GB of 1,066-MHz DDR3 memory.

RESULTS

Sequence coverage in the LSU gene training set.

The percentage of the 7,730 aligned training set sequences represented at each bp location (termed coverage) is shown in Fig. 1. Very few (<10%) of the sequences retrieved from public databases included the region past bp 1400, and all aligned sequences were trimmed at this location prior to use with the classifier. Representation in the training set dropped to 80% (79.86%) between bp 650 and 1000 and then ranged between 10% and 40% between bp 963 and 1400 (Fig. 1). Sequence coverage at a particular bp dropped in regions containing sequence gaps, and these regions are shown by dropping lines in Fig. 1. The training sequences between primers LR0R (bp 26 to 42) and LR3 (bp 635 to 651) had more than 80% coverage in the training set. A detailed inspection of the alignment revealed that the primer sequences were typically removed prior to sequence deposition into the public database and that the regions internal to the primers are more highly represented than the primer sequences.

Fig 1
Sequence coverage in the training set across a 1,400-bp region of the LSU gene, based on multiple sequence alignment with S. cerevisiae. The percent coverage is shown on the y axis, and the corresponding S. cerevisiae gene position is shown on the x axis. ...

Coverage variation caused by gap regions in sliding window.

Extraction of sequence fragments for classification testing used a 100-bp or 400-bp sliding window with a 25-bp length interval or step size that corresponded directly to the S. cerevisiae LSU gene sequence numbering. Because gaps were not counted, some of the extracted test sequences were longer or shorter than the window size (100 bp or 400 bp) (see Fig. S2 in the supplemental material). Using the sliding window observations, we found that the naïve Bayesian classifier did not perform well with sequences of <50 bases (data not shown), likely due to an insufficient number of features. Thus, a minimum length cutoff of 50 bp extracted from the master alignment was applied for all subsequent analyses.

Entropy calculation and optimal k-mer size.

Entropy calculations were performed to estimate which k-mer size should be used for optimal classifier performance. Entropy was calculated to measure the sequence complexity among different k-mer sizes. Figure 2A and B show the calculated entropies obtained using word sizes of 1-mer to 15-mer along the 1,400-bp LSU gene fragment, using the 100-bp and 400-bp sliding windows, respectively. Although the entropy increased with larger k-mer sizes, the degree of increase became more consistent once the k-mer size reached 8 bp. As expected, measured entropy was higher across the two hypervariable regions, which is consistent with the larger proportion of gaps and variable sequences, and the D2 region exhibited higher entropy than the D1 region (Fig. 2A). As expected, entropy was lower and more stable for the longer sequence fragments extracted with the 400-bp sliding window (Fig. 2B) than for fragments extracted with the 100-bp window (Fig. 2A).

Fig 2
Entropy across the 1,400-bp LSU gene region. The Shannon entropy index (H′) was calculated for each 100-bp (A) and 400-bp (B) tiled sequence fragment, and the mean information entropy for k-mer-sized windows along the length of the sequence was ...

Classification accuracy using the LOOCV test. (i) Training set influence.

Classification accuracy varied across the 1,400-bp region and was influenced by training set coverage, sequence length, and sequence variability. Accuracy was higher in the region of bp 1 through about bp 680, where sequence coverage in the training set was high. The accuracy from bp 680 to 1400 dropped significantly (Fig. 3A to D), due in part to fewer training sequences with representation in this region (Fig. 1; see Fig. S2 in the supplemental material).

Fig 3
Classification accuracy and bootstrap confidence across the 1,400-bp LSU gene region. (A and B) Classification accuracies for the BLASTN LOOCV test with sequence segments of 100 bp (A) and 400 bp (B), moving 25 bases at a time. (C and D) Classification ...

Each tiled 100-bp or 400-bp test sequence contained a different set of features that affected the classification of that sequence into a taxonomic bin. For example, the average bootstrap value or confidence with which a 100-bp sequence from bp 150 to 250 (within the D1 region) was classified to the same assigned genus by using this training set was 75%, and this dropped to less than 50% for bp 300 to 400. The measured bootstrap values (Fig. 3E and F) from our LOOCV testing followed the same trends observed with the accuracy measurements (Fig. 3C and D). Accuracy values are discussed in detail below. The differences in accuracy and bootstrap confidence across the LSU gene region emphasize that continued improvement and expansion of high-quality fungal sequence data sets are needed in parallel to development of rapid classification approaches.

(ii) Effect of sequence read length.

The 400-bp read length provided a higher classification accuracy than the 100-bp read length with the naïve Bayesian classifier (Fig. 3C [100 bp] versus Fig. 3D [400 bp]) and also with the BLASTN approach (Fig. 3A [100 bp] versus Fig. 3B [400 bp]). With the naïve Bayesian classifier, the average accuracies across the D1 and D2 regions at the genus level were about 78% (D1) and 80% (D2) for 400-bp sequences and about 60% (D1) and about 70% (D2) for 100-bp sequences.

(iii) Effect of sequence variability on accuracy in the LSU gene D1 and D2 hypervariable regions.

Classification accuracy was greater for query sequences mapping to the D1 and D2 hypervariable regions (Fig. 3C) than for other locations in the 1,400-bp test region. Sequence location was especially important for the 100-bp test sequences. For example, with the 100-bp query sequences, the genus-level accuracy of the naïve Bayesian classifier was about 64.5% across the D1 region (bp 127 to 264) and about 74.1% for the D2 region (bp 423 to 636) but less than 40% in the highly conserved regions of the gene fragment (Fig. 3C).

The classification accuracy results obtained using the 400-bp sequence length are shown in Fig. 3B and generally paralleled the results from the 100-bp sequence tests. The average accuracy derived from the D2 region was slightly higher than that for the D1 region. Overall, the results suggested that the D2 region provided the best classification accuracy for the 100-bp and 400-bp sequence lengths.

(iv) LR3 and LR0R primer extraction test.

For many applications, it is desirable to anchor sequences from a conserved sequence that can be used as a PCR primer. The primers LR0R (forward primer) and LR3 (reverse primer) are commonly used for PCR amplification and sequencing of LSU gene fragments from fungal cultures, fruiting bodies, and environmental samples. Using these primer sequences as anchor points, we extracted 75-bp to 400-bp sequences from the training set and used them in LOOCV tests to mimic classification of sequence reads from different next-generation sequencing (NGS) platforms. For the naïve Bayesian classifier (RDP) LOOCV tests, the overall accuracy for reverse primer LR3 with the 400-bp sequence length (mimicking 454 Titanium sequences) was about 80% at the genus level (Fig. 4B; see Table S1 in the supplemental material). Even with 100-bp sequences (mimicking Illumina amplicons), the classifier had an accuracy of about 89% at the family level and about 70% at the genus level. The accuracy for the 75-bp sequence length (mimicking Illumina sequences) was about 80% at the family level and decreased dramatically, to about 60%, at the genus level (Fig. 4B; see Table S1). The BLASTN approach provided comparable results but was not as accurate with 75-bp fragments at the family and genus levels (Fig. 4A and B; see Table S1).

Fig 4
Classification accuracy by query sequence length and primer position for LOOCV testing using the naïve Bayesian classifier and BLASTN approaches. Numbers are percentages of correctly classified query sequences. (A) Accuracy using BLASTN; (B) accuracy ...

The overall accuracies of the naïve Bayesian classifier and BLAST approaches with the forward primer LR0R were lower than those with the reverse primer LR3. The accuracy with the 400-bp fragments declined to 73% for the genus level, about 90% for the family level, and ~97% for the order level. The accuracy for the 75-bp segments was about 99% at the phylum level and decreased dramatically, to about 16%, at the genus level (Fig. 4; see Table S1 in the supplemental material). The results indicated that the LR3-anchored amplicons substantially outperformed LR0R-anchored amplicons and that the longer, 400-bp reads outperformed the 100-bp reads. With the longer LR3-extracted sequences, the higher accuracy quickly reached a plateau, at about 80% for genus-level accuracy, with sequences as short as 200 bp. For LR0R-extracted sequences, the higher accuracy increased steadily and did not reach ~75% until 400-bp sequences were used (Fig. 4). This suggests that LR3-extracted sequences are suitable for both 454 Titanium and Illumina platforms, since they perform better than LR0R-extracted sequences for both 100- and 400-bp fragments. LR0R-extracted sequences are also suitable for the 454 Titanium platform if 400-bp fragments are used. However, in a 454 Titanium sequencing run, there is substantial sequence length variation, and it would be better to choose LR3.

Optimization of classification using the naïve Bayesian classifier by use of bootstrap cutoff values.

When used with LOOCV testing, the bootstrap confidence value was calculated using the naïve Bayesian classifier as the frequency of the same taxonomic assignment out of 100 resamplings. A bootstrap confidence value can be chosen and used as a threshold to improve taxonomic assignment. This has a significant influence on the accuracy of the final assignment and is affected by the composition of the underlying training set. Selection of a high bootstrap cutoff increases the accuracy of taxonomic placement (if a close match exists in the data set) by removing uninformative or ambiguous sequences from the training set but then does not allow classification of less-well-represented taxa or poorer-quality sequences (6). Thus, there is always a tradeoff in implementing a bootstrap cutoff. The use of a bootstrap cutoff will result in a greater percentage of unclassified query sequences, but the classified sequences are more likely to be identified correctly. Calculation of the classifier bootstrap cutoffs provided an additional measurement of the confidence level of each assignment, tiled across the 1,400-bp LSU gene fragment, and helped to determine whether the available training set data were sufficient for a robust classification (Fig. 5).

Fig 5
Different bootstrap cutoffs across the 1,400-bp LSU gene region. (A and B) Classification accuracies obtained with sequence segments of 100 bp (A) and 400 bp (B) when different bootstrap cutoffs are used. (C and D) Percentages of remaining 100-bp (C) ...

Across the 1,400-bp LSU gene sequence region, we determined the effects of bootstrap cutoff values ranging from 0% (no bootstrap cutoff) to 90% on classification accuracy (Fig. 5A and B), as well as the percentage of sequences remaining in the training set when a certain bootstrap cutoff was used (Fig. 5C and D), for the tiled 100-bp and 400-bp sequences. For the D2 region, the classification accuracy at the genus level for 100-bp test sequences improved from about 75% without a cutoff to about 81% and 88% with bootstrap values of 50% and 90%, respectively. Similarly, for the D1 region, the classification accuracy improved from about 65% without a cutoff to about 76% and 87% with bootstrap values of 50% and 90%, respectively. These results indicated that classification confidence was greatly improved by using a bootstrap cutoff to estimate classification reliability. The results also illustrate that with lower bootstrap cutoff values, where more error in taxonomic placement is allowed, more sequences remain in the training set. For example, if a 50% bootstrap cutoff was applied to 100-bp sequences, providing about 85% accuracy in taxonomic placement (Fig. 5A), about 70% of the sequences in our training set would be retained in the data set if the target sequence spanned the D1 region, and about 80% of the sequences would be retained if the target sequence spanned the D2 region (Fig. 5D).

For each tested bootstrap cutoff value, more sequences from the D2 region than the D1 region and from the 400-bp data set than the 100-bp data set were above the threshold and classified to the genus level (Fig. 5C and D). These observations suggest that different bootstrap values should be applied for different fragment lengths and gene regions.

Bootstrap cutoffs and their associated accuracies were calculated for the sequences anchored to either the LR0R or LR3 primer and illustrated the same general trends as the tiled sequences (Fig. 6; see Table S2 in the supplemental material). Accuracy increased with higher bootstrap cutoffs but resulted in lower sequence representation in the training set.

Fig 6
(A) Naïve Bayesian classifier accuracies obtained using different bootstrap cutoff values. (B) Percentages of training set sequences remaining when different bootstrap cutoff values are used. Each y axis shows percentages relevant to the panel ...

The bootstrap values helped to determine whether the underlying training set data were sufficient for a robust classification. Low confidence estimates were obtained for query sequences from underrepresented fungal clades. For example, in our 100-bp training set LOOCV test shown in Table 2, most classification assignments were made with high accuracy and confidence. One exception was an Ascomycota classification designated incertae sedis, with both the lowest accuracy (75%) and the lowest bootstrap (96.5%), even at the class level. This low confidence was probably due either to a lack of enough training set sequences or to errors in the underlying taxonomy.

Performance of naïve Bayesian classifier versus BLASTN.

The accuracy results obtained using the BLASTN approach were similar to those obtained using the naïve Bayesian classifier (Fig. 3A). The genus-level classification accuracy was higher across the D2 region than across the D1 region (~70% versus 60%) (Fig. 3C).

(i) Speed of processing.

The naïve Bayesian classifier provided rapid taxonomic placement and bootstrap calculation, avoiding the computationally expensive alignment step required by the BLASTN approach. We compared the process times of the naïve Bayesian classifier and BLASTN for 100-bp and 400-bp LOOCV tests. The naïve Bayesian classifier processed about 24,000 100-bp read fragments and 8,300 400-bp read fragments per minute. BLASTN processed about 3,100 100-bp read fragments and 6,700 400-bp read fragments per hour. With these specifications, the naïve Bayesian classifier was 464 times faster for 100-bp reads and 74 times faster for 400-bp reads and was able to handle large data sets without the need for alignments.

(ii) Classification accuracy.

The LOOCV tests using the naïve Bayesian classifier and BLASTN with 100-bp and 400-bp sequence lengths showed that both methods can produce highly accurate, comparable results (Fig. 3A versus C and B versus D). For example, comparison between the BLASTN and naïve Bayesian classifier approaches at the D2 region for 400-bp fragments showed 99.9 versus 100.0%, 99.5 versus 99.7%, 98 versus 98%, 93.2 versus 94.3%, and 79.4 versus 80.0% accuracy at the rank of phylum, class, order, family, and genus, respectively (Fig. 3B and D). Both approaches also provided comparable results when tested with the 100-bp window size. At the genus level, both methods had about 63% accuracy for the D1 region and about 72% accuracy for the D2 region (Fig. 3A and C).

DISCUSSION

Accurate sequence classification is a critical component of the identification of core microbial taxa, discovery of novel taxa, and evaluation of fungal community diversity and ecology (2, 3, 24). However, the ability to obtain thousands of fungal sequences from environmental samples by using high-throughput sequencing technologies has become a reality before our ability to accurately and rapidly classify them.

This study describes a new, hand-curated LSU gene sequence data set and calibrates the use of that data set for classification of fungal sequences by use of the naïve Bayesian classifier. The combination of a curated LSU gene sequence training set with the naïve Bayesian classifier (25) now provides a means to rapidly classify large sequence data sets being generated from environmental surveys that otherwise could not be processed or interpreted using traditional taxonomic methods. The classifier and LOOCV analysis used in this study are also useful tools for the identification of underrepresented taxonomic groups and sequences that require taxonomic revisions. This approach will also facilitate the discovery and identification of teleomorph-anamorph matches (24) and the detection of fungi that are difficult to study due to their large phenotypic plasticity. In addition, the testing approach and results obtained in this study will facilitate the implementation of this classifier for use with fungal ITS sequences and other regions that are commonly used for fungal bar coding but cannot be aligned accurately (17).

In this study, we showed features of the LSU gene sequence and the current training set coverage that affect classification accuracy and described the classifier performance with sequence fragments of different lengths and from different regions of the LSU gene. We demonstrated that the naïve Bayesian classifier, which was originally developed for 16S rRNA gene taxonomy (25), can be utilized as a tool for rapid taxonomic placement of large fungal LSU gene data sets. In addition, a bootstrap value can be calculated without going through a computationally more expensive alignment step or sequence homology-level BLAST searches. Although the classifier and BLAST approaches provided similar classification accuracies, the substantially higher speed of the classifier makes it an attractive choice for large data sets that cannot be analyzed using traditional alignment-based algorithms.

The LOOCV approach to evaluate the classification accuracy of fungal LSU gene sequences identified the most important factors affecting classification accuracy. Using the summaries shown in Fig. 3 to to6,6, one can select an appropriate sequence length and region of the gene, with a prediction of the classification accuracy and confidence achievable. This is especially important because available sequencing technologies have very different capabilities regarding sequence length and quality. For fungal LSU gene sequences, longer sequence lengths (e.g., 400 bp) provided higher classification accuracy, especially for the finer-scale taxa (e.g., genus), than shorter lengths. This was not surprising given the structure of the LSU gene, in which conserved regions are interspersed with highly variable regions. However, sequences as short as 75 to 100 bp provided taxonomic placement with >95% accuracy at the order level and >70% accuracy at the genus level when anchored to the LR3 reverse primer. The D2 region was more variable across different fungal taxa than the D1 region (Fig. 3). Consequently, the taxonomic accuracy achieved was higher when test sequences included this region. The accuracy of taxonomic placement with short sequences was higher using the LR3 reverse primer than the LR0R primer (Fig. 4 and and6),6), suggesting that for PCR amplicon sequencing using technologies capable of single-direction reads only, the LR3 primer is the optimal choice.

Our accuracy calculations were based on high-quality, double-pass sequences that were hand curated and extracted from the NCBI database by using published phylogenetic studies and voucher and culture collections as guidelines. Although current sequencing technologies provide large amounts of sequence information, single reads coupled with inherent biases and sequence error rates (8, 15) and the large number of potential taxa that may not be represented in curated databases could result in more ambiguities in taxonomic placement. This will likely reduce the accuracy of taxonomic placement from that shown here, and caution must be taken in assigning final taxonomic identification to the species level. Therefore, taxonomic expertise is highly recommended (17) when using automated tools, and new guidelines for taxonomic classification of environmental samples need to be developed (10).

The LSU gene training set presented here represents an initial effort based on the availability of high-quality sequences and phylogenies. It lacks sufficient coverage for many described (and incompletely described) fungal taxa, and obtaining longer and more accurate sequence reads of the LSU gene for reference sequences with poor coverage is a priority. A more diverse training set will be especially useful for classifying the less-well-represented phyla (26). Development of a more comprehensive LSU gene sequence database for sequence classification will be accomplished best with wide scientific community participation. We have deposited the current training set in the Ribosomal Database Project website, where it is available for public use and for continued enhancement and inclusion of underrepresented taxa.

Supplementary Material

Supplemental material:

ACKNOWLEDGMENTS

This study was supported by the Los Alamos National Laboratory through a Laboratory Directed Research and Development Grant (20080662DR; G.X. and K.-L.L.) and an NIH NIDCR grant (Y1-DE-6006-02; G.X.) and by the U.S. Department of Energy, Office of Biological and Environmental Research, through a Science Focus Area grant (2009LANLF260; C.R.K., A.P.-A., S.A.E., and K.-L.L.). Additional support was provided by the NSF (grant 0919510; A.P.-A.), by the WIU Foundation and Office of Sponsored Projects (A.P.-A.), and by the National Science Council in Taiwan (NSC97-2917-I-006-111; K.-L.L.).

We thank Nick Hengartner, Patrick Chain, and Jim Cole for valuable discussions, La Verne Gallegos-Graves for technical support, and Srivathsan Vijayaraghavan, Sagar Yeraballi, Tabitha Williams, Chipampe Mpondela, Zachary Gossage, and Lynnaun Johnson for their assistance with the development of the training data set.

K.-L.L. led the classifier analysis, and A.P.-A. generated and curated the training set.

Footnotes

Published ahead of print 22 December 2011

Supplemental material for this article may be found at http://aem.asm.org/.

REFERENCES

1. Arnold AE, et al. 2009. A phylogenetic estimation of trophic transition networks for ascomycetous fungi: are lichens cradles of symbiotrophic fungal diversification? Syst. Biol. 58:283–297 [PubMed]
2. Blackwell M. 2011. The fungi: 1, 2, 3 … 5.1 million species? Am. J. Bot. 98:426–438 [PubMed]
3. Blackwell M, Hibbett DS, Taylor JW, Spatafora JW. 2006. Research coordination networks: a phylogeny for kingdom Fungi (deep hypha). Mycologia 98:829–837 [PubMed]
4. Buee M, et al. 2009. 454 pyrosequencing analyses of forest soils reveal an unexpectedly high fungal diversity. New Phytol. 184:449–456 [PubMed]
5. Camacho C, et al. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. [PMC free article] [PubMed]
6. Claesson MJ, et al. 2009. Comparative analysis of pyrosequencing and a phylogenetic microarray for exploring microbial community structures in the human distal intestine. PLoS One 4:e6669. [PMC free article] [PubMed]
7. Cole JR, et al. 2009. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 37:D141–D145 [PMC free article] [PubMed]
8. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. 2008. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 36:e105. [PMC free article] [PubMed]
9. Guadet J, Julien J, Lafay JF, Brygoo Y. 1989. Phylogeny of some Fusarium species, as determined by large-subunit rRNA sequence comparison. Mol. Biol. Evol. 6:227–242 [PubMed]
10. Hibbett DS, et al. 2011. Progress in molecular and morphological taxon discovery in fungi and options for formal classification of environmental sequences. Fungal Biol. Rev. 25:38–47
11. Hugenholtz P, Goebel BM, Pace NR. 1998. Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J. Bacteriol. 180:4765–4774 [PMC free article] [PubMed]
12. James TY, et al. 2006. A molecular phylogeny of the flagellated fungi (Chytridiomycota) and description of a new phylum (Blastocladiomycota). Mycologia 98:860–871 [PubMed]
13. Jones MD, et al. 2011. Discovery of novel intermediate forms redefines the fungal tree of life. Nature 474:200–203 [PubMed]
14. Jumpponen A, Jones KL. 2009. Massively parallel 454 sequencing indicates hyperdiverse fungal communities in temperate Quercus macrocarpa phyllosphere. New Phytol. 184:438–448 [PubMed]
15. Kunin V, Engelbrektson A, Ochman H, Hugenholtz P. 2010. Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ. Microbiol. 12:118–123 [PubMed]
16. Ludwig W, et al. 2004. ARB: a software environment for sequence data. Nucleic Acids Res. 32:1363–1371 [PMC free article] [PubMed]
17. Nilsson RH, Kristiansson E, Ryberg M, Hallenberg N, Larsson KH. 2008. Intraspecific ITS variability in the kingdom fungi as expressed in the international sequence databases and its implications for molecular species identification. Evol. Bioinform. Online 4:193–201 [PMC free article] [PubMed]
18. O'Brien HE, Parrent JL, Jackson JA, Moncalvo JM, Vilgalys R. 2005. Fungal community analysis by large-scale sequencing of environmental samples. Appl. Environ. Microbiol. 71:5544–5550 [PMC free article] [PubMed]
19. Öpik M, Moora M, Liira J, Zobel M. 2006. Composition of root-colonizing arbuscular mycorrhizal fungal communities in different ecosystems around the globe. J. Ecol. 94:778–790
20. Öpik M, et al. 2010. The online database MaarjAM reveals global and ecosystemic distribution patterns in arbuscular mycorrhizal fungi (Glomeromycota). New Phytol. 188:223–241 [PubMed]
21. Porras-Alfaro A, Herrera J, Natvig DO, Lipinski K, Sinsabaugh RL. 2011. Diversity and distribution of soil fungal communities in a semiarid grassland. Mycologia 103:10–21 [PubMed]
22. Pruesse E, et al. 2007. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 35:7188–7196 [PMC free article] [PubMed]
23. Shannon CE. 1948. A mathematical theory of communication. Bell Syst. Tech. J. 27:623–656
24. Shenoy BD, Jeewon R, Hyde KD. 2007. Impact of DNA sequence-data on the taxonomy of anamorphic fungi. Fungal Divers. 26:1–54
25. Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73:5261–5267 [PMC free article] [PubMed]
26. Werner JJ, et al. 2011. Impact of training sets on classification of high-throughput bacterial 16S rRNA gene surveys. ISME J. 6:94–103 [PMC free article] [PubMed]
27. Woese CR, Fox GE. 1977. Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc. Natl. Acad. Sci. U. S. A. 74:5088–5090 [PMC free article] [PubMed]

Articles from Applied and Environmental Microbiology are provided here courtesy of American Society for Microbiology (ASM)
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...