The sequencing of 16S rRNA genes from clone libraries of DNAs from environmental samples has led to a wealth of information concerning prokaryotic diversity. However, in addition to methodological problems in producing libraries representative of the environmental sample (for a review, see reference 8), this approach is also limited by the difficulty in comparing libraries and determining if they are significantly different.

This problem can be addressed quantitatively by application of the formula for coverage as described by Good (4). Let *X* be a collection of sequences, such as a library of 16S rRNA genes. Define the “homologous” coverage of *X* (or *C*_{X}) by a sample from *X* to be *C*_{X} = 1 − (*N*_{X}/n), where *N*_{X} is the number of unique sequences in the sample (i.e., sequences without a replicate) and *n* is the total number of sequences. In practice, the definition of *N*_{X} depends upon the criteria used to define uniqueness. For instance, McCaig et al. (6) considered sequences without a homolog of ≥97% similarity to be unique. Other authors have used ≥99% sequence similarity as the criterion. In principle, uniqueness can be defined at any level of sequence similarity or evolutionary distance (*D*) and a “homologous coverage curve,” or *C*_{X}(D), can be generated by plotting *C*_{X} versus *D* (Fig. ). The coverage curve then describes how well the sample represents the entire library *X* at various levels of relatedness. Typically, coverage might be low at high levels of relatedness (low values of *D*), indicating that only a small fraction of the sequences representing unique species are, in fact, sampled. In contrast, coverage might be much higher at low levels of relatedness, indicating that representatives of most of the deep phylogenetic groups present in *X* are found in the sample.

Results of selected LIBSHUFF comparisons. Homologous (○) and heterologous (●) coverage curves for 16S rRNA gene sequence libraries from environmental samples are shown. Solid lines indicate the value of (*C*_{X} − *C*_{XY})^{2} for the original **...**

While *C*_{X} is the “homologous coverage” of *X* by a sample of *X*, it is also possible to calculate a “heterologous coverage” of *X* (or *C*_{XY}) by a sample *Y* from another collection of sequences by the following formula: *C*_{XY} = 1 − (*N*_{XY}/n), where *N*_{XY} is the number of sequences in a sample of *X* that are not found in a sample of *Y* and *n* is the number of sequences in the sample of *X*. Similarly to *N*_{X}, N_{XY} can also be defined at different levels of *D* to generate a coverage curve, *C*_{XY}(D). Moreover, if *X* = *Y*, one might expect the coverage curves *C*_{X}(D) and *C*_{XY}(D) [as well as *C*_{Y}(D) and *C*_{YX}(D)] to be similar. Thus, a test for differences between these coverage curves is also a test for differences between *X* and *Y*. To determine if the coverage curves *C*_{X}(D) and *C*_{XY}(D) are significantly different, the distance between the two curves are first calculated by using the Cramér-von Mises test statistic (7):

where *D* increases in increments of 0.01. If *X* = *Y*, then Δ*C*_{XY} should not be significantly different than a Δ*C* calculated after randomly shuffling sequences between the two samples, *X* and *Y*. Typically, the sequences are randomly shuffled a large number (*N*) of times (e.g., *N* = 999) and Δ*C*_{XY} is calculated after each shuffling. The randomized values plus the empirical value of Δ*C*_{XY} are ranked from largest to smallest, and then the *P* value is estimated to be *r*/(*N* + 1), where *r* denotes the rank of the empirical value of Δ*C*_{XY} (5). The two libraries are considered significantly different when *P* < 0.05. We have created a computer program (LIBSHUFF) that uses a sorted distance matrix containing both *X* and *Y* as input and returns the coverage curves *C*_{X}(D), C_{Y}(D), C_{XY}(D), and *C*_{YX}(D), as well as the *P* values for both Δ*C*_{XY} and Δ*C*_{YX}, from the distribution of Δ*C*. In addition, the distribution of (*C*_{X} − *C*_{XY})^{2} with *D* appears to be informative and is given as well (see below). The computer program LIBSHUFF was written in Perl and can be downloaded along with more detailed instructions on its use at http://www.arches.uga.edu/~whitman/libshuff.html.

A first test of this method was done to ensure that samples from the same library were not shown to be different. Thus, a collection of clonal sequences (*n* = 275) from a soil community study (6) was divided into two samples based upon accession numbers (138 odds and 137 evens). Although the study contained sequences from two sample sites (SL and SAF clones), sequences from both sites were placed in each data set to form nearly equivalent samples. A comparison of Δ*C*_{odds/evens} to Δ*C* values resulted in *P* = 0.871, which indicated that the two samples were not significantly different (Fig. A). Similar results were obtained for Δ*C*_{evens/odds} and other arbitrarily divided sequence libraries (Table ). Thus, as expected, samples taken from the same library were not found to be different.

Comparisons of environmental clone libraries

To demonstrate that this procedure could correctly differentiate samples from different libraries, sequences of clones obtained from an activated sludge (SBR1; *n* = 97; reference 1) were compared to grassland soil SL clones. The SBR1 clones were found to be significantly different from the SL clones (*P* = 0.001; Fig. B). More information on the nature of this difference was obtained by examination of the distribution of (*C*_{X} − *C*_{XY})^{2} with *D* (Fig. B). At low *D*, the actual (*C*_{X} − *C*_{XY})^{2} exceeded the comparable values at *P* = 0.05 obtained during the calculation of Δ*C*. This result suggested that the libraries differed greatly at *D* < 0.10 but shared many deep taxa. However, smaller differences at *D* > 0.3 suggested that not all deep phylogenetic groups were found in both libraries. Similar results were also obtained for comparisons of other soil and bioreactor libraries (Table and data not shown).

Three sequence collections consisting of multiple samples were analyzed to determine if differences between the samples could be detected (Table ). Clonal libraries derived from the microbial populations of phosphate-removing (SBR1) and non-phosphate-removing (SBR2) bioreactors differed in the abundance of certain taxa (1). However, these differences were not shown to be significant by our method (Table ). The compositions of libraries from the microbial communities of improved (SL) and unimproved (SAF) upland grass pasture soils were not found to be significantly different (6). We also obtained the same conclusion by our method (Table ). Finally, comparisons of restriction fragment length types from C0 and S0, two clonal libraries derived from arid soils, suggested that C0 was more diverse than S0 (2). Our analysis of the sequences obtained from this study was consistent with this conclusion and further suggested that S0 was a subset of C0. Δ*C*_{S0/C0} was not significant, which suggested that all of the taxa present in S0 were also present in C0 (Table ). However, the reciprocal value Δ*C*_{C0/S0} was significant; therefore, C0 also contained sequences of one or more taxa not found in S0. The distribution of (*C*_{X} − *C*_{XY})^{2} with *D* further indicated that the additional taxa in C0 represented moderately deep phylogenetic groups, 0.15 < *D* < 0.25 (Fig. C).

Sample size should have a major effect on comparisons of libraries. The minimum number of sequences necessary to distinguish two dissimilar libraries was expected to increase with the complexity of the libraries and decrease with the magnitude of the dissimilarity. This point was examined in detail by using two libraries of high diversity and dissimilarity. Variable numbers of clonal sequences were randomly selected from either library SBR1 or SL (*Y*) and compared to the opposite library (*X*), and *P* values were determined for 10 replicates. Approximately 20 and 25 sequences from SBR1 and SL, respectively, were required to differentiate the two libraries (*P* < 0.05) when *X* was represented by 97 and 137 sequences, respectively (Fig. ). Tests were also performed to investigate the required sample size of *X* (SBR1) when the size of *Y* (SL) was small. It was found that nearly all (≥90) of the sequences from the SBR1 library were required to distinguish these libraries when the SL library (*Y*) was represented by 20 sequences (data not shown). When the sizes of both libraries were varied, they were consistently detected as different when the SBR1 (*X*) and SL (*Y*) libraries were represented by ≥40 and ≥30 sequences, respectively (data not shown). While these results may not generalize to all environmental samples, they should be representative of comparisons of libraries from diverse communities, such as those found in soil and bioreactors. Importantly, these results suggest than modestly sized libraries from microbial communities similar in complexity to those used in this study will be distinguished by this method.

Effect of sample size on the discrimination of libraries. A comparison of the SL library from grassland soil (*Y*; *n* = variable) to the bioreactor library SBR1 (*X*; *n* = 97) (●) and a comparison of the SBR1 (*Y*; *n* = variable) library to the SL (*X*; **...**