Coverage theories for metagenomic DNA sequencing based on a generalization of Stevens’ theorem
Abstract
Metagenomic project design has relied variously upon speculation, semi-empirical and ad hoc heuristic models, and elementary extensions of single-sample Lander–Waterman expectation theory, all of which are demonstrably inadequate. Here, we propose an approach based upon a generalization of Stevens’ Theorem for randomly covering a domain. We extend this result to account for the presence of multiple species, from which are derived useful probabilities for fully recovering a particular target microbe of interest and for average contig length. These show improved specificities compared to older measures and recommend deeper data generation than the levels chosen by some early studies, supporting the view that poor assemblies were due at least somewhat to insufficient data. We assess predictions empirically by generating roughly 4.5 Gb of sequence from a twelve member bacterial community, comparing coverage for two particular members, Selenomonas artemidis and Enterococcus faecium, which are the least (

Introduction
Microbes are both ubiquitous and singularly important to almost every aspect of life as we know it. There is no shortage of remarkable statistics that might be quoted, for example symbiont microbial cells outnumber human somatic cells by about 10 fold in most individuals, microbes represent about half the world’s biomass, and most of the probably more than 10 million bacterial species remain to be discovered. Such numbers contrast starkly with our relatively limited understanding of these organisms, which stems largely from difficulties in isolating and culturing most species in a laboratory setting. However, technology has lately reached the point where comprehensive metagenomic approaches are now being used. Here, whole-genome shotgun (WGS) sequencing is applied directly to the collective DNA of a community of organisms. A number of metagenomes have already been examined in this way (Breitbart et al. 2002; Tyson et al. 2004; Venter et al. 2004; Tringe et al. 2005; Gill et al. 2006; Culley et al. 2006; Angly et al. 2006; Martín et al. 2006; Rusch et al. 2007; Schlüter et al. 2008; Qin et al. 2010; Hess et al. 2011).
Project design remains a significant issue facing metagenomic research. In particular, it is difficult to know how much sequence data should be generated for any particular community. Early projects in the Sanger-era of sequencing often made pragmatic choices based simply on speculation (Handelsman et al. 1998) or budgetary constraints (Kunin et al. 2008). Sequencing was relatively expensive, limiting the amount of data. This meant that while simple metagenomic communities could still be mostly reconstructed (Tyson et al. 2004; Culley et al. 2006), large tracts within highly complex communities would necessarily be left uncharted (Venter et al. 2004; Tringe et al. 2005).
The commonality across all sequencing scenarios is that project success depends strongly on the notion of covering (Wendl and Wilson 2008, 2009a, b), i.e. the process that randomly places one-dimensional DNA segments onto larger genomic DNA targets. Venter et al. (2004) summarize the coverage idiosyncrasies of metagenomic sequencing in terms of the differences in both genome size among the member species and among their relative abundances. In essence, if abundance levels are roughly uniform, any single sequencing read is more likely to have come from a large genome rather than a small one. If instead genome sizes are all similar, this read probably represents an abundant species of individuals rather than a rare one. The sampling dynamics of an actual metagenomic project are a community-specific mixture of these two phenomena and the obvious danger is one of missing the proverbial “needle in the haystack” (Kowalchuk et al. 2007). That is, data may not adequately capture a member that plays some particularly vital internal role within the community and/or has some otherwise important biomedical relevance outside the community. The serendipitous discovery of the proteorhodopsins is a good example (Béjà et al. 2000).
Abundance biases are especially important in metagenomic projects because they can be quite extreme. Consider the viral community studied by Breitbart et al. (2003), which was estimated to contain around 1200 species. Its top 10 members, numbering about 0.8 % of those species, account for about 22 % of the community biomass (Fig. 1). Sequence representation will accrue rapidly for them, while their rare counterparts having abundances only on the order of 

Rank abundance curves are shown for the 12-member test microbial community used here for comparison and for the viral community analyzed by Breitbart et al. (2003). The latter was estimated to have around 1,200 species and to be distributed according to the power law 
The economics of DNA sequencing have improved dramatically with the commercialization of so-called next-generation technologies (Harismendy et al. 2009), suggesting that comprehensive studies of some of the more complex metagenomes are now becoming feasible. It is likely that the amounts of data that will have to be generated in such projects will be larger than what is now typical. For example, the remarkable figure of 10 Tb (more than 3,000 human genomes) has been floated for a single instance of a soil metagenome (Riesenfeld et al. 2004).
With a few exceptions, the current de facto standard methods for making such calculations rely on an elementary extension of traditional single-genome coverage theory (Venter et al. 2004; Tringe et al. 2005; Allen and Banfield 2005; Kunin et al. 2008). Specifically, a species is taken to have a sequence redundancy of 


Table 1
Mathematical notation
| Variable | Meaning |
|---|---|
![]() | Abundance of a species within metagenomic community |
![]() | Size in nucleotides of sequenceable genome |
![]() | Average length in nucleotides of a sequence read |
![]() | Total number of sequenced reads for a community |
![]() | Expected number of reads for target species: ![]() |
![]() | Probability of a position being covered: often ![]() |
![]() | Avg. number of reads spanning a position (redundancy): ![]() |
![]() | Steven’s series limiter: the smaller of ![]() ![]() |
![]() | Number of sequence gaps in target species (random variable) |
![]() | Number of reads hitting target species (random variable) |
![]() | Contig size in target species (random variable) |
![]() | Coverage: amount of genome covered by reads (random variable) |
![]() | Vacancy: complement of coverage (random variable) |
While such formulae are attractive because of their simplicity, the salient question is whether they are sufficient for project design. Consider the calculation by Rusch et al. (2007). They predicted that 6-fold Sanger redundancy for a 10 Mb genome at 1 % abundance should give an average contig length of about 


We briefly mention a few other results which, however more sophisticated, are still unsuited to this particular design problem. There is an appreciable body of work in the statistics literature regarding abundance estimation and these methods are readily applied to coverage-type calculations, for example as recently described by Hooper et al. (2009). They propose an expected coverage whose modeling parameters rely on fitting data to a user-chosen kernel function. Reported shortcomings include iterative tuning of parameters, limitations of kernel fidelity, and the need to discard certain portions of the data to preserve the model’s integrity. Perhaps even more important is that calculations can only be made once the project is already underway, having generated enough data for parameter-fitting. The model described by Breitbart et al. (2002) has similar technical issues and does not account for variation in genome size. Alternatively, Wendl (2008) developed the density function for the project-wide number of sequence gaps, but that equation also does not adequately consider the sampling biases mentioned above. Stanhope proposed an approximation model (Stanhope 2010) based on the idealized “occupancy” concept of covering (Wendl 2006b). That approach either takes all species at uniform genome size and abundance, or requires speculative distributions for these unknowns. Finally, there are scattered rules-of-thumb (Dutilh et al. 2009; Riesenfeld et al. 2004) whose origins are not entirely clear and upon which we also comment further below (Sect. 3.1).
These observations collectively point to the need for improved theoretical tools to quantify the metagenomic sequencing process. We propose several such results here. Most are corollaries of a generalization of Stevens’ theorem (Stevens 1939; Fisher 1940; Solomon 1978; Wendl and Waterston 2002), suitably extended to account for the distribution aspect of multiple species and its ensuing “abundance bias”. Like all of the methods above, this work does not strictly consider effects related to particular DNA sequence or instrumentation biases, within-species variation, or choices regarding computational processing. Consequently, we view it merely as another installment within a broader research program of metagenomic sequencing theory.
Results
The basic premise is to develop useful and rigorous quantitative tools for designing metagenomic projects based on the community members and the level at which one desires to characterize them. The goal might range anywhere from light sampling simply to estimate community membership, to reconstructing the dominant species, to fully recovering an extremely rare member within a very complex constituency. Consequently, we will speak of the target species as the basis of design. Species that are more readily accessible to sequencing than the target will almost certainly be even better characterized, while the converse is true for less accessible members. This is an inherent property of all random metagenomic sequencing.
The concept of a “target species” is implicit in expectation models and enables quantitative analysis without having to first speculate closures for the invariably unknown properties of the larger metagenomic community. This aspect is enormously practical. The closure problem is necessarily present for semi-empirical models (Hooper et al. 2009; Breitbart et al. 2002), but our theory does not depend on closure estimates.
Generalization of Stevens’ theorem
The problem of covering a one-dimensional domain with finite segments had been examined for some time before being solved successfully by W. L. Stevens in 1939 using a form of the well-known probability concept of inclusion–exclusion and a clever geometric observation (Stevens 1939; Fisher 1940; Solomon 1978). We generalize this result to the scenario of covering one particular domain from among a population of distinct domains. The abstraction is clearly applicable to metagenomic sequencing.
Consider a case in which 




Theorem 1
(Gap Census) If 


for 





This theorem can be applied either directly, or in various derivative ways to obtain rigorous probabilistic quantifiers for metagenomic sequencing. We discuss two of the more useful implementations in Sect. 2.2: the probability of complete target species coverage and the probability that the average size of contiguous regions of coverage in the target exceeds some threshold. (There are other possibilities, though of lesser practical interest; Roach 1995). Finally, we give another handy formula for community sampling, not related to Theorem 1, but derivable rather from elementary considerations.
Implementations of Theorem 1 for metagenomic sequencing
As alluded to in the above discussion of expectation models, let 

Corollary 1
(Complete Coverage) Complete coverage of the target species, 


This is a high standard of coverage. More relaxed conditions based on contig size are also relevant (Roach 1995; Stanhope 2010). Here, we exploit the fact that 



Corollary 2
(Average Contig Size) If coverage is almost complete, the average contig length is, to a very good approximation, a function only of the target size, 



where 




Formula for community sampling
Sequencing can also be used in a diagnostic capacity to assess what species are present in a community (Eisen 2007; Kunin et al. 2008). In the simplest case, coverage structure and contiguity are subordinated by raw counts of reads, especially if their lengths are sufficient to identify species merely by alignment against reference sequences.
Theorem 2
(Read Count) Let 


Numerical evaluation
Theorem 1 and its corollaries have a number of interesting mathematical properties, the most relevant here of which is the convergence rate. Evaluation requires summing terms that are themselves products of progressively larger and smaller numbers. Consequently, round-off error overwhelms slowly converging series unless extended precision arithmetic is employed. While such is required for much of the parameter space, standard precision can be used for Corollary 1 if the heuristic

is satisfied, where 

Parameter estimation
The formulae above can be used either parametrically or applied for specific species. In the former role, calculations will reveal the attributes of the most extreme member, i.e. its size and abundance, that could be captured for a given P-value and amount of data. In the latter, specific estimates of 





Discussion
Coverage probability as a design variable
We already mentioned above some of the shortcomings of using an expectation-based quantity such as 

The more obvious issues are based on the ensemble nature of expectations themselves. That is, they only characterize trials collectively and not necessarily any single one taken alone. In most instances, variances will not be terribly large compared to respective expectations. For example, the expected number of reads hitting the target species is 



The much more substantive concern is actually based on the sensitivity of predictions to small changes in the measure itself. Let us first be clear about the differences in what these measures mean. 

Figure 2 shows the characteristics of both measures for a 1 Mb target species using 100 bp reads. Here, the expectation results were plotted according to 







Abundance versus required number of project reads for a 1 Mb target species using 100 bp reads as specified by various theories. Here, 
A final argument, compelling more from an empirical standpoint, is that 

Figure 2 also shows two rules-of-thumb gleaned from the literature: the product of target species enrichment and redundancy should be at least 20 (Dutilh et al. 2009) and the metagenome redundancy should be around 1000 (Riesenfeld et al. 2004). The former is plotted for an enrichment factor of 15, again showing clearly insufficient data. This factor is largely arbitrary, being adjustable down to values that move the curve well past those of 

Lastly, we comment on another class of models based on contig length. The standard expectation result, quoted above (Sect. 1), is readily derivable as the ratio of coverage expectation to gap expectation, the latter obtained from Lander–Waterman theory. The formula is often avoided because it is divergent (Lander and Waterman 1988; Roach 1995), a consequence of the fact that gaps approach zero much faster than coverage approaches completion. (This can be demonstrated through simple differentiation.) More recently, Stanhope proposed a metagenomic coverage theory based on the occupancy concept (Stanhope 2010). It furnishes the probability that the largest contig exceeds some length, 

where 



Empirical comparison for a 12-member microbial community
On the more pragmatic side, a model’s ability to make worthwhile predictions can be assessed empirically. Here, we compare Corollary 1 to the data obtained from a 12-member bacterial community for which we generated roughly 4.47 Gb of sequence (about 46 million reads) from 1 lane on an Illumina GA-IIx instrument. Table 2 shows the project parameters, where “data” and “size” indicate the total amount of data generated for each genome and actual genome size, respectively. The “depth” column is their quotient, representing the average number of reads spanning each position in the genome, while the “vacant” column indicates the amount of genome remaining uncovered in the assembly. Though having only mild complexity, abundance bias is certainly evident in this population, given that the ratio of highest to lowest abundances exceeds a factor of 4. Larger values are admittedly more common, for example Breitbart et al. (2003) estimate a ratio of more than 300 (Fig. 1). However, the assemblies for such communities generally remain fragmented (Rusch et al. 2007; Hess et al. 2011) and are therefore unworkable as comparisons for the metric 


Table 2
Sequence data for 12-member microbial community
| Species (NCBI accession number) | Depth | Size | Vacant | Data |
![]() |
|---|---|---|---|---|---|
| (fold) | (Mb) | (kb) | (Mb) | ||
| E. faecalis (AEBQ00000000) | 142.5 | 3.00 | 2.26 | 427.4 | 0.096 |
| E. coli (AJGD01000000) | 62.8 | 4.57 | 2.16 | 287.1 | 0.064 |
| F. prausnitzii (AECU00000000) | 80.7 | 2.96 | 3.24 | 239.0 | 0.054 |
| S. artemidis (AECV01000000) | 56.8 | 2.22 | 2.19 | 126.0 | 0.028 |
| E. faecalis (AEBB00000000) | 139.2 | 2.85 | 1.00 | 396.6 | 0.089 |
| E. faecalis (AEBP00000000) | 115.1 | 3.01 | 3.45 | 346.4 | 0.078 |
| E. faecalis (AEBF00000000) | 148.5 | 2.83 | 1.64 | 420.1 | 0.094 |
| E. faecalis (AEBD00000000) | 147.7 | 2.88 | 1.26 | 425.5 | 0.095 |
| E. faecalis (AEBN00000000) | 132.2 | 3.12 | 1.55 | 412.5 | 0.092 |
| E. faecalis (AEBO00000000) | 131.4 | 3.12 | 3.36 | 409.5 | 0.092 |
| E. faecalis (AEBE00000000) | 131.3 | 3.26 | 2.56 | 426.5 | 0.095 |
| E. faecium (AEBC00000000) | 188.0 | 2.94 | 1.25 | 552.8 | 0.124 |
An important, but more subtle aspect in all empirical-theoretical comparisons is controlling for the unavoidable differences that arise as a consequence of project-specific factors, including DNA sequence and instrumentation biases (Harismendy et al. 2009) and the vagaries related to specific combinations of software packages used for processing, alignment, and assembly. In metagenomic projects, we must add inter-strain variation within species as another confounder. These factors, which we will henceforth refer to collectively as “coverage bias”, tend to reduce actual performance below predictions because portions of each species’ genome are inclined against locally spanning reads. While simplistic bias models have been used for posterior fitting (Port et al. 1995; Schbath 1997; Wendl et al. 2001), there is no established, general methodology for resolving this aspect of the design problem a priori.
Table 2 shows that the covering process for this community is indeed biased. Specifically, the amount of uncovered genome (vacancy) for each species is on the order of kilobases, despite sequence depths that often substantially exceed 
We compare Corollary 1 specifically to S. artemidis and E. faecium, which are the least (







Comparison of data from bacterial community in Table 2 (circles) to the probability of total genome coverage given by Corollary 1 (solid curves). Each datum represents the average of 50 random drawings from 


The plots show reasonable agreement when considered in light of the bias problem. Although our elementary truncation procedure referenced above corrects somewhat for the worst factors, it unquestionably falls short. If biases are “simple”, meaning relatively benign and not distributed in complicated or extreme ways, it may be possible to further compensate by artificially lowering the read length to simulate less efficient covering. This procedure is demonstrated on E. faecium, where we reduced 


Empirical simplification and the metagenomic design map
Theorem 1 is completely general in that it describes probability as a function of all four independent variables: 








With respect to coverage, the two-variable dependence enables us to construct what is essentially a “design map” for all metagenomic projects in the form of a single plot (Fig. 4). Assuming estimates of 




Let us illustrate the process with a brief example. Suppose our hypothetical 1 Mb target discussed above in the context of the Rusch et al. (2007) project is to be fully recovered at 90 % power using 100 bp reads. This scenario is denoted by the asterisk in Fig. 4 and corresponds to an abscissa value of roughly 



Let us also illustrate the compounding effect of size by now increasing the target to 10 Mb while holding all other parameters constant. Expectation theory simply multiplies everything by 10, according to the rule that the redundancy is constant if we maintain 

Assessing community membership
So far, we have concentrated on the special case 
Consider the example of a 50 kb target at 0.05 % abundance (Fig. 5), which is characteristic of a relatively rare virus. Assuming a read length of 100 bp, i.e. 




Quantification of various sequencing milestones when using 100 bp reads for a 50 kb target at an abundance of 




Substantive contigs start to form only at higher levels of coverage. For example, average contig length reaches 2 kb (20 read lengths) at 95 % probability only after about 4.9 million project reads. Expectation theory predicts 


The results for this target virus are readily transformed to other abundance values on the basis that 

Finally, it is interesting to assess the maximum data required for very complex communities. For instance, the (Breitbart et al. 2003) power law estimation suggests the least abundant species in their viral community (Fig. 1) is on the order of 





Closing remarks
We have described a rigorous mathematical framework for the analysis and design of metagenomic sequencing projects that does not suffer from various resolution, consistency, or closure problems of earlier works. Though it does not address every outstanding issue, including those related to bias, the theory will be useful for a broad spectrum of calculations. We demonstrated several such aspects above, including use of 
Some have argued that sufficiently complex communities will necessarily remain beyond reach (DeLong 2005; Wooley et al. 2010), primarily because of limitations in sampling, while others have maintained that it is simply a matter of generating enough data (Venter et al. 2004; Tyson et al. 2004; Allen and Banfield 2005). This issue may be debatable in the philosophical sense of “proving a negative”. Yet, in a practical sense, our theory furnishes quantitative conditions under which even the most complex metagenomes can be decoded and the least abundant species recovered. Developments in instrumentation continue apace, suggesting many of these communities will be within reach in the near future.
Methods
Proof of Theorem 1
Assume all reads are independently and identically distributed (IID) among species in the metagenome, each with a Bernoulli probability 





Here, 

The following combinatorial identity can readily be constructed

and substituting this result leads to the factored expression

where 


Proof of Theorem 2
Given the IID property of reads, the Bernoulli proposition of either hitting or missing the target species implies binomial distribution. Theorem 2 follows directly from its Poisson approximation (Feller 1968), justified by the fact that 

Derivation of numerical heuristic in Equation 1
The heuristic is based on the notion that rate of growth of successive terms in Corollary 1 is bounded to the degree that the largest one does not overwhelm standard arithmetic precision. The first term is always unity, so we focus on the second. Given 



Collapse of variables
The independent variables in Theorem 1 are governed by 






The outer and inner combinatorial terms are well-approximated by 







Sequence generation and analysis
Whole genome shotgun libraries were constructed from 1

Analytical processing and assembly of the 12 genomes were managed with the Genome Institute automated pipeline. It initially performs a BWA-style trim (Li and Durbin 2009) to a threshold of q10 on all input instrument data. Reads trimmed to less than 35 bp were discarded. The pipeline then runs Velvet (Zerbino and Birney 2008), which cycles through the 31–35 kmer range, optimizing for the kmer which produces the longest N50 contig length. The entire data set is publicly available through the NCBI Sequence Read Archive (SRA) under the accession numbers listed in Table 2.
BWA (Li and Durbin 2009) was used to align clean paired end reads to the 12 bacterial assemblies, ultimately placing 46,079,563 reads. Up to 5 mismatches were allowed per read, corresponding roughly to minimum 95 % identity. The distribution among the 12 organisms was then assessed using an in-house program called Refcov (Todd Wylie, unpublished) based on the generated alignments. Experimental coverage was then simulated by randomly picking reads from the total pool and assessing subsequent coverage for the target organism again using Refcov. For E. faecium (NCBI accession: AEBC00000000), we ran 50 simulations each of 3, 3.5, 4, and 4.5 million reads and for S. artemidis (NCBI accession: AECV01000000), which were based on 50 selections each of 9, 10, 11, 12, and 15 million reads. These numbers were based on species abundance within the community in Table 2.
Acknowledgments
The authors wish to acknowledge funding sources for this work: National Human Genome Research Institute grants HG003079 and HG004968.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
References
- Ajay SS, Parker SCJ, Abaan HO, Fuentes-Fajardo KV, Margulies EH. Accurate and comprehensive sequencing of personal genomes. Genome Res. 2011;21(9):1498–1505. doi: 10.1101/gr.123638.111. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Allen EE, Banfield JF. Community genomics in microbial ecology and evolution. Nat Rev Microbiol. 2005;3(6):489–498. doi: 10.1038/nrmicro1157. [PubMed] [CrossRef] [Google Scholar]
- Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F (2006) The marine viromes of four oceanic regions. PLoS Biol 4(11), article no. e368 [PMC free article] [PubMed]
- Béjà O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Nguyen LP, Jovanovich SB, Gates CM, Feldman RA, Spudich JL, Spudich EN, DeLong EF. Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science. 2000;289(5486):1902–1906. doi: 10.1126/science.289.5486.1902. [PubMed] [CrossRef] [Google Scholar]
- Beyer WH. CRC standard mathematical tables. Boca Raton: CRC Press; 1984. [Google Scholar]
- Bouck J, Miller W, Gorrell JH, Muzny D, Gibbs RA. Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Res. 1998;8(10):1074–1084. [PMC free article] [PubMed] [Google Scholar]
- Breitbart M, Salamon P, Andresen B, Mahaffy JM, Segall AM, Mead D, Azam F, Rohwer F. Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci. 2002;99(22):14250–14255. doi: 10.1073/pnas.202488399. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Breitbart M, Hewson I, Felts B, Mahaffy JM, Nulton J, Salamon P, Rohwer F. Metagenomic analyses of an uncultured viral community from human feces. J Bacteriol. 2003;185(20):6220–6223. doi: 10.1128/JB.185.20.6220-6223.2003. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Chen K, Pachter L. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput Biol. 2005;1(2):106–112. doi: 10.1371/journal.pcbi.0010024. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Clarke L, Carbon J. A colony bank containing synthetic Col El hybrid plasmids representative of the entire E. coli genome. Cell. 1976;9(1):91–99. doi: 10.1016/0092-8674(76)90055-6. [PubMed] [CrossRef] [Google Scholar]
- Culley AI, Lang AS, Suttle CA. Metagenomic analysis of coastal RNA virus communities. Science. 2006;312(5781):1795–1798. doi: 10.1126/science.1127404. [PubMed] [CrossRef] [Google Scholar]
- DeLong EF. Microbial community genomics in the ocean. Nat Rev Microbiol. 2005;3(6):459–469. doi: 10.1038/nrmicro1158. [PubMed] [CrossRef] [Google Scholar]
- Dutilh BE, Huynen MA, Strous M. Increasing the coverage of a metapopulation consensus genome by iterative read mapping and assembly. Bioinformatics. 2009;25(21):2878–2881. doi: 10.1093/bioinformatics/btp377. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Eisen JA (2007) Environmental shotgun sequencing: its potential and challenges for studying the hidden world of microbes. PLoS Biol 5(3), article no. e82 [PMC free article] [PubMed]
- Feller W. An introduction to probability theory and its applications. New York: Wiley; 1968. [Google Scholar]
- Fisher RA. On the similarity of the distributions found for the test of significance in harmonic analysis and in Stevens’ problem in geometrical probability. Ann Eugen. 1940;10:14–17. doi: 10.1111/j.1469-1809.1940.tb02233.x. [CrossRef] [Google Scholar]
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocayne JD, Scott J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna MC, Nguyen DT, Saudek DM, Brandon RC, Fine LD, Fritchman JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA, Small KV, Fraser CM, Smith HO, Venter JC. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science. 1995;269(5223):496–512. doi: 10.1126/science.7542800. [PubMed] [CrossRef] [Google Scholar]
- Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE. Metagenomic analysis of the human distal gut microbiome. Science. 2006;312(5778):1355–1359. doi: 10.1126/science.1124234. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Green ED. Strategies for the systematic sequencing of complex genomes. Nat Rev Genet. 2001;2(8):573–583. doi: 10.1038/35084503. [PubMed] [CrossRef] [Google Scholar]
- Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM. Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol. 1998;5(10):R245–R249. doi: 10.1016/S1074-5521(98)90108-9. [PubMed] [CrossRef] [Google Scholar]
- Harismendy O, Ng PC, Strausberg RL, Wang X, Stockwell TB, Beeson KY, Schork NJ, Murray SS, Topol EJ, Levy S, Frazer KA (2009) Evaluation of next generation sequencing platforms for population targeted sequencing studies. Genome Biol 10, article no. R32 [PMC free article] [PubMed]
- Hess M, Sczyrba A, Egan RWKT, Chokhawala H, Schroth G, Luo S, Clark DS, Chen F, Zhang T, Mackie RI, Pennacchio LA, Tringe SG, Visel A, Woyke T, Wang Z, Rubin EM. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 2011;331(6016):463–467. doi: 10.1126/science.1200387. [PubMed] [CrossRef] [Google Scholar]
- Hooper SD, Dalevi D, Pati A, Mavromatis K, Ivanova NN, Kyrpides NC. Estimating DNA coverage and abundance in metagenomes using a gamma approximation. Bioinformatics. 2009;26(3):295–301. doi: 10.1093/bioinformatics/btp687. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Kowalchuk GA, Speksnijder AGCL, Zhang K, Goodman RM, van Veen JA. Finding the needles in the metagenome haystack. Microb Ecol. 2007;53(3):475–485. doi: 10.1007/s00248-006-9201-2. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev. 2008;72(4):557–578. doi: 10.1128/MMBR.00009-08. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Lander ES, Waterman MS. Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics. 1988;2(3):231–239. doi: 10.1016/0888-7543(88)90007-9. [PubMed] [CrossRef] [Google Scholar]
- Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–1760. doi: 10.1093/bioinformatics/btp324. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Liles MR, Manske BF, Bintrim SB, Handelsman J, Goodman RM. A census of rRNA genes and linked genomic sequences within a soil metagenomic library. Appl Environ Microbiol. 2003;69(5):2684–2691. doi: 10.1128/AEM.69.5.2684-2691.2003. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Martín HG, Ivanova N, Kunin V, Warnecke F, Barry KW, McHardy AC, Yeates C, He S, Salamov AA, Szeto E, Dalin E, Putnam NH, Shapiro HJ, Pangilinan JL, Rigoutsos I, Kyrpides NC, Blackall LL, McMahon KD, Hugenholtz P. Metagenomic analysis of two enhanced biological phosphorus removal EBPR sludge communities. Nat Biotechnol. 2006;24(10):1263–1269. doi: 10.1038/nbt1247. [PubMed] [CrossRef] [Google Scholar]
- Nicholls H (2007) Sorcerer II: the search for microbial diversity roils the waters. PLoS Biol 5(3), article no. e74 [PMC free article] [PubMed]
- Port E, Sun F, Martin D, Waterman MS. Genomic mapping by end-characterized random clones: a mathematical analysis. Genomics. 1995;26(1):84–100. doi: 10.1016/0888-7543(95)80086-2. [PubMed] [CrossRef] [Google Scholar]
- Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, Nielsen T, Pons N, Levenez F, Yamada T, Mende DR, Li J, Xu J, Li S, Li D, Cao J, Wang B, Liang H, Zheng H, Xie Y, Tap J, Lepage P, Bertalan M, Batto JM, Hansen T, Paslier DL, Linneberg A, Nielsen HB, Pelletier E, Renault P, Sicheritz-Ponten T, Turner K, Zhu H, Yu C, Li S, Jian M, Zhou Y, Li Y, Zhang X, Li S, Qin N, Yang H, Wang J, Brunak S, Doré J, Guarner F, Kristiansen K, Pedersen O, Parkhill J, Weissenbach J, Bork P, Ehrlich SD, Wang J. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464(7285):59–65. doi: 10.1038/nature08821. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Riesenfeld CS, Schloss PD, Handelsman J. Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004;38:525–552. doi: 10.1146/annurev.genet.38.072902.091216. [PubMed] [CrossRef] [Google Scholar]
- Roach JC. Random subcloning. Genome Res. 1995;5(5):464–473. doi: 10.1101/gr.5.5.464. [PubMed] [CrossRef] [Google Scholar]
- Roach JC, Boysen C, Wang K, Hood L. Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics. 1995;26(2):345–353. doi: 10.1016/0888-7543(95)80219-C. [PubMed] [CrossRef] [Google Scholar]
- Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, Platt T, Bermingham E, Gallardo V, Tamayo-Castillo G, Ferrari MR, Strausberg RL, Nealson K, Friedman R, Frazier M, Venter JC (2007) The Sorcerer II global ocean sampling expedition: Northwest Atlantic through eastern tropical Pacific. PLoS Biol 5(3), article no. e77 [PMC free article] [PubMed]
- Schbath S. Coverage processes in physical mapping by anchoring random clones. J Comput Biol. 1997;4(1):61–82. doi: 10.1089/cmb.1997.4.61. [PubMed] [CrossRef] [Google Scholar]
- Schlüter A, Bekel T, Diaz NN, Dondrup M, Eichenlaub R, Gartemann KH, Krahn I, Krause L, Krömeke H, Kruse O, Mussgnug JH, Neuweger H, Niehaus K, Pühler A, Runte KJ, Szczepanowski R, Tauch A, Tilker A, Viehöver P, Goesmann A (2008) The metagenome of a biogas-producing microbial community of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology. J Biotechnol 136(1–2):77–90 [PubMed]
- Solomon H. Geometric probability. Philadelphia: Society for Industrial and Applied Mathematics; 1978. [Google Scholar]
- Stanhope SA (2010) Occupancy modeling, maximum contig size probabilities and designing metagenomic experiments. PLoS ONE 5(7), article no. e11,652 [PMC free article] [PubMed]
- Stevens WL. Solution to a geometrical problem in probability. Ann Eugen. 1939;9:315–320. doi: 10.1111/j.1469-1809.1939.tb02216.x. [CrossRef] [Google Scholar]
- Thousand Genomes Project Consortium A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–1073. doi: 10.1038/nature09534. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM. Comparative metagenomics of microbial communities. Science. 2005;308(5721):554–557. doi: 10.1126/science.1107851. [PubMed] [CrossRef] [Google Scholar]
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428(6978):37–43. doi: 10.1038/nature02340. [PubMed] [CrossRef] [Google Scholar]
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO. Environmental genome shotgun sequencing of the Sargasso sea. Science. 2004;304(5667):66–74. doi: 10.1126/science.1093857. [PubMed] [CrossRef] [Google Scholar]
- von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P. Quantitative phylogenetic assessment of microbial communities in diverse environments. Science. 2007;315(5815):1126–1130. doi: 10.1126/science.1133420. [PubMed] [CrossRef] [Google Scholar]
- Vos M, Quince C, Pijl AS, DeHollander M, Kowalchuk GA (2011) A comparison of rpoB and 16S rRNA as markers in pyrosequencing studies of bacterial diversity. PLoS ONE 7(2), article no. e30,600 [PMC free article] [PubMed]
- Wendl MC. A general coverage theory for shotgun DNA sequencing. J Comput Biol. 2006;13(6):1177–1196. doi: 10.1089/cmb.2006.13.1177. [PubMed] [CrossRef] [Google Scholar]
- Wendl MC. Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing. Bull Math Biol. 2006;68(1):179–196. doi: 10.1007/s11538-005-9021-4. [PubMed] [CrossRef] [Google Scholar]
- Wendl MC. Random covering of multiple one-dimensional domains with an application to DNA sequencing. SIAM J Appl Math. 2008;68(3):890–905. doi: 10.1137/06065979X. [CrossRef] [Google Scholar]
- Wendl MC, Barbazuk WB (2005) Extension of Lander-Waterman theory for sequencing filtered DNA libraries. BMC Bioinform 6, article no. 245 [PMC free article] [PubMed]
- Wendl MC, Waterston RH. Generalized gap model for bacterial artificial chromosome clone fingerprint mapping and shotgun sequencing. Genome Res. 2002;12(12):1943–1949. doi: 10.1101/gr.655102. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Wendl MC, Wilson RK (2008) Aspects of coverage in medical DNA sequencing. BMC Bioinform 9, article no. 239 [PMC free article] [PubMed]
- Wendl MC, Wilson RK (2009a) Statistical aspects of discerning indel-type structural variation via DNA sequence alignment. BMC Genom 10, article no. 359 [PMC free article] [PubMed]
- Wendl MC, Wilson RK (2009b) The theory of discovering rare variants via DNA sequencing. BMC Genom 10, article no. 485 [PMC free article] [PubMed]
- Wendl MC, Marra MA, Hillier LW, Chinwalla AT, Wilson RK, Waterston RH. Theories and applications for sequencing randomly selected clones. Genome Res. 2001;11(2):274–280. doi: 10.1101/gr.GR-1339R. [PMC free article] [PubMed] [CrossRef] [Google Scholar]
- Wooley JC, Godzik A, Friedberg I (2010) A primer on metagenomics. PLoS Comput Biol 6(2), article no. e1000,667 [PMC free article] [PubMed]
- Xia LC, Cram JA, Chen T, Fuhrman JA, Sun F (2011) Accurate genome relative abundance estimation based on shotgun metagenomic reads. PLoS ONE 6(12), article no. e27,992 [PMC free article] [PubMed]
- Zerbino DR, Birney E (2008) Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5):821–829 [PMC free article] [PubMed]

1,2,3 




















