# SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data

^{1}Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 08540, USA,

^{2}The Centre for Applied Genomics (TCAG), the Hospital for Sick Children, Toronto, ON M5G 1L7, Canada and

^{3}Center for Applied Genomics, the Children's Hospital of Philadelphia, Department of Pediatrics, University of Pennsylvania, Philadelphia, PA 19104, USA

## Abstract

We develop a statistical tool SNVer for calling common and rare variants in analysis of pooled or individual next-generation sequencing (NGS) data. We formulate variant calling as a hypothesis testing problem and employ a binomial–binomial model to test the significance of observed allele frequency against sequencing error. SNVer reports one single overall *P-*value for evaluating the significance of a candidate locus being a variant based on which multiplicity control can be obtained. This is particularly desirable because tens of thousands loci are simultaneously examined in typical NGS experiments. Each user can choose the false-positive error rate threshold he or she considers appropriate, instead of just the dichotomous decisions of whether to ‘accept or reject the candidates’ provided by most existing methods. We use both simulated data and real data to demonstrate the superior performance of our program in comparison with existing methods. SNVer runs very fast and can complete testing 300 K loci within an hour. This excellent scalability makes it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data using high performance computing cluster. SNVer is freely available at http://snver.sourceforge.net/.

## INTRODUCTION

The past few years have seen a dramatic development in sequencing technology, which has made the per-base cost of DNA sequencing plummet by ~100000-fold over the past decade (1). Because of the affordable cost and high digital resolution, the new or ‘next-generation’ sequencing (NGS) technology is replacing the traditional hybridization-based microarray technology in many applications (2). For genetics studies, NGS holds the promise to revolutionize genome-wide association studies (GWAS). The recently completed phase of GWAS mainly addresses common SNPs with Minor allele frequency (MAF) >5%, based upon the common disease/common variant (CD/CV) hypothesis (3). However, the identified common variants explain only a small proportion of heritability (4). Rare variants therefore have been hypothesized to account for the missing heritability (5,6). To identify rare variants, a direct and more powerful approach is to sequence a large number of individuals (7). This line of thought also implicitly motivates the recent 1000 Genomes Project, which will sequence the genomes of 1200 individuals of various ethnicities by NGS (8). It is expected to extend the catalog of known human variants down to a frequency ~1%.

Although the cost of whole-genome or exome sequencing of all enrolled subjects is prohibitively high now, such studies will eventually be carried out in a manner similar to GWAS with very large sample sizes (9). While the cost is being brought down to as low as $1000 for sequencing a whole genome (10), in the interim, a cost-effective strategy has to be taken in order to take the full advantage of NGS. Such issues with cost and labor are not new as similar problems were confronted in the early expensive stage of GWAS and were circumvented by focusing on small candidate regions and the use of pooling of genomic DNA (11,12). Borrowing the same idea, many targeted re-sequencing applications utilizing pooling have been seen in the past few years (13–16).

The first-step analysis of NGS data for genetics study is often to identify genomic variants among sequenced samples. Quite a few SNP calling tools have been implemented to identify SNPs from sequencing of individual genomes. SNP calling is a relatively straightforward problem in analysis of sequencing data of individual genomes, because the frequency of a candidate allele can be only 0 (non-variant), 0.5 (heterozygous) or 1 (alternate homozygous) for a diploid genome. Despite (high) sequencing error of NGS, a reliable call can be easily made given a high depth of coverage, say 20× to 30×. Consequently, statistical models for SNP calling have been developed and integrated as one simple functional module in many NGS short reads analysis tools such as SAMtools (17), MAQ (18), GATK (19) and VarScan (20). SAMtools and MAQ use a Bayesian statistical model to compute the posterior probabilities of the three possible genotypes. Specifically, for the likelihood part, they employ a binomial distribution to characterize sampling of the two haplotypes, and the prior probability, like other Bayesian approaches, is pre-specified. SAMtools and MAQ empirically set the prior probability of observing a heterozygote to be 0.001 for the discovery of new SNPs, and 0.2 for inferring genotypes at known SNP sites. A similar Bayesian algorithm is used by GATK followed by sophisticated filtering. Such Bayesian approaches may not be ideal for multiplicity control because of the subjectivity of assigning the prior probability. VarScan implements a heuristic/statistical method. For each candidate site, it applies several heuristic filters such as having a minimum number of supporting reads and allele frequency reaching a minimum threshold. It also conducts a Fisher's exact test for testing the deviation of the read counts supporting variant alleles from being generated because of sequencing error. Those heuristic filters overlap with the Fisher's exact test in terms of reducing false positives. When not systematically considered, they may distort the statistics distribution under null and thus void the resultant *P*-values for multiplicity control. The variant call program we develop here is based on a frequentist approach, which will systematically consider all relevant factors and output *P*-values valid for multiplicity control.

Identifying SNPs from pooled NGS data is more challenging in that pooled DNA is sampled from a number of individuals, which consequently will give rise to variant allele frequencies other than simply 0, 0.5 or 1. Driven by the need for analysis of increasing amount of pooled NGS data, several programs/methods for the detection of variants from the pooled data have been developed. SNPSeeker employs the large deviation theory for SNP detection (21). It compares observed allele frequencies against the distribution of sequencing errors as measured by the Kullback Leibler (KL) distance (22). One limitation of this approach is that its error model has to be estimated from negative control data. SNPSeeker was recently extended to SPLINTER with two main improvements (23). First, it is capable of detecting rare short indels. Second, it provides a good cutoff after ranking all candidate variants to balance power and type I error rate, which, however, requires an additional positive control data. CRISP (24) models the number of reads of the reference and alternate alleles at a particular position across all pools as a contingency table, which is then tested by the Fisher's exact test. Its working hypothesis is that, due to rareness, presence of rare variants in all pools will be sporadic and then results in an excess of reads with the alternate allele as compared with the other pools, which is expected to be captured by the Fisher's exact test. CRISP then conducts a complementary test for the overabundance of alternate alleles within each pool against the sequencing error rate. Although it is shown that CRISP outperforms SNPSeeker, MAQ and VarScan (24), it has the following limitations. First, its working hypothesis does not hold well for common variants. When the MAF is large and/or the number of individuals in each pool is large, sporadic presence will disappear and result in no prominent excess of reads that can be captured by the Fisher's exact test. Second, their method is not applicable for single-pool data. Third, rareness and overabundance of alternate alleles are related but are captured separately using two different models, which may not be an efficient approach. In addition, these two separate tests make it hard to obtain an overall multiplicity control. Finally, its computational efficiency makes scalability an issue and may prevent its application in analysis of whole-exome or genome sequencing data. The main bottleneck comes from computing the *P-*value of a large number of contingency tables in the Fisher's exact test.

In addition to the above direct SNP calling programs, there are also other relevant studies for analysis of pooled NGS data, including estimating allele frequencies from pooled sequencing (25), evaluating the ability to detect rare SNPs (15) and investigating the power of variant detection in pooled DNA for NGS and the optimal pooling designs (26), among others. In this article, we develop a statistical tool SNVer (single nucleotide variant caller/seeker) for detecting variants in analysis of NGS data. SNVer is applicable to both pooled and individual data, and in particular it addresses the limitations that pre-existing methods have.

## MATERIAL AND METHODS

### Statistical models for single-pool data

For a genomic locus, let *θ* be its MAF in a population. If *θ* is larger than a threshold *θ*_{0} (*θ*>θ_{0}), then we call it a single nucleotide polymorphism (SNP). Suppose that we sample *N* individuals (haploids) from this population for pooled sequencing. We assume that the number of individuals (*n*) carrying the minor allele follows a binomial distribution *b*(*N*, *θ*), namely,

with

Now we re-sequence this genomic region. Suppose that *K* short reads cover this locus, if no sequencing error, given *n* individuals carrying the minor allele, the number of minor alleles *X* we observe from the *K* short sequence reads follows also a binomial distribution *b*(*K*, *n/N*), namely:

with

Now we assume sequencing error rate to be *ε*, under which the minor allele will be flipped to one of the other three alternate alleles, and vice versa. So the observed *X* follows a binomial distribution , namely,

with

Since *n* is not observable, we sum it out and obtain the statistical model for *X* as

Now we consider the hypothesis test of whether this locus is a (rare) variant (*θ*>θ_{0})

Its significance *P*-value will be

### Partial conjunction test for multiple-pool data

The above statistical model is for testing a locus in one single-pool data. For *M* pools, we propose to test it in each pool separately. We therefore obtain a set of *M* hypotheses for each candidate variant. The problem of making a variant call at one specific locus involves the simultaneous testing of hypotheses at the set level. Typical questions considered in the multiple-testing framework include: (i) Are all *M* hypotheses in the set true? (ii) Are all *M* hypotheses in the set false? (iii) Are at least *u* out of *M* hypotheses in the set false? These questions are referred to as conjunction test, disjunction test and partial conjunction test, respectively (27). Testing whether a locus is a variant based on multiple-pool data is equivalent to the partial conjunction test that at least *u*=1 out of the *M* hypotheses for that locus is false. Let be the ordered *P-*values obtained from each single-pool test. Following (27), we employ the Simes method to calculate the pooled *P-*value for the partial conjunction test as

If the set of *M* null *P-*values at the tested locus are independent, Benjamini and Heller (27) show that *p*^{1/M} is a valid *P-*value for testing the partial conjunction null. The Benjamini Hochberg (BH) procedure (28) and other multiple-test adjustments can then be applied to the pooled Simes’ *P-*values for multiplicity control when testing a large number of loci. It has been shown that this Simes–BH procedure controls the false discovery rate (FDR) at the pre-specified nominal level (27).

### Data sets

#### Simulated data

We simulate synthetic data to investigate the numerical performances of our approach. For the single-pool scenario, a total of 10000 data sets are generated under each combination of several conditions:

- Sequencing coverage: low (10×) and high (30×).
- Sequencing error: low (0.01) and high (0.05)
- MAF: rare variants with
*θ*~U(0.001, 0.01), less common variants with*θ*~U(0.01, 0.05) and very common variants*θ*~U(0.05, 0.5) - The number of sequenced individuals from low to high with
*N*=10, 20, 50, 100, 200, 500, 1000, 1500, 2000

For each MAF setting *θ*~*U*(*θ*_{min}, *θ*_{max}), we calculate the power of our approach for detecting variants by testing the null hypothesis *H*_{0}: *θ*<*θ*_{min}. Meanwhile, to demonstrate that type I error is controlled at the nominal level by our proposed test, we simulate *θ*~*U*(*0*, *θ*_{min}), and evaluate how likely the same null hypothesis *H*_{0}: *θ*<*θ*_{min} will be rejected by mistake. For both power and type I error evaluations, we call a variant at the nominal level 0.05.

For the multiple-pool scenario, we follow the above single-pool simulation settings except that we simulate five pools with the same number of individuals in each pool and the total *N*=10, 20, 50, 100, 200, 500, 1000, 1500, 2000.

#### Real data

We also assess the performance of our method in analysis of two pooled and one individual real NGS data sets as summarized in Table 1. The first one was an in-house Autism data set generated using ABI SOLiD platform from sequencing three genomic regions, denoted as Core, CDH9 and CDH10, of size 187, 158 and 158kb, respectively, on chromosome 5 of the human genome. We made 24 pools with six individuals in each, totaling 144 samples. We have 12 pools for Autism case samples and the other half 12 pools for control samples. One case pool experiment failed and we therefore have 23 pools in total for analysis. We aligned short sequence reads by the Bioscope software from ABI SOLiD with default parameters. The mapped short sequence reads cover >96% of the three target regions with average 90× depth of coverage per individual. Meanwhile, we collected individual genotyping data for each sample, which were generated from Illumina HumanHap550v3 SNP arrays with approximately 550000 markers. With individual genotyping data, we may calculate the concordance of identified variants between pooled sequencing data and individual genotyping data for evaluating variant call quality.

The second data set was collected in a recent study of causative Type 1 Diabetes (T1D) variants (14). Exons and splice sites of 10 candidate genes were re-sequenced by the 454 sequencing system. Ten pooled samples each comprising equal amounts of DNA from 48 T1D patients and 10 pooled samples each comprising equal amounts of DNA from 48 healthy controls were made, totaling 480 T1D patients and 480 healthy controls from Great Britain. For each of the 20 pooled DNA samples, the numbers of produced short reads range from 281270 to 579102, with average length of 250 bases and 9416365 reads in total. We mapped these reads by BWA-SW (29) with default parameters and the average depth of coverage is 80× per individual.

The third one was an in-house individual sequencing data set. We performed paired end exome sequencing on three members affected with attention deficit/hyperactivity disorder (ADHD) in a pedigree, using the Illumina Genome Analyzer IIx platform with read lengths of 76bp. It targets all human exonic regions totaling ~38Mb. We aligned the short reads by BWA with default parameters and removed duplicates by picard (http://sourceforge.net/projects/picard/). These mapped and cleaned short reads were then re-aligned locally by the GATK IndelRealigner tool (30). The average depth of coverage is ~20× for each patient. Meanwhile, we also collected the genotyping data of these three patients, generated from the Illumina Human610-Quad version 1 SNP arrays with ~610000 markers (including ~20000 non-polymorphic markers).

For pooled sequencing data, CRISP has been shown to outperform other existing methods (24), so we focus on the comparison of our program with CRISP in performance evaluation. We also include SAMtools for comparison although it is not designed for pooled sequencing data. For the ADHD individual data, we compare SNVer with SAMtools and GATK. Variant positions were called and filtered by SAMtools with all default settings plus using awk ‘($3==“*” &$6>=50) || ($3! =‘*’ &$6>=20)’, as suggested by the SAMtools website. For the ADHD data, SAMtools with the suggested setting returned so many variants that we also report SAMtools results with an additional filtering −d20 to remove variant calls with sequencing coverage less than 20, for getting comparable numbers of variant calls as SNVer. We also called variants using the GATK UnifiedGenotyper, followed by further filtering based on the latest recommendations from the authors of GATK (see Supplementary Data for the detailed settings). SNVer utilizes SAMtools (17) to process and pile up mapped short reads. CRISP has its own pileup procedure integrated in its analysis pipeline. To make a fair comparison, following CRISP (24), we perform similar quality control and set the same processing parameters such as mapping quality and base quality filtering thresholds.

## RESULTS

### Power and type I error evaluations

The single-pool results are shown in Figure 1. We can see that our method can control type I error rate at the nominal level 0.05 in all settings. The number of sampled individuals (sample size) and the depth of coverage are both shown to be helpful in improving power. The largest improvement of ~10% attributed to depth of coverage (from 10× to 30×) is observed in the rare variants and high sequencing error (up–right panel). The improvement contributed by larger sample size keeps increasing at a decreasing rate until saturated. These power improvement curves would be helpful for pooling experiment design and provide guidance as to how to balance sample size (cost) and desired power. As expected, rare variants are much harder to be detected than common variants. A large sample size is required for achieving high power to detect them. Finally, higher sequencing error (0.05 versus 0.01) puts a small dent to power.

Figure 2 shows similar results for the multiple-pool scenario. Again, type I error rate is controlled at the nominal level 0.05. We also observe that given the same number of sequenced individuals, single-pool design yields a bit higher power with lower type I error rate in comparison with multiple-pool design, for example, 1000 individuals using one single pool versus five pools with 200 individuals in each. CRISP selects candidate SNPs by the Fisher's exact test, which is then followed by additional filtering steps. In the multiple-pool scenario, we show that the rankings of candidates SNPs by our test is superior to those by the Fisher's exact test employed by CRISP. To compare the efficiencies of these two rankings, we divide the 10000 positives with *θ*~*U*(*θ*_{min}, *θ*_{max}) and 10000 negatives with *θ*~U(0, *θ*_{min}) into 100 groups, each with 100 positives and 100 negatives. These 200 loci are then ranked by their significance levels of testing the null *H*_{0}: *θ*<*θ*_{min} using our statistical models. Rankings based the Fisher's exact test are also generated. The area under the curve (AUC) score averaged over 100 groups is used to evaluate these two rankings as shown in Figure 3 for the typical scenario of 30× coverage and 0.05 sequencing error. We can see that the Fisher's exact test is very inefficient for detecting common and less common variants. CRISP therefore has to rely on additional sequencing error models to complement the Fisher's exact test for detecting common variants. We apply the BH procedure to control FDR at the nominal level of 0.1 and 0.05. As shown in Supplementary Table S1, the FDR for the Fisher's exact test is inflated, particularly dramatically for common and less common variants; SNVer controls the FDR very well. The number of sequenced individuals is modeled in our test and is shown to be helpful. This information is not explicitly utilized by CRISP in its Fisher's exact test and therefore contributes very little for detecting common and less common variants, although CRISP models it at the later filtering step.

The accuracy of allele frequency estimation has an impact on variant call, and is more critical for establishing association in genetics studies. Therefore we also plot the estimated MAF against the actual MAF when *ε*=0.01 in Figure 4. For a moderate sample size of 250, we observe good concordance with correlation coefficients *r*^{2}=0.9828 and *r*^{2}=0.9318 for the single-pool design and the multiple-pool design, respectively. When the sample size increases to 1000, the concordance improves to *r*^{2}=0.9955 and *r*^{2}=0.9769 for the single- and the multiple-pool design, respectively. The lower concordance of the multiple design may be attributed to its additional between-pool variance. It also explains why singe-pool design yields fewer false positives than the multiple-pool design for the same set of samples.

### Real data application

#### Better performance

The user of SNVer only needs to set the sequencing error rate ε and the variant threshold *θ*_{0}. SNVer will then report the significance *P*-values of the tested loci of how likely their MAF *θ*<*θ*_{0.} We assume *ε*=0.01 for all real data sets. CRISP calls both rare and common variants, so we set *θ*_{0}=0 for SNVer to compare their performance in calling variants. CRISP will output the variants it calls, while SNVer will report overall significance *P-*values for each locus, based on which the user can choose a threshold he/she feels appropriate and make variant calls. To make a comparison, we rank loci by their *P-*values output by SNVer and take the significance threshold that gives the same number of variants called by CRISP. The loci identified as variants by these two programs are then annotated by SeattleSeq (http://gvs.gs.washington.edu/SeattleSeqAnnotation/), and we count how many of them have been confirmed as variants in dbSNP. Following (30), we evaluate variant call quality by examining dbSNP rate, transition/transversion (Ti/Tv) ratio and concordance of sequencing and individual genotyping calls. A higher Ti/Tv ratio generally indicates a higher accuracy; this metrics is particularly helpful for assessing novel single nucleotide variant calls (30). The variant call results are summarized in Table 2. For the Autism and T1D pooled sequencing data sets, SNVer has the higher dbSNP rates, the higher overall Ti/Tv ratios and the higher Ti/Tv ratios for new sites, in comparison with CRISP. It indicates the better quality of the call sets SNVer produced. In contrast, SAMtools made much fewer SNP calls which led to much lower sensitivities, despite its higher Ti/Tv ratios. Out of the 110 SNPs that have been genotyped by SNP arrays in the Autism data set, SAMtools identified only 16 SNPs with 100% genotyping concordance, while both SNVer and CRISP called about 100 SNPs with 100% genotyping concordance. This confirms that SAMtools may not be appropriate for pooled sequencing data. The correlation between alternate allele frequencies in individually genotyped DNA samples and frequency estimates in the sequenced DNA pools is plotted in Figure 5, with *r*^{2}=0.92 and *r*^{2}=0.94 for the Autism case and control, respectively. The achieved 100% genotype concordance with less perfect frequency estimates is not surprising because accurate estimate of allele frequency θ is only critical for rare variants when testing *θ*>0.

As shown in Table 2, for the ADHD individual sequencing data, under family-wise error rate 0.05 level, SNVer also obtained the variant call sets with good quality. This is evidenced by the ~97% dbSNP rates, the approximately 2.9 overall Ti/Tv ratios, the 2.22–2.73 Ti/Tv ratios for novel sites, and the 99% genotype concordance. SAMtools with suggested parameters/filters made 2+ times more variant calls than SNVer (e.g. ~49K versus ~18 K). The lower Ti/Tv ratios and genotype concordance suggest poorer quality for these larger call sets made by SAMtools. When applied with an additional filtering of sequencing depth ≥20×, SAMtools identified fewer SNPs than SNVer. But it still has lower quality as indicated by the lower Ti/Tv ratios and genotype concordance. Compared with GATK, SNVer has similar performance, while with the higher Ti/Tv ratios for novel variants in all three individuals.

We note that the Ti/Tv ratios for novel variants in the pooled sequencing data are low for both programs. It suggests that they may not perform well for novel variants if we estimate the false-positive rates based on the Ti/Tv ratios following (30). It confirms that variant calling is more challenging for pooled sequencing. Meanwhile, estimating false-positive rates using this summary statistic should be cautious for pooled sequencing. First, Ti/Tv estimate for pooled samples is not as accurate as for individual samples. Second, targeted resequencing regions are usually small, e.g. 31kb for the T1D data and 503kb for the Autism data, and therefore may exhibit higher genomic and statistical variances. For example, the ADHD individual 84060 has an exome-wide Ti/Tv ratio of 2.89 for all variants; if we calculate Ti/Tv ratios based on only 500-kb regions, then the smallest Ti/Tv ratio we obtain is 1.31, and the largest 7.00 with SD=1.53 (we consider only 500-kb regions with at least 30 variants for having stable Ti/Tv ratio estimates).

#### Better scalability

SNVer and SAMtools exhibit similar efficiency in terms of running time. The running time of SNVer and CRISP in analysis of the T1D and Autism data sets is given in Figure 6. The main bottleneck of CRISP comes from computing the *P-*value of a large number of contingency tables in the Fisher's exact test. Therefore, in additional to the number of tests, its time efficiency is also largely dependent on the number of pools and the depth of coverage. In contrast, these two factors have little impact on SNVer and its running time is roughly linear with the region size (the number of tests). For example, SNVer spends 0.1h on 31kb and 1.5h on 503kb for the two data sets, respectively. SNVer is much faster than CRISP. Taking the T1D case for example, SNVer is ~500-fold faster than CRISP and achieves 300kb/h. Such efficiency makes feasible the application of SNVer to analysis of whole-exome sequencing data, or even whole-genome sequencing data using high performance computing cluster, both of which, however, will take prohibitively longer time for CRISP.

### Informative ranking and multiplicity control

SNVer reports one single overall significance *P-*value for each locus, based on which the rankings of all tested loci can be produced. Such rankings are more informative and accurate than the dichotomous decision of whether to ‘accept or reject the candidate as a variant’ provided by CRISP and most other existing methods. For example, four rare variants have been found to be associated with T1D based on the T1D data set by comparing the estimated MAF in cases and controls (14). We use SNVer to call these four variants by testing the null hypothesis *θ≤θ*_{0}=0.01. We give the rankings of them by SNVer in Table 3, as well as the dichotomous decisions made by CRISP. For SNVer, we observe very significant ranking changes of these four SNPs, which are consistent with their MAFs (relative to the threshold 0.01) and the MAF differences. CRISP identifies three of them, rs35337543, ss107794688 and ss107794687, as variants in both cases and controls, exhibiting no informative differential changes. It should be noted that the ranking difference may only reflect frequency difference. Large frequency difference between case and control of those variants may suggest their potential association with the phenotype, but their functional importance to the phenotype is yet to be assessed by further experiments.

In addition to ranking, valid *P*-values given by SNVer also make multiplicity control possible. Tens of thousands or millions loci are usually simultaneously examined in typical NGS experiments. It is particularly desirable to have multiplicity control, which gives the user an idea of the chance of making any errors and/or the proportion of false positives among the variant calls they make. Each user can choose the type I error rate threshold he or she considers appropriate, instead of just the dichotomous decisions of whether to ‘accept or reject the candidates’ provided by most existing methods.

## DISCUSSION

We have developed a novel statistical tool SNVer for calling SNPs in analysis of pooled or individual NGS data. Different from the previous models employed by CRISP, it analyzes common and rare variants in one integrated model, which considers and models all relevant factors including variant distribution and sequencing errors simultaneously. As a result, the user does not need to specify several filter cutoffs as required by CRISP. Some variant calling methods simply discard loci with low depth of coverage to achieve reliable variant calls. Our statistical model does not discriminate against poorly covered loci. Loci with any (low) coverage can be tested and depth of coverage will be quantitatively factored into the final significance calculation. SNVer reports one single overall significance *P-*value for evaluating the significance of a candidate being a variant. An advantage of reporting results on a more continuous scale, instead of just the dichotomous decision of whether to ‘accept or reject the candidate as a variant’ as most existing methods do, is that the user can choose the alpha threshold he or she considers appropriate. We have used both simulated data and real data to demonstrate the superior performance of our program in comparison with pre-existing methods. Although SNVer is motivated by the need for analysis of pooled NGS data, it can also be applied to individual NGS data as a special case (*N*=2 for diploid species), as shown in the ADHD data set.

Sampling bias is a non-trivial problem in pooled sequencing, and in particular, rare variants are prone to sampling issues. Properly considering it may further improve the power. In this article, to make inference of the MAF *θ* of each site, we model the number of observed alleles conditional on the coverage from a frequentist standpoint. The power of detecting variants may be further improved if sampling bias is modeled properly so that we have more informative inference of the coverage rather than conditional on it. Since we have only one observation for each site, to model sampling bias or make any site-specific inference, e.g. base quality/error, we have to pool information across sites. Bayesian models may be a better, if not the only, way to this end. For example, the distribution of coverage of all sites can be approximated by the Gamma distribution for Illumina's short read alignments (31). Shen and colleagues (32) propose to estimate the posterior error rates for each substitution through a Bayesian formula, in which error models are learned from training data sets. Our frequentist approach does not model sampling bias; however, it has its own merits. First, the sampling bias issue may be very application specific. Different target enrichment kits may have different coverage uniformities. More variant sampling bias is expected for targeted re-sequencing, the current main pooling application, due to region-specific GC content. Mapping algorithms will also critically impact coverage. As a result, any approaches with sampling bias modeled may have to check carefully whether the sampling bias model/distribution fits well for every application. Second, our frequentist approach does not pool information across sites, which consequently has minimal requirement for input and wider applications. For example, when only one or few sites are tested, and without any help from external training data, sampling bias could not be modeled (well), but our frequentist approach still can be applied.

So, sampling bias is not considered in our frequentist approach, which consequently makes few assumptions, requires minimal input, and thus has wider applications. On the other hand, sampling issues may be addressed by more careful pooled re-sequencing designs (33). Companies such as NimbleGen and Agilent are also competing to improve their target enrichment kits to obtain coverage uniformity. With these upstream efforts, sampling bias may have a minimized impact on downstream variant call algorithms.

Our current program can be improved and extended in several ways. First, small indels are not supported. Indels impose a great challenge for NGS including DNA amplification and reads mapping which are under fast development. When those techniques become mature in handling indels, we may investigate their distribution and work out a proper calling strategy. Second, sequencing quality scores can be utilized to estimate site-specific sequencing error. Third, the majority loci of sequenced segments are known to carry no variants. The density of SNP is estimated to be around 1 out of 1000 bases. Such prior percentage of non-nulls information may help obtain more precise multiplicity control. Fourth, the dependency among tests will also be informative in increasing testing efficiency. We have shown that the LD dependency information is very informative in increasing the efficiency of conducting genome-wide association tests in analysis of GWAS data (34). We also found recently that dependency information is helpful for increasing the efficiency of testing hypotheses at the set level (35). For NGS data, one non-null (variant) is expected from every 1000 consecutive genomic bases. Such dependency patterns, if appropriately modeled, may help further improve testing efficiency. Lastly, our current program focuses on calling variants, namely, testing whether *θ* is larger than a threshold. Under the same framework, our models can be naturally extended for case-control association studies by testing whether *θ*_{case}= *θ*_{control}. We are currently working on these extensions.

In summary, we have developed a statistical tool SNVer for calling common and rare variants in analysis of both pooled and individual NGS data. As more and more NGS data become available, we expect more applications of our program.

## SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

## FUNDING

Institute Development Fund to the Center for Applied Genomics from The Children's Hospital of Philadelphia (partial). The open access publication charge for this paper has been waived by Oxford University Press – *NAR* Editorial Board members are entitled to one free paper per year in recognition of their work on behalf of the journal.

*Conflict of interest statement*. None declared.

## ACKNOWLEDGEMENTS

The authors thank Juvenile Diabetes Research Foundation and Wellcome Trust for providing the T1D NGS data used in the study. The authors thank Dan Koboldt for clarifying the usage of the Fisher's exact test in VarScan and Dr Vikas Bansal for helpful discussion. The authors also thank all four referees for their constructive comments, which have greatly helped improve the presentation of the article. The authors declare Juvenile Diabetes Research Foundation and Wellcome Trust bear no responsibility for interpreting the T1D results generated in the study.

## REFERENCES

**Oxford University Press**

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (3.4M) |
- Citation

- SNVerGUI: a desktop tool for variant analysis of next-generation sequencing data.[J Med Genet. 2012]
*Wang W, Hu W, Hou F, Hu P, Wei Z.**J Med Genet. 2012 Dec; 49(12):753-5. Epub 2012 Sep 28.* - A unified approach for allele frequency estimation, SNP detection and association studies based on pooled sequencing data using EM algorithms.[BMC Genomics. 2013]
*Chen Q, Sun F.**BMC Genomics. 2013; 14 Suppl 1:S1. Epub 2013 Jan 21.* - SNP calling by sequencing pooled samples.[BMC Bioinformatics. 2012]
*Raineri E, Ferretti L, Esteve-Codina A, Nevado B, Heath S, Pérez-Enciso M.**BMC Bioinformatics. 2012 Sep 20; 13:239. Epub 2012 Sep 20.* - A beginners guide to SNP calling from high-throughput DNA-sequencing data.[Hum Genet. 2012]
*Altmann A, Weber P, Bader D, Preuss M, Binder EB, Müller-Myhsok B.**Hum Genet. 2012 Oct; 131(10):1541-54. Epub 2012 Aug 11.* - Massively parallel sequencing approaches for characterization of structural variation.[Methods Mol Biol. 2012]
*Koboldt DC, Larson DE, Chen K, Ding L, Wilson RK.**Methods Mol Biol. 2012; 838:369-84.*

- Deciphering the Genome Repertoire of Pseudomonas sp. M1 toward β-Myrcene Biotransformation[Genome Biology and Evolution. ]
*Soares-Castro P, Santos PM.**Genome Biology and Evolution. 7(1)1-17* - Using VarScan 2 for Germline Variant Calling and Somatic Mutation Detection[Current protocols in bioinformatics / edito...]
*Koboldt DC, Larson DE, Wilson RK.**Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]. 2013 Dec; 4415.4.1-15.4.17* - Beyond the whole genome consensus: Unravelling of PRRSV phylogenomics using next generation sequencing technologies[Virus Research. 2014]
*Lu ZH, Archibald AL, Ait-Ali T.**Virus Research. 2014 Dec 19; 194167-174* - Collapsing singletons may boost signal for associating rare variants in sequencing study[BMC Proceedings. ]
*Wang W, Wei Z.**BMC Proceedings. 8(Suppl 1)S50* - Using VAAST to Identify Disease-Associated Variants in Next-Generation Sequencing Data[Current protocols in human genetics / edito...]
*Kennedy B, Kronenberg Z, Hu H, Moore B, Flygare S, Reese MG, Jorde LB, Yandell M, Huff C.**Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.]. 816.14.1-6.14.25*

- SNVer: a statistical tool for variant calling in analysis of pooled or individua...SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing dataNucleic Acids Research. 2011 Oct; 39(19)e132

Your browsing activity is empty.

Activity recording is turned off.

See more...