• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jul 1, 2011; 39(Web Server issue): W551–W556.
Published online May 11, 2011. doi:  10.1093/nar/gkr312
PMCID: PMC3125748

BiQ Analyzer HT: locus-specific analysis of DNA methylation by high-throughput bisulfite sequencing

Abstract

Bisulfite sequencing is a widely used method for measuring DNA methylation in eukaryotic genomes. The assay provides single-base pair resolution and, given sufficient sequencing depth, its quantitative accuracy is excellent. High-throughput sequencing of bisulfite-converted DNA can be applied either genome wide or targeted to a defined set of genomic loci (e.g. using locus-specific PCR primers or DNA capture probes). Here, we describe BiQ Analyzer HT (http://biq-analyzer-ht.bioinf.mpi-inf.mpg.de/), a user-friendly software tool that supports locus-specific analysis and visualization of high-throughput bisulfite sequencing data. The software facilitates the shift from time-consuming clonal bisulfite sequencing to the more quantitative and cost-efficient use of high-throughput sequencing for studying locus-specific DNA methylation patterns. In addition, it is useful for locus-specific visualization of genome-wide bisulfite sequencing data.

INTRODUCTION

DNA methylation is a widely studied epigenetic modification. It is present in all vertebrates and many invertebrate animals as well as in plants (1). In mammals, DNA methylation plays an important role for developmental gene regulation and for germline repression of repetitive elements (2). Aberrant DNA methylation patterns are frequently observed in cancer (3) and may also occur in many other human diseases (4). The link between locus-specific DNA methylation alterations and common diseases has created significant interest in using these epigenetic alterations as biomarkers in drug discovery and clinical diagnostics (5).

To investigate the many roles of DNA methylation in development and disease, researchers depend on experimental methods that accurately measure DNA methylation patterns at high accuracy and affordable cost. Many technologies with different advantages and disadvantages have been developed over the last 20 years, but only bisulfite-based methods provide quantitative DNA methylation data at single-base pair resolution (6). In bisulfite sequencing, the DNA is treated with sodium bisulfite, which selectively converts unmethylated cytosines into uracils but leaves methylated cytosines untouched (7). Hydroxymethylated DNA, which has recently been detected in some mammalian cell types, is also left unconverted and is indistinguishable from methylated DNA using bisulfite-based methods (8).

Bisulfite sequencing has recently been used to obtain the first genome wide, high-resolution maps of DNA methylation in the human genome (9,10). Bisulfite-based methods also performed well in a benchmarking study of DNA methylation mapping technologies (11). Along with technologies for DNA methylation mapping at a genomic scale, locus-specific bisulfite sequencing plays an important role as gold-standard validation method and promises to become a standard technology in clinical diagnostics (12).

Locus-specific bisulfite sequencing has traditionally been performed by Sanger sequencing of a few dozen hand-picked DNA clones, making this method rather time-consuming and costly. To address these limitations, researchers increasingly use high-throughput sequencing instead of Sanger sequencing (13–15), which has three major advantages: (i) due to the increased sequencing throughput, it becomes feasible to obtain highly quantitative DNA methylation patterns for the loci of interest. This is particularly relevant for studying heterogeneous tissue samples and for clinical diagnostics; (ii) due to lower per-base costs and the use of multiplexing to sequence many samples and/or loci in a single machine run, the sequencing costs are substantially reduced; and (iii) the cloning step for isolating DNA populations that carry the DNA sequence of a single DNA molecule becomes obsolete because current methods for high-throughput sequencing measure the sequences of individual DNA clones.

A major roadblock for the wider use of high-throughput bisulfite sequencing is the lack of software tools for processing and analyzing the large number of sequencing reads that are generated by this method. Several software tools have been developed for processing small-scale bisulfite sequencing data obtained by conventional Sanger sequencing. The BiQ Analyzer (16) software from our group has recently been updated to version 2.0 and continues to be a useful tool for interactive analysis of small-scale bisulfite sequencing data. Alternative tools include the QUMA web service (17), BISMA (18) and several more specialized programs (19–22). None of these tools can be scaled to the read numbers that are typically obtained by high-throughput sequencing. For this reason, recent studies utilized custom data analysis scripts, none of which are conveniently available (13–15).

Here, we describe BiQ Analyzer HT, a comprehensive software tool for locus-specific analysis of high-throughput bisulfite sequencing data. BiQ Analyzer HT builds on concepts that we originally developed for the popular BiQ Analyzer software (16), but it was redesigned and rewritten to meet the challenges arising for the analysis of high-throughput bisulfite sequencing data. All functionality of BiQ Analyzer HT is available through a web-startable graphical user interface, which guides the user through the data analysis (Figure 1). As an additional option, it is possible to run the computationally intensive parts of the software on a remote high-performance computer while maintaining the user-friendliness of a graphical interface run locally. Finally, BiQ Analyzer HT provides an optional command-line interface to facilitate integration into automatic data analysis pipelines.

Figure 1.
BiQ Analyzer HT workflow. Bisulfite sequencing data are generated either for the entire genome or selectively for a defined set of genomic loci using commercially available high-throughput sequencers (A). To reduce sequencing cost, bisulfite-converted ...

PROGRAM OVERVIEW

BiQ Analyzer HT facilitates locus-specific analysis, quality control and visualization of high-throughput bisulfite sequencing data. The tool takes sequencing read data as input, and it produces quality-controlled output tables and diagrams of the inferred DNA methylation information for each sample, locus and DNA methylation site.

BiQ Analyzer HT is a Java-based program which can be run on any computer which has a recent version of the Java Virtual Machine installed. The tool is available as a self-installing Java Web Start distribution, and as a downloadable installation package for computers that are not connected to the Internet. BiQ Analyzer HT’s project-based user interface supports the interactive analysis of bisulfite sequencing data for multiple target loci in multiple samples. A typical analysis consists of three phases: (i) data import; (ii) sequence alignment and quality control; and (iii) visualization and export of the inferred DNA methylation information (Figure 1).

To prepare high-throughput sequencing data for analysis with BiQ Analyzer HT, the user first applies vendor-specific software to perform base-calling, to resolve any sample multiplexing and to convert the data into one of two standard formats, FASTA or BAM. When importing FASTA files obtained by locus-specific bisulfite sequencing, BiQ Analyzer HT expects one file per sample and locus. We currently provide a custom script that automatizes data preparation for the Roche 454 sequencing platform (http://biq-analyzer-ht.bioinf.mpi-inf.mpg.de), and we will add similar scripts for other platforms based on user demand. Alternatively, genome-scale bisulfite sequencing data can be imported as BAM files, which are most conveniently generated with BSMAP (23).

When a new BiQ Analyzer HT project is initialized, an output directory is created into which the software writes its analysis results (Table 1). The user specifies the project structure by adding samples and by loading FASTA files that define the genomic reference sequence of each locus. The resulting tree structure is shown in BiQ Analyzer HT’s main window. Once the data are loaded, this tree can be ordered either by samples or by loci, depending on the biological question of interest.

Table 1.
Analysis results generated by BiQ Analyzer HT

Read alignment and inference of DNA methylation information are controlled by parameters that the user selects on the setup screen. While the default values often provide good results, it is recommended that the user runs a first analysis with default parameters, inspects the results and then adjusts the parameters as necessary. Data set-specific choice of quality control parameters can sometimes compensate for quality issues that may be present in the primary data. For example, a decrease in alignment stringency parameters allows for retaining reads with reduced similarity to the reference, which would be removed by the default filtering criteria. This can be essential to process highly polymorphic sequences such as retrotransposable elements and DNA repeats.

Once satisfactory results are obtained, the inferred DNA methylation data can be exported in several formats, including sequence alignments, data tables and DNA methylation plots. Table 1 summarizes all output items. The sequence alignments provide a detailed account of how the DNA methylation levels were inferred. In addition, they can be used to identify allele-specific polymorphisms or evidence of structural variation in the sequence data. The data tables facilitate exploratory data analysis using spreadsheets, in-depth statistics using statistical software such as R/Bioconductor (24) and epigenetic biomarker development using BiQ Analyzer’s companion tool MethMarker (25). Finally, the DNA methylation plots visualize the results of BiQ Analyzer HT analyses for use in papers and scientific reports.

The visualization module of BiQ Analyzer HT utilizes the publicly available GSEA library (26) for plotting DNA methylation heatmaps. BAM file handling is implemented using the Picard library (http://picard.sourceforge.net), and parts of the sequence processing code are based on the BioJava framework (27).

DATA PROCESSING

BiQ Analyzer HT implements a data processing pipeline that is run for each combination of locus and sample in the project tree. The pipeline aligns all sequencing reads from the corresponding input file to the locus-specific genomic reference sequence, and based on these alignments it infers which cytosines are methylated or unmethylated by comparing the read sequence with the reference sequence. The key steps of the data processing pipeline are outlined in more detail below. All analyses are conveniently accessible via the graphical interface. They can also be run from the command line, which facilitates integration with automatic data processing pipelines.

Read alignment

The analysis of bisulfite sequencing data crucially depends on accurate alignments. This is an inherently difficult task when complex genomic regions with repetitive elements and structural variation are studied and further complicated by the fact that bisulfite-converted DNA has substantially lower information content than genomic DNA. For this reason, speed-optimized seed-based aligners such as BLAT (28), MAQ (29) and BWA (30)—which are commonly used for aligning high-throughput sequencing data—could undermine the accuracy of BiQ Analyzer HT. After exploring several alternatives, we chose to use the Needleman–Wunsch algorithm (31), which is guaranteed to find the optimal (although not necessarily the correct) alignment between each sequencing read and the reference sequence. Furthermore, we made several modifications to the algorithm that account for recurrent issues with bisulfite-converted DNA (Supplementary Text S1). To partially compensate for the fact that the Needleman–Wunsch algorithm is substantially slower than current short-read aligners, we use a highly optimized implementation of this algorithm. This implementation provides excellent performance for read numbers in the order of 104 per locus on a standard laptop computer (Table 2). Furthermore, the read alignment can be outsourced to a remote high-performance computer, which makes it feasible to process in the order of one million reads per locus on a standard laptop computer.

Table 2.
Performance comparison of software packages for locus-specific analysis of bisulfite sequencing data

Quality control and read filtering

Based on the pairwise alignment of the sequencing reads with their corresponding genomic reference sequence, the data quality of the bisulfite sequencing experiment is estimated. Basic quality measures include the alignment score and sequence identity with the bisulfite-converted reference sequence, the estimated bisulfite conversion rate (fraction of unconverted cytosines outside of the analyzed methylation context, e.g. ‘CG’) and the number of DNA methylation sites with missing data. The sequencing read data can be filtered for each of these quality measures in order to quickly discard low quality or otherwise unsuitable reads. The threshold values of each quality measure are set to empirically chosen defaults, but users may need to adjust these parameters interactively to account for the characteristics of their specific data sets.

Inference of DNA methylation patterns

BiQ Analyzer HT's default settings focus on CpG methylation which is the most common modification of eukaryotic DNA. The user can also choose to include other symmetric and asymmetric methylation contexts in the analysis, such as CpHpG and CpHpH. A methylation context is defined by a pair of DNA sequence motifs, one of which matching the methylated and the other matching the unmethylated state. The positions of potential methylation sites are detected by scanning the reference sequence for matches of the methylated motif. Next, the methylation state is determined by comparing the read and reference sequences at each potential methylation site, and the site recorded as methylated, unmethylated or missing value (‘1′, ‘0′ and ‘x’, respectively). The collection of DNA methylation states for all sites in a given sequencing read constitute its methylation pattern, and the number of methylated sites divided by the total number of sites that are not missing values defines the mean methylation of a sequencing read.

Data visualization and export

The inferred DNA methylation data and quality control information can be exported for documentation and follow-up analysis using statistical tools (Table 1). The resulting tables list the quality measures, DNA methylation patterns and mean methylation levels for each sequencing read that has not been filtered out during quality control. Prior to exporting these tables, they can be sorted by one of the quality measures or by the inferred DNA methylation information.

PERFORMANCE EVALUATION

To confirm the practical utility of BiQ Analyzer HT for large data sets and to assess its performance relative to existing low-throughput tools, we benchmarked the tools on data sets with up to one million reads mapping to a single locus (Table 2). These data set were obtained by multiplexed locus-specific bisulfite sequencing on the Roche 454 sequencing platform. Briefly, three classes of repetitive elements (RE1, RE2 and RE3) were amplified from bisulfite-treated mouse DNA, and several thousand reads were sequenced for these repetitive elements. To evaluate BiQ Analyzer HT’s performance for higher read numbers, we further constructed artificial test sets from the actual data set of region RE3 by reusing sequencing reads multiple times. The results of this benchmarking shows that all existing tools have severe limitations in the number of reads that can be processed (Table 2). In contrast, with BiQ Analyzer HT, we could successfully analyze a data set with one million reads mapping to a single locus.

CONCLUSIONS

BiQ Analyzer HT provides comprehensive support for locus-specific analysis, quality control and visualization of high-throughput bisulfite sequencing data. It addresses the bioinformatic challenges of using high-throughput sequencing as a fast and cost-efficient alternative to clonal bisulfite sequencing, and it is fully compatible with multiplex analysis of several loci and samples. The alignment algorithm was specifically optimized for bisulfite-converted sequences, and it supports the analysis of both CpG and non-CpG methylation patterns. In summary, the combination of locus-specific high-throughput sequencing and interactive data analysis with BiQ Analyzer HT provides a highly practical approach for measuring the DNA methylation patterns of 10–100′s of loci in 100–1000′s of samples, for example, in the context of biomarker validation and clinical diagnostics.

AVAILABILITY

http://biq-analyzer-ht.bioinf.mpi-inf.mpg.de (This website/software is free and open to all users and there is no login requirement).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

CANCERDIP project (HEALTH-F2-2007-200620); ColoNet project (BMBF 0315417-D). Funding for open access charge: Max Planck Institute for Informatics and Saarland University.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We would like to thank Dr Sascha Tierling and Dirk Schuemacher for helpful discussions and the provision of test data, Yassen Assenov for advice with Java programming and Fabian Müller for advice on the BAM format.

REFERENCES

1. Suzuki MM, Bird A. DNA methylation landscapes: provocative insights from epigenomics. Nat. Rev. Genet. 2008;9:465–476. [PubMed]
2. Bird A. DNA methylation patterns and epigenetic memory. Genes Dev. 2002;16:6–21. [PubMed]
3. Esteller M. Epigenetics in cancer. N. Engl. J. Med. 2008;358:1148–1159. [PubMed]
4. Feinberg AP. Phenotypic plasticity and the epigenetics of human disease. Nature. 2007;447:433–440. [PubMed]
5. Laird PW. The power and the promise of DNA methylation markers. Nat. Rev. Cancer. 2003;3:253–266. [PubMed]
6. Laird PW. Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet. 2010;11:191–203. [PubMed]
7. Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc. Natl Acad. Sci. USA. 1992;89:1827–1831. [PMC free article] [PubMed]
8. Huang Y, Pastor WA, Shen Y, Tahiliani M, Liu DR, Rao A. The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing. PLoS ONE. 2010;5:e8888. [PMC free article] [PubMed]
9. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009;462:315–322. [PMC free article] [PubMed]
10. Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, Low HM, Kin Sung KW, Rigoutsos I, Loring J, et al. Dynamic changes in the human methylome during differentiation. Genome Res. 2010;20:320–331. [PMC free article] [PubMed]
11. Bock C, Tomazou EM, Brinkman AB, Müller F, Simmer F, Gu H, Jäger N, Gnirke A, Stunnenberg HG, Meissner A. Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat. Biotechnol. 2010;28:1106–1114. [PMC free article] [PubMed]
12. Bock C. Epigenetic biomarker development. Epigenomics. 2009;1:99–110. [PubMed]
13. Korshunova Y, Maloney RK, Lakey N, Citek RW, Bacher B, Budiman A, Ordway JM, McCombie WR, Leon J, Jeddeloh JA, et al. Massively parallel bisulphite pyrosequencing reveals the molecular complexity of breast cancer-associated cytosine-methylation patterns obtained from tissue and serum DNA. Genome Res. 2008;18:19–29. [PMC free article] [PubMed]
14. Taylor KH, Kramer RS, Davis JW, Guo J, Duff DJ, Xu D, Caldwell CW, Shi H. Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing. Cancer Res. 2007;67:8511–8518. [PubMed]
15. Varley KE, Mitra RD. Bisulfite Patch PCR enables multiplexed sequencing of promoter methylation across cancer samples. Genome Res. 2010;20:1279–1287. [PMC free article] [PubMed]
16. Bock C, Reither S, Mikeska T, Paulsen M, Walter J, Lengauer T. BiQ Analyzer: visualization and quality control for DNA methylation data from bisulfite sequencing. Bioinformatics. 2005;21:4067–4068. [PubMed]
17. Kumaki Y, Oda M, Okano M. QUMA: quantification tool for methylation analysis. Nucleic Acids Res. 2008;36:W170–W175. [PMC free article] [PubMed]
18. Rohde C, Zhang Y, Reinhardt R, Jeltsch A. BISMA–fast and accurate bisulfite sequencing data analysis of individual clones from unique and repetitive sequences. BMC Bioinformatics. 2010;11:230. [PMC free article] [PubMed]
19. Carr IM, Valleley EM, Cordery SF, Markham AF, Bonthron DT. Sequence analysis and editing for bisulphite genomic sequencing projects. Nucleic Acids Res. 2007;35:e79. [PMC free article] [PubMed]
20. Grunau C, Schattevoy R, Mache N, Rosenthal A. MethTools–a toolbox to visualize and analyze DNA methylation data. Nucleic Acids Res. 2000;28:1053–1058. [PMC free article] [PubMed]
21. Hetzl J, Foerster AM, Raidl G, Mittelsten Scheid O. CyMATE: a new tool for methylation analysis of plant genomic DNA after bisulphite sequencing. Plant J. 2007;51:526–536. [PubMed]
22. Xu YH, Manoharan HT, Pitot HC. CpG PatternFinder: a Windows-based utility program for easy and rapid identification of the CpG methylation status of DNA. Biotechniques. 2007;43:334. 336–340, 342. [PubMed]
23. Xi Y, Li W. BSMAP: whole genome bisulfite sequence MAPping program. BMC Bioinformatics. 2009;10:232. [PMC free article] [PubMed]
24. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. [PMC free article] [PubMed]
25. Schüffler P, Mikeska T, Waha A, Lengauer T, Bock C. MethMarker: user-friendly design and optimization of gene-specific DNA methylation assays. Genome Biol. 2009;10:R105. [PMC free article] [PubMed]
26. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. [PMC free article] [PubMed]
27. Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, Foisy S, Dräger A, Yates A, Heuer M, et al. BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008;24:2096–2097. [PMC free article] [PubMed]
28. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res. 2002;12:656–664. [PMC free article] [PubMed]
29. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. [PMC free article] [PubMed]
30. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. [PMC free article] [PubMed]
31. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970;48:443–453. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Compound
    Compound
    PubChem Compound links
  • MedGen
    MedGen
    Related information in MedGen
  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...