![]() | ![]() |
Formats:
|
||||||||||||
Copyright © 2009, EMBO and Nature Publishing Group Adaptable gene-specific dye bias correction for two-channel DNA microarrays 1Department of Physiological Chemistry, University Medical Center Utrecht, Universiteitsweg, Utrecht, The Netherlands aDepartment of Physiological Chemistry, University Medical Center Utrecht, PO box 85060, Universiteitsweg 100, Utrecht, 3584 CG, The Netherlands. Tel.: +31 88 755 5874/31 30 253 8486; Fax: +31 088 756 8479; E-mail: f.c.p.holstege/at/umcutrecht.nl *These authors contributed equally to this work Received May 20, 2008; Accepted March 13, 2009. This is an open-access article distributed under the terms of the Creative Commons Attribution Licence, which permits distribution and reproduction in any medium, provided the original author and source are credited. Creation of derivative works is permitted but the resulting work may be distributed only under the same or similar licence to this one. This licence does not permit commercial exploitation without specific permission. Abstract DNA microarray technology is a powerful tool for monitoring gene expression or for finding the location of DNA-bound proteins. DNA microarrays can suffer from gene-specific dye bias (GSDB), causing some probes to be affected more by the dye than by the sample. This results in large measurement errors, which vary considerably for different probes and also across different hybridizations. GSDB is not corrected by conventional normalization and has been difficult to address systematically because of its variance. We show that GSDB is influenced by label incorporation efficiency, explaining the variation of GSDB across different hybridizations. A correction method (Gene- And Slide-Specific Correction, GASSCO) is presented, whereby sequence-specific corrections are modulated by the overall bias of individual hybridizations. GASSCO outperforms earlier methods and works well on a variety of publically available datasets covering a range of platforms, organisms and applications, including ChIP on chip. A sequence-based model is also presented, which predicts which probes will suffer most from GSDB, useful for microarray probe design and correction of individual hybridizations. Software implementing the method is publicly available. Keywords: DNA microarrays, fluorescent dye labelling, mRNA expression profiling, normalization, two channel Introduction DNA microarrays are applied widely throughout the life sciences for a variety of purposes including mRNA analysis and genome-wide localization studies (Young, 2000). The accuracy of microarray experiments depends on appropriate normalization of signals derived from different samples or arrays (Quackenbush, 2002). For dual-channel microarrays, two samples are labelled with different fluorescent dyes and hybridized to a single microarray. The differences in the properties and in the detection of the fluorescent dyes result in measurement bias. Global and/or intensity-dependent bias can be normalized by locally weighted linear regression (Yang et al, 2002). A different type of bias is gene specific and is not corrected by such methods. Various such gene- or probe-specific biases have been reported for the majority of platforms (Dombkowski et al, 2004; Rosenzweig et al, 2004; Dobbin et al, 2005; Martin-Magniette et al, 2005; Kelley et al, 2008), including single-channel platforms (Hekstra et al, 2003; Naef and Magnasco, 2003; Zhang et al, 2003; Schuster et al, 2007). Such biases interfere with accurate determination of differential expression. For two-channel microarray experiments, gene-specific dye bias (GSDB) is easily observed in dye-swap replicates. It results in probes being affected mainly by the dyes rather than the samples, when dyes are swapped between two samples. Because of its variable nature, GSDB is difficult to address systematically and can result in deviations of more than two-fold from the correct ratio (Martin-Magniette et al, 2005; Kelley et al, 2008). GSDB is likely caused by differential fluorescence quenching within labelled material (Cox et al, 2004). A recent study showed that GSDB is dependent on the labelled nucleoside and presented a maximum-likelihood error model for correcting GSDB. The VERA method (Kelley et al, 2008) works well if the degree of GSDB is uniform across different hybridizations. Here, we show that the degree of GSDB is strongly linked to sample labelling efficiency. This agrees with the variable nature of GSDB across different hybridizations, which has complicated earlier attempts to model and correct GSDB effectively. We present a method that consists of a gene-specific correction that is modulated by the degree of overall bias observed in individual hybridizations. This significantly alleviates GSDB and results in greater accuracy of DNA microarray experiments. A sequence-based model is also presented. Results and discussion GSDB varies across hybridizations As part of a project to determine differential expression between various mutant yeast strains, a number of control experiments were carried out. These controls consisted of labelling and hybridizing a single reference wild-type (wt) RNA sample against other wt RNA samples, each processed on different days. These hybridizations show diverse degrees of variation (Figure 1A and B
When samples are labelled in reverse dye orientations, the outliers do not also reverse (Figure 1C GSDB is linked to the degree of label incorporation To validate GASSCO further and to investigate the cause and variable nature of GSDB, a dataset was generated that consisted of self versus self hybridizations, whereby the degree of label incorporation was varied from 0.5 to 3%. Strikingly, the degree of GSDB shows strong association with the degree of label incorporation (Figure 1E GSDB correction based on the previously described control hybridizations greatly reduces self–self variation in this dataset with variable label incorporations (Figure 1F Comparison with existing methods Several earlier studies have explicitly addressed the problem of GSDB in two-colour arrays (Dombkowski et al, 2004; Rosenzweig et al, 2004; Dobbin et al, 2005; Martin-Magniette et al, 2005; Kelley et al, 2008). In most cases, these methods do not take any slide-specific GSDB into account, or make the assumption that this is constant within a batch (Rosenzweig et al, 2004). The method reported by Dobbin et al (2005) does model slide-specific effects of GSDB. However in their linear model, the slide effect is allowed to vary for each probe individually and, therefore, requires inordinate numbers of hybridizations to be estimated properly. In GASSCO, the slide-specific effect is constant for all probes. Another way of countering GSDB is to average the results of dye-swap replicate hybridizations, in an attempt to cancel out GSDB. We
GSDB correction for small-scale projects The GSDB correction presented above is based on a set of 12 control hybridizations, reference wt versus other wt, undertaken as part of a large-scale expression-profiling project. GASSCO can also be independently applied to projects that do not include any same versus same controls (Figure 2 Identical clustering results were also obtained with iGSDB estimates based on leave-one-out cross-validation (data not shown). To approximate the minimum number of dye-swapped slides needed to get trustworthy iGSDB estimates, we compared estimates derived from all combinations to the original that was based on all five pairs. Using just two pairs of hybridizations resulted in a correlation between the different iGSDB estimates always >0.94, suggesting that three pairs of dye swaps may generally suffice. Although such iGSDB estimates are slightly less accurate than those obtained from self versus self hybridizations, Figure 2 Applying GASSCO to previously published data To assess the generality of our method, we applied GASSCO to five publically available datasets from different laboratories. These include cDNA and Agilent oligo arrays, for both expression profiling and ChIP on chip studies, for yeast, mouse and human, using different labelling protocols (Dobbin et al, 2005; Chua et al, 2006; Chen et al, 2007; Tan et al, 2008; Tuteja et al, 2008). The criterion for selection of these datasets was that they included dye-swap replicates, which were used to estimate the iGSDB and subsequently correct the data. The results are shown in Figure 3 Sequence-only based model Differences between Cy3 and Cy5 with regard to dye–dye quenching contribute to GSDB (Cox et al, 2004). Using linear regression, several probe characteristics were scrutinized for correlation with GSDB. In agreement with an earlier study (Kelley et al, 2008), the adenine content of the probe, which corresponds to the aminoallyl-UTP used here to label the target RNA, has the strongest influence (Supplementary information). This fits with the finding that a higher degree of dye incorporation results in a higher degree of GSDB (Figure 1C In summary, the correction method described here works robustly, alleviates the GSDB artefact to a great extent, in a wide variety of experimental designs, without loss of statistical power and is therefore likely to be beneficial to most two-colour microarray studies. Materials and methods Full details about samples, microarrays, datasets, hybridization, scanning, normalization and clustering are described in Supplementary information. All microarray data and full protocols have been deposited in the public microarray database ArrayExpress (http://www.ebi.ac.uk/microarray) under accession E-MTAB-462. GSDB correction The total GSDB is expressed as the product of two factors, the iGSDB and a slide-dependent factor (F). That is,
GSDBij is the GSDB of gene i on slide j. iGSDBi is the intrinsic gene-specific dye bias of gene i and Fj is the slide-dependent factor of slide j. The method consists of the following steps: estimation of the iGSDBs once for an entire project; estimation of the slide-dependent factor for each hybridization; and lastly application of the individual corrections. The full R source code, including detailed documentation and worked examples, is available in the dyebias package from www.holstegelab.nl/publications/margaritis_lijnzaad/ or through BioConductor (www.bioconductor.org). Estimation of iGSDB For the first step, iGSDBi is arbitrarily defined as the average dye effect of probe i in the set of hybridizations used to estimate iGSDB. The set of hybridizations used to estimate iGSDB may consist of same versus same hybridizations (Figure 1 Estimation of the slide-dependent factor The slide-dependent factor (F) is most accurately determined from probes that show the strongest GSDB. Firstly, probes mapping to highly variable transcripts are discarded. For instance, Ty-elements and mitochondrial genes were ignored in Figures 1 Application of GASSCO The dye bias correction for each probe in an individual hybridization is the product of the probe's iGSDB and the slide-dependent factor (Formula 1). The correction is subtracted from M but only for probes unlikely to be affected by border effects due to the dynamic range of the technology. Probes with a log2(intensity) in either channel, >15 or intensity <1.5-fold above the local area background are not corrected. If the local area background is not available, 1.5-fold the minimum intensity of all probes can be used as a threshold instead. Performance evaluation To measure the performance of the correction method, we compared the variances of the apparent M (i.e. the log2-ratio (Cy5 over Cy3)) of the hybridization before and after the correction. In Figure 3 Conflict of interest The authors declare that they have no conflict of interest. Supplementary Materials 1 GSDB correction of very high or very low label incorporation data Prediction of GSDB from sequence Correction of an Agilent array ChIP experiment using sequence-predicted iGSDB Materials and methods not described in the main text Click here to view.(887K, doc) Supplementary Materials 2 Details regarding dye bias correction of previously published data sets Click here to view.(57K, xls) References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||
Cell. 2000 Jul 7; 102(1):9-15.
[Cell. 2000]Nat Genet. 2002 Dec; 32 Suppl():496-501.
[Nat Genet. 2002]Nucleic Acids Res. 2002 Feb 15; 30(4):e15.
[Nucleic Acids Res. 2002]FEBS Lett. 2004 Feb 27; 560(1-3):120-4.
[FEBS Lett. 2004]Environ Health Perspect. 2004 Mar; 112(4):480-7.
[Environ Health Perspect. 2004]Bioinformatics. 2005 May 1; 21(9):1995-2000.
[Bioinformatics. 2005]Bioinformatics. 2008 Jan 1; 24(1):71-7.
[Bioinformatics. 2008]Anal Biochem. 2004 Aug 15; 331(2):243-54.
[Anal Biochem. 2004]FEBS Lett. 2004 Feb 27; 560(1-3):120-4.
[FEBS Lett. 2004]Environ Health Perspect. 2004 Mar; 112(4):480-7.
[Environ Health Perspect. 2004]Bioinformatics. 2005 May 15; 21(10):2430-7.
[Bioinformatics. 2005]Bioinformatics. 2005 May 1; 21(9):1995-2000.
[Bioinformatics. 2005]Bioinformatics. 2008 Jan 1; 24(1):71-7.
[Bioinformatics. 2008]Bioinformatics. 2008 Jan 1; 24(1):71-7.
[Bioinformatics. 2008]Genes Dev. 2008 Apr 1; 22(7):872-7.
[Genes Dev. 2008]Bioinformatics. 2005 May 15; 21(10):2430-7.
[Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2006 Aug 8; 103(32):12045-50.
[Proc Natl Acad Sci U S A. 2006]Genes Dev. 2007 Nov 15; 21(22):2897-907.
[Genes Dev. 2007]Proc Natl Acad Sci U S A. 2008 Feb 26; 105(8):2934-9.
[Proc Natl Acad Sci U S A. 2008]Nucleic Acids Res. 2008 Jul; 36(12):4149-57.
[Nucleic Acids Res. 2008]Anal Biochem. 2004 Aug 15; 331(2):243-54.
[Anal Biochem. 2004]Bioinformatics. 2008 Jan 1; 24(1):71-7.
[Bioinformatics. 2008]Bioinformatics. 2008 Jan 1; 24(1):71-7.
[Bioinformatics. 2008]Genes Dev. 2008 Apr 1; 22(7):872-7.
[Genes Dev. 2008]