![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||
Using probe secondary structure information to enhance Affymetrix GeneChip background estimates Bioinformatics Research Center, University of North Carolina at Charlotte, 9201 University City Blvd, Charlotte, North Carolina 28223, USA. *Corresponding author. Tel.: +1 704 687 8378; fax: +1 704 687 6610. E-mail address: cgibas/at/uncc.edu. The publisher's final edited version of this article is available at Comput Biol Chem. See other articles in PMC that cite the published article.Abstract High-density short oligonucleotide microarrays are a primary research tool for assessing global gene expression. Background noise on microarrays comprises a significant portion of the measured raw data. A number of statistical techniques have been developed to correct for this background noise. Here, we demonstrate that probe minimum folding energy and structure can be used to enhance a previously existing model for background noise correction. We estimate that probe secondary structure accounts for up to 3% of all variation on Affymetrix microarrays. Keywords: DNA Microarray, Probe secondary structure, Background correction 1. Introduction Microarray technology holds the promise of capturing global gene expression by providing global molecular snapshots of the cell’s transcriptional machinery products (Lockhart et al., 1996). The ultimate goal of gene expression microarrays is to measure the abundance of each known transcript in the sample under investigation. The abundance is inferred from the signal generated by each probe as a result of a hybridization reaction with a labeled target (transcript). However, this signal includes background noise that not only measures the target abundance, but also non-specific binding and autofluorescence of the chip surface. In the Affymetrix GeneChip system, each transcript’s abundance is measured by a set of 11-20 probe pairs. Each pair is composed of a perfect match probe (PM), which exactly complements a region on the transcript, and a mismatch probe (MM), which is identical to the PM probe except at the 13 th base, where the reverse compliment nucleotide is introduced. MM probes were originally introduced by Affymetrix to measure background noise. However, it has been shown by many groups that MM contain significant amount of PM signal and are therefore unreliable as estimators of background noise (Chudin et al., 2001; Forman et al., 1998; Irizarry et al., 2003; Naef et al., 2003). A true estimate of background noise would improve the quality of Affymetrix GeneChip data. Inconsistency of the signal generated from each probe is a common phenomenon in GeneChip microarray experiments (Li and Wong, 2001; Nielsen et al., 2005). The differences in the signal produced can be attributed to many sources: optical noise, cross-hybridization, dye-related contributions and probe sequence composition. Many algorithms have been developed to attempt to correct for these inconsistencies (Irizarry et al., 2006; Wu and Irizarry, 2005; Zhang et al., 2003). In particular, it has been found that probe sequence composition can significantly affect the intensity of the signal generated from that probe, independent of the concentration of its target. A number of groups have suggested models where the background intensity of probes could be estimated based on their sequence composition (Naef and Magnasco, 2003; Zhang et al., 2003). The process of nucleic acid hybridization in solution has been well studied and models such as the nearest-neighbor model provide a robust description of hybridization thermodynamics (SantaLucia and Hicks, 2004). Probe-target hybridization on the microarray surface, however, does not follow the solution analogue, and the nearest-neighbor parameters that describe solution hybridization appear to be different than those for microarrays (Zhang et al., 2003). On-chip DNA hybridization is likely to be complicated by the geometric constraints of having one strand (i.e. probe) attached to the surface of the chip (Shchepinov et al., 1997). In addition, many other factors like probe and target secondary structure, effective reaction volume, electrostatics, diffusion and surface effects, reaction thermodynamics and kinetics, competitive binding effects, hybridization buffer composition and probe-probe interactions are believed to affect microarray DNA hybridization (Lima et al., 1992; Southern et al., 1999). In this study, we examine the effect of predicted probe secondary structure on background hybridization in Affymetrix microarrays. Although microarray probes are attached to the surface of the chip, they are dynamic molecules that, depending on their sequence composition, can fold onto themselves into stable secondary structure. Such stable secondary structure has the potential to interfere with probe-target hybridization (Lima et al., 1992). Consequently, the signal obtained from such probes may not reflect the actual transcript concentrations. It has been shown, for example, that a stable secondary structure motif in a 20-mer probe dramatically decreases the final signal obtained to a point where the probe is considered insensitive to its intended target (Anthony et al., 2003). Microarray probes are usually screened for the presence of stable secondary structure either by a simple base complementarity check or using more sophisticated and time consuming energy minimization algorithms (Markham and Zuker, 2005). The base complementarity check is more routinely used for its simplicity and speed. Discrepancies between methods do exist, and there are no guidelines that determine which method is preferable (Koehler and Peyret, 2005). It is therefore likely that, despite these screening procedures, a significant amount of secondary structure is present in probes in microarray experiments. Here we propose that the background noise of each probe can be modeled as a function of its sequence composition and its minimum folding energy and secondary structure. By incorporating probe secondary structure information into a previously described model of background concentration (Naef and Magnasco, 2003), we improved the fit of that model to microarray data by 1-3% with minimal addition of significant free parameters. 2. Methods 2.1. Data sets Seven data sets were used in this study (Table 1): the human genome U133 Latin Square data set (http://www.affymetrix.com/support/technical/sample_data/datasets.affx), theChoe control data set (Choe et al., 2005), a Leukemia data set (Armstrong et al., 2002), a Malaria PM only data set (Le Roch et al., 2003), an Etoposide response data set (Fodor et al., 2006), a BK potassium channel knockout data set (Meredith et al., 2006; Pyott et al., 2006) and an alternative splicing PM only tiling microarray data set (Sugnet et al., 2006).
2.2. System and software All the computational work was done on a 73-node Apple cluster. Each node is a dual 2.7 GHz PowePC G5 with 2GB RAM running Mac OSX 10.4. Secondary structure prediction was done using the hybrid-min-ss program of the UNAFold-2.5 software package (Markham and Zuker, 2005). All probes were folded as single DNA strands at 45 °C and 1.0 M sodium concentration. All other options were set to the program defaults. Simple linear model fitting andp-value calculations were done using R linear model function (lm) (http://www.r-project.org). The Naef and Magnasco model (Naef and Magnasco, 2003) and the position-dependent secondary-structure attenuated affinity model were implemented in Perl. All Perl code is available upon request. 3. Results 3.1. Simple linear models The signal intensity generated from each probe can be modeled as:
Controlling the GC content of the probe is one of the basic principles of microarray probe design. A probe with high GC content tends to hybridize better and to form a stable duplex with both target and non-target sequences. A simple linear model that relates probe intensity to GC content can be written as follows:
We wondered, compared to the GC content, how much of the background noise probe secondary structure would explain when put into a simple linear model. The free energy of probe secondary structure formation (ΔGss) is an indicator of the stability of secondary structure in which the probe folds on itself. The more stable the secondary structure, the less a probe will be able to hybridize to its target or non-target sequences. As a result, one would expect to observe a low signal from such probe. How much of all probe variance can be explained directly by secondary structure predictions? A simple linear model is:
If we apply this simple linear model to the Latin Square data set, we find a very low r-squared values; R2 < 10-4 (Fig. 1 One may argue that the low r-squared values in Eq. 3 are due to the fact that ΔGss value does not reflect the size of the secondary structure motif found in that probe and the number of free bases available for hybridization. The program hybrid-min-ss reports for the most stable secondary structure of a probe whether a given nucleotide is involved in secondary structure formation or not. We can define a value, SL, which is the longest stretch of nucleotides that are not involved in secondary structure formation (for example, SL=10 in Fig. 2
We see that GC content can explain a modest amount of overall intensity. Models based on secondary structure explain much less of the intensity data, although they are still highly statistically significant. 3.2. Position-dependent secondary-structure attenuated affinity model Since the three simple linear models (Eq. 2-4) all hold significant relationships with the observed intensity (Fig. 1 The model of Naef and Magnasco (Naef and Magnasco, 2003) provides a starting point that meets our requirement for individual base information. In this model, probe background is modeled based on sequence composition:
Eq. 5 is a simple model that has four free parameters for each probe base (100 free parameters for a 25-base probe). The values of these 100 free parameters are generated by linear least squares fit (Naef and Magnasco, 2003). Given the large number of probes on each chip (about half a million for the human genome U133 chip, for example) over-fitting is not a concern. In our approach, we add the continuous variable θ to reflect the involvement of the probe nucleotides in secondary structure formation. The model now is written as:
The θ term reflects the degree to which an individual probe base participates in secondary structure formation. In our model, it is represented by any value between 0 and 1. There are a large number of ways in which values for θ could be generated. We made the following simplifying assumptions. We begin by considering nucleotides that are not involved in secondary structure formation. In cases where a probe’s ΔGss> 0 Kcal/mol we can set θ for all bases within that probe to 1. Likewise, when a base within a probe is not involved in secondary structure hydrogen bonding (yellow ovals inFig 2
This equation has two unknown parameters ΔGss-cutoff and tb. To find the best values for these parameters, we tested the effects of changing ΔGss-cutoff and tb on the performance of the model (Eq. 6) on a single chip from the Latin Square data set. We found that the best performance of the model was obtained at ΔGss-cutoff = -3.6 Kcal/mol and a tb = 0.35 (Fig. 4
To summarize, we define our position-dependent secondary-structure attenuated affinity model (PSAA) as Eq. 6, where B is the raw probe intensity, M is the median intensity of the array, l is letter index, k is the position of l along the probe, A is the per-site per letter affinity, S a Boolean variable equal to 1 if the probe sequence has l at k and zero otherwise, and θ is:
Here, the involvement of each probe base in secondary structure hydrogen bonding is based on its minimum energy structure. The model defined in Eq. 6 was fitted to all the data sets (Table 1). The fitting was done on the PM and MM probes separately. Table 1 shows a comparison between the native Naef and Magnasco model (Naef and Magnasco, 2003) and our position-dependent secondary-structure attenuated affinity model. We see that including probe secondary structure information improved the fit of the native Naef and Magnasco model (Naef and Magnasco, 2003) by 1-3%, depending on the chip and probe type. Note that all the models (Eq. 2,3,4,5 and 6) perform better on the MM probes due to the higher background noise present in the MM signal. 3.3. Gains in performance can not be trivially explained by additional free parameters We note that there are two distinct kinds of free parameters in our model. The 100 free parameters from the original Naef and Magnasco model (Eq. 5) are calculated for each chip by linear least squares fit. We have added two free parameters in Eq. 6, ΔGss-cutoff and tb. These parameters were determined from one of the Latin Square data set chips from the curves shown in Fig. 4
4. Discussion In the absence of a clear understanding of the microarray hybridization mechanisms and the frequent use of probes that fold into stable secondary structure under the hybridization conditions on microarrays, a model is needed to explain or approximate the effects of such behavior on microarray signal. Using simple linear models, we saw a modest relationship (R2 < 10-3) between probe intensity and its ΔGss or SL. We propose as a more powerful alternative to two parameter linear models, a modification of the Naef and Magnasco model (Naef and Magnasco, 2003) to include probe secondary structure effects on the background intensity. Our model works by equating an increase in secondary structure with a decreased contribution to a linear least square fit. If a particular base is involved in secondary structure hydrogen bonding (Fig. 2 The secondary structure information used here is based on the minimum folding energy (ΔGss) and the minimum energy structure, as predicted by an energy minimization algorithm (Markham and Zuker, 2005) that uses the nearest-neighbor parameters (SantaLucia, 1998) to predict secondary structure of single-stranded DNA molecules in solution. In the absence of clear understanding of the effects of the geometric constraints of attaching one end of the DNA probe to the chip surface on its secondary structure, the nearest-neighbor parameters represent a reasonable approximation for microarray (Held et al., 2003). We are also fully aware that single-stranded DNA molecules are highly dynamic and each molecule is likely to exist in an ensemble of structures. Based on that, predicting the minimum folding energy (ΔGss) and the minimum energy structure for any single-stranded DNA molecule can be different when using different prediction algorithms, even when the same folding conditions are used. The results presented here are based on the minimum folding energy (ΔGss) and the minimum energy structure calculated using UNAFold (Markham and Zuker, 2005). It has been shown that the differences in the predicted minimum folding energy (ΔGss) and the minimum energy structure between different prediction algorithms are small (Ding et al., 2004; Ratushna et al., 2005). Consequently, we would expect similar results no matter which of the currently popular secondary structure prediction algorithms were used. The results presented in this work suggest that, on average, 1-3% of all the intensities on Affymetrix GeneChip microarrays can be explained by probe secondary structure independent of any target information. Given that not all the probes form stable secondary structure (50% of the human genome U133 Latin Square data set probes, for example have predicted ΔGss > 0), the 1-3% enhancement over the original model is quite satisfactory, and represent a step forward in understanding the factors that affect the on-chip hybridization process. The current design of GeneChip microarrays devotes half of the chip to MM probes. The sole purpose of these probes is to estimate the background noise portion present in the PM signal to enhance the chip ability to detect differently expressed genes. Advances in the ability to correctly estimate background noise on Affymetrix GeneChip microarrays based on probe sequence information may in the future eliminate the need of MM probes on these arrays offering more space to interrogate more genes on the same array. Acknowledgments This research was supported in part by NIH 1R01GM072619-01 (C.J.G.) and by the UNC-Charlotte GASP program (R.Z.G.). Cel files for the splicing microarray data set were generously provided by Manny Ares. Footnotes Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||
Nat Biotechnol. 1996 Dec; 14(13):1675-80.
[Nat Biotechnol. 1996]Genome Biol. 2002; 3(1):RESEARCH0005.
[Genome Biol. 2002]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Proc Natl Acad Sci U S A. 2001 Jan 2; 98(1):31-6.
[Proc Natl Acad Sci U S A. 2001]Bioinformatics. 2005 Mar 1; 21(5):687-8.
[Bioinformatics. 2005]Bioinformatics. 2006 Apr 1; 22(7):789-94.
[Bioinformatics. 2006]J Comput Biol. 2005 Jul-Aug; 12(6):882-93.
[J Comput Biol. 2005]Nat Biotechnol. 2003 Jul; 21(7):818-21.
[Nat Biotechnol. 2003]Annu Rev Biophys Biomol Struct. 2004; 33():415-40.
[Annu Rev Biophys Biomol Struct. 2004]Nat Biotechnol. 2003 Jul; 21(7):818-21.
[Nat Biotechnol. 2003]Nucleic Acids Res. 1997 Mar 15; 25(6):1155-61.
[Nucleic Acids Res. 1997]Biochemistry. 1992 Dec 8; 31(48):12055-61.
[Biochemistry. 1992]Nat Genet. 1999 Jan; 21(1 Suppl):5-9.
[Nat Genet. 1999]Biochemistry. 1992 Dec 8; 31(48):12055-61.
[Biochemistry. 1992]Biotechniques. 2003 May; 34(5):1082-6, 1088-9.
[Biotechniques. 2003]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W577-81.
[Nucleic Acids Res. 2005]Comput Biol Chem. 2005 Dec; 29(6):393-7.
[Comput Biol Chem. 2005]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Genome Biol. 2005; 6(2):R16.
[Genome Biol. 2005]Nat Genet. 2002 Jan; 30(1):41-7.
[Nat Genet. 2002]Science. 2003 Sep 12; 301(5639):1503-8.
[Science. 2003]Nat Neurosci. 2006 Aug; 9(8):1041-9.
[Nat Neurosci. 2006]PLoS Comput Biol. 2006 Jan; 2(1):e4.
[PLoS Comput Biol. 2006]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W577-81.
[Nucleic Acids Res. 2005]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]J Comput Biol. 2005 Jul-Aug; 12(6):882-93.
[J Comput Biol. 2005]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W577-81.
[Nucleic Acids Res. 2005]Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]Proc Natl Acad Sci U S A. 2003 Jun 24; 100(13):7575-80.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W135-41.
[Nucleic Acids Res. 2004]BMC Genomics. 2005 Mar 8; 6(1):31.
[BMC Genomics. 2005]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]Phys Rev E Stat Nonlin Soft Matter Phys. 2003 Jul; 68(1 Pt 1):011906.
[Phys Rev E Stat Nonlin Soft Matter Phys. 2003]