![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Copyright © 2008 The Author(s) A competitive hybridization model predicts probe signal intensity on high density DNA microarrays Gulf Coast Research Lab, Department of Coastal Sciences, University of Southern Mississippi, Ocean Springs, MS 39564, USA *To whom correspondence should be addressed. Tel: Phone: +1 228 872 4278; Fax: +1 228 872 4204; Email: shuzhao.li/at/gmail.com Received August 20, 2008; Revised October 1, 2008; Accepted October 2, 2008. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract A central, unresolved problem of DNA microarray technology is the interpretation of different signal intensities from multiple probes targeting the same transcript. We propose a competitive hybridization model for DNA microarray hybridization. Our model uses a probe-specific dissociation constant that is computed with current nearest neighbor model and existing parameters, and only four global parameters that are fitted to Affymetrix Latin Square data. This model can successfully predict signal intensities of individual probes, therefore makes it possible to quantify the absolute concentration of targets. Our results offer critical insights into the design and data interpretation of DNA microarrays. INTRODUCTION Current DNA microarray technology utilizes multiple oligonucleotide probes to detect the concentration of target molecules. These probes, even though interrogating the same target, often yield very different signal intensities. Without understanding the physicochemistry underlying this problem, the quantification of absolute gene abundance is unattainable and inter-probe comparison is unjustified, leaving DNA microarray technology severely compromised. A number of physical models have been proposed to address this problem, mostly in the form of Langmuir derivatives (1–10). The Langmuir model is a generic mathematical form that also fits the description of first-order chemical reactions, which is frequently used for probe/target binding on DNA microarrays:
According to the Langmuir model, all probes should saturate at the same level, which is clearly not the case in microarray hybridizations. Various modifications were proposed to accommodate this difference in saturation levels. A generic version may be written as
The best prediction of probe signals to date was reported by Zhang et al. (13). They accounted for both specific binding and nonspecific binding in the form of , where is total target concentration, while fitting 83 parameters to the data. Mei et al. (14) also sought a linear composition of binding energy, where the single base energy contribution alone used 75 parameters. Over-parameterization has been a concern in all these previous studies and invited criticism on their general applicability (15).After all, a valid physical model of microarray hybridization will have to explain the probe difference through sequence-specific thermodynamics, as its oligonucleotide sequence is the defining property of a microarray probe. The free energy of polynucleotide hybridization in bulk solution has been successfully described by a nearest neighbor (NN) model (16,17). However, this NN model is widely regarded as not applicable to high-density microarray hybridization, as it was either modified and re-parameterized (5,7,9,13,18), or abandoned (1,14,19). We will first demonstrate that the NN model is applicable to probes free of secondary structures. With the thermodynamic component calculated from the NN model, we then propose a new competitive hybridization model to describe the kinetics. Our model, using only four global parameters that are fitted to Affymetrix Latin Square data, can successfully predict signal intensities of individual probes, and therefore, achieve the absolute quantification of target concentrations. METHODS The Affymetrix Latin Square spike-in data U133A were retrieved from (12). They contained 14×3 hybridizations where spike-in targets were added at various concentrations from 0 pM to 512 pM. Probe information was obtained through (20), where only 30 of the 42 probesets were found. A total of 365 probes matched to target sequences. Among them, 10 probes with very low signal intensities (under 900 at highest target concentration) were removed. In total, 355 probes are included in this study. Background was taken as the signal intensity at zero spike-in concentration, and subtracted from data at other concentrations. No normalization was performed on these data. The probe self-folding energy, , was computed by RNAStructure [version 4.5, function OligoWalk (21)]. Duplexing energy, , was computed by the current NN model with the parameters from Ref. (17). Compiled data and computational scripts used in this study are available upon request.RESULTS AND DISCUSSION Thermodynamic predictability in DNA microarrays Microarray probes and targets may form secondary structures by intramolecular self-folding. These structural effects are not accounted for in the NN model, posing a problem to the thermodynamic calculation. As a first step, we investigated the structural effects through the self-folding energy of probes, . In the Affymetrix Latin Square data (see Methods Section for details), about 45% of the probes can be selected by the criterion . For these probes, a clear correlation appears between log signal intensities ( ) at the highest target concentration and the duplexing energy that are computed by the current NN model with existing parameters (Figure 1 ). If the selection criterion is relaxed to , 75% probes are included and the correlation has (data not shown). However, the correlation diminishes at lower target concentrations (data not shown). These observations suggest that the current NN model offers a certain degree of predictability, but they cannot be accommodated by previous, Langmuir-like models. A new kinetic model is needed.
A competitive hybridization model We treat DNA microarray hybridization as two subprocesses, the binding of targets to probes and the dissociation of target/probe duplexes. Assuming that equilibrium is reached at the end of hybridization and the binding rate is the same for all target molecules (see below), the dissociation rate is governed by the duplexing energy between paired target/probe. A kinetic equilibrium between binding and dissociation should be observed. Two types of targets are explicitly modeled: ‘specific targets’ (perfect match) with probe-specific dissociation rate kd, and ‘cross-hybridizing targets’ with dissociation rate kn. These cross-hybridizing targets are present in large quantities because partially matching sequences are abundant in a transcriptome. For the moment, we simplify them as a uniform mixture with a probe-nonspecific kn. The target/probe duplex formation is commonly believed to start with an initiation step, the base-pairing between a small number of nucleotide bases, and then extend to the rest of complementary regions (22,23). If the initiation step sets the rate limit, the binding rate should be hardly specific to probe sequences. We therefore assume a single binding rate, kb, for all target molecules. How the specific factors (24,25), including adsorption and electrostatics (26), steric and brush effects (27) and labeling (19,28), come into play is not yet entirely clear. In this study, we postulate that the available area of probe spots is the limiting factor in adsorption, so that the binding is described as
is the number of target molecules going into the exposed probes over a unit of time, NA the Avogadro constant, V the volume of hybridization solution. On the right side, α is the fraction of probes bound to specific targets, β the fraction of probes bound to cross-hybridizing targets. p is the total number of probes in unit of molar concentration (for simplicity, as if they were dissolved in the hybridization solution).On the other hand, the dissociation is described by
is the number of target molecules leaving target/probe duplexes over a unit of time; kd and kn are dissociation rates for specific targets and cross-hybridizing targets, respectively.At equilibrium between binding and dissociation,
is the concentration of free specific targets, the concentration of free cross-hybridizing targets.Equations (6) and (7) can be combined to express β as:
, the concentration of free specific target molecules, is less than nominal spike-in concentration by the amount of probe binding:
as the nominal spike-in concentration (total amount).We assume the concentration of cross-hybridizing targets, , is large and can be treated as constant in this model. Let
It can be shown that the other analytical solution of Equation (12), which bears a plus sign before the square root, has no valid physical meaning and merits no further discussion. So α, the fraction of probes bound to specific targets, is described by three global parameters: p, kb and γ, one probe-specific parameter kd and one variable . kd can be expressed as:
the energy computed from NN model, ξ as a scaling factor to account for binding to immobilized probes.The physical meaning of our model is clear. Both specific binding α and cross-hybridization β compete for the same probe sites. As a result, high affinity probes (small kd) can achieve a higher fraction of specific binding, while low-affinity probes (large kd) saturate at a lower fraction. γ serves as a cross-hybridization factor. We made assumptions that are important to real experimental settings: a large quantity of cross-hybridizing targets are present; kb is uniform for all targets and the adsorption is limited by the available area of probe spots. These assumptions make our model fundamentally different from previous competitive kinetic models (29,30). Experimentally, signal intensity is what is observed after washing, where most of cross-hybridized targets have been washed off:
is the observed signal intensity, τ the residual intensity from cross-hybridized targets, ι scanner bias, A the detection coefficient of fluorescence. As the unit of signal intensities is arbitrarily digitized, it only comes to a physical meaning through A.Explanation to the correlationFirst of all, we shall demonstrate that our model is capable of explaining the correlation at high target concentration in Figure 1Equation (12) can be rearranged to a logarithmic form:
Note that the second item on the right side still contains the probe-dependent variable α. However, at high target concentration, the bound targets are minor comparing to free targets. This means, , and . Hence, Equation (17) at high target concentration is approximated as:
At high target concentration, both cross-hybridization and scanner bias can be neglected. Therefore Equation (16) can be simplified to . We substitute the α in Equation (20) with :
, a constant for fixed . Thus, is inversely correlated to . The observed correlation is explained by our competitive hybridization model. At low , the premise is less valid; as a result, is less correlated to . A similar effect may be created by a very low , where a large fraction of targets is bound to probes and taken out of solution.Procedure of fitting model to Latin Square data In DNA microarray experiments, signal intensities are measured in place of fluorescent densities of bound targets. However, common photomultiplier tube scanners usually carry a significant nonlinearity for low signal intensities (31). This means, the lower end of these Affymetrix data may deviate from the true fluorescent densities, a problem difficult to correct without knowledge of the specific instrument calibration data. And the signals from targets below 1 pM are hardly distinguishable from backgrounds, therefore, data from spike-in concentration 1 pM and above are used for our modeling. The model fitting is to match the theoretical calculation of signal intensity, , to the experimentally observed counterpart is defined from Equation (16):
is taken. This value of ι is relatively small and has no significant effect on our model parameters. Though it is useful for stabilizing the small numbers in the fitting process.With the theoretical value
Equation (13) can be written as
is known, the observed value for , and kd can be calculated from Equation (15). So we only need to fit four global parameters: A, p, kb and γ.We use a fitness function of weighted squares [similar to (1)]. For a probe i, the fitting error is calculated as
is observed signal intensity, the calculated value by Equation (24), t one of the nominal target concentrations (1–512 pM). Our model in Equation (24) is fitted to the training data by minimizing the sum of Ei through brute-force searches as heuristic ranges of the four parameters can be obtained based on their physical meanings.A useful constraint to the fitting is the value of . This is the signal intensity in Equation (23) when , often referred as the saturation level of hybridization. It is obvious that P0 should be larger but not infinitely larger than the highest signal intensity observed in the experiment. When varying P0 is used, as shown in Figure 3 here.
Probe signal intensities can be successfully modeled Figure 1 (hence kd) can be reasonably approximated by the current NN model for the probes free of secondary structures. We use half of these probes to fit our competitive hybridization model, and determine the four global parameters, A, p, kb and γ. The evaluation is then performed on the rest of probes.The results indicate that our model captures the probe properties well. Figure 4 (Figure 4 on evaluation data (Figure 4 , about 75% of total probes are included, with prediction (Figure 4 .
In the previous, heavily parameterized models, the best prediction on was correlation coefficient in Ref. (14) and in Ref. (13). In comparison, our model of four parameters produces for all probes, and for 75% probes after a preliminary selection by secondary structures (i.e. Figure 4Prediction of target concentrations With the four global parameters, target concentration can be calculated from Equation (12):
If we substitute ,
Since kd calculation is more accurate for probes free of secondary structures, we focus on 19 out of the 30 probesets (transcripts) in this study that have five or more probes with . For these transcripts, Equation (27) is applied to calculate a target concentration from each probe. And the final concentration of a transcript is taken as the median of the data from its probes (Figure 5 alone. At low concentrations, the predicted values in Figure 5
DISCUSSION In DNA microarray experiments, systematic variations stem from sample preparation and instrument operations. They are likely reflected in the global parameters of our model, A, kb and γ. Therefore, batch variations can be expected in these parameters. The highest signal intensity in the Latin Square data was about 16 000. Comparing with the saturation level , this means about half of those probe sites are bound to specific targets at Thus, the fitted value of (as in Figure 4We would like to emphasize that kd is the only probe-specific factor in our model, and therefore plays a pivotal role in model accuracy. The accuracy of kd or in this article is limited by the NN model, which is only a coarse approximation and affected by probe/target secondary structures. This can be improved but beyond the scope of this current study.We assumed a constant cross-hybridization factor γ for all probes, which may not be the case. Further research on γ may improve the accuracy of our model. We did not deal with the background levels in this study, which are not important to signals at high target concentration but will affect signals at low concentrations. Background levels have a clear dependency on , and are well addressed in other studies (32,33).CONCLUSION Our study presents the first model of DNA microarray hybridization that explains probe signal intensities through sequence-based thermodynamic properties without excessive parameter fitting. This fills in the long standing knowledge gap in DNA microarray hybridization. Our model provides a mechanism of absolute quantification, and shall improve the quality control and reproducibility of the technology. With only four global parameters, this model can be easily calibrated through control features that are built into microarrays, and adopted in practice. We expect new design and quantification algorithms to take advantage of our results. FUNDING National Oceanic and Atmospheric Administration (NA05NOS4261163 and NA06NOS42600117). Conflict of interest statement: None declared. REFERENCES 1. Hekstra D, Taussig AR, Magnasco M, Naef F. Absolute mRNA concentrations from sequence-specific calibration of oligonucleotide arrays. Nucleic Acids Res. 2003;31:1962–1968. [PubMed] 2. Held GA, Grinstein G, Tu Y. Modeling of DNA microarray data by using physical properties of hybridization. Proc Nat Acad Sci. USA. 2003;100:7575. [PubMed] 3. Halperin A, Buhot A, Zhulina EB. Specificity, sensitivity and the hybridization Isotherms of DNA chips. Biophys. J. 2004;86:718–730. [PubMed] 4. Abdueva D, Skvortsov D, Tavare S. Non-linear analysis of GeneChip arrays. Nucleic Acids Res. 2006;34:e105. [PubMed] 5. Binder H, Preibisch S. GeneChip microarrays-signal intensities, RNA concentrations and probe sequences. J.Phys. Condens. Matter. 2006;18:S537–S566. 6. Heim T, Tranchevent LC, Carlon E, Barkema GT. Physical-chemistry-based analysis of affymetrix microarray data. J. phys. chem. B. 2006;110:22786–22795. [PubMed] 7. Held GA, Grinstein G, Tu Y. Relationship between gene expression and observed intensities in DNA microarrays–a modeling study. Nucleic Acids Res. 2006;34:e70. [PubMed] 8. Burden CJ, Pittelkow Y, Wilson SR. Adsorption models of hybridization and post-hybridization behaviour on oligonucleotide microarrays. J. Phys. Condensed Matter. 2006;18:5545–5565. 9. Bruun GM, Wernersson R, Juncker AS, Willenbrock H, Nielsen HB. Improving comparability between microarray probe signals by thermodynamic intensity correction. Nucleic Acids Res. 2007;35:e48. [PubMed] 10. Burden CJ. Understanding the physics of oligonucleotide microarrays: the Affymetrix spike-in data reanalysed. Phys. Biol. 2008;5:16004. [PubMed] 11. Skvortsov D, Abdueva D, Curtis C, Schaub B, Tavare S. Explaining differences in saturation levels for Affymetrix GeneChip (R) arrays. Nucleic Acids Res. 2007;35:4154–4163. [PubMed] 12. Affymetrix Latin Square data. http://www.affymetrix.com/support/datasets.affx (6 May 2008, date last accessed). 13. Zhang L, Miles MF, Aldape KD. A model of molecular interactions on short oligonucleotide microarrays. Nat. Biotechnol. 2003;21:818–821. [PubMed] 14. Mei R, Hubbell E, Bekiranov S, Mittmann M, Christians FC, Shen MM, Lu G, Fang J, Liu WM, Ryder T, et al. Probe selection for high-density oligonucleotide arrays. Proc. Nat. Acad. Sci. 2003;100:11237. [PubMed] 15. Wu Z, Irizarry RA. Preprocessing of oligonucleotide array data. Nat. Biotechnol. 2004;22:656–658. [PubMed] 16. Lucia Santa., Jr A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Nat. Acad. Sci. USA. 1998;95:1460. [PubMed] 17. Wu P, Nakano S, Sugimoto N. Temperature dependence of thermodynamic properties for DNA/DNA and RNA/DNA duplex formation. FEBS J. 2002;269:2821–2830. 18. Ono N, Suzuki S, Furusawa C, Agata T, Kashiwagi A, Shimizu H, Yomo T. An improved physico-chemical model of hybridization on high-density oligonucleotide microarrays. Bioinformatics. 2008;24:1278–1285. [PubMed] 19. Naef F, Magnasco MO. Solving the riddle of the bright mismatches: Labeling and effective binding in oligonucleotide arrays. Phys. Rev. E. 2003;68:11906. 20. NetAffx Analysis Center. http://www.affymetrix.com/analysis/index.affx (6 May 2008, date last accessed). 21. Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH. Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc. Nat. Acad. Sci. USA. 2004;101:7287–7292. [PubMed] 22. Turner DH. Bloomfield VA, Crothers DM, Tinoco I., Jr Nucleic Acids: Structures, Properties, and Functions. Sausalito University Science Books, Sausalito, CA. 2000:259–334. 23. Christensen U, Jacobsen N, Rajwanshi VK, Wengel J, Koch T. Stopped-flow kinetics of locked nucleic acid (LNA)-oligonucleotide duplex formation: studies of LNA-DNA and DNA-DNA interactions. Biochem. J. 2001;354:481–484. [PubMed] 24. Levicky R, Horgan A. Physicochemical perspectives on DNA microarray and biosensor technologies. Trends Biotechnol. 2005;23:143–149. [PubMed] 25. Halperin A, Buhot A, Zhulina EB. On the hybridization isotherms of DNA microarrays: the Langmuir model and its extensions. J. Phys. Condens. Matter. 2006;18:S463–S490. 26. Vainrub A, Pettitt BM. Thermodynamics of association to a molecule immobilized in an electric double layer. Chem. Phys. Lett. 2000;323:160–166. 27. Halperin A, Buhot A, Zhulina EB. Brush Effects on DNA Chips: Thermodynamics, Kinetics, and Design Guidelines. Biophys. J. 2005;89:796–811. [PubMed] 28. Zhang L, Hurek T, Reinhold-Hurek B. Position of the fluorescent label is a crucial factor determining signal intensity in microarray hybridizations. Nucleic Acids Res. 2005;33:e166. [PubMed] 29. Zhang Y, Hammer DA, Graves DJ. Competitive hybridization kinetics reveals unexpected behavior patterns. Biophys. J. 2005;89:2950–2959. [PubMed] 30. Bishop J, Blair S, Chagovetz AM. A competitive kinetic model of nucleic acid surface hybridization in the presence of point mutants. Biophys. J. 2006;90:831–840. [PubMed] 31. Shi L, Tong W, Su Z, Han T, Han J, Puri RK, Fang H, Frueh FW, Goodsaid FM, Guo L, et al. Microarray scanner calibration curves: characteristics and implications. BMC Bioinformatics. 2005;6:S11. [PubMed] 32. Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A model-based background adjustment for oligonucleotide expression arrays. J. Am. Stat. Assoc. 2004;99:909–917. 33. Schuster EF, Blanc E, Partridge L, Thornton JM. Estimation and correction of non-specific binding in a large-scale spike-in experiment. Genome Biol. 2007;8:R126. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Nucleic Acids Res. 2006 May 24; 34(9):e70.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2007; 35(12):4154-63.
[Nucleic Acids Res. 2007]Nat Biotechnol. 2003 Jul; 21(7):818-21.
[Nat Biotechnol. 2003]Proc Natl Acad Sci U S A. 2003 Sep 30; 100(20):11237-42.
[Proc Natl Acad Sci U S A. 2003]Nat Biotechnol. 2004 Jun; 22(6):656-8; author reply 658.
[Nat Biotechnol. 2004]Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]Nucleic Acids Res. 2006 May 24; 34(9):e70.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2007; 35(7):e48.
[Nucleic Acids Res. 2007]Nat Biotechnol. 2003 Jul; 21(7):818-21.
[Nat Biotechnol. 2003]Bioinformatics. 2008 May 15; 24(10):1278-85.
[Bioinformatics. 2008]Proc Natl Acad Sci U S A. 2004 May 11; 101(19):7287-92.
[Proc Natl Acad Sci U S A. 2004]Biochem J. 2001 Mar 15; 354(Pt 3):481-4.
[Biochem J. 2001]Trends Biotechnol. 2005 Mar; 23(3):143-9.
[Trends Biotechnol. 2005]Biophys J. 2005 Aug; 89(2):796-811.
[Biophys J. 2005]Nucleic Acids Res. 2005 Oct 27; 33(19):e166.
[Nucleic Acids Res. 2005]Biophys J. 2005 Nov; 89(5):2950-9.
[Biophys J. 2005]Biophys J. 2006 Feb 1; 90(3):831-40.
[Biophys J. 2006]BMC Bioinformatics. 2005 Jul 15; 6 Suppl 2():S11.
[BMC Bioinformatics. 2005]Proc Natl Acad Sci U S A. 2003 Sep 30; 100(20):11237-42.
[Proc Natl Acad Sci U S A. 2003]Nat Biotechnol. 2003 Jul; 21(7):818-21.
[Nat Biotechnol. 2003]Genome Biol. 2007; 8(6):R126.
[Genome Biol. 2007]