![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2003, The National Academy of Sciences Statistics, Biophysics Modeling of DNA microarray data by using physical properties of hybridization IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598 * To whom correspondence should be addressed at: IBM Thomas J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598. E-mail: gaheld/at/us.ibm.com. Communicated by Charles H. Bennett, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, April 25, 2003 Received January 31, 2003. This article has been cited by other articles in PMC.Abstract A method of analyzing DNA microarray data based on the physical modeling of hybridization is presented. We demonstrate, in experimental data, a correlation between observed hybridization intensity and calculated free energy of hybridization. Then, combining hybridization rate equations, calculated free energies of hybridization, and microarray data for known target concentrations, we construct an algorithm to compute transcript concentration levels from microarray data. We also develop a method for eliminating outlying data points identified by our algorithm. We test the efficacy of these methods by comparing our results with an existing statistical algorithm, as well as by performing a cross-validation test on our model. Use of DNA microarrays, wherein the expression levels of thousands of genes can be monitored simultaneously (1, 2), has become widespread in biochemical research. Accurate interpretation of microarray data remains a significant challenge, however. The raw data exhibit large fluctuations, the origin of which is unclear, and most algorithms for extracting quantitative expression levels from the data are either empirical or statistical in nature. The central component of all DNA microarray technologies is an array of different oligonucleotides (between 20 and several hundred bases long) deposited onto a single substrate. When a fluorescently labeled nucleic acid sample is washed over the substrate, it hybridizes to those probes that are complementary to it. Subsequent detection of the fluorescent signal allows one to quantify the presence of various sequences (and thus the expression levels of various genes) within the sample. One class of DNA microarrays is exemplified by Affymetrix (Santa Clara, CA) GeneChips (2, 3), wherein each transcript is probed by multiple, short oligomers (typically 20- to 25-mers) or probes. Typically, ≈20 perfect match (PM) probes and an equal number of mismatch (MM) probes correspond to each transcript. Each PM probe exactly complements a short region of the transcript, referred to as the target. The corresponding MM probe is identical to the PM probe except at its centermost base, which is noncomplementary to the transcript. Typical algorithms that use microarray data to infer quantitative transcript expression levels begin by subtracting the MM intensity from the corresponding PM intensity (with adjustments to the MM value if MM > PM) (4). The expression level is then taken to be a weighted average of those differences obtained from all the pairs that probe the given transcript. One advantage of these genechips over other technologies, such as arrays spotted with a single, longer probe for each transcript, is that these chips can provide an absolute measure of gene expression (i.e., transcript copies per cell), whereas spotted arrays typically measure up- or down-regulation relative to some control condition (1). Although many techniques have been developed to identify trends in the expression levels inferred from DNA microarray data (5), less attention has been devoted to methods of obtaining accurate expression levels from the raw data. One promising method, developed by Li and Wong (6), assumes probe-dependent target-binding strengths determined through statistical analysis of data. In addition, several studies have been carried out to identify sources of noise present in microarray data (7, 8). However, to date there has been little attempt (9) to analyze raw microarray data by constructing an algorithm based on the underlying physical principles of the hybridization process. In this article we present such an analysis. That is, we develop a model based on calculated hybridization free energies and the rate equation for duplex formation and melting. Using publicly available data (see www.affymetrix.com/analysis/download_center2.affx), we demonstrate a clear correlation between calculated hybridization free energies and observed hybridization intensities, and show that the data can be well modeled by using the static equilibrium solution of the rate equation and incorporating the correlation. We further present a method of systematically identifying and removing outlying data points from the analysis, and compare our results both with the known answers for a controlled, publicly available data set and the results of an existing statistical algorithm. Finally, we discuss possible causes for variations observed among data sets from nominally equivalent experiments. Experiments and Data All the data shown and analyzed in this article are taken from the publicly available results of experiments carried out at Affymetrix (see www.affymetrix.com/analysis/download_center2.affx and ref. 10). In particular, the data are the results of a series of controlled ``spike-in'' experiments, in which a transcript group comprised of known concentrations (each between 0 and 1,024 pM), of 14 human genes is spiked into a background consisting of a labeled mixture of mRNA from a human pancreatic tissue source. (None of the spike-in genes were expressed in the tissue source.) This mixture is then hybridized onto Affymetrix U95A GeneChips. Fourteen different transcript groups, each containing different concentrations, of the various spike-in genes (following an experimental design known as a Latin square), are then each hybridized onto a genechip. The net result is that each of the genes is spiked into one of the transcript groups at each of the following concentrations: 0, 0.25, 0.5, 1, 2, 4,... 1,024 pM. That is, data are collected for all 14 genes, each at 14 different concentrations. Sixteen distinct PM–MM probe pairs interrogate each transcript. Finally, each of the measurements for a given gene at a given concentration was replicated between 2 and 12 times. The raw data provided by Affymetrix contain intensities for each PM and MM probe (the intensities provided corresponding to the 75th percentile of the intensity of the scanned pixels associated with the probe site). We performed no normalization of these data before the analysis discussed below. Because two of the genes in the experiment (407_at and 36889_at) had some defective probes, only 12 of the 14 genes are used in our analysis. Note that throughout this article we use Affymetrix notation to identify genes. Fluctuations and Hybridization Energies Because target fragments complementary to each of the 16 PM probes representing any given transcript are present with virtually identical concentration in the sample solution, one might expect that in any experiment the intensities of the 16 spots containing these probes should be close in magnitude. In practice, however, significant variations in intensity are reproducibly observed among probes within the probe set interrogating a given target (3). One possible cause for these variations is the tendency of certain probes to hybridize more strongly with their complementary targets than others. In this article we study and exploit this possibility. Because the hybridization rate increases as the free-energy differential associated with the reaction becomes more negative, some of the probe-to-probe variations observed in the data might be ascribable to sequence-dependent variations in these free energies. We show below that this indeed is the case and then incorporate these systematic differences in hybridization rates into algorithms designed to infer transcript concentrations from microarray data. Studying the dependence of measured microarray intensities on target–probe hybridization free energies requires first determining those free energies. In the absence of detailed experimental results on hybridization in the confined geometry of microarrays, we use an existing nearest-neighbor model (11) developed for calculating free energies in solution. Obviously this model represents an approximation for the microarray geometry, where (at least) the entropic contributions to free-energy changes clearly differ from those in solution. Because the entropies, in microarrays, of both the initial, confined, single-strand probe and the final, hybridized, probe–target double strand are reduced relative to their values in solution, however, one hopes that the two reductions are close enough to make the approximation reasonable. The fact that such a model provides a reasonable fit to the experimental data provides some a posteriori justification for its use. In nearest-neighbor models, the hybridization free energy of any base pair depends not only on whether that pair is a C-G or an A-T but also on which base pairs occupy the neighboring positions along the strand. Table 2 of ref. 11 gives the changes in enthalpy and entropy, and hence free energy (at 310 K), associated with each of the 10 independent stackings of two base pairs along an oligonucleotide chain in a 1 M solution. To calculate the hybridization free energy for any sequence, one sums the contributions for each of these stacked pairs along the chain and adds a correction (see table 2 of ref. 11) for the base pairs terminating the sequence at each end. One now can check whether measured microarray intensities in fact increase monotonically with increasingly negative free-energy changes. Although such a trend is not obvious in the data obtained in a single measurement of the 16 probes representing any single gene, it becomes apparent once one aggregates the Latin-square data from all the genes at a given concentration. Fig. 1
One can also use the nearest-neighbor hybridization model to compute the melting temperatures of oligonucleotides in solution (11) and then compare the results with experimental observations. Forman et al. (12) measure the degree of hybridization of specific microarray probes as a function of temperature, observing, for example, that a specific 18-mer melts at 338 K (the temperature at which the fraction of hybridized pairs goes to zero, presumably corresponding to the melting of the full-length 18-mers on the given probe spot). The calculated melting temperature for the same probe is 367 K, suggesting that the application of thermodynamic quantities determined for nucleic acids in solution to confined geometries is reasonable. Calculating the melting temperatures for each of the PM and MM probes for the 12 genes that we analyzed from the Latin-square data set (11, 13), one finds that the variation in melting temperature across the range of free energies represented is ≈20 K, whereas that between a PM–MM probe pair is typically ≈4 K (see Fig. 11, which is published as supporting information on the PNAS web site, www.pnas.org). Thus, incorporating the effects of free energy seems at least as significant as incorporating those of single MMs. Data Analysis To understand the dependence of measured intensity on transcript concentration, we plot in Fig. 2
In principle, data from each individual probe should follow Eq. 2, with nI varying little from probe to probe. In practice, however, fitting the data for each probe individually results in nI values of nI that range over an order of magnitude. To reduce the effects of such fluctuations, we fit all the data with a single nI. This is accomplished by binning the data over free energy for each concentration and then fitting all the binned data simultaneously. In Fig. 3
We fit each of the curves in Fig. 3 From equilibrium thermodynamics, one would expect that ne = e–ΔG/RT. The best-fit straight line through the data in the semilog plot of Fig. 4a In Fig. 4b Combining the best-fit value of nI from the fits of Fig. 3 To determine the concentration, c, of a given transcript from the data obtained from a single genechip, we plot the observed PM-probe intensities for that transcript as a function of calculated probe hybridization free energy. We then perform a least-squares fit of this data to Eq. 2, using the above values of nI, ne, and bge, with c as the only adjustable parameter and the constraint c ≥ 0. In Fig. 5
To improve the analysis, it is desirable to identify those data points that have either anomalously low or high intensities. Such points can be identified by their large distance from the best-fit curve of the model just described. In particular, we calculate the mean distance from the 16 data points for a given transcript to the best-fit curve, and the standard deviation of these distances. We next identify all the data points with distances from the best fit that exceed the mean distance by at least 1 standard deviation. We discard these data points and then refit the remaining data. In Fig. 5 Analysis of the Accuracy of the Algorithm Having used the above algorithm to obtain fitted values of the concentration for each of the spike-in transcripts in each of the Latin-square experiments, we can compare these results with both the known spike-in values of the concentrations and the signal values obtained by using Affymetrix software. For each spike-in concentration (i.e., 0.25–1,024 pM), we plot the median value of the calculated concentrations of all transcript measurements taken at that spike-in concentration (Fig. 6
Note that for the spike-in concentration 0.25 pM, the median value of the fits with our model was 0 (the lower limit allowed by our fitting procedure); thus, this point could not be included in the log–log plot above. If we also exclude our results for the spike-in value of 0.5 pM, the best-fit slope to our fitted concentrations becomes 1.01. Given the anomalously low values of our results at 0.25 and 0.5 pM, we believe that our predicted concentrations are not valid for spike-in concentrations <1 pM. (Note that the Affymetrix software predicts that a transcript is present in only 18%, 36%, and 33% of the 0.25-, 0.5-, and 1-pM spike-in experiments, respectively.) These unreliable predictions at low concentration are very likely connected with the fact that the background is comparable with the total signal at the lowest concentrations. Fig. 7
Fig. 8
As a final test of the efficacy of our methods, we performed a cross-validation study on the Latin-square data. The data set first was divided into three nonoverlapping subsets, each comprising the data from all experiments performed on 4 of the 12 genes. One of the three subsets was arbitrarily chosen as the ``test'' subset, and the data from the other two subsets then were used as the ``training'' subset to determine values of nI, bge, and ne as described above. These values were then used (again, just as described above) to fit the concentrations of the genes of the test subset. This procedure was then repeated twice more, with each of the two remaining subsets used as the test subset. The results for two of three of these tests are combined and displayed in Fig. 9 Discussion In this article we have described algorithms, based on simple physical principles of hybridization, for extracting gene-expression levels from microarray data. The results for the controlled Latin-square spike-in data were comparable in accuracy with those obtained from the statistical Affymetrix algorithm. Specifically, for multiple measurements taken at the same spike-in concentration, our algorithm tended to yield a more accurate median prediction, whereas the Affymetrix software tended to yield a narrower distribution of predicted values. At least in part, the accuracy of our model is limited by the quantity and range of control data available as a training set. More control data, including some at higher spike-in concentrations, would presumably reduce the fluctuations in the data (e.g., Fig. 3 Although the dependence of hybridization strength on free energy clearly accounts for some of the observed variations in probe intensity, there are substantial systematic variations that, equally clearly, have different origins. This may be seen in Fig. 10
Note from Fig. 10 We noted earlier that, in Fig. 4a Supporting Information
Notes Abbreviations: PM, perfect match; MM, mismatch. Note. After submission of this article, we became aware of concurrent research (19) that addresses similar questions to our work, albeit through different methods. Footnotes †We have defined ΔG as the negative of the free energy of hybridization; thus, increasingΔG implies a stronger tendency to hybridize. Our ΔG is equal to –ΔG0 in the notation of ref. 11. References 1. Brown, P. O. & Botstein, D. (1999. ) Nat. Genet. 21, 33–37. [PubMed] 2. Lipshutz, R. J., Fodor, S. P. A., Gingeras, T. R. & Lockhart, D. J. (1999. ) Nat. Genet. 21, 20–24. [PubMed] 3. Lockhart, D. J., Dong, H. L., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C. W., Kobayashi, M., Horton, H. & Brown, E. L. (1996. ) Nat. Biotechnol. 14, 1675–1680. [PubMed] 4. Affymetrix (2001. ) Statistical Algorithms Reference Guide, Affymetrix Technical Note (Affymetrix, Santa Clara, CA). 5. Mount, D. W. (2001. ) Bioinformatics: Sequence and Genome Analysis (Cold Spring Harbor Lab. Press, Plainview, NY), pp. 519–523. 6. Li, C. & Wong, W. H. (2001. ) Proc. Natl. Acad. Sci. USA 98, 31–36. [PubMed] 7. Naef, F., Hacker, C. R., Patil, N. & Magnasco, M. (2002. ) Genome Biol. 3, research0018.1–0018.11. [PubMed] 8. Tu, Y., Stolovitzky, G. & Klein, U. (2002. ) Proc. Natl. Acad. Sci. USA 99, 14031–14036. [PubMed] 9. Affymetrix (2001. ) Array Design for the GeneChip Human Genome U133 Set, Affymetrix Technical Note (Affymetrix, Santa Clara, CA). 10. Affymetrix (2001. ) New Statistical Algorithms for Monitoring Gene Expression on GeneChip Probe Arrays, Affymetrix Technical Note (Affymetrix, Santa Clara, CA). 11. SantaLucia, J. (1998. ) Proc. Natl. Acad. Sci. USA 95, 1460–1465. [PubMed] 12. Forman, J. E., Walton, I. D., Stern, D., Rava, R. P. & Trulson, M. O. (1998. ) in Molecular Modeling of Nucleic Acids, eds. Leontis, N.B. & SantaLucia, J. (Am. Chem. Soc., Washington, DC), Vol. 682, pp. 206–228. 13. Peyret, N., Seneviratne, P. A., Allawi, H. T. & SantaLucia, J. (1999. ) Biochemistry 38, 3468–3477. [PubMed] 14. Hill, A. A., Brown, E. L., Whitley, M. Z., Tucker-Kellog, G., Hunter, C. P. & Slonim, D. K. (2001. ) Genome Biol. 2, research0055.1–0055.13. [PubMed] 15. Hoffmann, R., Seidl, T. & Dugas, M. (2002. ) Genome Biol. 3, research0033.1–0033.11. [PubMed] 16. Dai, H. Y., Meyer, M., Stepaniants, S., Ziman, M. & Stoughton, R. (2002. ) Nucleic Acids Res. 30, e86. [PubMed] 17. Kepler, T. B., Crosby, L. & Morgan, K. T. (2002. ) Genome Biol. 3, research0037.1–0037.12. [PubMed] 18. Cantor, C. R. & Schimmel, P. R. (1980. ) Biophysical Chemistry Part III: The Behavior of Biological Macromolecules (Freeman, San Francisco). 19. Hekstra, D., Taussig, A. R., Magnasco, M. & Naef, F. (2003. ) Nucleic Acids Res. 31, 1962–1968. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Nucleic Acids Res. 2003 Apr 1; 31(7):1962-8.
[Nucleic Acids Res. 2003]Nat Genet. 1999 Jan; 21(1 Suppl):33-7.
[Nat Genet. 1999]Nat Genet. 1999 Jan; 21(1 Suppl):20-4.
[Nat Genet. 1999]Nat Genet. 1999 Jan; 21(1 Suppl):20-4.
[Nat Genet. 1999]Nat Biotechnol. 1996 Dec; 14(13):1675-80.
[Nat Biotechnol. 1996]Nat Genet. 1999 Jan; 21(1 Suppl):33-7.
[Nat Genet. 1999]Proc Natl Acad Sci U S A. 2001 Jan 2; 98(1):31-6.
[Proc Natl Acad Sci U S A. 2001]Genome Biol. 2002; 3(4):RESEARCH0018.
[Genome Biol. 2002]Proc Natl Acad Sci U S A. 2002 Oct 29; 99(22):14031-6.
[Proc Natl Acad Sci U S A. 2002]Nat Biotechnol. 1996 Dec; 14(13):1675-80.
[Nat Biotechnol. 1996]Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]Biochemistry. 1999 Mar 23; 38(12):3468-77.
[Biochemistry. 1999]Proc Natl Acad Sci U S A. 2001 Jan 2; 98(1):31-6.
[Proc Natl Acad Sci U S A. 2001]Genome Biol. 2001; 2(12):RESEARCH0055.
[Genome Biol. 2001]Genome Biol. 2002 Jun 28; 3(7):RESEARCH0037.
[Genome Biol. 2002]Proc Natl Acad Sci U S A. 1998 Feb 17; 95(4):1460-5.
[Proc Natl Acad Sci U S A. 1998]