![]() | ![]() |
Formats:
|
||||||||||||||||||||||||
Copyright © 2005 Choe et al.; licensee BioMed Central Ltd. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset 1Department of Genetics, Harvard Medical School, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA 2Division of Genetics, Department of Medicine, Brigham and Women's Hospital, New Research Building, 77 Avenue Louis Pasteur, Boston, MA 02115, USA 3Howard Hughes Medical Institute, Brigham and Women's Hospital, 20 Shattuck Street, Boston, MA 02115, USA 4Department of Biochemistry, 140 Farber Hall, 3435 Main St., SUNY at Buffalo, Buffalo, NY 14214, USA 5Center of Excellence in Bioinformatics, 140 Farber Hall, 3435 Main St., SUNY at Buffalo, Buffalo, NY 14214, USA 6German Cancer Research Center (DKFZ/B110), Im Neuenheimer Feld 580, 69120 Heidelberg, Germany Corresponding author.Sung E Choe: sung_choe/at/post.harvard.edu; Michael Boutros: m.boutros/at/dkfz.de; Alan M Michelson: michelson/at/receptor.med.harvard.edu; Marc S Halfon: mshalfon/at/buffalo.edu Received August 3, 2004; Revised October 20, 2004; Accepted December 2, 2004. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background As more methods are developed to analyze RNA-profiling data, assessing their performance using control datasets becomes increasingly important. Results We present a 'spike-in' experiment for Affymetrix GeneChips that provides a defined dataset of 3,860 RNA species, which we use to evaluate analysis options for identifying differentially expressed genes. The experimental design incorporates two novel features. First, to obtain accurate estimates of false-positive and false-negative rates, 100-200 RNAs are spiked in at each fold-change level of interest, ranging from 1.2 to 4-fold. Second, instead of using an uncharacterized background RNA sample, a set of 2,551 RNA species is used as the constant (1x) set, allowing us to know whether any given probe set is truly present or absent. Application of a large number of analysis methods to this dataset reveals clear variation in their ability to identify differentially expressed genes. False-negative and false-positive rates are minimized when the following options are chosen: subtracting nonspecific signal from the PM probe intensities; performing an intensity-dependent normalization at the probe set level; and incorporating a signal intensity-dependent standard deviation in the test statistic. Conclusions A best-route combination of analysis methods is presented that allows detection of approximately 70% of true positives before reaching a 10% false-discovery rate. We highlight areas in need of improvement, including better estimate of false-discovery rates and decreased false-negative rates. Background Since their introduction in the mid 1990s [1,2], expression-profiling methods have become a widespread tool in numerous areas of biological and biomedical research. However, choosing a method for analyzing microarray data is a daunting task. Dozens of methods have been proposed for the analysis of both high-density oligonucleotide (for example, Affymetrix GeneChip) and spotted cDNA or long oligonucleotide arrays, with more being put forward on a regular basis [3]. Moreover, it is clear that different methods can produce substantially different results. For example, two lists of differentially expressed genes generated from the same dataset can display as little as 60-70% overlap when analyzed using different methods ([4] and see Additional data file 1). Despite the large number of proposed algorithms, there are relatively few studies that assess their relative performance [5-9]. A significant challenge to undertaking such studies is the scarcity of control datasets that contain a sufficiently large number of known differentially expressed genes to obtain adequate statistics. The comparative studies that have been performed have used a small number of positive controls, and have included a background RNA sample in which the concentrations of the various genes are unknown, preventing an accurate assessment of false-positive rates and nonspecific hybridization. The most useful control datasets to date for evaluating the effectiveness of analysis methods for Affymetrix arrays are cRNA spike-in datasets from Affymetrix and Gene Logic. The Affymetrix Latin square dataset [10] is a series of transcriptional profiles of the same biological RNA sample, into which 42 cRNAs have been spiked at various known concentrations. The dataset is designed so that, when comparing any two hybridizations in the series, all known fold changes are powers of two. The Gene Logic dataset [11] has a similar experimental design, but with 11 cRNAs spiked in at varying fold changes, ranging from 1.3-fold upwards. Here we present a new control dataset for the purpose of evaluating methods for identifying differentially expressed genes (DEGs) between two sets of replicated hybridizations to Affymetrix GeneChips. This dataset has several features to facilitate the relative assessment of different analysis options. First, rather than containing a limited number of spiked-in cRNAs, the current dataset has 1309 individual cRNAs that differ by known relative concentrations between the spike-in and control samples. This large number of defined RNAs enables us to generate accurate estimates of false-negative and false-positive rates at each fold-change level. Second, the dataset includes low fold changes, beginning at only a 1.2-fold concentration difference. This is important, as small fold changes can be biologically relevant, yet are frequently overlooked in microarray datasets because of a lack of knowledge as to how reliably such small changes can be detected. Third, our dataset uses a defined background sample of 2,551 RNA species present at identical concentrations in both sets of microarrays, rather than a biological RNA sample of unknown composition. This background RNA population is sufficiently large for normalization purposes, yet also enables us to observe the distribution of truly nonspecific signal from probe sets which correspond to RNAs not present in the sample. We have used this dataset to compare several algorithms commonly used for microarray analysis. To perform a direct comparison of the selected methods at each stage of analysis, we applied all possible combinations of options to the data. Thus, it was possible to assess whether some steps are more critical than others in maximizing the detection of true DEGs. Our results show that at several steps of analysis, large differences exist in the effectiveness of the various options that we considered. These key steps are: first, adjusting the perfect match probe signal with an estimate of nonspecific signal (the method from MAS 5.0 [12] performs best); second, checking that the log fold changes are roughly distributed around 0 (by observing the so-called M versus A plot [13], the plot of log fold change (M) versus average log signal intensity (A)), and if necessary, performing a normalization at the probe-set level to center this plot around M = 0; and third, choosing the best test statistic (the regularized t-statistic from CyberT [14] is most accurate). Overall, we find a significant limit to the sensitivity of microarray experiments to detect small changes: in the best-case scenario we could detect approximately 95% of true DEGs with changes greater than twofold, but less than 30% with changes below 1.7-fold before exceeding a 10% false-discovery rate. We propose a 'best-route' combination of existing methods to achieve the most accurate assessment of DEGs in Affymetrix experiments. Results and discussion Experimental design A common use of microarrays is to compare two samples, for example, a treatment and a control, to identify genes that are differentially expressed. We constructed a control dataset to mimic this scenario using 3,860 individual cRNAs of known sequence in a concentration range similar to what would be used in an actual experimental situation (see Materials and methods). The cRNAs were divided into two samples - 'constant' (C) and 'spike' (S) - and each sample was hybridized in triplicate to Affymetrix GeneChips (six chips total). The S sample contains the same cRNAs as the C sample, except that selected groups of approximately 180 cRNAs each are present at a defined increased concentration compared to the C sample (Figure (Figure1,1
Assignment of Affymetrix probe sets to DGC clones In the Affymetrix GeneChip design, the expression level of each RNA species is reported by a probe set, which in the DrosGenome1 chip [15] comprises 14 oligonucleotide probe pairs. Each probe pair contains two 25-mer DNA oligonucleotide probes; the perfect match (or PM) probe matches perfectly to the target RNA, and the mismatch (or MM) probe is identical to its PM partner probe except for a single homomeric mismatch at the central base-pair position, and thus serves to estimate nonspecific signal. The DrosGenome1 chip used in this experiment is based on release version 1.0 of the Drosophila genome sequence and thus does not represent the most up-to-date annotated version of the genome. To ensure that probe-target assignments are made correctly, we assigned the 14,010 probe sets on the DrosGenome1 GeneChip to the spiked-in RNAs by BLAST of the individual PM probe sequences against the Drosophila Gene Collection release 1.0 (DGC [16]) clone sequences that served as the template for the cRNA samples (Materials and methods). Of the 3,860 DGC clones used in this study, 3,762 (97%) have full-length cDNA sequence available at the DGC web site, 90 have 3' and 5'-end sequence only, and eight have no available sequence. For each probe set, all clone sequences with BLAST matches to PM probe sequences in that probe set are collected, allowing at most two (out of 25 base-pair (bp)) mismatches, and only allowing matches on the correct strand. If at least three PM sequences match to a given clone, then the probe set is assigned to that clone. Matches of one probe set to more than one clone are allowed. In this manner, 3,866 probe sets are assigned to at least one DGC clone each. Among these probe sets, 1,331 have an increased concentration between the S and C chips, whereas 2,535 represent RNAs with equal concentration between the two samples. Among those probe sets which do not have any assignment using this criterion, if fewer than three PM probes within the probe set have a BLAST match to any clone, the probe set is then called 'empty' (that is, its signal should correspond to nonspecific hybridization). There are 10,131 empty probe sets; combined with the 2,535 1x probe sets, about 90% of the probe sets on the chip represent RNAs with constant expression level between the C and S samples. The rest of the probe sets are then called 'mixed', meaning that they match to more than one clone, but each with only a few PM probe matches. There are only 13 mixed probe sets. The numbers of probe sets assigned to each fold-change class are depicted in Table 1. Assessment of absent/present call metrics Our dataset design provides the rare knowledge of virtually all of the RNA sequences within a complex sample (excepting the small number (3%) of clones for which only partial sequence was available, and the possible rare mistakenly assigned or contaminated clone). We can therefore evaluate various absent/present call metrics on the basis of their ability to distinguish between the known present and absent RNAs. We investigate this issue at both the probe pair level and probe set level. For the probe pair level assessment, we first identify the probe pairs which we expect to show signal, and those which should not. We thus define two classes of probe pairs: first, perfect probe pairs, whose PM probe matches perfectly to a target RNA sequence, and neither PM nor MM probe matches to any other RNA in the sample with a BLAST E-value cutoff of 1 and word size of 7, and second, empty probe pairs, whose PM and MM probes do not match to any RNA sequence when using the same criteria. On the chip, which contains 195,994 probe pairs, there are 50,859 perfect probe pairs and 117,904 empty ones. Observation of the signal for these probe pairs (Figure 2a,b , and log2(PM) - to distinguish between perfect and empty probe pairs, by calculating receiver-operator characteristics (ROC) curves using the perfect probe pairs as true positives and the empty ones as true negatives. Each point on a curve depicts the specificity and sensitivity for RNA detection, when using a specific value of the corresponding metric as a cutoff for classifying probe sets as present or absent. Instead of depicting the false-positive rate (the fraction of true negatives that are detected as present) on the x-axis, which is customary for these types of graphs, we show the false-discovery rate (the fraction of detected probe sets which are true negatives), which distinguishes between the metrics more effectively for the top-scoring probe sets. Figure Figure22
When signals from the 14 probe pairs in each probe set are combined to create a composite absence/presence call, a much larger fraction of the spiked-in RNA species can be detected reliably. To obtain absent/present calls at the probe-set level, we perform the Wilcoxon signed rank test using each of the metrics listed above [17]. The p-values from this test are used to generate the ROC curves in Figure Figure2d.2d Generating expression summary values The first task in analyzing Affymetrix microarrays is to combine the 14 PM and 14 MM probe intensities into a single number ('expression summary') which reflects the concentration of the probe set's target RNA species. Generating this value involves several discrete steps designed to subtract background levels, normalize signal intensities between arrays and correct for nonspecific hybridization. To compare the effectiveness of different analysis packages at each of these steps, we created multiple expression summary datasets using every possible (that is, compatible) combination of the options described below. Algorithms were chosen for their popularity with microarray researchers and their open-source availability, and were generated using the implementations found in the Bioconductor 'affy' package [18]. Figure Figure33
Background correction An estimate of the background signal, which is the signal due to nonspecific binding of fluorescent molecules or the autofluorescence of the chip surface, was generated using two possible metrics. The MAS background [17] is calculated on the basis of the 2nd percentile signal in each of 16 subsections of the chip, and is thus a spatially varying metric. The Robust Multi-chip Average (RMA) algorithm [22] subtracts a background value which is based on modeling the PM signal intensities as a convolution of an exponential distribution of signal and a normal distribution of nonspecific signal. Normalization at the probe level The signal intensities are normalized between chips to allow comparisons between them. Because in our dataset, a large number of RNAs are increased in S versus C (and none are decreased), commonly used methods often result in apparent downregulation for spiked-in probe sets in the 1x change category. We thus added a set of modified normalization methods which used our knowledge of the 1x probe sets. The following different methods were applied. Constant is a global adjustment by a constant value to equalize the chip-wide mean (or median) signal intensity between chips. Constantsubset is the same global adjustment but equalizing the mean intensity for only the probe sets with fold change equal to 1. Invariantset [23] is a nonlinear, intensity-dependent normalization based on a subset of probes which have similar ranks (the rank-invariant set) between two chips. Invariantsetsubset is the same as invariantset but the rank-invariant set is selected as a subset of the probe sets with fold change equal to 1. Loess normalization [24] is a nonlinear intensity-dependent normalization which uses a local regression to make the median fold change equal to zero, at all average intensity levels. Loesssubset normalization is the same as loess but using only the probe sets with fold change equal to 1. Quantile normalization [24] enforces all the chips in a dataset to have the same distribution of signal intensity. Quantilesubset normalization is the same as quantile but normalizes the spiked-in and non-spiked-in probe sets separately. PM correction We chose three ways to adjust the PM signal intensities to account for nonspecific signal. The first is to subtract the corresponding MM probe signal (subtractmm). The second is the method used in MAS 5.0, in which negative values are avoided by estimating the nonspecific signal when the MM value exceeds its corresponding PM intensity [17]. The third is PM only (no correction). The subtractmm and MAS methods are compatible only with the MAS background correction method; that is, it does not make sense to combine these with RMA background correction. Expression summary The 14 probe intensity values were combined using one of the following robust estimators: Tukey-biweight (MAS 5.0); median polish (RMA); or the model-based Li-Wong expression index (dChip). Analyses including the subtractmm PM correction method require dealing with negative values when PM is less than MM, which occurs in about a third of the cases. Within Bioconductor, the Li-Wong estimator can handle negative values, but the other two metrics mostly output 'not applicable' (NA) for the probe set when any of the constituent probe pairs has negative PM - MM. The result for MAS and median polish is NA for about 85% of the probe sets on the chip. To study the consequence of losing so many probe sets, we modified one of these two metrics (median polish) to accept negative (PM - MM) (medianpolishna), and added this metric whenever subtractmm was used. Normalization at the probe set level Many of the expression summary datasets that were produced still show a dependence of fold change on the signal intensity (Figure (Figure4a).4a
Comparison of the observed fold changes with known fold changes For each of the 150 expression summary datasets that we generated, fold changes between the S and C samples were calculated and then compared with the actual fold changes. Most expression summary datasets show good correlation between the observed and actual fold changes (Figure (Figure5).5
We also observed that the fold changes resulting from the chips are consistently lower than the actual fold changes. Apparently, the decrease in fold change is only partly the result of signal saturation (Figure 5b-c Test statistics and ROC curves Because a typical microarray experiment contains a large number of hypotheses (here 14,010) and a limited number of replicates (in this case three), high false-positive rates are a common problem in identifying DEGs. An important factor in minimizing false positives is to incorporate an appropriate error model into the signal/noise metric. We compared three t-statistic variants, which differ in their calculations of noise. The first is significance analysis for microarrays (SAM) [26], in which the t-statistic has a constant value added to the standard deviation. This constant 'fudge factor' is chosen to minimize the dependence of the t-statistic variance on standard deviation levels. The second is CyberT [14], in which the standard deviation is modeled as a function of signal intensity. The third is the basic (Student's) t-statistic. For CyberT and the basic t-test, we performed the tests on the expression summaries after log transformation, as well as on the raw data. As shown in the example ROC curve, the CyberT statistic outperforms the other statistics for the vast majority of expression summary datasets (Figure (Figure6a).6a
Comparison of options at each of the other analysis steps Performance of the various options that were investigated varied significantly, as seen by the ROC curves shown in Figure Figure7.7
Among the background correction methods, the MAS 5.0 method generally performs better than the RMA method (Figure (Figure7b).7b With respect to adjusting the PM probe intensity with an estimate of nonspecific signal, Figure Figure7d7d Figure Figure7e7e We were concerned that some of our analyses might be confounded by a possible correlation between low fold change and low expression summary levels, which could affect the interpretations of Figure Figure77 Models dependent on probe sequence provide a promising route to improving the accuracy of nonspecific signal measures. Here, we applied two different models (perfect match and gcrma) to the control dataset. With respect to detecting the true DEGs, these two models perform reasonably well, although slightly less well than the MAS 5.0 PM correction method. When we consider only the low signal DEGs (Additional data file 5), gcrma outperforms perfect match, and is similar in effectiveness to the top analysis option combinations. Estimating false discovery rates We have identified a set of analysis choices that optimally ranks genes according to significance of differential expression. To decide how many of the top genes to investigate further in follow-up experiments, it would be useful to have accurate estimates of the false-discovery rate (FDR or q-value), which is the fraction of false positives within a list of genes exceeding a given statistical cutoff. We used our control dataset to compare the actual q-values for the 10 optimal expression summary datasets with q-value estimates from the permutation method implemented in SAM. As shown in Figure Figure8b,8b
Assessment of sensitivity and specificity As the identities and relative concentrations of each of the RNAs in the experiment were known, we were able to assess directly the sensitivity and specificity obtained by the best-performing methods. Examination of the ROC curves in Figure Figure77 We next looked at the dependence of sensitivity and specificity on the magnitude of the spiked-in fold-change value. We find that at q = 10%, sensitivity is increased to 93% when only cRNAs that differ by twofold or more are considered as DEGs (Figure (Figure9a).9a
It is tempting to conclude from this that we are achieving adequate sensitivity in our experiments and merely need not bother with DEGs below the twofold change level. However, we would argue that obtaining greater sensitivity should be an important goal. There is ample demonstration in the biological and medical literature that small changes in gene expression can have serious phenotypic consequences, as seen both from haploinsufficiencies and from mutations that reduce levels of gene expression through transcriptional regulation or effects on mRNA stability. Furthermore, effective fold changes seen in a microarray experiment might be considerably smaller than actual fold changes within a cell, if the sample contains additional cell populations that dilute the fold-change signal. As it is often not possible to obtain completely homogeneous samples (for example, when profiling an organ composed of several specialized cell types), this is likely to prove a very real limitation to detecting DEGs. In cases where pure cell populations can be obtained, for example by laser capture microdissection, the numbers of cells are often small and RNA needs to undergo amplification in order to have enough for hybridization. Here, non-linearities in RNA amplification might also lead to observed fold changes that fall below the twofold level. We used three microarray replicates for this study, as this is frequently the number chosen by experimentalists because of cost and limiting amounts of RNA. One possible extension of this work would be to examine how many replicates are necessary for reliable detection of DEGs at a given fold change level. Conclusions We have compared a number of popular analysis options for the purpose of identifying differentially expressed genes using an Affymetrix GeneChip control dataset. Clear differences in sensitivity and specificity were observed among the analysis method choices. By trying all possible combinations of options, we could see that choices at some steps of analysis are more critical than at others; for example, the normalization methods that we considered perform similarly, whereas the choice of the PM adjustment method can strongly influence the accuracy of the results. On the basis of our observations, we have chosen a best route for finding DEGs (Figure (Figure3).3 Materials and methods cRNA and hybridization PCR products from Drosophila Gene Collection release 1.0 cDNA clones [16] were generated in 96-well format, essentially as described [29]. Each PCR product includes T7 and SP6 promoters located 5' and 3' to the coding region of the cDNA, respectively. Each PCR reaction was checked by gel electrophoresis for a band of detectable intensity and the correct approximate size. Those clones which did not yield PCR product were labeled as 'failed' and eliminated from subsequent analysis. From sequence verification of randomly selected clones, we estimate the number of mislabeled clones to be < 3%. The contents of the plates were collected into 19 pools, such that each pool contained the PCR product from one to four plates (approximately 96-384 clones). Biotinylated cRNA was generated from each pool using SP6 polymerase (detailed protocol available upon request) and the reactions were purified using RNeasy columns (Qiagen). Concentration and purity for each pool was determined both by spectrophotometry and with an Agilent Bioanalyzer. The labeled products were then divided into each of two samples - constant (C) and spike (S) - at specific relative concentrations (Table 1, Figure Figure1).1 Estimate of RNA concentrations The total amount of labeled cRNA that was added to each chip (approximately 18 μg) was comparable to a typical Affymetrix experiment (20 μg). Although we do not know the individual RNA concentrations, we estimate that these span the average RNA concentration in a biological GeneChip experiment. Our biological RNA samples typically result in about 40% of the probe sets on the DrosGenome1 chip called present, so the mean amount of individual RNA is 20 μg/(14,010 × 0.40) = 0.003 μg/RNA. In the C chips, the average concentration of individual RNAs in the different pools range from 0.0008 to 0.007 μg/RNA, so the concentrations are roughly similar to those in a typical experiment. We note, however, that there is no way to ensure that the concentration distribution is truly reflective of a real RNA distribution. This is especially true with respect to the low end of the range, as it is usually unknown how many of the absent genes on an array are truly absent versus weakly expressed and thus poorly detected by the analysis algorithms used. Therefore, our analysis possibly favors methods that perform best when applied to highly expressed genes. Software All of the analysis was performed using the statistical program R [30], including the affy and gcrma packages from Bioconductor [18], and scripts adapted from the hdarray library by Baldi et al. [31,32]. In addition, we used the dChip [19], MAS 5.0 [12], Perfect Match [20,21] and SAM [27] executables made available by the respective authors. Note that the false-discovery rate calculations were slightly different depending on the t-statistic variant: for the SAM statistic, false discovery rates from the authors' Excel Add-in software was used, whereas for the CyberT and basic t-statistics, the Bioconductor false-discovery rate implementation was applied, which includes an extra step to enforce monotonicity of the ROC curve. In our experience, this extra step does not qualitatively alter the results. All scripts generated in this study are available for use [27,28]. Calculation of the statistic that combines the results of multiple expression summary datasets Say we have n datasets and Cij, Sij are the logged signals for a given probe set in the jth C and S chips, respectively, in dataset i. The mean signal (for this probe set) for the C chips in dataset i is: ![]() where is the number of C chips in dataset i; similarly, the mean signal for the S chips in dataset i is:![]() The mean fold change over all datasets is: ![]() The modified standard deviation for the C chips in dataset i is based on the CyberT estimate: ![]() where const is the weight for the contribution of the average standard deviation for probe sets with the same average signal intensity as Cij. The modified standard deviation for the S chips in dataset i (sd.Si) is defined analogously. The pooled variance over all 10 datasets is defined as:![]() The variance between the 10 datasets is defined as: ![]() Then the combined statistic was chosen to be: ![]() Additional data files Additional data is available with the online version of this article. Additional data file contains a figure and explanatory legend showing the degree of overlap between two lists of differentially expressed genes. Additional data file 2 lists all analysis option combinations used to generate the expression summary datasets in this study. Additional data file 3 is a plot of observed vs actual spiked-in fold changes at the probe level. Additional data file 4 shows an example of asymmetric M (log2 fold change) vs A (average log2 signal) plot for the comparison of two biological samples. Additional data file 5 contains a comparison of the analysis methods with respect to the detection of DEGs with low signal. Additional data file 6 is a Zip archive containing plain text files (in Affymetrix CEL format), Affymetrix *.CEL files for the C chips in this dataset. Additional data file 7 is a Zip archive containing plain text files (in Affymetrix CEL format), Affymetrix *.CEL files for the S chips in this dataset. Additional data file 8 contains detailed information for the individual DGC clones used in this study. Additional data file 1 A figure and explanatory legend showing the degree of overlap between two lists of differentially expressed genes Click here for additional data file(41K, doc) Additional data file 2 All analysis option combinations used to generate the expression summary datasets in this study Click here for additional data file(31K, xls) Additional data file 3 A plot of observed vs actual spiked-in fold changes at the probe level Click here for additional data file(44K, doc) Additional data file 4 An example of asymmetric M (log2 fold change) vs A (average log2 signal) plot for the comparison of two biological samples Click here for additional data file(48K, doc) Additional data file 5 A comparison of the analysis methods with respect to the detection of DEGs with low signal Click here for additional data file(803K, doc) Additional data file 6 A Zip archive containing plain text files (in Affymetrix CEL format), Affymetrix *.CEL files for the C chips in this dataset Click here for additional data file(8.7M, zip) Additional data file 7 A Zip archive containing plain text files (in Affymetrix CEL format), Affymetrix *.CEL files for the S chips in this dataset Click here for additional data file(8.9M, zip) Additional data file 8 Detailed information for the individual DGC clones used in this study Click here for additional data file(853K, xls) Acknowledgements We thank M. Ramoni and M. Morrissey for helpful comments on the manuscript, K. Kerr and A. Wohlheuter for assistance with, and N. Perrimon for resources for, the PCR, the HMS Biopolymers facility for assistance with robotics and GeneChip hybridization, and B. Estrada and L. Raj for sharing the data depicted in Additional Data Files 1 and 4, respectively. S.E.C. was supported by a PhRMA Foundation CEIGI grant, a Brigham and Women's Research Council bioinformatics grant, and NIH fellowship F32 GM67483-01A1. A.M.M. is an Associate Investigator of the Howard Hughes Medical Institute. G.M.C. is supported by a PhRMA Foundation CEIGI grant. M.S.H. is supported by NIH grant K22-HG002489. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||
Nat Biotechnol. 1996 Dec; 14(13):1675-80.
[Nat Biotechnol. 1996]Science. 1995 Oct 20; 270(5235):467-70.
[Science. 1995]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]Genome Biol. 2003; 4(6):R41.
[Genome Biol. 2003]Nucleic Acids Res. 2002 Feb 15; 30(4):e15.
[Nucleic Acids Res. 2002]Bioinformatics. 2001 Jun; 17(6):509-19.
[Bioinformatics. 2001]Proc Natl Acad Sci U S A. 2001 Jan 2; 98(1):31-6.
[Proc Natl Acad Sci U S A. 2001]Nat Biotechnol. 2003 Jul; 21(7):818-21.
[Nat Biotechnol. 2003]Biostatistics. 2003 Apr; 4(2):249-64.
[Biostatistics. 2003]J Cell Biochem Suppl. 2001; Suppl 37():120-5.
[J Cell Biochem Suppl. 2001]Bioinformatics. 2003 Jan 22; 19(2):185-93.
[Bioinformatics. 2003]Bioinformatics. 2003 Aug 12; 19(12):1469-76.
[Bioinformatics. 2003]Biostatistics. 2003 Apr; 4(2):249-64.
[Biostatistics. 2003]Genome Biol. 2002; 3(1):RESEARCH0005.
[Genome Biol. 2002]Proc Natl Acad Sci U S A. 2001 Apr 24; 98(9):5116-21.
[Proc Natl Acad Sci U S A. 2001]Bioinformatics. 2001 Jun; 17(6):509-19.
[Bioinformatics. 2001]Genome Biol. 2003; 4(6):R41.
[Genome Biol. 2003]Nucleic Acids Res. 2003 Feb 15; 31(4):e15.
[Nucleic Acids Res. 2003]Proc Natl Acad Sci U S A. 2001 Jan 2; 98(1):31-6.
[Proc Natl Acad Sci U S A. 2001]Nat Biotechnol. 2003 Jul; 21(7):818-21.
[Nat Biotechnol. 2003]