![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||
Copyright © 2007, Cold Spring Harbor Laboratory Press Assessing the performance of different high-density tiling microarray strategies for mapping transcribed regions of the human genome 1 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520-8114, USA; 2 Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520-8103, USA; 3 Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 06520–8005, USA; 4 Department of Computer Science, Yale University, New Haven, Connecticut 06520-8285, USA; 5 Center for Nanotechnology, NASA Ames Research Center, Moffett Field, California 94035, USA 6Present address: Stockholm Bioinformatics Center, AlbaNova University Center, Stockholm University, SE-10691 Stockholm, Sweden 7Corresponding authors.E-mail michael.snyder/at/yale.edu; fax (360) 838-7861.E-mail mark.gerstein/at/yale.edu; fax: (360) 838-7861. Received December 7, 2005; Accepted June 8, 2006. Freely available online through the Genome Research Open Access option. This article has been cited by other articles in PMC.Abstract Genomic tiling microarrays have become a popular tool for interrogating the transcriptional activity of large regions of the genome in an unbiased fashion. There are several key parameters associated with each tiling experiment (e.g., experimental protocols and genomic tiling density). Here, we assess the role of these parameters as they are manifest in different tiling-array platforms used for transcription mapping. First, we analyze how a number of published tiling-array experiments agree with established gene annotation on human chromosome 22. We observe that the transcription detected from high-density arrays correlates substantially better with annotation than that from other array types. Next, we analyze the transcription-mapping performance of the two main high-density oligonucleotide array platforms in the ENCODE regions of the human genome. We hybridize identical biological samples and develop several ways of scoring the arrays and segmenting the genome into transcribed and nontranscribed regions, with the aim of making the platforms most comparable to each other. Finally, we develop a platform comparison approach based on agreement with known annotation. Overall, we find that the performance improves with more data points per locus, coupled with statistical scoring approaches that properly take advantage of this, where this larger number of data points arises from higher genomic tiling density and the use of replicate arrays and mismatches. While we do find significant differences in the performance of the two high-density platforms, we also find that they complement each other to some extent. Finally, our experiments reveal a significant amount of novel transcription outside of known genes, and an appreciable sample of this was validated by independent experiments. Mapping transcribed regions of the human genome in an unbiased fashion is a crucial step toward understanding at a molecular level the organization of hereditary information and the specific functions of each human cell or tissue type. To this end, a number of approaches using genomic tiling microarrays have been tested and published over the last few years, including key studies by Kapranov et al. (2002), Rinn et al. (2003), Bertone et al. (2004), Schadt et al. (2004), and Cheng et al. (2005). While the strategies differ substantially in most of their details, they all share a basic array design concept: to construct an array whose probes (the molecules attached to the microarray at the manufacturing) cover all of the nonrepetitive sequence of the genome or genomic region under investigation. Kapranov et al. (2002) used a high-density oligonucleotide array design containing perfect match probes of length 25 bp and corresponding mismatch probes. The arrays were synthesized in situ (directly on the supporting array material) using physical masks (Lipshutz et al. 1999) and covered chromosomes 21 and 22 with probe starting positions spaced every 35 bp (genomic distance). They were hybridized with samples representing 11 cell lines. The data was later reanalyzed (Kampa et al. 2004) and a more sophisticated approach to genomic segmentation was introduced. We refer to this setup as the Affymetrix tiling-array platform. Rinn et al. (2003) mapped transcribed regions of chromosome 22 with an array of PCR products (amplicons), tiled end-to-end with a probe size range of 300–1400 bp. This array represents the PCR tiling-array platform and was hybridized with placenta poly(A)+ RNA (Rinn et al. 2003) and later with RNA from two cell lines (White et al. 2004). Schadt et al. (2004) used tiling arrays where the probes were synthesized on the array using the Agilent ink-jet technology (Shoemaker et al. 2001). They tiled chromosomes 20 and 22 with 60-mers uniformly spaced every 30 bp. The statistical treatment of the data was presented in Ying et al. (2003). Bertone et al. (2004) used oligonucleotide microarrays with 36-bp probes spaced every 46 bp to map transcribed regions of the entire nonrepetitive portion of the human genome. The arrays are synthesized in situ using maskless technologies developed by NimbleGen Systems. We refer to this as the MAS (maskless array synthesis) tiling array platform (Singh-Gasson et al. 1999; Nuwaysir et al. 2002). Cheng et al. (2005) used an updated version of the Affymetrix platform with a tighter spacing of the probes, every 5 bp, and covering 10 chromosomes of the human genome. Transcript maps were generated for polyadenylated cytosolic RNA from eight cell lines (and for one of these cell lines, also nonpolyadenylated RNA). These different studies produced a wealth of data. However, the experiments represent very different choices in array design and manufacturing, RNA extraction and hybridization conditions, and data processing methods. As such, comparing the results from these studies is not trivial. Here, we outline some of the key parameters differentiating the various studies. The array design parameters include the length and genomic spacing of the probes, the use of mismatch probes, and whether to cover one or both genomic strands. For the oligonucleotide tiling experiments referenced above, the probe length varies between 25 and 70 bases. The genomic spacing of the probes is measured between probe initiation points and can range from the smallest possible distance of one single base up to the length of the probe, or even further. At the design stage it is important to minimize potential cross-hybridization, self-pairing, and other probe sequence artifacts such as DNA secondary structure formation (SantaLucia Jr. and Hicks 2004). Genomic regions considered as repeats (by, e.g., RepeatMasker [A.F.A. Smit and P. Green, unpubl.]) are usually omitted from the design due to potential cross-hybridization. If some flexibility is allowed in the design process, probes may be chosen so as to achieve better probe thermodynamics. This is possible for arrays interrogating genes (Mathews et al. 1999; Hughes et al. 2001; Rouillard et al. 2003), but for tiling arrays with high genomic density probe optimization options are limited (Bertone et al. 2006). The experimental protocols for extraction, labeling, and hybridization of the RNA sample to the array vary considerably. Choosing the type of target RNA (i.e., tissue or cell line, poly(A)+ or total RNA), and the reactions and conditions to use in the hybridization will affect the results. The number of technical and biological replicates is an additional crucial parameter, more replicates potentially enables greater certainty and detail in the interpretation of the results. Once the tiling arrays have been designed, manufactured, hybridized with labeled RNA, and the hybridization intensities have been extracted, there are a number of ways to transform the raw intensities into a score for each probe. This is usually done using statistical methods such as a sign test or the t-test. Exactly what methods are available depends on the design features of the array, such as the presence of mismatch probes. The segmentation of the genome into transcribed and nontranscribed regions is then performed based on the scores. Our goal is to assess different tiling microarrays that are currently used for transcription mapping, an area where no detailed comparison thus far has been performed, and ultimately to aid the ENCODE Consortium when choosing strategy for the multiple tissue whole-genome transcription mapping of the human genome (The ENCODE Project Consortium 2004). Previous work on comparing gene-based microarrays include studies by Tan et al. (2003), Jarvinen et al. (2004), Mah et al. (2004), Park et al. (2004), and Yauk et al. (2004). Most of these indicate differences in the gene expression results from different microarray platforms, which have been attributed to differences in data processing or inadequate choice of comparison metrics (Larkin et al. 2005). We start our microarray comparison by analyzing a set of already published chromosome 22 transcription experiments. Overall, this study indicated that high-density oligonucleotide arrays perform significantly better than amplicon (PCR) arrays. We then describe a direct comparison of the two in situ-synthesized oligonucleotide-based platforms MAS (Bertone et al. 2004) and Affymetrix (Affy) (Kapranov et al. 2002) on the manually picked part of the ENCODE regions of the human genome (http://www.genome.gov/10005107). We hybridized identical biological samples to the arrays and developed a unified data-processing scheme based on statistical treatment of the data. Using this approach, we compare the results from the two platforms with each other and with the recently generated GENCODE gene annotation (Guigo et al. 2003; Ashurst et al. 2005; http://genome.imim.es/gencode/). Results and Discussion Pilot study: Comparison of public chromosome 22 tiling data We carried out an initial comparison of previously published transcription maps of chromosome 22 generated from PCR-based tiling arrays (Rinn et al. 2003; White et al. 2004) and two oligonucleotide tiling-array platforms, MAS and Affymetrix (Kapranov et al. 2002; Bertone et al. 2004) (Fig. 1
Approach Oligonucleotide array designs and hybridizations An oligonucleotide array containing 36 bp oligonucleotides that tile both strands of the nonrepetitive sequence of the ENCODE regions end-to-end (allowing some positional shifts to reduce self-complementarity) was prepared using maskless photolithography, MAS (maskless array synthesis). The MAS arrays cover both strands of the ENCODE regions ENm001–ENm011 (11.6 Mb). An Affymetrix ENCODE array, which covers one strand of the entire ENCODE region on one array, tiled with 25-mer oligonucleotides with an average distance between oligonucleotide starts of 21 bases was obtained from the manufacturer. This array has both perfect match (PM) and mismatch (MM) probes. As outlined in Table 1, five different hybridization experiments were carried out: two different RNA targets (placenta poly(A)+ RNA and NB4 total RNA) were hybridized to the two different array types. (We follow the nomenclature of Royce et al. [2006, i.e., target or sample is the RNA extracted from a biological entity [tissue or cell line], which is hybridized to the probes on the microarray.) The Affymetrix arrays were hybridized according to the manufacturer’s recommendation. The MAS arrays were hybridized using two different experimental protocols, MAS-B, described in Bertone et al. (2004), and MAS-N, a variant of the manufacturer’s recommended protocol. The placental RNA was hybridized using both MAS protocols, the NB4 RNA only with MAS-N.
Generating comparable maps of transcriptionally active regions (TARs) Development of consistent scoring schemes To bring the outcomes from the two technologies MAS and Affymetrix into a comparable form, we developed ways of scoring them similarly. For each spot on the microarrays, a hybridization intensity was collected. For oligonucleotide tiling arrays, it is usually advantageous to aggregate the intensities from probes that are adjacent to each other in genomic space (Kampa et al. 2004; Cheng et al. 2005; Royce et al. 2005). This is done by applying a sliding genomic window encompassing multiple probes and converting the intensities within the window into a score, which is assigned to the middle probe. The windowed approach is logical since we are ultimately interested in obtaining a set of regions whose intensities are significantly higher than the background, and we expect those regions to be of the same length as exons (150–200 bp on average, depending on exon type) rather than of single probes (25–36 bp in this study). We developed new ways of scoring the MAS arrays and describe these in terms of three levels of scoring: single probe intensities, robust statistics within a sliding window, and robust statistics using paired data within a sliding window (Cawley et al. 2004). Single-probe intensities Single-probe intensity scoring uses the raw intensities from the arrays. By wisely choosing methods and parameters to deal with the genomic segmentation (see below) it is possible to obtain reasonable results from this approach (Bertone et al. 2004). In this approach, both intra- and interarray normalization of the microarray data may be particularly important (Royce et al. 2005). Robust nonparametric statistics within a sliding window We used the sign test for scoring MAS array data. The sign test is attractive since it is statistically robust and does not assume normally distributed data. Comparing each intensity within a sliding genomic window of a specified size with the array median yields a measure or a score of the significance of the intensities (see Methods for details). It is easy to include multiple replicates in this scheme: Each probe is simply compared with the median intensity of its own array, and no interarray normalization is necessary. The number of available score levels is restricted; however, due to the discrete values introduced by the counting (it is a binomial), it may not be sufficient in situations in which discerning the top scores (say, top 5%) from near-top scores is important. With an average genomic spacing of 36 bp between the starts of two adjacent probes, the window (160 bp) encompasses five probes. We also applied the sign test on the Affymetrix data as a part of our comparison. Robust nonparametric statistics using paired data within a sliding window When paired data is available, such as the PM and MM probe intensities on Affymetrix arrays, the paired Wilcoxon signed rank test is a more powerful option than the standard sign test. It was first used with tiling microarrays by Cawley et al. (2004) to score ChIP-chip data (Horak and Snyder 2002), and it is also immediately applicable to transcription data as is shown in Kampa et al. (2004) and Cheng et al. (2005). All pairwise PM–MM differences within the window are calculated and a P-value, which essentially measures how significantly the distribution of PM–MM differences is skewed to either side around zero, is calculated, along with the corresponding point estimate (the pseudomedian). While this approach is analogous to the standard sign test, it has considerably greater statistical power. The MAS arrays did not contain proper mismatch probes. Instead, we tried to simulate these using the complementary strand oligonucleotide of the MAS arrays as the “mismatch” probe. We call this approach the Fwd-Rev scoring, and it is justified on the MAS-B (placenta) data, since the correlation between forward and reverse-strand probes is close to the correlation between PM and MM probes for the Affy placenta data (Table 2).
Segmentation of genomic regions After obtaining one score value per oligonucleotide probe, the next step is to construct a transcription map based on these scores, i.e., to segment the genomic regions into transcribed and nontranscribed regions. We call the transcribed regions TARs (Transcriptionally Active Regions) (Rinn et al. 2003), regardless of overlap with genes, exons, or other genomic features. (Note, an alternate term, transfrag [Transcriptional Fragment], was introduced by Kampa et al. [2004). Maxgap/minrun segmentation In Bertone et al. (2004), TARs were generated by requiring at least five adjacent probes with a raw intensity in the top 10% of all intensities of that slide. Thus, the threshold above which to consider a probe “positive” was the intensity value corresponding to the 90th percentile, and any probe that was below the threshold immediately terminated the transcribed region. In the Affymetrix series of publications (Kapranov et al. 2002; Cheng et al. 2005), the threshold for generating TARs was based on setting a maximum false-positive rate of the hybridization levels of negative bacterial controls, thus enabling an optimized percentile cutoff for each array set and biological sample. Furthermore, gaps were allowed, such that a maximum stretch of a certain number of nucleotides (called maximal gap, or maxgap for short) with a score below the threshold was allowed between probes whose scores were above the cutoff. Typically, the maxgap parameter allows one or two probes to be below the cutoff while still being incorporated into the TAR. The total length of a TAR is then required to be of at least a certain length (a minimal run, or minrun), usually corresponding to at least two probes. HMM segmentation As an alternative to the maxgap/minrun segmentation, a hidden Markov model (HMM) (Rabiner 1989; Ji and Wong 2005; Li et al. 2005) was used to predict TARs, given the derived probe scores (above). Each probe can be in one of four HMM states (TAR, non-TAR, and two intermediate transition states), emitting the assigned score (i.e., the emission spectrum is continuous). The parameters of the HMM can be estimated by learning from the sequences of probes that fall into regions with known transcription characteristics (e.g., according to gene annotation). The HMM can then be applied to sequences of probes bearing the same scoring protocol to determine the most likely corresponding state sequence, in order to identify TARs (Viterbi decoding). Platform comparison We have analyzed the five microarray tiling experiments, representing the MAS and Affymetrix platforms, introduced in Table 1 at multiple stages throughout the data processing.
The results are available at http://tiling.gersteinlab.org/platformcmp. Outcomes Conclusions about optimal scoring and segmentation systems We first examined the effect of the segmentation threshold on the size of the resulting TAR sets. The results are in Figure 2A
Different scoring schemes give different results and also differ from the results obtained using single-probe intensity scores when comparing to the gene annotation. As is clear from Figure 2B For the Affymetrix data, Figure 2C Figure 2D These two points (elaborate scoring with replicates and the advantage of using mismatches) are further illustrated in Figure 2E The analysis of different segmentation algorithms reveals that the TAR sets generated by the nonparametric HMM segmentation (Viterbi decoding) are biased toward a high sensitivity (Fig. 2B Results from comparison pipeline Replicate comparison of unprocessed hybridization intensities As is shown in Table 2, we obtained Pearson correlation coefficients of 0.83 and 0.96 for placenta MAS-B and MAS-N data, respectively, measured on pairwise comparison of the raw hybridization intensities of the arrays. The figure for Affymetrix was 0.96 and NB4 results were similar. We also note that the correlation of PM and MM probes for Affy placenta is close to the correlation of Fwd and Rev probes for MAS-B. Comparing the preliminary TAR sets, generated from single arrays, across technical replicates (Supplemental Table S1) again indicates that the MAS-B data is the most variable. Choosing TAR sets to include in comparison For each of the five experiments, the best-performing scoring and segmentation algorithm was chosen, and the segmentation threshold was tuned to generate TAR sets of roughly equal size, as measured in number of bases. The chosen sets are presented in Table 3 (and its extended version, Supplemental Table S2) and the points corresponding to these sets in Figure 2A
Compare TAR sets with each other and to GENCODE annotation and conserved regions Figure 2D Figure 3A
As shown in Figure 3B The bimodal distribution in Figure 4
Table 4 and Supplemental Figure S10 show a comparison of each of the five TAR sets to a set of conserved elements, generated from the union of conserved regions called by the Threader Blockset Aligner (TBA) (Blanchette et al. 2004) and MLagan (Brudno et al. 2003). The union set of conserved elements covers ~10% of the ENCODE regions. We find that most of the novel (intergenic) TARs do not overlap with conserved regions. Only 7%–8% of Affymetrix novel TARs and 1%–2% of MAS novel TARs overlap fully (>90%) with conserved regions.
Transcription status of known genes and exons The transcription status of all known splice variants (transcripts) of the 264 GENCODE genes in the regions ENm001–ENm011 was assessed, and the results are shown in Table 5. A gene is considered as “transcribed” if at least one of its transcripts is detected at significance level P < 0.001. If a transcript has <10 probes, it will be unable to reach a P-value below 0.001, and if this is true for all splice variants of a gene, that gene is in the “Too few probes” category. In total, 158 (69.3%) genes are considered transcribed according to both platforms and 221 (83.7%) according to at least one platform (similar percentages on the transcript level). In Table 6, the multi-exon coherence (either all exons on or all exons off) is assessed and found to be higher for Affy. For both placenta and NB4 there is an enrichment of multi-exon coherence in transcripts that are considered as transcribed in both platforms. One example is the Affy NB4 set, for which, in total, 15.5% of all transcripts have all their exons transcribed, while 26.9% of the transcripts that are on in both NB4 sets have all their exons transcribed. The difference in score distribution between exons and introns is assessed (Supplemental Fig. S7) and for all five sets exons are indeed overrepresented at the high end of the score spectrum, but also many introns have high scores.
Experimental validation of novel TARs and known genes Experimental validation of the microarray transcription data is crucial to the interpretation of the results. Table 7 shows that we used RT–PCR to assess, in total, 144 regions experimentally in placenta. Of these, 98 were novel TARs (no overlap with known genes). The experiments verified the presence of 56.4% (22/39) of the assayed novel TARs that were exclusively found on the MAS platform (MAS-B), 66.7% (26/39) of the novel TARs that were exclusively found on the Affymetrix platform, and 85% (17/20) of the assessed novel TARs that were common to both. In total, 66.3% of all assessed novel TARs were verified.
Forty-three known genes were also validated. Genes that were completely off (i.e., none of their splice variants were considered transcribed) according to one of the platforms, but not the other, were assessed. In total, 58.8% (10/17) of the MAS-B exclusive genes were verified and 87.5% (7/8) of the Affymetrix exclusive genes were verified. For genes that were considered “off” in both platforms, 33.3% (6/18) were found in our experimental validation. Discussion In this work we have attempted to assess the suitability of two oligonucleotide tiling microarray strategies for transcription mapping in human. We tried to overcome the inherent differences between the approaches through using the same biological samples and a unified scoring and TAR generation procedure, and we have produced, compared, and validated several sets of transcribed regions. We conclude that many factors are significant for the outcome of the experiments. Here, we elaborate on some key findings. Arrays are noisy In the comparison between the two microarray tiling platforms, the Affymetrix platform yielded TARs that better agreed with the GENCODE annotation (Figs. 2D,E A comparison of the NB4 total RNA and placenta poly(A)+ TAR sets within the two platforms (Tables 2–4, 6; Fig. 3 Counteract the noise: More data, appropriate scoring Figure 2D,E To take advantage of the data, appropriate probe scoring procedures are needed. We tried several scoring schemes for our array data and found, in Figure 2D,E In Figure 2D Conclusions from the array platform comparison According to our study, the current form of the Affymetrix tiling microarray platform is better suited than the MAS platform for detailed transcription mapping of the human genome. This is true in the sense that the agreement of the TARs with known annotation is larger (Fig. 2 While the results obtained from the Affy arrays agree better with the annotation and the validation results, the advantage of the MAS technology is that it allows for rapid manufacturing of customized designs and cost-effective production of small array series. Using true mismatches in the MAS design may improve the results for MAS arrays as well, but there are currently no results publicly available. We conclude that oligonucleotide tiling microarrays are suitable to detect novel transcribed regions, and that the use of replicates and statistically based scoring schemes significantly improves the performance for all investigated oligonucleotide-tiling microarray-based transcription-mapping experiments Methods Array designs Affymetrix arrays Arrays were designed and manufactured by Affymetrix, Inc., using a physical mask. Probes are 25-bp long with an average genomic spacing of 21 bp, and they cover one genomic strand, with the exception of repeat regions, as defined by RepeatMasker (A.F.A. Smit and P. Green, unpubl.). Each probe is present in a “perfect match” and a “mismatch” version. The mismatch probe contains a single substitution at the middle probe position (A→T, T→A, C→G, G→C). Each array contains in total ~1,400,000 features. MAS arrays Arrays were designed by us and manufactured by NASA using a NimbleGen maskless array synthesizer. Probes are 36-bp long with an average genomic spacing of 36 bp. Positional shifts were allowed to avoid self-complementarity at the probe ends (defined as at least four consecutive complementary nucleotides within the six 5′/3′ nucleotides). The probes cover both genomic strands, with the exception of repeat regions. The design was done on the NCBI v34 of the human genome build, and each array contains almost 390,000 features. RNA extraction and array hybridization Cell culture The human NB4 cells were cultured in RPMI medium containing 20 mM L-glutamine (Media Tech) and supplemented with 10% fetal bovine serum (Invitrogen), 100 IU/mL penicillin (Media Tech) and 100 μg/mL streptomycin (Media Tech). Cells were maintained at 37°C under 5% CO2/95% air in a humidified incubator. RNA samples Total RNA from the human NB4 cells was extracted using a Qiagen RNA extraction kit according to the manufacturer’s instructions. Human placental poly(A)+ mRNA (obtained from total RNA) was purchased from Ambion. Protocols The Supplemental material contains a detailed description of all three experimental protocols (MAS-B, MAS-N, Affy). The MAS-N protocol yields in-vitro transcribed, biotin-labeled, single-stranded cRNA (Van Gelder et al. 1990), fragmented to an average size of 50–200 bp before hybridization. The MAS-B protocol yields Cy3-aminoallyl-labeled unfragmented single-stranded cDNA. The Affymetrix protocol yields end-labeled (bio-ddATP) double-stranded cDNA, fragmented to an average size of 50–100 bp before hybridization. Scoring schemes To obtain the desired statistical resolution, MAS array scoring was done pooling the data from all three biological samples (for both placenta and NB4). For placenta, this corresponds to seven measurements for each probe and six for NB4. For Affymetrix, three technical replicates were used, corresponding to six measurements for each probe (three PM and three MM probes). Sign test using array median intensity The intensity of every probe within the window is compared with the median intensity of the slide and assigned a “1” if it is above and “0” otherwise. The number of ones within the window is counted and the probability P of finding at least this number of 1’s under the null hypothesis that half of the probes should be above the median is calculated. The score assigned to the probe in the middle of the window is then defined as score = −log(P). Window sizes of 90–240 bp were tried, choosing 160 (five probes) for the MAS data in this study. No inter-array normalization is performed, since each intensity is compared with the median intensity on its own array only. A variant of this scoring approach is to weight the probes within the window differently, such that the central probe(s) becomes more important. For instance, the intensities within the window can be multiplied with a discretized Gaussian envelope. Several parameter settings were tried (Supplemental Figs. S2–S5). Paired Wilcoxon signed rank sum test Inter-array normalization is undertaken through dividing each intensity with the array median (median normalization). Within a window, all pairwise differences between the intensities of a perfect match probe and its corresponding mismatch probe are calculated and ranked. A sign is assigned to each rank number depending on whether the PM or the MM intensity was greater, and a P-value is calculated from the sum of this signed ranking (keeping track of the rank sum of all negative ranks and the rank sum of all positive ranks). The P-value, which is a measure of how significantly the distribution of PM–MM differences is skewed to either side around zero, can then be used to compute the final score for the probe in the middle of the window (Kampa et al. 2004; Royce et al. 2005). The corresponding point estimate, the pseudomedian, is obtained by taking the median value of all the pairwise averages of PM–MM values within the window. The Affymetrix scores used in this study were calculated by Affymetrix using a window size of 101 nt, corresponding to on average five probes in the window. For MAS arrays, the paired Wilcoxon signed rank test was applied using the probe corresponding to the reverse strand of the exact same genomic locus as mismatch probe (instead of a designed mismatch probe). Segmentation of genomic regions Maxgap/minrun segmentation The transcribed regions were generated from scored data. The maxgap parameter was set to 50 for Affymetrix data and 80 for MAS data. The minrun parameter was set to 50 for both approaches. Other maxgap/minrun parameter settings were also tested (data not shown). We evaluated segmentation thresholds of the 70–99th percentile. HMM segmentation The emission and transition probability distributions of the four-state HMM for each data set were learned according to the scores of those probes that fall into known gene regions, where the score characteristics in the exon regions were used to estimate the parameters for the TAR state, and those in the intron regions for the non-TAR state. The parameters for the two intermediate transition states were obtained by investigating those probes containing both exon and intron regions. These emission distributions were fitted with mixed-Gaussian distributions to generate a continuous model. The Viterbi algorithm was utilized to identify TARs. Assessing transcription of annotated genes The transcription status was assessed using the sign test as described for all annotated splice variants of all known genes in the GENCODE annotation of regions ENm001–ENm011, accepting the exons with labels “VEGA_known,” “VEGA_Novel_CDS,” “VEGA_Novel_transcript_gencode_conf,” and “VEGA_Putative_ gencode_conf.” For the exon/intron-based investigations (Table 6; Supplemental Fig. S7), the median probe score for each feature was used, with a percentile threshold for on/off calls as defined for each experiment in Table 3. Choosing primer pairs for validation Primer pairs were generated using Primer3 (Rozen and Skaletsky 2000). Primers assessing novel TARs were required to define a genomic region with no overlap with any GENCODE gene. When assessing known genes, the exon with the highest P-value-based transcription score was chosen. Primer3 settings were as default or more stringent, e.g., GC content within 35%–65%, primer size was forced to be between 20 and 28 nt, and the resulting PCR products to be between 100 and 200 bp. Validation candidates were checked using UCSC In Silico PCR (http://genome.ucsc.edu/cgi-bin/hgPcr) against the NCBI v35 human genome build to ensure that exactly one PCR product was possible; those that generated no or multiple hits were discarded. Three regions that did not contain any verified or predicted transcription were chosen to act as negative controls. The experimental protocol of the PCR validation is in the Supplemental material. Accessing data and results The MAS ENCODE array platform has GEO (Gene Omnibus Expression, http://www.ncbi.nlm.nih.gov/geo/) accession number GPL2105; the corresponding data series has GEO accession number GSE2720 (placenta and untreated NB4). The Affymetrix anti-sense ENCODE array platform has GEO accession number GPL1789; the corresponding data series has accession number GSE2671 (placenta) and GSE2679 (untreated NB4). The TAR sets, the gene/transcript/exon transcription status, the validation results, and the raw data are available at or from http://tiling.gersteinlab.org/platformcmp. Acknowledgments O.E. is supported by a Knut and Alice Wallenberg Foundation postdoctoral fellowship. We acknowledge support from the NIH (1U01HG003156-01). Footnotes [Supplemental material is available online at www.genome.org.] Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5014606 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||
Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]Genes Dev. 2003 Feb 15; 17(4):529-40.
[Genes Dev. 2003]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Genome Biol. 2004; 5(10):R73.
[Genome Biol. 2004]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]Genome Res. 2004 Mar; 14(3):331-42.
[Genome Res. 2004]Genes Dev. 2003 Feb 15; 17(4):529-40.
[Genes Dev. 2003]Proc Natl Acad Sci U S A. 2004 Dec 21; 101(51):17771-6.
[Proc Natl Acad Sci U S A. 2004]Genome Biol. 2004; 5(10):R73.
[Genome Biol. 2004]Nature. 2001 Feb 15; 409(6822):922-7.
[Nature. 2001]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Nat Biotechnol. 1999 Oct; 17(10):974-8.
[Nat Biotechnol. 1999]Genome Res. 2002 Nov; 12(11):1749-55.
[Genome Res. 2002]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]RNA. 1999 Nov; 5(11):1458-69.
[RNA. 1999]Nat Biotechnol. 2001 Apr; 19(4):342-7.
[Nat Biotechnol. 2001]Nucleic Acids Res. 2003 Jun 15; 31(12):3057-62.
[Nucleic Acids Res. 2003]Genome Res. 2006 Feb; 16(2):271-81.
[Genome Res. 2006]Science. 2004 Oct 22; 306(5696):636-40.
[Science. 2004]Nucleic Acids Res. 2003 Oct 1; 31(19):5676-84.
[Nucleic Acids Res. 2003]Genomics. 2004 Jun; 83(6):1164-8.
[Genomics. 2004]Physiol Genomics. 2004 Feb 13; 16(3):361-70.
[Physiol Genomics. 2004]J Biotechnol. 2004 Sep 9; 112(3):225-45.
[J Biotechnol. 2004]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]Proc Natl Acad Sci U S A. 2003 Feb 4; 100(3):1140-5.
[Proc Natl Acad Sci U S A. 2003]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D459-65.
[Nucleic Acids Res. 2005]Genes Dev. 2003 Feb 15; 17(4):529-40.
[Genes Dev. 2003]Proc Natl Acad Sci U S A. 2004 Dec 21; 101(51):17771-6.
[Proc Natl Acad Sci U S A. 2004]Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D501-4.
[Nucleic Acids Res. 2005]Methods Enzymol. 2006; 411():282-311.
[Methods Enzymol. 2006]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Genome Res. 2004 Mar; 14(3):331-42.
[Genome Res. 2004]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]Trends Genet. 2005 Aug; 21(8):466-75.
[Trends Genet. 2005]Cell. 2004 Feb 20; 116(4):499-509.
[Cell. 2004]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Trends Genet. 2005 Aug; 21(8):466-75.
[Trends Genet. 2005]Cell. 2004 Feb 20; 116(4):499-509.
[Cell. 2004]Methods Enzymol. 2002; 350():469-83.
[Methods Enzymol. 2002]Genome Res. 2004 Mar; 14(3):331-42.
[Genome Res. 2004]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]Genes Dev. 2003 Feb 15; 17(4):529-40.
[Genes Dev. 2003]Genome Res. 2004 Mar; 14(3):331-42.
[Genome Res. 2004]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]Science. 2005 May 20; 308(5725):1149-54.
[Science. 2005]Bioinformatics. 2005 Sep 15; 21(18):3629-36.
[Bioinformatics. 2005]Bioinformatics. 2005 Jun; 21 Suppl 1():i274-82.
[Bioinformatics. 2005]J Mol Biol. 1997 Apr 25; 268(1):78-94.
[J Mol Biol. 1997]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]J Mol Biol. 1997 Apr 25; 268(1):78-94.
[J Mol Biol. 1997]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Genome Res. 2003 Apr; 13(4):721-31.
[Genome Res. 2003]Nucleic Acids Res. 2002 May 1; 30(9):2089-195.
[Nucleic Acids Res. 2002]Genome Res. 2004 Mar; 14(3):331-42.
[Genome Res. 2004]Trends Genet. 2005 Aug; 21(8):466-75.
[Trends Genet. 2005]Genes Dev. 2003 Feb 15; 17(4):529-40.
[Genes Dev. 2003]Proc Natl Acad Sci U S A. 2004 Dec 21; 101(51):17771-6.
[Proc Natl Acad Sci U S A. 2004]Science. 2004 Dec 24; 306(5705):2242-6.
[Science. 2004]Science. 2002 May 3; 296(5569):916-9.
[Science. 2002]