![]() | ![]() |
Formats:
|
||||||||||||||||||||
Copyright © 2006, Cold Spring Harbor Laboratory Press Experimental validation of predicted mammalian erythroid cis-regulatory modules 1 Center for Comparative Genomics and Bioinformatics of the Huck Institutes of Life Sciences, 2 Department of Biochemistry and Molecular Biology, 3 Intercollege Graduate Degree Program in Genetics, 4 Intercollege Graduate Degree Program in Integrative Biosciences, 5 Department of Computer Science and Engineering, 6 Department of Statistics, and 7 Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA; 8 Department of Pediatrics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania 19104, USA 9Corresponding author. E-mail rch8/at/psu.edu; fax (814) 863-7024. Received April 6, 2006; Accepted June 7, 2006. Freely available online through the Genome Research Open Access option. This article has been cited by other articles in PMC.Abstract Multiple alignments of genome sequences are helpful guides to functional analysis, but predicting cis-regulatory modules (CRMs) accurately from such alignments remains an elusive goal. We predict CRMs for mammalian genes expressed in red blood cells by combining two properties gleaned from aligned, noncoding genome sequences: a positive regulatory potential (RP) score, which detects similarity to patterns in alignments distinctive for regulatory regions, and conservation of a binding site motif for the essential erythroid transcription factor GATA-1. Within eight target loci, we tested 75 noncoding segments by reporter gene assays in transiently transfected human K562 cells and/or after site-directed integration into murine erythroleukemia cells. Segments with a high RP score and a conserved exact match to the binding site consensus are validated at a good rate (50%–100%, with rates increasing at higher RP), whereas segments with lower RP scores or nonconsensus binding motifs tend to be inactive. Active DNA segments were shown to be occupied by GATA-1 protein by chromatin immunoprecipitation, whereas sites predicted to be inactive were not occupied. We verify four previously known erythroid CRMs and identify 28 novel ones. Thus, high RP in combination with another feature of a CRM, such as a conserved transcription factor binding site, is a good predictor of functional CRMs. Genome-wide predictions based on RP and a large set of well-defined transcription factor binding sites are available through servers at http://www.bx.psu.edu/. Comprehensive discovery of functional DNA sequences in genomes requires both computational and experimental approaches (Collins et al. 2003). A particularly difficult challenge is identifying the cis-acting sequences, called cis-regulatory modules (CRMs), that are responsible for determining the amount, timing, and tissue specificity of gene expression. Unlike the situation for protein-coding genes, systematic rules for encoding CRMs in genomic DNA are not yet elucidated (Wasserman and Sandelin 2004), although various predictive methods are being explored. Methods that seek overrepresented motifs in co-expressed genes have limited but improving success (Tompa et al. 2005); however, most of these methods are not applicable to large genomic intervals. Consensus binding sites have been deduced for many transcription factors and are stored as positional weight matrices in databases such as TRANSFAC (Wingender et al. 2001) and JASPAR (Sandelin et al. 2004). Matches to the positional weight matrices in single DNA sequences far exceed the sites verified as being occupied by transcription factors (e.g., Grass et al. 2003). However, the number of predicted binding sites can be reduced substantially with increased specificity by requiring the matches to be conserved in multiple species (Berman et al. 2004; Gibbs et al. 2004). DNA segments that appear to be under evolutionary constraint are good candidates for functional elements. This predictive method relies on the assumption that sequences carrying out similar functions in two related species are constrained to maintain a level of sequence similarity in excess of that seen for nonfunctional, or neutral, DNA (Pennacchio and Rubin 2001; Miller et al. 2004). Indeed, most DNA sequences known to be functional, such as exons and CRMs, align among human, mouse, and rat genomes (Waterston et al. 2002; Gibbs et al. 2004), but many CRMs fail to align between human and chicken (Hillier et al. 2004). Statistical methods that score multiple sequence alignments to find highly constrained elements are being developed (Stojanovic et al. 1999; Margulies et al. 2003; Cooper et al. 2005; Siepel et al. 2005). These discriminate very well between stringently constrained sequences and likely neutral DNA, but they are less effective for analyzing more diverse reference sets of CRMs (Hughes et al. 2005; King et al. 2005). Many studies demonstrate that constrained noncoding sequences can be used as guides to discovering functional binding sites and CRMs (e.g., Gumucio et al. 1996; Elnitski et al. 1997; Loots et al. 2000; Cliften et al. 2003; Kellis et al. 2003; Frazer et al. 2004), and, furthermore, function is strongly associated with evolutionary conservation in noncoding regulatory regions of Ciona (Johnson et al. 2004). Another approach using aligned genomic sequences to predict CRMs is the computation of regulatory potential (RP), which captures context and pattern information in addition to conservation (Elnitski et al. 2003; Kolbe et al. 2004; Taylor et al. 2006). The statistical models used to compute RP scores are derived from a positive training set of alignments of known CRMs and a negative training set of alignments of ancestral repeats (a model for likely neutral DNA). A high RP score for an aligned block of sequences means that the patterns of alignment columns in it are more similar to the patterns observed in aligned CRMs than those seen in aligned ancestral repeats. The number of possible alignment columns is very large and increases exponentially with additional sequences. Patterns of alignment columns that are distinctive for different functional classes can be found only by grouping columns and training statistical models using these groups of alignment columns as a reduced representation of the alignments. Our current approach of incorporating phylogenetic information and utilizing performance to evaluate the many possible groupings insures that the reduced representation retains the information valuable for discrimination between functional classes (Taylor et al. 2006). The distinctive patterns that contribute to a high RP score are actually a series of groups of alignment columns. These capture multiple subtle contributions to discrimination rather than a single motif such as a nucleotide string needed to bind a single transcription factor. The RP scores perform better than constraint scores against a reference set of known regulatory elements from the HBB gene complex (King et al. 2005), and we have selected these as part of our strategy to predict CRMs. The fraction of a mammalian genome whose conservation or RP score exceeds a predictive threshold (determined by equivalent sensitivity and specificity against a reference set) is larger than the lower-bound estimate of the fraction under purifying selection since the primate–rodent divergence (~7% versus ~5%) (Waterston et al. 2002; Chiaromonte et al. 2003; King et al. 2005). Thus, using RP or conservation alone for prediction should capture many CRMs, but either should also return many false positives. It is prudent to use an additional filter for predictions of CRMs (Berman et al. 2002, 2004). We use conserved binding site motifs for the transcription factor GATA-1 as the additional filter, because most known erythroid CRMs have this binding site (Weiss and Orkin 1995), the binding specificity has been studied extensively (e.g., Ko and Engel 1993; Merika and Orkin 1993), and this protein is required for late erythroid maturation (Pevny et al. 1991). The mouse G1E cell line, derived from Gata1 knock-out embryonic stem cells, is blocked at the level of an immature committed erythroblast (Weiss et al. 1997) and undergoes terminal erythroid maturation when GATA-1 function is restored. Using this model system, we identified GATA-1-regulated erythroid genes by transcriptome analysis (Welch et al. 2004). In addition, we used similar approaches to identify patterns of altered gene expression in murine erythroleukemia cells induced to mature in vitro. We combined these studies to identify candidate genes that are likely to have GATA-1 and its binding site involved in regulation and applied our bioinformatics tools to predict CRMs. Many of the predicted CRMs had significant effects on the expression of reporter genes in transfected cells, showing the power of bioinformatic predictions based on RP scores plus conserved transcription factor binding sites. Results Cohorts of co-expressed genes from microarray expression analyses Two somatic cell models of late erythroid maturation were used to find groups of genes whose expression levels increase or decrease during this process. Murine erythroleukemia (MEL) cells have properties of proerythroblasts, and are induced to mature into erythroblasts upon treatment with N,N′-hexamethylene-bis-acetamide (HMBA) (Reuben et al. 1976). Transcriptome analysis (Eisen et al. 1998) revealed a cohort of genes coexpressed with Hbb-b1 (encoding β-globin), which includes known markers of late erythroid differentiation, such as the heme biosynthetic gene Alas2 (May et al. 1995), the histone variant gene Hist1h1c (Brown et al. 1985; Cheng and Skoultchi 1989), and other genes not previously known to be in this cohort, such as Vav2, Btg2, and Hipk2. The Gata2 gene was down-regulated during maturation, as expected (Grass et al. 2003). A second cell culture model of erythroid maturation is the G1E line of murine immature erythroblasts that carry a knockout of the Gata1 gene (Weiss et al. 1997). The subline G1E-ER4 stably expresses an estrogen-activated form of GATA-1. Reactivation of GATA-1 function induces terminal erythroid maturation synchronously in all cells. From the results of a previous microarray analysis of gene expression after the restoration of GATA-1 in G1E-ER4 cells (Welch et al. 2004) we identified cohorts of up- and down-regulated genes. Many of these show similar patterns of expression in induced MEL cells. Based on results from both cell lines, genes in the up-regulated cohort chosen for study were Alas2, Btg2, Vav2, Hist1h1c, Hipk2, and Hebp1. Gata2 was chosen for study as a down-regulated gene. Previous studies (Welch et al. 2004) also showed that Zfpm1 was an immediate target of GATA-1 in these cells, and thus this gene was also studied for predicted cis-regulatory modules. The patterns of expression for most genes were confirmed in an induced MEL cell line using RT-PCR analysis of RNA (Supplemental Fig. S1). At each target locus, we analyzed the gene of interest plus additional intergenic DNA extending to the flanking genes. A total of 1,012,000 bp (~1 Mb) was included in the eight target loci. Selection of conserved noncoding regionsto test as predicted CRMs Mouse genomic DNA sequences whose alignment with four other mammals meet the following two criteria were predicted to be CRMs: (1) the RP score is greater than 0, and (2) the alignment contains a predicted match to a binding site for GATA-1. Only noncoding DNA sequences were used. The RP scores were determined using the phylogeny-based method of Taylor et al. (2006) on TBA alignments (Blanchette et al. 2004) of the mouse DNA sequences with the orthologous sequences from rat, human, chimpanzee, and dog. Most loci have many genomic segments with positive RP scores (Fig. 1
Matches to the binding site for GATA-1 fall into three categories. Of all the exact matches to the consensus motif (A/T)GATA(A/G) in the mouse sequence (Fig. 1 Genome-wide preCRMs generated by a similar method are provided at http://www.bx.psu.edu/~ross/dataset/DatasetHome.html. The file can be uploaded to genome browsers (Kent et al. 2002) to identify erythroid preCRMs anywhere in the mouse genome, or to databases (Giardine et al. 2003, 2005; Elnitski et al. 2005) and other resources for further analysis. Transient expression assay for gene regulatory effects of preCRMs The RP scores and different classes of predicted GATA-1 binding sites were combined to identify distinctive groups of predicted cis-regulatory modules (preCRMs) for experimental tests (Fig. 2A
The first assay tests for altered expression of a luciferase reporter after transient transfection of human K562 leukemia cells (Fig. 2B Activity measurements from predicted neutral fragments (preNeutral) rarely exceed log2 of 0.7 (corresponding to a 1.6-fold increase, Fig. 2B Assay for effects of preCRMs after site-directed integration into MEL cells One of the limitations of the transient expression assay is that the reporter plasmids do not assemble into a chromatin structure fully equivalent to that of a chromosome (Reeves et al. 1985). Thus, regulatory effects requiring a normal chromatin structure can be missed. We also tested the preCRMs in a reporter gene cassette after stable integration into a marked locus in MEL cells, using recombinase-mediated cassette exchange (Bouhassira et al. 1997). Targeting the expression cassettes to the same chromosomal location, using the Cre-loxP system (Fig. 2C At all stages of induction, the distribution of GFP fluorescence measurements for expression cassettes containing a preCRMcc is broader and shifted upward with respect to the distribution for cassettes with a predicted neutral segment (Fig. 2C Validated preCRMs A total of 32 preCRMs was validated in the enhancer assays or by chromatin immunoprecipitation (Table 1). Four of these overlap previously described cis-regulatory elements, and all of these are validated in our assays (Supplemental Fig. S5; Supplemental Table S1). One is Alas2R3 (Fig. 1
Several novel preCRMs have strong effects in both transient and site-directed stable integration assays. For example, Alas2R1, a predicted CRM in the first intron of the Alas2 gene, caused a fivefold increase in expression in transfected K562 cells in two separate experiments (Fig. 3A
The role of GATA-1 in this validated preCRM was tested by mutating the two matches to GATA-1 binding sites and measuring the effect of the altered preCRMs after transfection into K562 cells. Luciferase expression by the mutated expression plasmids decreased to the level of the parental plasmid (Fig. 3B The gene Zfpm1 encodes a multiple zinc finger protein, FOG1, that cooperates with GATA-1 at some regulatory sites (Tsang et al. 1998; Crispino et al. 1999; Fox et al. 1999; Chang et al. 2002). This locus is an immediate target of GATA-1 in G1E-ER4 cells (Welch et al. 2004). Our bioinformatic approach predicts many preCRMs in Zfpm1 (Fig. 1
We chose Gata2 as an example of a gene whose expression is down-regulated in response to restoration of GATA-1 function in G1E-ER4 cells (Welch et al. 2004). The novel pCRMs Gata2R1 (~50 kb upstream, Fig. 1 All the individual preCRMs discussed so far contain at least one ccGATA1BS, and additional predicted CRMccs in three other loci, Vav2, Btg2, and Hebp1, were validated with strong effects (Supplemental Fig. S6A; Supplemental Table S1). Notably, four of the preCRMs for which the consensus GATA-1 binding site is present only in mouse or rodents (preCRMncc class) also were validated in transient transfection assays (Fig. 4D Overall, the pCRMcc and pCRMncc classes had the highest validation rates, much higher than pCRMcnc or segments with negative RP (Table 1). This supports the use of a combination of high RP and a consensus binding site for a transcription factor in predicting CRMs. Also, deviation from the consensus binding site (pCRMcnc class) leads to less accurate predictions. Site occupancy by GATA-1 A sampling of each class of preCRM was tested for occupancy by the protein GATA-1 in rescued G1E-ER4 cells using chromatin immunoprecipitation. Ten of 12 tested preCRMccs showed significant levels of GATA-1 protein bound (Fig. 5
An exception to this association is Zfpm1R4. This preCRMcc is bound by GATA-1, as previously reported by Welch et al. (2004), but it is not active in either enhancer assay. The site occupancy indicates that it is involved in some aspect of regulation, perhaps one displayed in the G1E-ER4 cells but not mimicked in the cell lines used for transfection. Conservation of the consensus GATA-1 binding motif, WGATAR, is strongly associated with site occupancy. In contrast, neither of the preCRMs with a conserved nonconsensus site is occupied by GATA-1 (Fig. 5 The preCRMccs Hipk2R28 and Zfpm1R9 appear to be false positives of our prediction pipeline. It is possible that they have a function for which we have not tested, but that function does not involve GATA-1 binding in G1E-ER4 cells. Positive correlation of activity with RP and conserved consensus GATA-1 binding motif Both the activities in the transfection assays and the magnitude of the RP signals vary for the validated preCRMs, which raises the question of whether RP scores have a positive correlation with activity. Thus, the correlation between the negative log10 of the P-value for activity in the assays (transient or stable) and the mean RP score was examined for all DNA segments assayed. The segments with negative mean RP scores consistently have low activity (Fig. 6A
Another method for evaluating the relative contributions of RP score and conserved factor binding sites to the prediction of CRMs is to determine sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Using RP scores ≥0 or ccGATA1BSs separately to predict CRMs gives many false positives, as expected, and consequently the specificity and PPV are marginal (Table 2). The sensitivity and negative predictive values are quite good for these individual features. Combining the two features improves the specificity to 0.68 and the PPV to 0.55. Raising the RP threshold to at least 0.05 for a positive prediction while using RP < 0 for a negative prediction improves the performance. Combining the latter RP thresholds with a requirement for a ccGATA1BS gives the best mix of sensitivity and specificity (0.83 and 0.73) and of positive and negative predictive value (0.65 and 0.88). This analysis supports the contribution of both features to predictive value. However, only DNA segments with both high RP and ccGATA1BSs were examined comprehensively in our experiments, and segments in other categories were only sampled. Thus the actual values for these measures may be different in more comprehensive experiments.
Association of high RP segments with function We also evaluated how well RP (in concert with ccGATA1BSs) works to enrich selections of noncoding DNA for function in regulation. All the noncoding mouse DNA in the eight target loci that aligned with other species was partitioned into bins by RP score (Fig. 6C Discussion We evaluated the effectiveness of using RP scores in combination with conserved binding sites for transcription factors to predict cis-regulatory modules for genes regulated during erythroid differentiation. Ninety-two DNA segments, covering a range of RP scores and varying in the presence or absence of predicted GATA-1 binding sites, were tested for enhancer activity. DNA segments with a high RP score (e.g., at least 0.05) and a conserved match to the consensus GATA-1 binding site were validated as enhancers at a high rate. The strength of the effects correlated positively both with RP score and with number of ccGATA1BSs. Thus, in a genome-wide application, RP > 0.05 and strict conservation of the GATA-1 binding site consensus should provide good specificity, >70% if the current rates are maintained (Table 2). However, some segments with RP scores between 0 and 0.05 are also validated, and thus greater sensitivity in studies of individual loci can be achieved by lowering the threshold for RP. For example, 18 out of 23 CRMs in the β-globin gene locus (King et al. 2005) pass an RP threshold of at least 0.05, computed using the method of Taylor et al. (2006). These include the CRMs with the strongest activity; the others have more subtle or lineage-specific effects. Thus, the thresholds for predictions should be set as appropriate to the scope and aims of the experiments. Our experimental approaches may actually underestimate the number of preCRMs that play a role in regulation. Most of our current assays test only one preCRM at a time, requiring that an individual preCRM be sufficient to cause a phenotype in order to be validated. However, it is common for groups of CRMs to work together, as has been observed for the locus control region of the HBB complex (Bungert et al. 1995; Hardison et al. 1997; Li et al. 1999). Initial results reported here show that at least one of the preCRMs that is not validated in individual assays can act in concert with another CRM to modulate its effects. In addition, some preCRMs may be specific for a subset of erythroid promoters, and may not be active on the globin gene promoters employed in this study. The cell lines used in our study do not mimic all aspects of developmental regulation, and assays in whole animals will be required for full exploration of potential activities. Positive and negative regulation are often exerted by the same CRMs, with changes in the trans-factors accomplishing the switch. This is the case for Gata2R3 (Grass et al. 2003; Martowicz et al. 2005). The inactivity of Gata2R3 alone in gene expression assays may result from complications of protein competition. This segment of DNA is located ~2.8 kb upstream of an alternate, tissue-specific promoter in an erythroid DNase hypersensitive site, and it has been shown to enhance expression in G1E cells (Grass et al. 2003; Martowicz et al. 2005). It is bound initially by GATA-2 but is then replaced by GATA-1 at progressive stages of erythroid maturation, leading to an inhibition of expression of Gata2 (Grass et al. 2003; Martowicz et al. 2005). Both GATA-1 and GATA-2 are present in the two cell lines used in our transfection studies, and it is possible that the effects of these competing proteins offset each other, preventing an obvious effect on expression. The 32 preCRMs validated in this study confirm four previously known erythroid CRMs and add 28 novel ones. Thus, this study increases substantially the set of confirmed erythroid CRMs, and they increase our knowledge about the regulation of individual genes. For example, our data reveal two more regulatory modules for Gata2, one located ~70 kb upstream and another in intron 4. CRMs are distributed throughout the Zfpm1 gene, and a highly active cluster of them in intron 3 may be particularly important for regulation. Our study demonstrates that the predictions based on RP scores and conserved transcription factor binding sites have good predictive value (Table 2), but improvements are still needed. Some strongly predicted regions show no effects in the assays used in this study. Further work with a wider range of assays is needed to determine whether these are actually false positives, or whether their function is not revealed by cell transfections. Many stringently conserved noncoding regions, which align between human and fish sequences, have been shown to function as developmental enhancers (Aparicio et al. 1995; Plessy et al. 2005; Woolfe et al. 2005), with validation rates as high as 90% (Nobrega et al. 2003; Woolfe et al. 2005). However, most mammalian CRMs are not so highly constrained (Hillier et al. 2004); indeed, none of the eight loci investigated in our study shows substantial noncoding sequence matches between mouse and fish. Thus, we restricted our analysis to alignments of mammalian genome sequences, filtering these for positive RP scores and a conserved consensus GATA-1 binding site. Other recent studies have employed alignments along with an additional filter with good success. Donaldson et al. (2005) exploited detailed knowledge of critical binding site motifs and their spacing, along with human–rodent conservation and positive RP, to predict and validate novel enhancers for genes involved in mammalian hematopoiesis. Johnson et al. (2005) used a combination of clusters of predicted motifs and evolutionary conservation successfully to identify muscle CRMs in a genome-wide scan of Ciona. Thus, the approach of combining RP scores with another feature of CRMs, such as conserved motifs or clusters of motifs, should be broadly applicable to studies of gene regulation in complex genomes. Combining RP scores with new methods to identify clusters of conserved factor binding sites (Berman et al. 2004; Blanchette et al. 2006) may be particularly productive. The phylogeny-based, multispecies RP scores (Taylor et al. 2006) improve prediction accuracy, compared with the initial implementations (Elnitski et al. 2003; Kolbe et al. 2004; Taylor et al. 2006). The current RP scores are systematically higher in the validated preCRMs. Three of the preCRMs that failed to validate had positive RP scores in the initial implementation but have negative scores by the current method. Thus, the improved methodology can reduce false positive predictions. In addition, requiring conservation of a stringent match to a GATA-1 binding site motif improves accuracy over use of conserved matches to GATA-1 binding site weight matrices. Another obvious limitation of our current prediction approach is that lineage-specific regulatory elements are invisible to techniques utilizing rodent–primate comparisons (Hughes et al. 2005; King et al. 2005). However, this limitation also could be overcome; lineage-specific CRMs may be identified utilizing a set of more closely related species, with techniques such as phylogenetic shadowing (Boffelli et al. 2003). Methods Conserved GATA-1 motifs The motif-scorer scans aligned sequences with either the position-specific scoring matrix (PSSM, threshold = 0.85) for GATA-1 binding sites or pattern matching routines (searching for WGATAR), ignoring gap characters from the alignment rows in MAF format. The PSSM was generated by merging the PSSMs for GATA-1 binding sites in JASPAR (Sandelin et al. 2004) and TRANSFAC (Wingender et al. 2001). Sequence alignments with motif matches above the threshold in both mouse and at least one other non-rodent (human, chimp, or dog) comprised the conserved GATA1 binding site predictions. Three different types of matches to the GATA-1 binding site were employed. The most stringent is an exact match to the consensus motif, WGATAR, in mouse and in the aligned positions in at least one non-rodent sequence (conserved consensus GATA-1 binding site, or ccGATA1BS). The second is a conserved non-consensus site (cncGATA1BS), which matches the merged PSSM for GATA-1 binding sites in mouse and at least one non-rodent sequence in the multiple alignment, but is not an exact match to WGATAR. The third is a nonconserved consensus site (nccGATA1BS), which has an exact match to WGATAR in mouse but not in an aligned non-rodent sequence. Prediction of erythroid cis-regulatory modules and negative controls Predicted cis-regulatory modules (preCRMs) in the intervals containing the target mouse genes have the following properties: They (1) align among mouse, rat, human, chimp, and dog; (2) do not contain exons; (3) have a five-way mouse–rat–human–chimp–dog RP score (Taylor et al. 2006) >0; and (4) contain at least one match to one category of GATA-1 binding site (see above). For the preCRMs with a ccGATA1BS, all those in the eight target loci with a mean RP score >0.05 and most of those with scores between 0 and 0.05 were tested. Noncoding aligned DNA segments were initially predicted to be cis-regulatory modules (preCRMs) based on an empirically established threshold (Hardison et al. 2003) for mouse–human alignments (Elnitski et al. 2003; Schwartz et al. 2003) and their proximity to a predicted binding site for GATA-1 conserved among mouse, rat, and human (Schwartz et al. 2003). The advances in the availability of multiple sequences (Gibbs et al. 2004; Chimpanzee Sequencing and Analysis Consortium 2005; Dog Sequencing and Analysis Consortium 2005), greater sensitivity in multiple alignment (Blanchette et al. 2004), improved algorithms for Markov models (Kolbe et al. 2004; Taylor et al. 2006), and modified methods for determining conserved matches to transcription factor binding sites (see section above) allowed us to reevaluate previously predicted preCRMs in terms of RP score and conservation of GATA-1 binding sites. The mouse DNA sequence in the chosen eight loci was aligned with the orthologous sequences from rat, human, chimpanzee, and dog using TBA (Blanchette et al. 2004), and RP scores were recomputed using the phylogeny-based five-species implementation of Taylor et al. (2006). A modification to this method was implemented to address a missing data problem. Some of the genome assemblies are incomplete, and thus the alignments include some gaps corresponding to missing sequence (rather than real gaps in the alignment). Gaps likely to result from absence of sequence were replaced with a wildcard symbol. RP scores were then computed using the same models as the other five-way scores, but each column with a wildcard was assigned to the same reduced alphabet symbol as the nearest non-wildcard column based on distance between ancestral reconstructions (Taylor et al. 2006). The previously predicted CRMs were mapped onto the new five-way RP scores to obtain the mean RP score used in this report. The type of predicted GATA-1 binding site was also determined using the procedure described in the previous section. Transient transfection and expression The luciferase expression plasmid MCSγluc (Elnitski et al. 2001) containing the human A gamma globin (HBG1pr) gene promoter (from −260 to +35) fused to the firefly luciferase coding region of pGL3Basic (Promega) was modified to contain restriction endonuclease sites for MluI and NotI. Predicted CRMs and neutrals were amplified from MEL_RL5 genomic DNA and inserted into the MCS via these sites to make each test expression plasmid. Primer sequences for preCRMs and negative controls are in Supplemental Tables S2 and S3. The plasmid DNAs were transiently transfected into K562 cells using the cationic lipid reagent Tfx50 (Promega) following the procedure as described in Elnitski and Hardison (1999) and Elnitski et al. (2001). Briefly, 0.8 μg of plasmid containing firefly luciferase reporter and 0.008 μg of cotransfection control plasmid expressing Renilla luciferase were transfected in triplicate into 4 × 105 cells at a 2:1 ratio (charge to mass) of Tfx50 to DNA. For a triplicate determination, a plasmid is prepared in three independent minipreps, each of which is transfected into the K562 cells. The entire triplicate experiment was done at least twice for each test plasmid. Two days after the transfection, cell extracts were subject to a dual luciferase assay following the manufacturer’s protocol (Promega). For each of the triplicate samples, the firefly luciferase activity of the test plasmid (divided by the Renilla luciferase activity of the cotransfection control) was normalized by the firefly luciferase activity from the parental MCSγluc (divided by the Renilla luciferase activity of the cotransfection control) to obtain a fold change. The fold change is reported as its log (base 2). Site-directed mutagenesis Mutagenesis used the QuickChange Site-Directed Mutagenesis Kit (Stratagene), following manufacturer’s protocol. The mutagenic primers were designed as following:
The supercoiled double-stranded plasmid Alas2preCRM1γluc was used to amplify the mutated, nicked plasmid by pfuturbo DNA polymerase. The PCR product was treated with DpnI (McClelland and Nelson 1992) and then transformed into XL1-Blue supercompetent cells. Recombinase-mediated cassette exchange and expressionafter stable transfection of MEL_RL5 cells preCRMs were inserted into the plasmid L1-MCS2βEGFP-1L (Molete et al. 2001) containing MluI and NotI in the MCS, which expresses EGFP (Clontech) from the human β-globin gene (HBB) promoter (segment from −374 to +44 relative to the transcription start site). Thus, PCR-amplified DNA could be cloned into L1-MCS2βEGFP-1L as well as MCSγluc. The Cre expression plasmid pBS185 (CMV-CRE) was obtained from Clontech. Expression cassettes containing the parental HBB promoter-EGFP construct with or without a preCRM or preNeutral were integrated at RL5 by site-specific recombination directed by CRE recombinase, following procedures as described in Bouhassira et al. (1997). Three pools of stably transfected cells carrying each expression cassette were isolated, and the median EGFP fluorescence was monitored by flow cytometry for several days to ascertain that the level was stable (Molete et al. 2001). The pools were then induced for erythroid maturation by incubating cultures of cells at a density of 2 × 105/mL in DMEM containing 4 mM N,N′-hexamethylene-bis-acetamide (HMBA) for 6 d at 37°C. The level of green fluorescence from EGFP was measured daily by flow cytometry. Each measurement on each pool of cells containing a preCRM (or preNeutral) was divided by the fluorescence measurement from cells carrying the parental cassette (MCS2βEGFP, which is also integrated independently for each experiment) to obtain a fold change. The log2 of the fold change is the expression value analyzed in this work. Statistics for validation thresholds for enhancer activities Validation of enhancer activity in either assay is based on comparison with the expression values from the set of predicted neutral fragments. Expression from a set of 17 preNeutrals (Supplemental Table S1) was measured in transiently transfected K562, with triplicate determinations and at least two experiments. This yielded a set of 38 measurements. Expression after stable, site-directed integration into MEL_RL5 cells, before and after induction, was measured for a set of 11 preNeutrals (a subset of those tested in transient transfections; Supplemental Table S1). This produced a set of at least 30 measurements for each day of the induction series. The sets of measurements (all as log2 fold change relative to the parental expression cassettes or plasmids) for the sets of preNeutrals are the comparison distributions for the expression from each preCRM in each assay. A Wilcoxon-Mann-Whitney rank order test was applied for the set of expression measurements for each preCRM and preNeutral, comparing to the set of values from all preNeutrals for that assay. For the transient transfection assay, no preNeutral had a P-value <0.0001, so we consider a preCRM to be validated if its P-value in this test is ≤0.0001. For the site-directed integration assay, no preNeutral had a P-value <0.0065 at any day of induction, so we consider the activity from a preCRM to be significant if its P-value in this test is ≤0.0065. For the latter assay, we further require that a significant activity be observed for at least three consecutive days during the induction series to insure that a consistent effect is observed. The differences in P-value thresholds are influenced by the differences in numbers of measurements for both the preCRMs and the preNeutrals in the two assays, as well as differences in the dynamic range of values obtained. Chromatin immunoprecipitation Chromatin immunoprecipitation (ChIP) was performed as described (Welch et al. 2004). GATA-1 (N6) and ER (Ab-10) antibodies used for ChIP were obtained from Santa Cruz Biotechnology and Neomarkers, respectively. Acknowledgments This work was supported by NIH grants DK65806 (R.H.) and HG02238 (W.M.), Tobacco Settlement funds from the Commonwealth of Pennsylvania, and the Huck Institutes of Life Sciences of the Penn State University. Footnotes [Supplemental material is available online at www.genome.org. The expression profile data obtained during MEL cell differentiation have been submitted to GEO under accession no. GSE2217.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5353806 References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||
Nature. 2003 Apr 24; 422(6934):835-47.
[Nature. 2003]Nat Rev Genet. 2004 Apr; 5(4):276-87.
[Nat Rev Genet. 2004]Nat Biotechnol. 2005 Jan; 23(1):137-44.
[Nat Biotechnol. 2005]Nucleic Acids Res. 2001 Jan 1; 29(1):281-3.
[Nucleic Acids Res. 2001]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Nat Rev Genet. 2001 Feb; 2(2):100-9.
[Nat Rev Genet. 2001]Annu Rev Genomics Hum Genet. 2004; 5():15-56.
[Annu Rev Genomics Hum Genet. 2004]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Genome Res. 2003 Jan; 13(1):64-72.
[Genome Res. 2003]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Proc Natl Acad Sci U S A. 2002 Jan 22; 99(2):757-62.
[Proc Natl Acad Sci U S A. 2002]Genome Biol. 2004; 5(9):R61.
[Genome Biol. 2004]Exp Hematol. 1995 Feb; 23(2):99-107.
[Exp Hematol. 1995]Mol Cell Biol. 1993 Jul; 13(7):4011-22.
[Mol Cell Biol. 1993]Mol Cell Biol. 1993 Jul; 13(7):3999-4010.
[Mol Cell Biol. 1993]Nature. 1991 Jan 17; 349(6306):257-60.
[Nature. 1991]Mol Cell Biol. 1997 Mar; 17(3):1642-51.
[Mol Cell Biol. 1997]Proc Natl Acad Sci U S A. 1976 Mar; 73(3):862-6.
[Proc Natl Acad Sci U S A. 1976]Proc Natl Acad Sci U S A. 1998 Dec 8; 95(25):14863-8.
[Proc Natl Acad Sci U S A. 1998]Prog Nucleic Acid Res Mol Biol. 1995; 51():1-51.
[Prog Nucleic Acid Res Mol Biol. 1995]Mol Cell Biol. 1985 Nov; 5(11):2879-86.
[Mol Cell Biol. 1985]Mol Cell Biol. 1989 Jun; 9(6):2332-40.
[Mol Cell Biol. 1989]Blood. 2004 Nov 15; 104(10):3136-47.
[Blood. 2004]Genome Res. 2004 Apr; 14(4):708-15.
[Genome Res. 2004]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Nucleic Acids Res. 1989 Jan 11; 17(1):73-92.
[Nucleic Acids Res. 1989]Mol Cell Biol. 1993 Jul; 13(7):4011-22.
[Mol Cell Biol. 1993]Mol Cell Biol. 1993 Jul; 13(7):3999-4010.
[Mol Cell Biol. 1993]Genome Res. 2002 Jun; 12(6):996-1006.
[Genome Res. 2002]Genome Res. 2003 Apr; 13(4):732-41.
[Genome Res. 2003]Genome Res. 2005 Oct; 15(10):1451-5.
[Genome Res. 2005]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D466-70.
[Nucleic Acids Res. 2005]Proc Natl Acad Sci U S A. 1980 Jun; 77(6):3509-13.
[Proc Natl Acad Sci U S A. 1980]Nucleic Acids Res. 1985 May 24; 13(10):3599-615.
[Nucleic Acids Res. 1985]Blood. 1997 Nov 1; 90(9):3332-44.
[Blood. 1997]Mol Cell Biol. 2001 May; 21(9):2969-80.
[Mol Cell Biol. 2001]J Biol Chem. 1998 Jul 3; 273(27):16798-809.
[J Biol Chem. 1998]Proc Natl Acad Sci U S A. 2003 Jul 22; 100(15):8811-6.
[Proc Natl Acad Sci U S A. 2003]J Biol Chem. 2005 Jan 21; 280(3):1724-32.
[J Biol Chem. 2005]Genes Dev. 1998 Apr 15; 12(8):1176-88.
[Genes Dev. 1998]Mol Cell. 1999 Feb; 3(2):219-28.
[Mol Cell. 1999]EMBO J. 1999 May 17; 18(10):2812-22.
[EMBO J. 1999]Proc Natl Acad Sci U S A. 2002 Jul 9; 99(14):9237-42.
[Proc Natl Acad Sci U S A. 2002]Blood. 2004 Nov 15; 104(10):3136-47.
[Blood. 2004]Blood. 2004 Nov 15; 104(10):3136-47.
[Blood. 2004]Proc Natl Acad Sci U S A. 2003 Jul 22; 100(15):8811-6.
[Proc Natl Acad Sci U S A. 2003]J Biol Chem. 2005 Jan 21; 280(3):1724-32.
[J Biol Chem. 2005]Blood. 2004 Nov 15; 104(10):3106-16.
[Blood. 2004]Nature. 2000 Feb 3; 403(6769):564-7.
[Nature. 2000]Mol Biol Evol. 2002 Jul; 19(7):1114-21.
[Mol Biol Evol. 2002]Blood. 2004 Nov 15; 104(10):3136-47.
[Blood. 2004]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Genes Dev. 1995 Dec 15; 9(24):3083-96.
[Genes Dev. 1995]Genome Res. 1997 Oct; 7(10):959-66.
[Genome Res. 1997]Trends Genet. 1999 Oct; 15(10):403-8.
[Trends Genet. 1999]Proc Natl Acad Sci U S A. 2003 Jul 22; 100(15):8811-6.
[Proc Natl Acad Sci U S A. 2003]J Biol Chem. 2005 Jan 21; 280(3):1724-32.
[J Biol Chem. 2005]Proc Natl Acad Sci U S A. 1995 Feb 28; 92(5):1684-8.
[Proc Natl Acad Sci U S A. 1995]PLoS Biol. 2005 Jan; 3(1):e7.
[PLoS Biol. 2005]Science. 2003 Oct 17; 302(5644):413.
[Science. 2003]Nature. 2004 Dec 9; 432(7018):695-716.
[Nature. 2004]Hum Mol Genet. 2005 Mar 1; 14(5):595-601.
[Hum Mol Genet. 2005]Genome Res. 2003 Jan; 13(1):64-72.
[Genome Res. 2003]Genome Res. 2004 Apr; 14(4):700-7.
[Genome Res. 2004]Proc Natl Acad Sci U S A. 2005 Jul 12; 102(28):9830-5.
[Proc Natl Acad Sci U S A. 2005]Genome Res. 2005 Aug; 15(8):1051-60.
[Genome Res. 2005]Science. 2003 Feb 28; 299(5611):1391-4.
[Science. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D91-4.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2001 Jan 1; 29(1):281-3.
[Nucleic Acids Res. 2001]Cold Spring Harb Symp Quant Biol. 2003; 68():335-44.
[Cold Spring Harb Symp Quant Biol. 2003]Genome Res. 2003 Jan; 13(1):64-72.
[Genome Res. 2003]Genome Res. 2003 Jan; 13(1):103-7.
[Genome Res. 2003]Nature. 2004 Apr 1; 428(6982):493-521.
[Nature. 2004]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]J Biol Chem. 2001 Mar 2; 276(9):6289-98.
[J Biol Chem. 2001]Blood Cells Mol Dis. 1999 Oct-Dec; 25(5-6):299-304.
[Blood Cells Mol Dis. 1999]Nucleic Acids Res. 1992 May 11; 20 Suppl():2145-57.
[Nucleic Acids Res. 1992]Mol Cell Biol. 2001 May; 21(9):2969-80.
[Mol Cell Biol. 2001]Blood. 1997 Nov 1; 90(9):3332-44.
[Blood. 1997]Mol Cell Biol. 2001 May; 21(9):2969-80.
[Mol Cell Biol. 2001]Blood. 2004 Nov 15; 104(10):3136-47.
[Blood. 2004]