• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of plosbiolPLoS BiologySubmit to PLoSGet E-mail AlertsContact UsPublic Library of Science (PLoS)View this Article
PLoS Biol. Apr 2011; 9(4): e1001046.
Published online Apr 19, 2011. doi:  10.1371/journal.pbio.1001046
PMCID: PMC3079585

A User's Guide to the Encyclopedia of DNA Elements (ENCODE)

The ENCODE Project Consortium*
Peter B. Becker, Academic Editor

Abstract

The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

Author Summary

The Encyclopedia of DNA Elements (ENCODE) Project was created to enable the scientific and medical communities to interpret the human genome sequence and to use it to understand human biology and improve health. The ENCODE Consortium, a large group of scientists from around the world, uses a variety of experimental methods to identify and describe the regions of the 3 billion base-pair human genome that are important for function. Using experimental, computational, and statistical analyses, we aimed to discover and describe genes, transcripts, and transcriptional regulatory regions, as well as DNA binding proteins that interact with regulatory regions in the genome, including transcription factors, different versions of histones and other markers, and DNA methylation patterns that define states of the genome in various cell types. The ENCODE Project has developed standards for each experiment type to ensure high-quality, reproducible data and novel algorithms to facilitate analysis. All data and derived results are made available through a freely accessible database. This article provides an overview of the complete project and the resources it is generating, as well as examples to illustrate the application of ENCODE data as a user's guide to facilitate the interpretation of the human genome.

I. Introduction and Project Overview

Interpreting the human genome sequence is one of the leading challenges of 21st century biology [1]. In 2003, the National Human Genome Research Institute (NHGRI) embarked on an ambitious project—the Encyclopedia of DNA Elements (ENCODE)—aiming to delineate all of the functional elements encoded in the human genome sequence [2]. To further this goal, NHGRI organized the ENCODE Consortium, an international group of investigators with diverse backgrounds and expertise in production and analysis of high-throughput functional genomic data. In a pilot project phase spanning 2003–2007, the Consortium applied and compared a variety of experimental and computational methods to annotate functional elements in a defined 1% of the human genome [3]. Two additional goals of the pilot ENCODE Project were to develop and advance technologies for annotating the human genome, with the combined aims of achieving higher accuracy, completeness, and cost-effective throughput and establishing a paradigm for sharing functional genomics data. In 2007, the ENCODE Project was expanded to study the entire human genome, capitalizing on experimental and computational technology developments during the pilot project period. Here we describe this expanded project, which we refer to throughout as the ENCODE Project, or ENCODE.

The major goal of ENCODE is to provide the scientific community with high-quality, comprehensive annotations of candidate functional elements in the human genome. For the purposes of this article, the term “functional element” is used to denote a discrete region of the genome that encodes a defined product (e.g., protein) or a reproducible biochemical signature, such as transcription or a specific chromatin structure. It is now widely appreciated that such signatures, either alone or in combinations, mark genomic sequences with important functions, including exons, sites of RNA processing, and transcriptional regulatory elements such as promoters, enhancers, silencers, and insulators. However, it is also important to recognize that while certain biochemical signatures may be associated with specific functions, our present state of knowledge may not yet permit definitive declaration of the ultimate biological role(s), function(s), or mechanism(s) of action of any given genomic element.

At present, the proportion of the human genome that encodes functional elements is unknown. Estimates based on comparative genomic analyses suggest that 3%–8% of the base pairs in the human genome are under purifying (or negative) selection [4][7]. However, this likely underestimates the prevalence of functional features, as current comparative methods may not account for lineage-specific evolutionary innovations, functional elements that are very small or fragmented [8], elements that are rapidly evolving or subject to nearly neutral evolutionary processes, or elements that lie in repetitive regions of the genome.

The current phase of the ENCODE Project has focused on completing two major classes of annotations: genes (both protein-coding and non-coding) and their RNA transcripts, and transcriptional regulatory regions. To accomplish these goals, seven ENCODE Data Production Centers encompassing 27 institutions have been organized to focus on generating multiple complementary types of genome-wide data (Figure 1 and Figure S1). These data include identification and quantification of RNA species in whole cells and in sub-cellular compartments, mapping of protein-coding regions, delineation of chromatin and DNA accessibility and structure with nucleases and chemical probes, mapping of histone modifications and transcription factor (TF) binding sites by chromatin immunoprecipitation (ChIP), and measurement of DNA methylation (Figure 2 and Table 1). In parallel with the major production efforts, several smaller-scale efforts are examining long-range chromatin interactions, localizing binding proteins on RNA, identifying transcriptional silencer elements, and understanding detailed promoter sequence architecture in a subset of the genome (Figure 1 and Table 1).

Figure 1
The Organization of the ENCODE Consortium.
Figure 2
Data available from the ENCODE Consortium.
Table 1
Experimental assays used by the ENCODE Consortium.

ENCODE has placed emphasis on data quality, including ongoing development and application of standards for data reproducibility and the collection of associated experimental information (i.e., metadata). Adoption of state-of-the-art, massively parallel DNA sequence analysis technologies has greatly facilitated standardized data processing, comparison, and integration [9],[10]. Primary and processed data, as well as relevant experimental methods and parameters, are collected by a central Data Coordination Center (DCC) for curation, quality review, visualization, and dissemination (Figure 1). The Consortium releases data rapidly to the public through a web-accessible database (http://genome.ucsc.edu/ENCODE/) [11] and provides a visualization framework and analytical tools to facilitate use of the data [12], which are organized into a web portal (http://encodeproject.org).

To facilitate comparison and integration of data, ENCODE data production efforts have prioritized selected sets of cell types (Table 2). The highest priority set (designated “Tier 1”) includes two widely studied immortalized cell lines—K562 erythroleukemia cells [13]; an EBV-immortalized B-lymphoblastoid line (GM12878, also being studied by the 1,000 Genomes Project; http://1000genomes.org) and the H1 human embryonic stem cell line [14]. A secondary priority set (Tier 2) includes HeLa-S3 cervical carcinoma cells [15], HepG2 hepatoblastoma cells [16], and primary (non-transformed) human umbilical vein endothelial cells (HUVEC; [17]), which have limited proliferation potential in culture. To capture a broader spectrum of human biological diversity, a third set (Tier 3) currently comprises more than 100 cell types that are being analyzed in selected assays (Table 2). Standardized growth conditions for all ENCODE cell types have been established and are available through the ENCODE web portal (http://encodeproject.org, “cell types” link).

Table 2
ENCODE cell types.

This report is intended to provide a guide to the data and resources generated by the ENCODE Project to date on Tier 1–3 cell types. We summarize the current state of ENCODE by describing the experimental and computational approaches used to generate and analyze data. In addition, we outline how to access datasets and provide examples of their use.

II. ENCODE Project Data

The following sections describe the different types of data being produced by the ENCODE Project (Table 1).

Genes and Transcripts

Gene annotation

A major goal of ENCODE is to annotate all protein-coding genes, pseudogenes, and non-coding transcribed loci in the human genome and to catalog the products of transcription including splice isoforms. Although the human genome contains ~20,000 protein-coding genes [18], accurate identification of all protein-coding transcripts has not been straightforward. Annotation of pseudogenes and noncoding transcripts also remains a considerable challenge. While automatic gene annotation algorithms have been developed, manual curation remains the approach that delivers the highest level of accuracy, completeness, and stability [19]. The ENCODE Consortium has therefore primarily relied on manual curation with moderate implementation of automated algorithms to produce gene and transcript models that can be verified by traditional experimental and analytical methods. This annotation process involves consolidation of all evidence of transcripts (cDNA, EST sequences) and proteins from public databases, followed by building gene structures based on supporting experimental data [20]. More than 50% of annotated transcripts have no predicted coding potential and are classified by ENCODE into different transcript categories. A classification that summarizes the certainty and types of the annotated structures is provided for each transcript (see http://www.gencodegenes.org/biotypes.html for details). The annotation also includes extensive experimental validation by RT-PCR for novel transcribed loci (i.e., those not previously observed and deposited into public curated databases such as RefSeq). Pseudogenes are identified primarily by a combination of similarity to other protein-coding genes and an obvious functional disablement such as an in-frame stop codon. Because it is difficult to validate pseudogenes experimentally, three independent annotation methods from Yale (“pseudopipe”) [21], UCSC (“retrofinder”; http://users.soe.ucsc.edu/~markd/gene-sets-new/pseudoGenes/RetroFinder.html, and references therein), and the Sanger Center [20] are combined to produce a consensus pseudogene set. Ultimately, each gene or transcript model is assigned one of three confidence levels. Level 1 includes genes validated by RT-PCR and sequencing, plus consensus pseudogenes. Level 2 includes manually annotated coding and long non-coding loci that have transcriptional evidence in EMBL/GenBank. Level 3 includes Ensembl gene predictions in regions not yet manually annotated or for which there is new transcriptional evidence.

The result of ENCODE gene annotation (termed “GENCODE”) is a comprehensive catalog of transcripts and gene models. ENCODE gene and transcript annotations are updated bimonthly and are available through the UCSC ENCODE browser, distributed annotation servers (DAS; see http://genome.ucsc.edu/cgi-bin/das/hg18/features?segment=21:33031597,33041570type=wgEncodeGencodeManualV3), and the Ensembl Browser [22].

RNA transcripts

ENCODE aims to produce a comprehensive genome-wide catalog of transcribed loci that characterizes the size, polyadenylation status, and subcellular compartmentalization of all transcripts (Table 1).

ENCODE has generated transcript data with high-density (5 bp) tiling DNA microarrays [23] and massively parallel DNA sequencing methods [9],[10],[24], with the latter predominating in ongoing efforts. Both polyA+ and polyA− RNAs are being analyzed. Because subcellular compartmentalization of RNAs is important in RNA processing and function, such as nuclear retention of unspliced coding transcripts [25] or snoRNA activity in the nucleolus [26], ENCODE is analyzing not only total whole cell RNAs but also those concentrated in the nucleus and cytosol. Long (>200 nt) and short RNAs (<200 nt) are being sequenced from each subcellular compartment, providing catalogs of potential miRNAs, snoRNA, promoter-associated short RNAs (PASRs) [27], and other short cellular RNAs. Total RNA from K562 and GM12878 cells has been mapped by hybridization to high-density tiling arrays and sequenced to a depth of >500 million paired-end 76 bp reads under conditions where the strand of the RNA transcript is determined, providing considerable depth of transcript coverage (see below).

These analyses reveal that the human genome encodes a diverse array of transcripts. For example, in the proto-oncogene TP53 locus, RNA-seq data indicate that, while TP53 transcripts are accurately assigned to the minus strand, those for the oppositely transcribed, adjacent gene WRAP53 emanate from the plus strand (Figure 3). An independent transcript within the first intron of TP53 is also observed in both GM12878 and K562 cells (Figure 3).

Figure 3
ENCODE gene and transcript annotations.

Additional transcript annotations include exonic regions and splice junctions, transcription start sites (TSSs), transcript 3′ ends, spliced RNA length, locations of polyadenylation sites, and locations with direct evidence of protein expression. TSSs and 3′ ends of transcripts are being determined with two approaches, Paired-End diTag (PET) [28] and Cap-Analysis of Gene Expression (CAGE) [29][31] sequencing.

Transcript annotations throughout the genome are further corroborated by comparing tiling array data with deep sequencing data and by the manual curation described above. Additionally, selected compartment-specific RNA transcripts that cannot be mapped to the current build of the human genome sequence have been evaluated by 5′/3′ Rapid Amplification of cDNA Ends (RACE) [32], followed by RT-PCR cloning and sequencing. To assess putative protein products generated from novel RNA transcripts and isoforms, proteins may be sequenced and quantified by mass spectrometry and mapped back to their encoding transcripts [33],[34]. ENCODE has recently begun to study proteins from distinct subcellular compartments of K562 and GM12878 cells by using this complementary approach.

Cis-Regulatory Regions

Cis-regulatory regions include diverse functional elements (e.g., promoters, enhancers, silencers, and insulators) that collectively modulate the magnitude, timing, and cell-specificity of gene expression [35]. The ENCODE Project is using multiple approaches to identify cis-regulatory regions, including localizing their characteristic chromatin signatures and identifying sites of occupancy of sequence-specific transcription factors. These approaches are being combined to create a comprehensive map of human cis-regulatory regions.

Chromatin structure and modification

Human cis-regulatory regions characteristically exhibit nuclease hypersensitivity [36][39] and may show increased solubility after chromatin fixation and fragmentation [40],[41]. Additionally, specific patterns of post-translational histone modifications [42],[43] have been connected with distinct classes of regions such as promoters and enhancers [3],[44][47] as well as regions subject to programmed repression by Polycomb complexes [48],[49] or other mechanisms [46],[50],[51]. Chromatin accessibility and histone modifications thus provide independent and complementary annotations of human regulatory DNA, and massively parallel, high-throughput DNA sequencing methods are being used by ENCODE to map these features on a genome-wide scale (Figure 2 and Table 1).

DNaseI hypersensitive sites (DHSs) are being mapped by two techniques: (i) capture of free DNA ends at in vivo DNaseI cleavage sites with biotinylated adapters, followed by digestion with a TypeIIS restriction enzyme to generate ~20 bp DNaseI cleavage site tags [52],[53] and (ii) direct sequencing of DNaseI cleavage sites at the ends of small (<300 bp) DNA fragments released by limiting treatment with DNaseI [54][56]. Chromatin structure is also being profiled with the FAIRE technique [40],[57],[58], in which chromatin from formaldehyde-crosslinked cells is sonicated in a fashion similar to ChIP and then extracted with phenol, followed by sequencing of soluble DNA fragments. An expanding panel of histone modifications (Figure 2) is being profiled by ChIP-seq [59][62]. In this method, chromatin from crosslinked cells is immunoprecipitated with antibodies to chromatin modifications (or other proteins of interest), the associated DNA is recovered, and the ends are subjected to massively parallel DNA sequencing. Control immunoprecipitations with a control IgG antibody or “input” chromatin—sonicated crosslinked chromatin that is not subjected to immune enrichment—are also sequenced for each cell type. These provide critical controls, as shearing of crosslinked chromatin may occur preferentially within certain regulatory DNA regions, typically promoters [41]. ENCODE chromatin data types are illustrated for a typical locus in Figure 4, which depicts the patterns of chromatin accessibility, DNaseI hypersensitive sites, and selected histone modifications in GM12878 cells.

Figure 4
ENCODE chromatin annotations in the HLA locus.

For each chromatin data type, the “raw signal” is presented as the density of uniquely aligning sequence reads within 150 bp sliding windows in the human genome. In addition, some data are available as processed signal tracks in which filtering algorithms have been applied to reduce experimental noise. A variety of specialized statistical algorithms are applied to generate discrete high-confidence genomic annotations, including DHSs, broader regions of increased sensitivity to DNaseI, regions of enrichment by FAIRE, and regions with significant levels of specific histone modifications (see Tables 3 and S1). Notably, different histone modifications exhibit characteristic genomic distributions that may be either discrete (e.g., H3K4me3 over a promoter) or broad (e.g., H3K36me3 over an entire transcribed gene body). Because statistical false discovery rate (FDR) thresholds are applied to discrete annotations, the number of regions or elements identified under each assay type depends upon the threshold chosen. Optimal thresholds for an assay are typically determined by comparison to an independent and standard assay method or through reproducibility measurements (see below). Extensive validation of the detection of DNaseI hypersensitive sites is being performed independently with traditional Southern blotting, and more than 6,000 Southern images covering 224 regions in >12 cell types are available through the UCSC browser.

Table 3
Analysis tools applied by the ENCODE Consortium.

Transcription factor and RNA polymerase occupancy

Much of human gene regulation is determined by the binding of transcriptional regulatory proteins to their cognate sequence elements in cis-regulatory regions. ChIP-seq enables genome-scale mapping of transcription factor (TF) occupancy patterns in vivo [59],[60],[62] and is being extensively applied by ENCODE to create an atlas of regulatory factor binding in diverse cell types. ChIP-seq experiments rely on highly specific antibodies that are extensively characterized by immunoblot analysis and other criteria according to ENCODE experimental standards. High-quality antibodies are currently available for only a fraction of human TFs, and identifying suitable immunoreagents has been a major activity of ENCODE TF mapping groups. Alternative technologies, such as epitope tagging of TFs in their native genomic context using recombineering [63],[64], are also being explored.

ENCODE has applied ChIP-seq to create occupancy maps for a variety of TFs, RNA polymerase 2 (RNA Pol2) including both unphosphorylated (initiating) and phosphorylated (elongating) forms, and RNA polymerase 3 (RNA Pol3). The localization patterns of five transcription factors and RNA Pol2 in GM12878 lymphoblastoid cells are shown for a typical locus in Figure 5. Sequence reads are processed as described above for DNaseI, FAIRE, and histone modification experiments, including the application of specialized peak-calling algorithms that use input chromatin or control immunoprecipitation data to identify potential false-positives introduced by sonication or sequencing biases (Table 3). Although different peak-callers vary in performance, the strongest peaks are generally identified by multiple algorithms. Most of the sites identified by ChIP-seq are also detected by traditional ChIP-qPCR [65] or are consistent with sites reported in the literature. For example, 98% of 112 sites of CTCF occupancy previously identified by using both ChIP-chip and ChIP-qPCR [66] are also identified in ENCODE CTCF data. Whereas the binding of sequence-specific TFs is typically highly localized resulting in tight sequence tag peaks, signal from antibodies that recognize the phosphorylated (elongating) form of RNA Pol2 may detect occupancy over a wide region encompassing both the site of transcription initiation as well as the domain of elongation. Comparisons among ENCODE groups have revealed that TF and RNA Pol2 occupancy maps generated independently by different groups are highly consistent.

Figure 5
Occupancy of transcription factors and RNA polymerase 2 on human chromosome 6p as determined by ChIP-seq.

Additional Data Types

ENCODE is also generating additional data types to complement production projects and benchmark novel technologies. An overview of these datasets is provided in Table 1.

DNA methylation

In vertebrate genomes, methylation at position 5 of the cytosine in CpG dinucleotides is a heritable “epigenetic” mark that has been connected with both transcriptional silencing and imprinting [67],[68]. ENCODE is applying several complementary approaches to measure DNA methylation. All ENCODE cell types are being assayed using two direct methods for measuring DNA methylation following sodium bisulfite conversion, which enables quantitative analysis of methylcytosines: interrogation of the methylation status of 27,000 CpGs with the Illumina Methyl27 assay [69][72] and Reduced Representation Bisulfite Sequencing (RRBS) [73], which couples MspI restriction enzyme digestion, size selection, bisulfite treatment, and sequencing to interrogate the methylation status of >1,000,000 CpGs largely concentrated within promoter regions and CpG islands. Data from an indirect approach using a methylation-sensitive restriction enzyme (Methyl-seq) [74] are also available for a subset of cell types. These three approaches measure DNA methylation in defined (though overlapping) subsets of the human genome and provide quantitative determinations of the fraction of CpG methylation at each site.

DNaseI footprints

DNaseI footprinting [75] enables visualization of regulatory factor occupancy on DNA in vivo at nucleotide resolution and has been widely applied to delineate the fine structure of cis-regulatory regions [76]. Deep sampling of highly enriched libraries of DNaseI-released fragments (see above) enables digital quantification of per nucleotide DNaseI cleavage, which in turn enables resolution of DNaseI footprints on a large scale [55],[77],[78]. Digital genomic footprinting is being applied on a large scale within ENCODE to identify millions of DNaseI footprints across >12 cell types, many of which localize the specific cognate regulatory motifs for factors profiled by ChIP-seq.

Sequence and structural variation

Genotypic and structural variations within all ENCODE cell types are being interrogated at ~1 million positions distributed approximately every 1.5 kb along the human genome, providing a finely grained map of allelic variation and sequence copy number gains and losses. Genotyping data are generated with the Illumina Infinium platform [79], and the results are reported as genotypes and as intensity value ratios for each allele. The genotype and sequence data from GM12878 generated by the 1,000 Genomes Project are being integrated with sequence data from ENCODE chromatin, transcription, TF occupancy, DNA methylation, and other assays to facilitate recognition of functional allelic variation, a significant contributor to phenotypic variability in gene expression [80],[81]. The data also permit determination of the sequence copy number gains and losses found in every human genome [82][84], which are particularly prevalent in cell lines of malignant origin.

Long-range Chromatin interactions

Because cis-regulatory elements such as enhancers can control genes from distances of tens to hundreds of kb through looping interactions [85], a major challenge presented by ENCODE data is to connect distal regulatory elements with their cognate promoter(s). To map this connectivity, the Consortium is applying the 5C method [86], an enhanced version of Chromosome Conformation Capture (3C) [87], to selected cell lines. 5C has been applied comprehensively to the ENCODE pilot regions as well as to map the interactions between distal DNaseI hypersensitive sites and transcriptional start sites across chromosome 21 and selected domains throughout the genome. Special interfaces have been developed to visualize these 3-dimensional genomic data and are publicly available at http://my5C.umassmed.edu [88].

Protein:RNA interactions

RNA-binding proteins play a major role in regulating gene expression through control of mRNA translation, stability, and/or localization. Occupancy of RNA-binding proteins (RBPs) on RNA can be determined by using immunoprecipitation-based approaches (RIP-chip and RIP-seq) [89][92] analogous to those used for measuring TF occupancy. To generate maps of RBP[ratio]RNA associations and binding sites, a combination of RIP-chip and RIP-seq are being used. These approaches are currently targeting 4–6 RBPs in five human cell types (K562, GM12878, H1 ES, HeLa, and HepG2). RBP associations with non-coding RNA and with mRNA are also being explored.

Identification of functional elements with integrative analysis and fine-scale assays of biochemical elements

ChIP-seq of TFs and chromatin modifications may identify genomic regions bound by transcription factors in living cells but do not reveal which segments bound by a given TF are functionally important for transcription. By applying integrative approaches that incorporate histone modifications typical of enhancers (e.g., histone H3, Lysine 4 monomethylation), promoters (e.g., histone H3, Lysine 4 trimethylation), and silencers (e.g., Histone H3, Lysine 27, and Lysine 9 trimethylation), ENCODE is categorizing putative functional elements and testing a subset for activities in the context of transient transfection/reporter gene assays [93][97]. To further pinpoint the biological activities associated with specific regions of TF binding and chromatin modification within promoters, hundreds of TF binding sites have been mutagenized, and the mutant promoters are being assayed for effects on reporter gene transcription by transient transfection assays. This approach is enabling identification of specific TF binding sites that lead to activation and others associated with transcriptional repression.

Proteomics

To assess putative protein products generated from novel RNA transcripts and isoforms, proteins are sequenced and quantified by mass spectrometry and mapped back to their encoding transcripts [33],[34],[98]. ENCODE has recently begun to study proteins from distinct subcellular compartments of K562 and GM12878 with this complementary approach.

Evolutionary conservation

Evolutionary conservation is an important indicator of biological function. ENCODE is approaching evolutionary analysis from two directions. Functional properties are being assigned to conserved sequence elements identified through multi-species alignments, and conversely, the evolutionary histories of biochemically defined elements are being deduced. Multiple alignments of the genomes of 33 mammalian species have been constructed by using the Enredo, Pecan, Ortheus approach (EPO) [99],[100], and complementary multiple alignments are available through the UCSC browser (UCSC Lastz/ChainNet/Multiz). These alignments enable measurement of evolutionary constraint at single-nucleotide resolution using GERP [101], SCONE [102], PhyloP [103], and other algorithms. In addition, conservation of DNA secondary structure based on hydroxyl radical cleavage patterns is being analyzed with the Chai algorithm [7].

Data Production Standards and Assessment of Data Quality

With the aim of ensuring quality and consistency, ENCODE has defined standards for collecting and processing each data type. These standards encompass all major experimental components, including cell growth conditions, antibody characterization, requirements for controls and biological replicates, and assessment of reproducibility. Standard formats for data submission are used that capture all relevant data parameters and experimental conditions, and these are available at the public ENCODE portal (http://genome.ucsc.edu/ENCODE/dataStandards.html). All ENCODE data are reviewed by a dedicated quality assurance team at the Data Coordination Center before release to the public. Experiments are considered to be verified when two highly concordant biological replicates have been obtained with the same experimental technique. In addition, a key quality goal of ENCODE is to provide validation at multiple levels, which can be further buttressed by cross-correlation between disparate data types. For example, we routinely perform parallel analysis of the same biological samples with alternate detection technologies (for example, ChIP-seq versus ChIP-chip or ChIP-qPCR). We have also compared our genome-wide results to “gold-standard” data from individual locus studies, such as DNase-seq versus independently performed conventional (Southern-based) DNaseI hypersensitivity studies. Cross-correlation of independent but related ENCODE data types with one another, such as DNaseI hypersensitivity, FAIRE, transcription factor occupancy, and histone modification patterns, can provide added confidence in the identification of specific DNA elements. Similarly, cross-correlation between long RNA-seq, CAGE, and TAF1 ChIP-seq data can strengthen confidence in a candidate location for transcription initiation. Finally, ENCODE is performing pilot tests for the biological activity of DNA elements to the predictive potential of various ENCODE biochemical signatures for certain biological functions. Examples include transfection assays in cultured human cells and injection assays in fish embryos to test for enhancer, silencer, or insulator activities in DNA elements identified by binding of specific groups of TFs or the presence of DNaseI hypersensitive sites or certain chromatin marks. Ultimately, defining the full biological role of a DNA element in its native chromosomal location and organismic context is the greatest challenge. ENCODE is beginning to approach this by integrating its data with results from other studies of in situ knockouts and/or knockdowns, or the identification of specific naturally occurring single base mutations and small deletions associated with changes in gene expression. However, we expect that deep insights into the function of most elements will ultimately come from the community of biologists who will build on ENCODE data or use them to complement their own experiments.

Current Scope and Completeness of ENCODE Data

A catalog of ENCODE datasets is available at http://encodeproject.org. These data provide evidence that ~1 Gigabase (Gb; 32%) of the human genome sequence is represented in steady-state, predominantly processed RNA populations. We have also delineated more than 2 million potential regulatory DNA regions through chromatin and TF mapping studies.

The assessment of the completeness of detection of any given element is challenging. To analyze the detection of transcripts in a single experiment, we have sequenced to substantial depth and used a sampling approach to estimate the number of reads needed to approach complete sampling of the RNA population (Figure 6A) [104]. For example, analyzing RNA transcripts with about 80 million mapped reads yields robust quantification of more than 80% of the lowest abundance class of genes (2–19 reads per kilobase per million mapped tags, RPKM) [24]. Measuring RNAs across multiple cell types, we find that, after the analysis of seven cell lines, 68% of the GENCODE transcripts can be detected with RPKM >1.

Figure 6
Incremental discovery of transcribed elements and regulatory DNA.

In the case of regulatory DNA, we have analyzed the detection of regulatory DNA by using three approaches: 1) the saturation of occupancy site discovery for a single transcription factor within a single cell type as a function of sequencing read depth, 2) the incremental discovery of DNaseI hypersensitive sites or the occupancy sites for a single TF across multiple cell types, and 3) the incremental rate of collective TF occupancy site discovery for all TFs across multiple cell types.

For detecting TF binding sites by ChIP-seq, we have found that the number of significant binding sites increases as a function of sequencing depth and that this number varies widely by transcription factor. For example, as shown in Figure 6B, 90% of detectable sites for the transcription factor GABP can be identified by using the MACS peak calling program at a depth of 24 million reads, whereas only 55% of detectable RNA Pol2 sites are identified at this depth when an antibody that recognizes both initiating and elongating forms of the enzyme is used. Even at 50 million reads, the number of sites is not saturated for RNA Pol2 with this antibody. It is important to note that determinations of saturation may vary with the use of different antibodies and laboratory protocols. For instance, a different RNA Pol2 antibody that recognizes unphosphorylated, non-elongating RNA Pol2 bound only at promoters requires fewer reads to reach saturation [105]. For practical purposes, ENCODE currently uses a minimum sequencing depth of 20 M uniquely mapped reads for sequence-specific transcription factors. For data generated prior to June 1, 2010, this figure was 12 M.

To assess the incremental discovery of regulatory DNA across different cell types, it was necessary to account for the non-uniform correlation between cell lines and assays (see Figure 6C legend for details). We therefore examined all possible orderings of either cell types or assays and calculated the distribution of elements discovered as the number of cell types or assays increases, presented as saturation distribution plots (Figure 6C and 6D, respectively). For DNase hypersensitive sites, we observe a steady increase in the mean number of sites discovered as additional cell types are tested up to and including the 62 different cell types examined to date, indicating that new elements continue to be identified at a relatively high rate as additional cell types are sampled (Figure 6C). Analysis of CTCF sites across 28 cell types using this approach shows similar behavior. Analysis of binding sites for 42 TFs in the cell line with most data (K562) also shows that saturation of the binding sites for these factors has not yet been achieved. These results indicate that additional cell lines need to be analyzed for DNaseI and many transcription factors, and that many more transcription factors need to be analyzed within single cell types to capture all the regulatory information for a given factor across the genome. The implications of these trends for defining the extent of regulatory DNA within the human genome sequence is as yet unclear.

III. Accessing ENCODE Data

ENCODE Data Release and Use Policy

The ENCODE Data Release and Use Policy is described at http://www.encodeproject.org/ENCODE/terms.html. Briefly, ENCODE data are released for viewing in a publicly accessible browser (initially at http://genome-preview.ucsc.edu/ENCODE and, after additional quality checks, at http://encodeproject.org). The data are available for download and pre-publication analysis of any kind, as soon as they are verified (i.e., shown to be reproducible). However, consistent with the principles stated in the Toronto Genomic Data Use Agreement [106], the ENCODE Consortium data producers request that they have the first publication on genome-wide analyses of ENCODE data, within a 9-month timeline from its submission. The timeline for each dataset is clearly displayed in the information section for each dataset. This parallels policies of other large consortia, such as the HapMap Project (http://www.hapmap.org), that attempt to balance the goal of rapid data release with the ability of data producers to publish initial analyses of their work. Once a producer has published a dataset during this 9-month period, anyone may publish freely on the data. The embargo applies only to global analysis, and the ENCODE Consortium expects and encourages immediate use and publication of information at one or a few loci, without any consultation or permission. For such uses, identifying ENCODE as the source of the data by citing this article is requested.

Public Repositories of ENCODE Data

After curation and review at the Data Coordination Center, all processed ENCODE data are publicly released to the UCSC Genome Browser database (http://genome.ucsc.edu). Accessioning of ENCODE data at the NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/info/ENCODE.html) is underway. Primary DNA sequence reads are stored at UCSC and the NCBI Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?) and will also be retrievable via GEO. Primary data derived from DNA microarrays (for example, for gene expression) are deposited directly to GEO. The processed data are also formatted for viewing in the UCSC browser. Metadata, including information on antibodies, cell culture conditions, and other experimental parameters, are deposited into the UCSC database, as are results of validation experiments. Easy retrieval of ENCODE data to a user's desktop is facilitated by the UCSC Table Browser tool (http://genome.ucsc.edu/cgi-bin/hgTables?org=human), which does not require programming skills. Computationally sophisticated users may gain direct access to data through application programming interfaces (APIs) at both the UCSC browser and NCBI and by downloading files from http://genome.ucsc.edu/ENCODE/downloads.html.

An overview of ENCODE data types and the location of the data repository for each type is presented in Table 4.

Table 4
Overview of ENCODE data types.

IV. Working with ENCODE Data

Using ENCODE Data in the UCSC Browser

Many users will want to view and interpret the ENCODE data for particular genes of interest. At the online ENCODE portal (http://encodeproject.org), users should follow a “Genome Browser” link to visualize the data in the context of other genome annotations. Currently, it is useful for users to examine both the hg18 and the hg19 genome browsers. The hg18 has the ENCODE Integrated Regulation Track on by default, which shows a huge amount of data in a small amount of space. The hg19 browser has newer datasets, and more ENCODE data than are available on hg18. Work is in progress to remap the older hg18 datasets to hg19 and generate integrated ENCODE tracks. On either browser, additional ENCODE tracks are marked by a double helix logo in the browser track groups for genes, transcripts, and regulatory features. Users can turn tracks on or off to develop the views most useful to them (Figure 7). To aid users in navigating the rich variety of data tracks, the ENCODE portal also provides a detailed online tutorial that covers data display, data download, and analysis functions available through the browser. Examples applying ENCODE data at individual loci to specific biological or medical issues are a good starting point for exploration and use of the data. Thus, we also provide a collection of examples at the “session gallery” at the ENCODE portal. Users are encouraged to submit additional examples; we anticipate that this community-based sharing of insights will accelerate the use and impact of the ENCODE data.

Figure 7
Accessing ENCODE data at the UCSC Portal.

An Illustrative Example

Numerous genome-wide association studies (GWAS) that link human genome sequence variants with the risk of disease or with common quantitative phenotypes have now become available. However, in most cases, the molecular consequences of disease- or trait-associated variants for human physiology are not understood [107]. In more than 400 studies compiled in the GWAS catalog [108], only a small minority of the trait/disease-associated SNPs (TASs) occur in protein-coding regions; the large majority (89%) are in noncoding regions. We therefore expect that the accumulating functional annotation of the genome by ENCODE will contribute substantially to functional interpretation of these TASs.

For example, common variants within a ~1 Mb region upstream of the c-Myc proto-oncogene at 8q24 have been associated with cancers of the colon, prostate, and breast (Figure 8A) [109][111]. ENCODE data on transcripts, histone modifications, DNase hypersensitive sites, and TF occupancy show strong, localized signals in the vicinity of major cancer-associated SNPs. One variant (rs698327) lies within a DNase hypersensitive site that is bound by several TFs and the enhancer-associated protein p300 and contains histone modification patterns typical of enhancers (high H3K4me1, low H3K4me3; Figure 8B). Recent studies have shown enhancer activity and allele-specific binding of TCF7L2 at this site [112], with the risk allele showing greater binding and activity [113],[114]. Moreover, this element appears to contact the downstream c-Myc gene in vivo, compatible with enhancer function [114],[115]. Similarly, several regions predicted via ENCODE data to be involved in gene regulation are close to SNPs in the BCL11A gene associated with persistent expression of fetal hemoglobin (Figure S2). These examples show that the simple overlay of ENCODE data with candidate non-coding risk-associated variants may readily identify specific genomic elements as leading candidates for investigation as probable effectors of phenotypic effects via alterations in gene expression or other genomic regulatory processes. Importantly, even data from cell types not directly associated with the phenotype of interest may be of considerable value for hypothesis generation. It is reasonable to expect that application of current and future ENCODE data will provide useful information concerning the mechanism(s) whereby genomic variation influences susceptibility to disease, which then can then be tested experimentally.

Figure 8
ENCODE data indicate non-coding regions in the human chromosome 8q24 loci associated with cancer.

Limitations of ENCODE Annotations

All ENCODE datasets to date are from populations of cells. Therefore, the resulting data integrate over the entire cell population, which may be physiologically and genetically inhomogeneous. Thus, the source cell cultures in the ENCODE experiments are not typically synchronized with respect to the cell cycle and, as with all such samples, local micro-environments in culture may also vary, leading to physiological differences in cell state within each culture. In addition, one Tier 1 cell line (K562) and two Tier 2 cell lines (HepG2 and HeLa) are known to have abnormal genomes and karyotypes, with genome instability. Finally, some future Tier 3 tissue samples or primary cultures may be inherently heterogeneous in cell type composition. Averaging over heterogeneity in physiology and/or genotype produces an amalgamation of the contributing patterns of gene expression, factor occupancy, and chromatin status that must be considered when using the data. Future improvements in genome-wide methodology that allow the use of much smaller amounts of primary samples, or follow-up experiments in single cells when possible, may allow us to overcome many of these caveats.

The use of DNA sequencing to annotate functional genomic features is constrained by the ability to place short sequence reads accurately within the human genome sequence. Most ENCODE data types currently represented in the UCSC browser use only those sequence reads that map uniquely to the genome. Thus, centromeric and telomeric segments (collectively ~15% of the genome and enriched in recent transposon insertions and segmental duplications) as well as sequences not present in the current genome sequence build [116] are not subject to reliable annotation by our current techniques. However, such information can be gleaned through mining of the publicly available raw sequence read datasets generated by ENCODE.

It is useful to recognize that the confidence with which different classes of ENCODE elements can be related to a candidate function varies. For example, ENCODE can identify with high confidence new internal exons of protein-coding genes, based on RNA-seq data for long polyA+ RNA. Other features, such as candidate promoters, can be identified with less, yet still good, confidence by combining data from RNA-seq, CAGE-tags, and RNA polymerase 2 (RNA Pol2) and TAF1 occupancy. Still other ENCODE biochemical signatures come with much lower confidence about function, such as a candidate transcriptional enhancer supported by ChIP-seq evidence for binding of a single transcription factor.

Identification of genomic regions enriched by ENCODE biochemical assays relies on the application of statistical analyses and the selection of threshold significance levels, which may vary between the algorithms used for particular data types. Accordingly, discrete annotations, such as TF occupancy or DNaseI hypersensitive sites, should be considered in the context of reported p values, q values, or false discovery rates, which are conservative in many cases. For data types that lack focal enrichment, such as certain histone modifications and many RNA Pol2-bound regions, broad segments of significant enrichment have been delineated that encompass considerable quantitative variation in the signal strength along the genome.

V. ENCODE Data Analysis

Development and implementation of algorithms and pipelines for processing and analyzing data has been a major activity of the ENCODE Project. Because massively parallel DNA sequencing has been the main type of data generated by the Consortium, much of the algorithmic development and data analysis to date has been concerned with issues related to producing and interpreting such data. Software packages and algorithms commonly used in the ENCODE Consortium are summarized in Tables 3 and S1.

In general, the analysis of sequencing-based measurements of functional or biochemical genomic parameters proceeds through three major phases. In the first phase, the short sequences that are the output of the experimental method are aligned to the reference genome. Algorithm development for efficient and accurate alignment of short read sequences to the human genome is a rapidly developing field, and ENCODE groups employ a variety of the state-of-the-art software (see Tables 3 and S1). In the second phase, the initial sequence mapping is processed to identify significantly enriched regions from the read density. For ChIP-seq (TFs and histone modification), DNase-seq or FAIRE-seq, both highly localized peaks or broader enriched regions may be identified. Within the ENCODE Consortium, each data production group provides lists of enriched regions or elements within their own data, which are available through the ENCODE portal. It should be noted that, for most data types, the majority of enriched regions show relatively weak absolute signal, necessitating the application of conservative statistical thresholds. For some data, such as those derived from sampling RNA species (e.g., RNA-seq), additional algorithms and processing are used to handle transcript structures and the recognition of splicing events.

The final stage of analysis involves integrating the identified regions of enriched signal with each other and with other data types. An important prerequisite to data integration is the availability of uniformly processed datasets. Therefore, in addition to the processing pipelines developed by individual production groups, ENCODE has devoted considerable effort toward establishing robust uniform processing for phases 1 and 2 to enable integration. For signal comparison, specific consideration has been given to deriving a normalized view of the sequence read density of each experiment. In the case of ChIP-seq for TFs, this process includes in silico extension of the sequence alignment to reflect the experimentally determined average lengths of the input DNA molecules that are sampled by the short sequence tag, compensation for repetitive sequences that may lead to alignment with multiple genomic locations, and consideration of the read density of the relevant control or input chromatin experiment. ENCODE has adopted a uniform standardized peak-calling approach for transcription factor ChIP-seq, including a robust and conservative replicate reconciliation statistic (Irreproducible Discovery Rate, IDR [117], to yield comparable consensus peak calls. As the project continues, we expect further standardizations to be developed.

There are many different ways to analyze and integrate large, diverse datasets. Some of the basic approaches include assigning features to existing annotations (e.g., assigning transcribed regions to annotated genes or Pol2-binding peaks to likely genes), discovery of correlations among features, and identification of particular gene classes (e.g., Gene Ontology categories) preferentially highlighted by a given annotation. Many software tools exist in the community for these purposes, including some developed within the ENCODE Project, such as the Genome Structure Correction statistic for assessing overlap significance [3]. Software tools used for integration by ENCODE are summarized in Tables 3 and S1.

VI. Future Plans and Challenges

Data Production Plans

The challenge of achieving complete coverage of all functional elements in the human genome is substantial. The adult human body contains several hundred distinct cell types, each of which expresses a unique subset of the ~1,500 TFs encoded in the human genome [118]. Furthermore, the brain alone contains thousands of types of neurons that are likely to express not only different sets of TFs but also a larger variety of non-coding RNAs [119]. In addition, each cell type may exhibit a diverse array of responses to exogenous stimuli such as environmental conditions or chemical agents. Broad areas of fundamental chromosome function, such as meiosis and recombination, remain unexplored. Furthermore, ENCODE has focused chiefly on definitive cells and cell lines, bypassing the substantial complexity of development and differentiation. A truly comprehensive atlas of human functional elements is not practical with current technologies, motivating our focus on performing the available assays in a range of cell types that will provide substantial near-term utility. ENCODE is currently developing a strategy for addressing this cellular space in a timely manner that maximizes the value to the scientific community. Feedback from the user community will be a critical component of this process.

Integrating ENCODE with Other Projects and the Scientific Community

To understand better and functionally annotate the human genome, ENCODE is making efforts to analyze and integrate data within the project and with other large-scale projects. These efforts include 1) defining promoter and enhancer regions by combining transcript mapping and biochemical marks, 2) delineating distinct classes of regions within the genomic landscape by their specific combinations of biochemical and functional characteristics, and 3) defining transcription factor co-associations and regulatory networks. These efforts aim to extend our understanding of the functions of the different biochemical elements in gene regulation and gene expression.

One of the major motivations for the ENCODE Project has been to aid in the interpretation of human genome variation that is associated with disease or quantitative phenotypes. The Consortium is therefore working to combine ENCODE data with those from other large-scale studies, including the 1,000 Genomes Project, to study, for example, how SNPs and structural variation may affect transcript, regulatory, and DNA methylation data. We foresee a time in the near future when the biochemical features defined by ENCODE are routinely combined with GWAS and other sequence variation–driven studies of human phenotypes. Analogously, the systematic profiling of epigenomic features across ex vivo tissues and stem cells currently being undertaken by the NIH Roadmap Epigenomics program will provide synergistic data and the opportunity to observe the state and behavior of ENCODE-identified elements in human tissues representing healthy and disease states.

These are but a few of many applications of the ENCODE data. Investigators focused on one or a few genes should find many new insights within the ENCODE data. Indeed, these investigators are in the best position to infer potential functions and mechanisms from the ENCODE data—ones that will also lead to testable hypotheses. Thus, we expect that the work of many investigators will be enhanced by these data and that their results will in turn inform the development of the project going forward.

Finally, we also expect that comprehensive paradigms for gene regulation will begin to emerge from our work and similar work from many laboratories. Deciphering the “regulatory code” within the genome and its associated epigenetic signals is a grand and complex challenge. The data contributed by ENCODE in conjunction with complementary efforts will be foundational to this effort, but equally important will be novel methods for genome-wide analysis, model building, and hypothesis testing. We therefore expect the ENCODE Project to be a major contributor not only of data but also novel technologies for deciphering the human genome and those of other organisms.

Supporting Information

Figure S1

The Organization of the ENCODE Consortium. The geographical distribution of the members of the ENCODE Consortium, with pin colors indicating the group roles as detailed in the text below.

(TIF)

Figure S2

Quantitative trait example (BCL11A). Candidates for gene regulatory features in the vicinity of SNPs at the BCL11A locus associated with fetal hemoglobin levels. SNPs associated with fetal hemoglobin levels are marked in red on the top line; those not associated are marked in blue. The phenotype-associated SNPs are close to an antisense transcript (AC009970.1, light orange), shown in the ENCODE gene annotations. This antisense transcript is within a region (boxed in red) with elevated levels of H3K4me1 and DNase hypersensitive sites. The phenotype-associated region is flanked by two regions (boxed in blue) with multiple strong biochemical signals associated with transcriptional regulation, including transcription factor occupancy. The data are from the lymphoblastoid cell line GM12878, as BCL11A is expressed in this cell line (RNA-seq track) but not in K562 (unpublished data).

(TIF)

Table S1

This supplemental table contains additional details of the computational analysis tools used by the ENCODE Consortium that are listed in Table 3. The name of each software tool appears in the first column, and subsequent columns contain the tasks for which the tool is used, the PMID reference number when available, and a web address where the tool can be accessed.

(DOC)

Acknowledgments

We thank Judy R. Wexler and Julia Zhang at the National Human Genome Research Institute for their support in administering the ENCODE Consortium, additional members of our laboratories and institutions who have contributed to the experimental and analytical components of this project, and J. D. Frey for assistance in preparing the figures.

The ENCODE Consortium Authors

Writing Group. Richard M. Myers1, John Stamatoyannopoulos2, Michael Snyder3, Ian Dunham4, Ross C. Hardison5, Bradley E. Bernstein6,7, Thomas R. Gingeras8, W. James Kent9, Ewan Birney4, Barbara Wold10,11, Gregory E. Crawford12,13.

Broad Institute Group. Bradley E. Bernstein6,7, Charles B. Epstein6, Noam Shoresh6, Jason Ernst6,14, Tarjei S. Mikkelsen6, Pouya Kheradpour6,14, Xiaolan Zhang6, Li Wang6, Robbyn Issner6, Michael J. Coyne6, Timothy Durham6, Manching Ku6, Thanh Truong6, Lucas D. Ward6,14, Robert C. Altshuler14, Michael F. Lin6,14, Manolis Kellis6,14.

Cold Spring Harbor; University of Geneva; Center for Genomic Regulation, Barcelona; RIKEN; University of Lausanne; Genome Institute of Singapore Group. Cold Spring Harbor I: Thomas R. Gingeras8, Carrie A. Davis8, Philipp Kapranov15, Alexander Dobin8, Christopher Zaleski8, Felix Schlesinger8, Philippe Batut8, Sudipto Chakrabortty8, Sonali Jha8, Wei Lin8, Jorg Drenkow8, Huaien Wang8, Kim Bell8, Hui Gao16, Ian Bell15, Erica Dumais15, Jacqueline Dumais15. University of Geneva: Stylianos E. Antonarakis17, Catherine Ucla17, Christelle Borel17. Center for Genomic Regulation, Barcelona: Roderic Guigo18, Sarah Djebali18, Julien Lagarde18, Colin Kingswood18, Paolo Ribeca18, Micha Sammeth18, Tyler Alioto18, Angelika Merkel18, Hagen Tilgner18. RIKEN: Piero Carninci19, Yoshihide Hayashizaki19, Timo Lassmann19, Hazuki Takahashi19, Rehab F. Abdelhamid19. Cold Spring Harbor II: Gregory Hannon20, Katalin Fejes-Toth8, Jonathan Preall8, Assaf Gordon8, Vihra Sotirova8. University of Lausanne: Alexandre Reymond21, Cedric Howald21, Emilie Aït Yahya Graison21, Jacqueline Chrast21. Genome Institute of Singapore: Yijun Ruan22, Xiaoan Ruan22, Atif Shahab22, Wan Ting Poh22, Chia-Lin Wei22.

Duke University, EBI, University of Texas, Austin, University of North Carolina–Chapel Hill Group. Duke University: Gregory E. Crawford12,13, Terrence S. Furey12, Alan P. Boyle12, Nathan C. Sheffield12, Lingyun Song12, Yoichiro Shibata12, Teresa Vales12, Deborah Winter12, Zhancheng Zhang12, Darin London12, Tianyuan Wang12. EBI: Ewan Birney4, Damian Keefe4. University of Texas, Austin: Vishwanath R. Iyer23, Bum-Kyu Lee23, Ryan M. McDaniell23, Zheng Liu23, Anna Battenhouse23, Akshay A. Bhinge23. University of North Carolina–Chapel Hill: Jason D. Lieb24, Linda L. Grasfeder24, Kimberly A. Showers24, Paul G. Giresi24, Seul K. C. Kim24, Christopher Shestak24.

HudsonAlpha Institute, Caltech, Stanford Group. HudsonAlpha Institute: Richard M. Myers1, Florencia Pauli1, Timothy E. Reddy1, Jason Gertz1, E. Christopher Partridge1, Preti Jain1, Rebekka O. Sprouse1, Anita Bansal1, Barbara Pusey1, Michael A. Muratet1, Katherine E. Varley1, Kevin M. Bowling1, Kimberly M. Newberry1, Amy S. Nesmith1, Jason A. Dilocker1, Stephanie L. Parker1, Lindsay L. Waite1, Krista Thibeault1, Kevin Roberts1, Devin M. Absher1. Caltech: Barbara Wold10,11, Ali Mortazavi10,11, Brian Williams10, Georgi Marinov10, Diane Trout10, Shirley Pepke25, Brandon King10, Kenneth McCue10, Anthony Kirilusha10, Gilberto DeSalvo10, Katherine Fisher-Aylor10, Henry Amrhein10, Jost Vielmetter11. Stanford: Gavin Sherlock3, Arend Sidow3,26, Serafim Batzoglou27, Rami Rauch3, Anshul Kundaje26,27, Max Libbrecht27.

NHGRI Groups. NHGRI, Genome Informatics Section: Elliott H. Margulies28, Stephen C. J. Parker28. NHGRI, Genomic Functional Analysis Section: Laura Elnitski29. NHGRI, NIH Intramural Sequencing Center: Eric D. Green30.

Sanger Institute; Washington University; Yale University; Center for Genomic Regulation, Barcelona; UCSC; MIT; University of Lausanne; CNIO Group. Sanger Institute: Tim Hubbard31, Jennifer Harrow31, Stephen Searle31, Felix Kokocinski31, Browen Aken31, Adam Frankish31, Toby Hunt31, Gloria Despacio-Reyes31, Mike Kay31, Gaurab Mukherjee31, Alexandra Bignell31, Gary Saunders31, Veronika Boychenko31. Washington University: Michael Brent32, M. J. Van Baren32, Randall H. Brown32. Yale University: Mark Gerstein33,34,35, Ekta Khurana33,34, Suganthi Balasubramanian33,34, Zhengdong Zhang33,34, Hugo Lam33,34, Philip Cayting3,33,34, Rebecca Robilotto33,34, Zhi Lu33,34. Center for Genomic Regulation, Barcelona: Roderic Guigo18, Thomas Derrien18, Andrea Tanzer18, David G. Knowles18, Marco Mariotti18. UCSC: W. James Kent9, David Haussler9,36, Rachel Harte9, Mark Diekhans9. MIT: Manolis Kellis6,14, Mike Lin6,14, Pouya Kheradpour6,14, Jason Ernst6,14. University of Lausanne: Alexandre Reymond21, Cedric Howald21, Emilie Aït Yahya Graison21, Jacqueline Chrast21. CNIO: Alfonso Valencia37, Michael Tress37, Jose Manuel Rodriguez37.

Stanford-Yale, Harvard, University of Massachusetts Medical School, University of Southern California/UCDavis Group. Stanford-Yale: Michael Snyder3, Stephen G. Landt3, Debasish Raha38, Minyi Shi3, Ghia Euskirchen3, Fabian Grubert3, Maya Kasowski38, Jin Lian39, Philip Cayting3,33,34, Phil Lacroute3, Youhan Xu38, Hannah Monahan38, Dorrelyn Patacsil3, Teri Slifer3, Xinqiong Yang3, Alexandra Charos38, Brian Reed38, Linfeng Wu3, Raymond K. Auerbach33, Lukas Habegger33, Manoj Hariharan3, Joel Rozowsky33,34, Alexej Abyzov33,34, Sherman M. Weissman39, Mark Gerstein33,34,35. Harvard: Kevin Struhl40, Nathan Lamarre-Vincent40, Marianne Lindahl-Allen40, Benoit Miotto40, Zarmik Moqtaderi40, Joseph D. Fleming40. University of Massachusetts Medical School: Peter Newburger41. University of Southern California/UCDavis: Peggy J. Farnham42,43, Seth Frietze42,43, Henriette O'Geen43, Xiaoqin Xu43, Kim R. Blahnik43, Alina R. Cao43, Sushma Iyengar43.

University of Washington, University of Massachusetts Medical School Group. University of Washington: John A. Stamatoyannopoulos2, Rajinder Kaul2, Robert E. Thurman2, Hao Wang2, Patrick A. Navas2, Richard Sandstrom2, Peter J. Sabo2, Molly Weaver2, Theresa Canfield2, Kristen Lee2, Shane Neph2, Vaughan Roach2, Alex Reynolds2, Audra Johnson2, Eric Rynes2, Erika Giste2, Shinny Vong2, Jun Neri2, Tristan Frum2, Ericka M. Johnson2, Eric D. Nguyen2, Abigail K. Ebersol2, Minerva E. Sanchez2, Hadar H. Sheffer2, Dimitra Lotakis2, Eric Haugen2, Richard Humbert2, Tanya Kutyavin2, Tony Shafer2. University of Massachusetts Medical School: Job Dekker44, Bryan R. Lajoie44, Amartya Sanyal44.

Data Coordination Center. W. James Kent9, Kate R. Rosenbloom9, Timothy R. Dreszer9, Brian J. Raney9, Galt P. Barber9, Laurence R. Meyer9, Cricket A. Sloan9, Venkat S. Malladi9, Melissa S. Cline9, Katrina Learned9, Vanessa K. Swing9, Ann S. Zweig9, Brooke Rhead9, Pauline A. Fujita9, Krishna Roskin9, Donna Karolchik9, Robert M. Kuhn9, David Haussler9,36.

Data Analysis Center. Ewan Birney4, Ian Dunham4, Steven P. Wilder4, Damian Keefe4, Daniel Sobral4, Javier Herrero4, Kathryn Beal4, Margus Lukk4, Alvis Brazma4, Juan M. Vaquerizas4, Nicholas M. Luscombe4, Peter J. Bickel45, Nathan Boley45, James B. Brown45, Qunhua Li45, Haiyan Huang45, Mark Gerstein32,33,34, Lukas Habegger33, Andrea Sboner33,34, Joel Rozowsky33,34, Raymond K. Auerbach33, Kevin Y. Yip33,34, Chao Cheng33,34, Koon-Kiu Yan33,34, Nitin Bhardwaj33,34, Jing Wang33,34, Lucas Lochovsky33,34, Justin Jee33,34, Theodore Gibson33,34, Jing Leng33,34, Jiang Du35, Ross C. Hardison5, Robert S. Harris5, Giltae Song5, Webb Miller5, David Haussler9,36, Krishna Roskin9, Bernard Suh9, Ting Wang46, Benedict Paten9, William S. Noble2,47, Michael M. Hoffman2, Orion J. Buske2, Zhiping Weng48, Xianjun Dong48, Jie Wang48, Hualin Xi49.

University of Albany SUNY Group. Scott A. Tenenbaum50, Frank Doyle50, Luiz O. Penalva51, Sridar Chittur50.

Boston University Group. Thomas D. Tullius52, Stephen C. J. Parker28,52.

University of Chicago, Stanford Group. University of Chicago: Kevin P. White53, Subhradip Karmakar53, Alec Victorsen53, Nader Jameel53, Nick Bild53, Robert L. Grossman53. Stanford: Michael Snyder3, Stephen G. Landt3, Xinqiong Yang3, Dorrelyn Patacsil3, Teri Slifer3.

University of Massachusetts Medical School Groups. University of Massachusetts Medical School I: Job Dekker44, Bryan R. Lajoie44, Amartya Sanyal44. University of Massachusetts Medical School II: Zhiping Weng48, Troy W. Whitfield48, Jie Wang48, Patrick J. Collins54, Nathan D. Trinklein54, E. Christopher Partridge1, Richard M. Myers1.

Boise State University/University of North Carolina–Chapel Hill Proteomics Group. Morgan C. Giddings55,56,57, Xian Chen58, Jainab Khatun55, Chris Maier55, Yanbao Yu57, Harsha Gunawardena57, Brian Risk56.

NIH Project Management Group. Elise A. Feingold58, Rebecca F. Lowdon58, Laura A. L. Dillon58, Peter J. Good58.

Affiliations

1 HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, United States of America,

2 Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America,

3 Department of Genetics, Stanford University School of Medicine, Stanford, California, United States of America,

4 European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, United Kingdom,

5 Center for Comparative Genomics and Bioinformatics, Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America,

6 Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America,

7 Howard Hughes Medical Institute and Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, United States of America,

8 Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America,

9 Center for Biomolecular Science and Engineering, University of California, Santa Cruz, Santa Cruz, California, United States of America,

10 Biology Division, California Institute of Technology, Pasadena, California, United States of America,

11 Beckman Institute, California Institute of Technology, Pasadena, California, United States of America,

12 Institute for Genome Sciences and Policy, Duke University, Durham, North Carolina, United States of America,

13 Department of Pediatrics, Duke University, Durham, North Carolina, United States of America,

14 Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, United States of America,

15 Affymetrix, Santa Clara, California, United States of America,

16 Karolinksa Institutet, Huddinge, Sweden,

17 University of Geneva, Geneva, Switzerland,

18 Bioinformatics and Genomics, Centre de Regulacio Genomica, Barcelona, Spain,

19 Omics Science Center, RIKEN Yokohama Institute, Yokohama, Kanagawa, Japan,

20 Howard Hughes Medical Institute, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America,

21 Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland,

22 Genome Institute of Singapore, Singapore,

23 Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, Section of Molecular Genetics and Microbiology, University of Texas at Austin, Austin, Texas, United States of America,

24 Department of Biology, Carolina Center for Genome Sciences, and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America,

25 Center for Advanced Computing Research, California Institute of Technology, Pasadena, California, United States of America,

26 Department of Pathology, Stanford University School of Medicine, Stanford, California, United States of America,

27 Department of Computer Science, Stanford University, Stanford, California, United States of America,

28 Division of Intramural Research, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America,

29 National Human Genome Research Institute, Genome Technology Branch, National Institutes of Health, Rockville, Maryland, United States of America,

30 National Human Genome Research Institute, NIH Intramural Sequencing Center, National Institutes of Health, Bethesda, Maryland, United States of America,

31 Vertebrate Genome Analysis, Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire, United Kingdom,

32 Center for Genome Sciences and Department of Computer Science, Washington University in St. Louis, St. Louis, Missouri, United States of America,

33 Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America,

34 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America,

35 Department of Computer Science, Yale University, New Haven, Connecticut, United States of America,

36 Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, California, United States of America,

37 Structural Computational Biology, Centro Nacional de Investigaciones Oncolùgicas, Madrid, Spain,

38 Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut, United States of America,

39 Department of Genetics, Yale University, New Haven, Connecticut, United States of America,

40 Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts, United States of America,

41 Department of Pediatrics, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America,

42 Department of Biochemistry and Molecular Biology, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, California, United States of America,

43 Genome Center, University of California–Davis, Davis, California, United States of America,

44 Program in Gene Function and Expression, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America,

45 Department of Statistics, University of California at Berkeley, Berkeley, California, United States of America,

46 Department of Genetics, Washington University in St. Louis, St. Louis, Missouri, United States of America,

47 Department of Computer Science and Engineering, University of Washington, Seattle, Washington, United States of America,

48 Program in Bioinformatics and Integrative Biology, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, Massachusetts, United States of America,

49 Bioinformatics Program, Boston University, Boston, Massachusetts, United States of America,

50 College of Nanoscale Sciences and Engineering, University at Albany–SUNY, Albany, New York, United States of America,

51 Children's Cancer Research Institute, Department of Cellular and Structural Biology, San Antonio, Texas, United States of America,

52 Department of Chemistry and Program in Bioinformatics, Boston University, Boston, Massachusetts, United States of America,

53 Institute for Genomics and Systems Biology, The University of Chicago, Chicago, Illinois, United States of America,

54 SwitchGear Genomics, Menlo Park, California, United States of America,

55 Biomolecular Research Center, Boise State University, Boise, Idaho, United States of America,

56 Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America,

57 Biochemistry Department, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, United States of America,

58 National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, United States of America

Abbreviations

3C
Chromosome Conformation Capture
API
application programming interface
CAGE
Cap-Analysis of Gene Expression
ChIP
chromatin immunoprecipitation
DCC
Data Coordination Center
DHS
DNaseI hypersensitive site
ENCODE
Encyclopedia of DNA Elements
EPO
Enredo, Pecan, Ortheus approach
FDR
false discovery rate
GEO
Gene Expression Omnibus
GWAS
genome-wide association studies
IDR
Irreproducible Discovery Rate
Methyl-seq
sequencing-based methylation determination assay
NHGRI
National Human Genome Research Institute
PASRs
promoter-associated short RNAs
PET
Paired-End diTag
RACE
Rapid Amplification of cDNA Ends
RNA Pol2
RNA polymerase 2
RBP
RNA-binding protein
RRBS
Reduced Representation Bisulfite Sequencing
SRA
Sequence Read Archive
TAS
trait/disease-associated SNP
TF
transcription factor
TSS
transcription start site

Footnotes

The authors have declared that no competing interests exist.

Funded by the National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA. The role of the NIH Project Management Group in the preparation of this paper was limited to coordination and scientific management of the ENCODE Consortium.

References

1. Collins F. S, Green E. D, Guttmacher A. E, Guyer M. S. A vision for the future of genomics research. Nature. 2003;422:835–847. [PubMed]
2. ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004;306:636–640. [PubMed]
3. ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. [PMC free article] [PubMed]
4. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. [PubMed]
5. Chiaromonte F, Weber R. J, Roskin K. M, Diekhans M, Kent W. J, Haussler D. The share of human genomic DNA under selection estimated from human-mouse genomic alignments. Cold Spring Harb Symp Quant Biol. 2003;68:245–254. [PubMed]
6. Stone E. A, Cooper G. M, Sidow A. Trade-offs in detecting evolutionarily constrained sequence by comparative genomics. Annu Rev Genomics Hum Genet. 2005;6:143–164. [PubMed]
7. Parker S. C, Hansen L, Abaan H. O, Tullius T. D, Margulies E. H. Local DNA topography correlates with functional noncoding regions of the human genome. Science. 2009;324:389–392. [PMC free article] [PubMed]
8. Asthana S, Noble W. S, Kryukov G, Grant C. E, Sunyaev S, Stamatoyannopoulos J. A. Widely distributed noncoding purifying selection in the human genome. Proc Natl Acad Sci U S A. 2007;104:12410–12415. [PMC free article] [PubMed]
9. Wold B, Myers R. M. Sequence census methods for functional genomics. Nature Meth. 2008;5:19–21. [PubMed]
10. Wang Z, Gerstein M, Snyder M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. [PMC free article] [PubMed]
11. Rosenbloom K. R, Dreszer T. R, Pheasant M, Barber G. P, Meyer L. R, et al. ENCODE whole-genome data in the UCSC Genome Browser. Nucleic Acids Res. 2010;38(Database issue):D620–D625. [PMC free article] [PubMed]
12. Rhead B, Karolchik D, Kuhn R. M, Hinrichs A. S, Zweig A. S, et al. The UCSC Genome Browser database: update 2010. Nucleic Acids Res. 2010;38(Database issue):D613–D619. [PMC free article] [PubMed]
13. Lozzio C. B, Lozzio B. B. Human chronic myelogenous leukemia cell-line with positive Philadelphia chromosome. Blood. 1975;45:321–334. [PubMed]
14. Thomson J. A, Itskovitz-Eldor J, Shapiro S. S, Waknitz M. A, Swiergiel J. J, et al. Embryonic stem cell lines derived from human blastocysts. Science. 1998;282:1145–1147. [PubMed]
15. Gey G. O, Coffman W. D, Kubicek M. T. Tissue culture studies of the proliferative capacity of cervical carcinoma and normal epithelium. Cancer Res. 1952;12:264–265.
16. Knowles B. B, Howe C. C, Aden D. P. Human hepatocellular carcinoma cell lines secrete the major plasma proteins and hepatitis B surface antigen. Science. 1980;209:497–499. [PubMed]
17. Jaffe E. A, Nachman R. L, Becker C. G, Minick C. R. Culture of human endothelial cells derived from umbilical veins: Identification by morphologic and immunologic criteria. J Clin Invest. 1973;52:2745–2756. [PMC free article] [PubMed]
18. International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. [PubMed]
19. Guigó R, Flicek P, Abril J. F, Reymond A, Lagarde J, et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol. 2006;7(Suppl 1):S2 1–31. [PMC free article] [PubMed]
20. Harrow J, Denoeud F, Frankish A, Reymond A, Chen C. K, et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7:S4 1–9. [PMC free article] [PubMed]
21. Zhang Z, Carriero N, Zheng D, Karro J, Harrison P. M, Gerstein M. PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics. 2006;22:1437–1439. [PubMed]
22. Flicek P, Aken B. L, Ballester B, Beal K, Bragin E, et al. Ensembl's 10th year. Nucleic Acids Res. 2010;38:D557–D562. [PMC free article] [PubMed]
23. Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, et al. Novel RNAs identified from a comprehensive analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 2004;14:331–342. [PMC free article] [PubMed]
24. Mortazavi A, Williams B. A, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods. 2008;5:621–628. [PubMed]
25. Schmid M, Jensen T. H. Nuclear quality control of RNA polymerase II transcripts. J Wiley Interdisciplinary Review 2010 [PubMed]
26. Bachellerie J. P, Cavaille J, Huttenhofer A. The expanding snoRNA world. Biochimie. 2002;84:775–790. [PubMed]
27. Kapranov P, Willingham A. T, Gingeras T. R. Genome-wide transcription and the implications for genomic organization. Nat Rev Genet. 2007;8:413–423. [PubMed]
28. Fullwood M. J, Wei C. L, Liu E. T, Ruan Y. Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses. Genome Res. 2009;19:521–532. [PMC free article] [PubMed]
29. Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci U S A. 2003;100:15776–15781. [PMC free article] [PubMed]
30. Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. 2006;38:626–635. [PubMed]
31. Valen E, Pascarella G, Chalk A, Maeda N, Kojima M, et al. Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE. Genome Res. 2009;19:255–265. [PMC free article] [PubMed]
32. Frohman M. A, Dush M. K, Martin G. R. Rapid production of full-length cDNAs from rare transcripts: amplification using a single gene-specific oligonucleotide primer. Proc Natl Acad Sci U S A. 1988;85:8998–9002. [PMC free article] [PubMed]
33. Giddings M. C, Shah A. A, Gesteland R, Moore B. Genome-based peptide fingerprint scanning. Proc Natl Acad Sci U S A. 2003;100:20–25. [PMC free article] [PubMed]
34. Merrihew G. E, Davis C, Ewing B, Williams G, Käll L, et al. Use of shotgun proteomics for the identification, confirmation, and correction of C elegans gene annotations. Genome Res. 2008;18:1660–1669. [PMC free article] [PubMed]
35. Maston G. A, Evans S. K, Green M. R. Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet. 2006;7:29–59. [PubMed]
36. Wu C. The 5′ ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I. Nature. 1980;286:854–860. [PubMed]
37. Keene M. A, Corces V, Lowenhaupt K, Elgin S. C. DNase I hypersensitive sites in Drosophila chromatin occur at the 5′ ends of regions of transcription. Proc Natl Acad Sci U S A. 1981;78:143–146. [PMC free article] [PubMed]
38. McGhee J. D, Wood W. I, Dolan M, Engel J. D, Felsenfeld G. A 200 base pair region at the 5′ end of the chicken adult beta-globin gene is accessible to nuclease digestion. Cell. 1981;27:45–55. [PubMed]
39. Gross D. S, Garrard W. T. Nuclease hypersensitive sites in chromatin. Annu Rev Biochem. 1988;57:159–197. [PubMed]
40. Giresi P. G, Kim J, McDaniell R. M, Iyer V. R, Lieb J. D. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. 2007;17:877–885. [PMC free article] [PubMed]
41. Auerbach R. K, Euskirchen G, Rozowsky J, Lamarre-Vincent N, Moqtaderi Z, et al. Mapping accessible chromatin regions using Sono-Seq. Proc Natl Acad Sci U S A. 2009;106:14926–14931. [PMC free article] [PubMed]
42. Kouzarides T. Chromatin modifications and their function. Cell. 2007;128:693–705. [PubMed]
43. Bernstein B. E, Meissner A, Lander E. S. The mammalian epigenome. Cell. 2007;128:669–681. [PubMed]
44. Heintzman N. D, Stuart R. K, Hon G, Fu Y, Ching C, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007;39:311–318. [PubMed]
45. Liang G, Lin J. C, Wei V, Yoo C, Cheng J, et al. Distinct localization of histone H3 acetylation and H3-K4 methylation to the transcription start sites in the human genome. Proc Natl Acad Sci U S A. 2004;11:7357–7362. [PMC free article] [PubMed]
46. Schneider R, Bannister A. J, Myers F. A, Thorne A. W, Crane-Robinson C, Kouzarides T. Histone H3 lysine 4 methylation patterns in higher eukaryotic genes. Nat Cell Biol. 2004;6:73–77. [PubMed]
47. Bernstein B. E, Kamal M, Lindblad-Toh K, Bekiranov S, Bailey D. K, et al. Genomic maps and comparative analysis of histone modifications in human and mouse. Cell. 2005;28:169–181. [PubMed]
48. Boyer L. A, Plath K, Zeitlinger J, Brambrink T, Medeiros L. A, et al. Polycomb complexes repress developmental regulators in murine embryonic stem cells. Nature. 2006;18:349–353. [PubMed]
49. Lee T. I, Jenner R. G, Boyer L. A, Guenther M. G, Levine S. S, et al. Control of developmental regulators by Polycomb in human embryonic stem cells. Cell. 2006;21:301–313. [PMC free article] [PubMed]
50. Bernstein B. E, Mikkelsen T. S, Xie X, Kamal M, Huebert D. J, et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006;125:315–326. [PubMed]
51. Strahl B. D, Allis C. D. The language of covalent histone modifications. Nature. 2000;6:41–45. [PubMed]
52. Sabo P. J, Hawrylycz M, Wallace J. C, Humbert R, Yu M, et al. Discovery of functional noncoding elements by digital analysis of chromatin structure. Proc Natl Acad Sci U S A. 2004;101:16837–16842. [PMC free article] [PubMed]
53. Boyle A. P, Davis S, Shulha H. P, Meltzer P, Margulies E. H, et al. High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008;25:311–322. [PMC free article] [PubMed]
54. Sabo P. J, Kuehn M. S, Thurman R, Johnson B. E, Johnson E. M, et al. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods. 2006;3:511–518. [PubMed]
55. Hesselberth J. R, Chen X, Zhang Z, Sabo P. J, Sandstrom R, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nat Methods. 2009;6:283–289. [PMC free article] [PubMed]
56. Sekimata M, Pérez-Melgosa M, Miller S. A, Weinmann A. S, Sabo P. J, et al. CCCTC-binding factor and the transcription factor T-bet orchestrate T helper 1 cell-specific structure and function at the interferon-gamma locus. Immunity. 2009;31:551–564. [PMC free article] [PubMed]
57. Giresi P. G, Lieb J. D. Isolation of active regulatory elements from eukaryotic chromatin using FAIRE (Formaldehyde Assisted Isolation of Regulatory Elements). Methods. 2009;48:233–239. [PMC free article] [PubMed]
58. Gaulton K. J, Nammo T, Pasquail L, Simon J. M, Giresi P. G, et al. A map of open chromatin in human pancreatic islets. Nat Genet. 2010;42:255–259. [PMC free article] [PubMed]
59. Barski A, Cuddapah S, Cui K, Roh T. Y, Schones D. E, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. [PubMed]
60. Johnson D. S, Mortazavi A, Myers R. M, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316:1497–1502. [PubMed]
61. Mikkelsen T. S, Ku M, Jaffe D. B, Issac B, Lieberman E, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007;448:553–560. [PMC free article] [PubMed]
62. Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007;4:651–657. [PubMed]
63. Poser I, Sarov M, Hutchins J. R, Hériché J. K, Toyoda Y, et al. BAC TransgeneOmics: a high-throughput method for exploration of protein function in mammals. Nat Methods. 2008;5:409–415. [PMC free article] [PubMed]
64. Hua S, Kittler R, White K. P. Genomic antagonism between retinoic acid and estrogen signaling in breast cancer. Cell. 2009;137:1259–1271. [PMC free article] [PubMed]
65. Raha D, Hong M, Snyder M. ChIP-seq: a method for global identification of regulatory elements in the genome. Curr Protoc Mol Biol Chapter. 2010;21:Unit 2119 1–14. [PubMed]
66. Kim T. H, Abdullaev Z. K, Smith A. D, Ching K. A, Loukinov D. I, et al. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell. 2007;128:1231–1245. [PMC free article] [PubMed]
67. Jaenisch R. DNA methylation and imprinting: Why bother? Trends Genet. 1997;13:323–329. [PubMed]
68. Bird A. DNA methylation patterns and epigenetic memory. Genes & Dev. 2002;16:6–21. [PubMed]
69. Noushmehr H, Weisenberger D. J, Diefes K, Phillips H. S, Pujara K, et al. Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma. Cancer Cell. 2010;17:510–522. [PMC free article] [PubMed]
70. Teschendorff A. E, Menon U, Gentry-Maharaj A, Ramus S. J, Weisenberger S. J, et al. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 2010;20:440–446. [PMC free article] [PubMed]
71. Laurent L, Wong E, Li G, Tsirigos A, Ong C. T, et al. Dynamic changes in the human methylome during differentiation. Genome Res. 2010;20:320–331. [PMC free article] [PubMed]
72. Rakyan V. K, Down T. A, Maslau S, Andrew T, Yang T. P, et al. Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res. 2010;20:434–439. [PMC free article] [PubMed]
73. Meissner A, Mikkelsen T. S, Gu H, Wernig M, Hanna J, et al. Genome-scale DNA methylation maps of pluripotent and differentiated cells. Nature. 2008;454:766–770. [PMC free article] [PubMed]
74. Brunner A. L, Johnson D. S, Kim S. W, Valouev A, Reddy T. E, et al. Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver. Genome Res. 2009;19:1044–1056. [PMC free article] [PubMed]
75. Galas D. J, Schmitz A. DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 1978;5:3157–3170. [PMC free article] [PubMed]
76. Strauss E. C, Orkin S. H. In vivo protein-DNA interactions at hypersensitive site 3 of the human beta-globin locus control region. Proc Natl Acad Sci U S A. 1992;89:5809–5813. [PMC free article] [PubMed]
77. Boyle A. P, Song L, Lee B. K, London D, Keefe D, et al. High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Res. 2011 in press. [PMC free article] [PubMed]
78. Pique-Regi R, Degner J. F, Pai A. A, Gaffney D. J, Gilad Y, Pritchard J. K. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 2011 in press. [PMC free article] [PubMed]
79. Lu Y, Dimasi D. P, Hysi P. G, Hewitt A. W, Burdon K. P, et al. Common genetic variants near the Brittle Cornea Syndrome locus ZNF469 influence the blinding disease risk factor central corneal thickness. PLoS Genet. 2010;6:e1000947. doi: 10.1371/journal.pgen.1000947. [PMC free article] [PubMed]
80. McDaniell R, Lee B. K, Song L, Liu Z, Boyle A. P, et al. Heritable individual-specific and allele-specific chromatin signatures in humans. Science. 2010;328:235–239. [PMC free article] [PubMed]
81. Kasowski M, Grubert F, Heffelfinger C, Hariharan M, Asabere A, et al. Variation in transcription factor binding among humans. Science. 2010;328:232–235. [PMC free article] [PubMed]
82. Redon R, Ishikawa S, Fitch K. R, Feuk L, Perry G. H, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. [PMC free article] [PubMed]
83. Korbel J. O, Urban A. E, Grubert F, Du J, Royce T. E, et al. Systematic prediction and validation of breakpoints associated with copy-number variants in the human genome. Proc Natl Acad Sci U S A. 2007;104:10110–10115. [PMC free article] [PubMed]
84. Conrad D. F, Bird C, Blackburne B, Lindsay S, Mamanova L, et al. Mutation spectrum revealed by breakpoint sequencing of human germline CNVs. Nat Genet. 2010;42:385–391. [PMC free article] [PubMed]
85. Miele A, Dekker J. Long-range chromosomal interactions and gene regulation. Mol Biosyst. 2008;4:1046–1057. [PMC free article] [PubMed]
86. Dostie J, Richmond T. A, Arnaout R. A, Selzer R. R, Lee W. L, et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 2006;16:1299–1309. [PMC free article] [PubMed]
87. Dekker J, Rippe K, Dekker M, Kleckner N. Capturing chromosome conformation. Science. 2002;15:1306–1311. [PubMed]
88. Lajoie B. R, van Berkum N. L, Sanyal A, Dekker J. My5C: web tools for chromosome conformation capture studies. Nat Methods. 2009;6:690–691. [PMC free article] [PubMed]
89. Baroni T. E, Chittur S. V, George A. D, Tennenbaum S. A. Advances in RIP-chip analysis: RNA-binding protein immunoprecipitation-microarray profiling. Methods Mol Biol. 2008;419:93–108. [PubMed]
90. Keene J. D, Komisarow J. M, Friedersdorf M. B. RIP-Chip: the isolation and identification of mRNAs, microRNAs and protein components of ribonucleoprotein complexes from cell extracts. Nat Protoc. 2006;1:302–307. [PubMed]
91. Tenenbaum S. A, et al. Identifying mRNA subsets in messenger ribonucleoprotein complexes by using cDNA arrays. Proc Natl Acad Sci U S A. 2000;97:14085–14090. [PMC free article] [PubMed]
92. Tenenbaum S. A, Lager P. J, Carson C. C, Keene J. D. Ribonomics: identifying mRNA subsets in mRNP complexes using antibodies to RNA-binding proteins and genomic arrays. Methods. 2002;26:191–198. [PubMed]
93. Trinklein N. D, Karaöz U, Wu J, Halees A, Force Aldred S, et al. Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome. Genome Res. 2007;17:720–731. [PMC free article] [PubMed]
94. Lin J. M, Collins P. J, Trinklein N. D, Fu Y, Xi H, et al. Transcription factor binding and histone modifications in human bidirectional promoters. Genome Res. 2007;17:818–827. [PMC free article] [PubMed]
95. Collins P. J, Kobayashi Y, Nguyen L, Trinklein N. D, Myers R. M. The ets-related transcription factor GABP directs bidirectional transcription. PLoS Genet. 2007;3(11):e208. doi: 10.1371/journal.pgen.0030208. [PMC free article] [PubMed]
96. Petrykowska H. M, Vockley C. M, Elnitski L. Detection and characterization of silencers and enhancer-blockers in the greater CFTR locus. Genome Research. 2008;18:1238–1246. [PMC free article] [PubMed]
97. Landolin J. M, Johnson D. S, Trinklein N. D, Aldred S. F, Medina C, et al. Sequence features that drive human promoter function and tissue specificity. Genome Res. 2010;20:890–898. [PMC free article] [PubMed]
98. Khatun J, Hamlett E, Giddings M. C. Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification. Bioinformatics. 2008;24:674–681. [PMC free article] [PubMed]
99. Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 2008a;8:1814–1828. [PMC free article] [PubMed]
100. Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P. Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome Res. 2008b;18:1829–1843. [PMC free article] [PubMed]
101. Cooper G. M, Stone E. A, Asimenos G, Green E. D, et al. NISC Comparative Sequencing Program. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913. [PMC free article] [PubMed]
102. Asthana S, Roytberg M, Stamatoyannopoulos J, Sunyaev S. Analysis of sequence conservation at nucleotide resolution. PLoS Comput Biol. 2007;3:e254. doi: 10.1371/journal.pcbi.0030254. [PMC free article] [PubMed]
103. Pollard K. S, Hubisz M. J, Rosenbloom K. R, Siepel A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 2010;20:110–121. [PMC free article] [PubMed]
104. Wu J. Q, Habegger L, Noisa P, Szekely A, Qiu C, et al. Dynamic transcriptomes during neural differentiation of human embryonic stem cells revealed by short, long, and paired-end sequencing. Proc Natl Acad Sci U S A. 2010;107:5254–5259. [PMC free article] [PubMed]
105. Rozowsky J, Euskirchen G, Auerbach R. K, Zhang Z. D, Gibson T, et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009;27:66–75. [PMC free article] [PubMed]
106. Birney E, Hudson T. J, Green E. D, Gunter C, et al. Toronto International Data Release Workshop Authors. Prepublication data sharing. Nature. 2009;461:168–170. [PMC free article] [PubMed]
107. Manolio T. A, Collins F. S, Cox N. J, Goldstein D. B, Hindorff L. A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–753. [PMC free article] [PubMed]
108. Hindorff L. A, Sethupathy P, Junkins H. A, Ramos E. M, Mehta J. P, et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–9367. [PMC free article] [PubMed]
109. Wokolorczyk D, Gliniewicz B, Sikorski A, Zlowocka E, Masojc B, et al. A range of cancers is associated with the rs6983267 marker on chromosome 8. Cancer Res. 2008;68:9982–9986. [PubMed]
110. Curtin K, Lin W. Y, George R, Katory M, Shorto J, et al. Meta association of colorectal cancer confirms risk alleles at 8q24 and 18q21. Cancer Epidemiol Biomarkers Prev. 2009;18:616–621. [PMC free article] [PubMed]
111. Al Olama A. A, Kote-Jarai Z, Giles G. G, Guy M, Morrison J, et al. Multiple loci on 8q24 associated with prostate cancer susceptibility. Nat Genet. 2009;41:1058–1060. [PubMed]
112. Jia L, Landan G, Pomerantz M, Jaschek R, Herman P, et al. Functional enhancers at the gene-poor 8q24 cancer-linked locus. PLoS Gene. 2009;5:e1000597. doi: 10.1371/journal.pgen.1000597. [PMC free article] [PubMed]
113. Pomerantz M. M, Ahmadiyeh N, Jia L, Herman P, Verzi M. P, et al. The 8q24 cancer risk variant rs6983267 shows long-range interaction with MYC in colorectal cancer. Nat Genet. 2009;41:882–884. [PMC free article] [PubMed]
114. Tuupanen S, Turunen M, Lehtonen R, Hallikas O, Vanharanta S, et al. The common colorectal cancer predisposition SNP rs6983267 at chromosome 8q24 confers potential to enhanced Wnt signaling. Nat Genet. 2009;41:885–890. [PubMed]
115. Wright J. B, Brown S. J, Cole M. D. Upregulation of c-MYC in cis through a large chromatin loop linked to a cancer risk-associated single-nucleotide polymorphism in colorectal cancer cells. Mol Cell Biol. 2010;30:1411–1420. [PMC free article] [PubMed]
116. Kidd J. M, Sampas N, Antonacci F, Graves T, Fulton R, et al. Characterization of missing human genome sequences and copy-number polymorphic insertions. Nat Methods. 2010;7:365–371. [PMC free article] [PubMed]
117. Li Q, Brown J. B, Huang H, Bickel P. Measuring reproducibility of high-throughput experiments. Annals of Applied Statistics. 2011 in press.
118. Vaquerizas J. M, Kummerfeld S. K, Teichmann S. A, Luscombe N. M. A census of human transcription factors: function, expression and evolution. Nat Rev Genet. 2009;10:252–263. [PubMed]
119. Nelson S. B, Sugino K, Hempel C. M. The problem of neuronal cell types: a physiological genomics approach. Trends Neurosci. 2006;29:339–345. [PubMed]
120. Reisman D, Bálint é, Loging W. T, Rotter V, Almon E. A novel transcript encoded within the 10-kb first intron of the human p53 tumor suppressor gene (D17S2179E) is induced during differentiation of myeloid leukemia cells. Genomics. 1996;38:364–370. [PubMed]
121. Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T. Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet. 2004;36:40–45. [PubMed]

Articles from PLoS Biology are provided here courtesy of Public Library of Science
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

  • Meta-analysis of genome-wide association studies identifies novel loci that influence cupping and the glaucomatous process[Nature Communications. ]
    Springelkamp H, Höhn R, Mishra A, Hysi PG, Khor CC, Loomis SJ, Bailey JN, Gibson J, Thorleifsson G, Janssen SF, Luo X, Ramdas WD, Vithana E, Nongpiur ME, Montgomery GW, Xu L, Mountain JE, Gharahkhani P, Lu Y, Amin N, Karssen LC, Sim KS, van Leeuwen EM, Iglesias AI, Verhoeven VJ, Hauser MA, Loon SC, Despriet DD, Nag A, Venturini C, Sanfilippo PG, Schillert A, Kang JH, Landers J, Jonasson F, Cree AJ, van Koolwijk LM, Rivadeneira F, Souzeau E, Jonsson V, Menon G, Blue Mountains Eye Study—GWAS groupMitchellPaulP48WangJie JinJJ48RochtchinaElenaE48AttiaJohnJ49ScottRodneyR49HollidayElizabeth G.EG49WongTien-YinTY50BairdPaul N.PN50XieJingJ50InouyeMichaelM51ViswanathanAnanthA4552SimXuelingX53, Weinreb RN, de Jong PT, Oostra BA, Uitterlinden AG, Hofman A, Ennis S, Thorsteinsdottir U, Burdon KP, NEIGHBORHOOD ConsortiumAllinghamR. RandRR54BrilliantMurray H.MH55BudenzDonald L.DL56Cooke BaileyJessica N.JN5758ChristenWilliam G.WG59FingertJohnJ6061FriedmanDavid S.DS62GaasterlandDouglasD63GaasterlandTerryT64HainesJonathan L.JL5758HauserMichael A.MA5465KangJae HeeJH66KraftPeterP67LeeRichard K.RK68LichterPaul R.PR69LiuYutaoY5465LoomisStephanie J.SJ70MoroiSayoko E.SE69PasqualeLouis R.LR6670Pericak-VanceMargaret A.MA71RealiniAnthonyA72RichardsJulia E.JE69SchumanJoel S.JS73ScottWilliam K.WK71SinghKuldevK74SitArthur J.AJ75VollrathDouglasD76WeinrebRobert N.RN77WiggsJaney L.JL70WollsteinGadiG73ZackDonald J.DJ62ZhangKangK77, Wellcome Trust Case Control Consortium 2 (WTCCC2)Donnelly (Chair)PeterP7879Barroso (Deputy Chair)InesI80BlackwellJenefer M.JM8182BramonElviraE83BrownMatthew A.MA84CasasJuan P.JP85CorvinAidenA86DeloukasPanosP80DuncansonAudreyA87JankowskiJanuszJ8889MarkusHugh S.HS90MathewChristopher G.CG91PalmerColin N. A.CN92PlominRobertR93RautanenAnnaA78SawcerStephen J.SJ94TrembathRichard C.RC91ViswanathanAnanth C.ACWoodNicholas W.NW95SpencerChris C. A.CC78BandGavinG78BellenguezCélineC78FreemanColinC78HellenthalGarrettG78GiannoulatouEleniE78PirinenMattiM78PearsonRichardR78StrangeAmyA78SuZhanZ78VukcevicDamjanD78DonnellyPeterP7879LangfordCordeliaC80HuntSarah E.SE80EdkinsSarahS80GwilliamRhianR80BlackburnHannahH80BumpsteadSuzannah J.SJ80DronovSergeS80GillmanMatthewM80GrayEmmaE80HammondNaomiN80JayakumarAlagurevathiA80McCannOwen T.OT80LiddleJenniferJ80PotterSimon C.SC80RavindrarajahRadhiR80RickettsMichelleM80WallerMatthewM80WestonPaulP80WidaaSaraS80WhittakerPamelaP80BarrosoInesI80DeloukasPanosP80Mathew (Chair)Christopher G.CG92BlackwellJenefer M.JM8182BrownMatthew A.MA84CorvinAidenA86SpencerChris C. A.CC78, Spector TD, Mirshahi A, Saw SM, Vingerling JR, Teo YY, Haines JL, Wolfs RC, Lemij HG, Tai ES, Jansonius NM, Jonas JB, Cheng CY, Aung T, Viswanathan AC, Klaver CC, Craig JE, Macgregor S, Mackey DA, Lotery AJ, Stefansson K, Bergen AA, Young TL, Wiggs JL, Pfeiffer N, Wong TY, Pasquale LR, Hewitt AW, van Duijn CM, Hammond CJ. Nature Communications. 54883
  • A New IRAK-M-Mediated Mechanism Implicated in the Anti-Inflammatory Effect of Nicotine via ?7 Nicotinic Receptors in Human Macrophages[PLoS ONE. ]
    Maldifassi MC, Atienza G, Arnalich F, López-Collazo E, Cedillo JL, Martín-Sánchez C, Bordas A, Renart J, Montiel C. PLoS ONE. 9(9)e108397
  • Top associated SNPs in prostate cancer are significantly enriched in cis-expression quantitative trait loci and at transcription factor binding sites[Oncotarget. ]
    Jiang J, Jia P, Shen B, Zhao Z. Oncotarget. 5(15)6168-6177
  • microRNA-29 negatively regulates EMT regulator N-myc interactor in breast cancer[Molecular Cancer. ]
    Rostas JW III, Pruitt HC, Metge BJ, Mitra A, Bailey SK, Bae S, Singh KP, Devine DJ, Dyess DL, Richards WO, Tucker JA, Shevde LA, Samant RS. Molecular Cancer. 13(1)200
  • Identification of TERRA locus unveils a telomere protection role through association to nearly all chromosomes[Nature Communications. ]
    de Silanes IL, Graña O, De Bonis ML, Dominguez O, Pisano DG, Blasco MA. Nature Communications. 54723
See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...