Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.
An integrated encyclopedia of DNA elements in the human genome.
Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, Epstein CB, Frietze S, Harrow J, Kaul R, Khatun J, Lajoie BR, Landt SG, Lee BK, Pauli F, Rosenbloom KR, Sabo P, Safi A, Sanyal A, Shoresh N, Simon JM, Song L, Trinklein ND, Altshuler RC, Birney E, Brown JB, Cheng C, Djebali S, Dong X, Dunham I, Ernst J, Furey TS, Gerstein M, Giardine B, Greven M, Hardison RC, Harris RS, Herrero J, Hoffman MM, Iyer S, Kellis M, Khatun J, Kheradpour P, Kundaje A, Lassmann T, Li Q, Lin X, Marinov GK, Merkel A, Mortazavi A, Parker SC, Reddy TE, Rozowsky J, Schlesinger F, Thurman RE, Wang J, Ward LD, Whitfield TW, Wilder SP, Wu W, Xi HS, Yip KY, Zhuang J, Pazin MJ, Lowdon RF, Dillon LA, Adams LB, Kelly CJ, Zhang J, Wexler JR, Green ED, Good PJ, Feingold EA, Bernstein BE, Birney E, Crawford GE, Dekker J, Elnitski L, Farnham PJ, Gerstein M, Giddings MC, Gingeras TR, Green ED, Guigó R, Hardison RC, Hubbard TJ, Kellis M, Kent W, Lieb JD, Margulies EH, Myers RM, Snyder M, Stamatoyannopoulos JA, Tenenbaum SA, Weng Z, White KP, Wold B, Khatun J, Yu Y, Wrobel J, Risk BA, Gunawardena HP, Kuiper HC, Maier CW, Xie L, Chen X, Giddings MC, Bernstein BE, Epstein CB, Shoresh N, Ernst J, Kheradpour P, Mikkelsen TS, Gillespie S, Goren A, Ram O, Zhang X, Wang L, Issner R, Coyne MJ, Durham T, Ku M, Truong T, Ward LD, Altshuler RC, Eaton ML, Kellis M, Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F, Xue C, Marinov GK, Khatun J, Williams BA, Zaleski C, Rozowsky J, Röder M, Kokocinski F, Abdelhamid RF, Alioto T, Antoshechkin I, Baer MT, Batut P, Bell I, Bell K, Chakrabortty S, Chen X, Chrast J, Curado J, Derrien T, Drenkow J, Dumais E, Dumais J, Duttagupta R, Fastuca M, Fejes-Toth K, Ferreira P, Foissac S, Fullwood MJ, Gao H, Gonzalez D, Gordon A, Gunawardena HP, Howald C, Jha S, Johnson R, Kapranov P, King B, Kingswood C, Li G, Luo OJ, Park E, Preall JB, Presaud K, Ribeca P, Risk BA, Robyr D, Ruan X, Sammeth M, Sandhu KS, Schaeffer L, See LH, Shahab A, Skancke J, Suzuki AM, Takahashi H, Tilgner H, Trout D, Walters N, Wang H, Wrobel J, Yu Y, Hayashizaki Y, Harrow J, Gerstein M, Hubbard TJ, Reymond A, Antonarakis SE, Hannon GJ, Giddings MC, Ruan Y, Wold B, Carninci P, Guigó R, Gingeras TR, Rosenbloom KR, Sloan CA, Learned K, Malladi VS, Wong MC, Barber GP, Cline MS, Dreszer TR, Heitner SG, Karolchik D, Kent W, Kirkup VM, Meyer LR, Long JC, Maddren M, Raney BJ, Furey TS, Song L, Grasfeder LL, Giresi PG, Lee BK, Battenhouse A, Sheffield NC, Simon JM, Showers KA, Safi A, London D, Bhinge AA, Shestak C, Schaner MR, Kim SK, Zhang ZZ, Mieczkowski PA, Mieczkowska JO, Liu Z, McDaniell RM, Ni Y, Rashid NU, Kim MJ, Adar S, Zhang Z, Wang T, Winter D, Keefe D, Birney E, Iyer VR, Lieb JD, Crawford GE, Li G, Sandhu KS, Zheng M, Wang P, Luo OJ, Shahab A, Fullwood MJ, Ruan X, Ruan Y, Myers RM, Pauli F, Williams BA, Gertz J, Marinov GK, Reddy TE, Vielmetter J, Partridge E, Trout D, Varley KE, Gasper C, Bansal A, Pepke S, Jain P, Amrhein H, Bowling KM, Anaya M, Cross MK, King B, Muratet MA, Antoshechkin I, Newberry KM, McCue K, Nesmith AS, Fisher-Aylor KI, Pusey B, DeSalvo G, Parker SL, Balasubramanian S, Davis NS, Meadows SK, Eggleston T, Gunter C, Newberry J, Levy SE, Absher DM, Mortazavi A, Wong WH, Wold B, Blow MJ, Visel A, Pennachio LA, Elnitski L, Margulies EH, Parker SC, Petrykowska HM, Abyzov A, Aken B, Barrell D, Barson G, Berry A, Bignell A, Boychenko V, Bussotti G, Chrast J, Davidson C, Derrien T, Despacio-Reyes G, Diekhans M, Ezkurdia I, Frankish A, Gilbert J, Gonzalez JM, Griffiths E, Harte R, Hendrix DA, Howald C, Hunt T, Jungreis I, Kay M, Khurana E, Kokocinski F, Leng J, Lin MF, Loveland J, Lu Z, Manthravadi D, Mariotti M, Mudge J, Mukherjee G, Notredame C, Pei B, Rodriguez JM, Saunders G, Sboner A, Searle S, Sisu C, Snow C, Steward C, Tanzer A, Tapanari E, Tress ML, van Baren MJ, Walters N, Washietl S, Wilming L, Zadissa A, Zhang Z, Brent M, Haussler D, Kellis M, Valencia A, Gerstein M, Reymond A, Guigó R, Harrow J, Hubbard TJ, Landt SG, Frietze S, Abyzov A, Addleman N, Alexander RP, Auerbach RK, Balasubramanian S, Bettinger K, Bhardwaj N, Boyle AP, Cao AR, Cayting P, Charos A, Cheng Y, Cheng C, Eastman C, Euskirchen G, Fleming JD, Grubert F, Habegger L, Hariharan M, Harmanci A, Iyengar S, Jin VX, Karczewski KJ, Kasowski M, Lacroute P, Lam H, Lamarre-Vincent N, Leng J, Lian J, Lindahl-Allen M, Min R, Miotto B, Monahan H, Moqtaderi Z, Mu XJ, O'Geen H, Ouyang Z, Patacsil D, Pei B, Raha D, Ramirez L, Reed B, Rozowsky J, Sboner A, Shi M, Sisu C, Slifer T, Witt H, Wu L, Xu X, Yan KK, Yang X, Yip KY, Zhang Z, Struhl K, Weissman SM, Gerstein M, Farnham PJ, Snyder M, Tenenbaum SA, Penalva LO, Doyle F, Karmakar S, Landt SG, Bhanvadia RR, Choudhury A, Domanus M, Ma L, Moran J, Patacsil D, Slifer T, Victorsen A, Yang X, Snyder M, Auer T, Centanin L, Eichenlaub M, Gruhl F, Heermann S, Hoeckendorf B, Inoue D, Kellner T, Kirchmaier S, Mueller C, Reinhardt R, Schertel L, Schneider S, Sinn R, Wittbrodt B, Wittbrodt J, Weng Z, Whitfield TW, Wang J, Collins PJ, Aldred SF, Trinklein ND, Partridge EC, Myers RM, Dekker J, Jain G, Lajoie BR, Sanyal A, Balasundaram G, Bates DL, Byron R, Canfield TK, Diegel MJ, Dunn D, Ebersol AK, Frum T, Garg K, Gist E, Hansen R, Boatman L, Haugen E, Humbert R, Jain G, Johnson AK, Johnson EM, Kutyavin TV, Lajoie BR, Lee K, Lotakis D, Maurano MT, Neph SJ, Neri FV, Nguyen ED, Qu H, Reynolds AP, Roach V, Rynes E, Sabo P, Sanchez ME, Sandstrom RS, Sanyal A, Shafer AO, Stergachis AB, Thomas S, Thurman RE, Vernot B, Vierstra J, Vong S, Wang H, Weaver MA, Yan Y, Zhang M, Akey JM, Bender M, Dorschner MO, Groudine M, MacCoss MJ, Navas P, Stamatoyannopoulos G, Kaul R, Dekker J, Stamatoyannopoulos JA, Dunham I, Beal K, Brazma A, Flicek P, Herrero J, Johnson N, Keefe D, Lukk M, Luscombe NM, Sobral D, Vaquerizas JM, Wilder SP, Batzoglou S, Sidow A, Hussami N, Kyriazopoulou-Panagiotopoulou S, Libbrecht MW, Schaub MA, Kundaje A, Hardison RC, Miller W, Giardine B, Harris RS, Wu W, Bickel PJ, Banfai B, Boley NP, Brown JB, Huang H, Li Q, Li JJ, Noble WS, Bilmes JA, Buske OJ, Hoffman MM, Sahu AD, Kharchenko PV, Park PJ, Baker D, Taylor J, Weng Z, Iyer S, Dong X, Greven M, Lin X, Wang J, Xi HS, Zhuang J, Gerstein M, Alexander RP, Balasubramanian S, Cheng C, Harmanci A, Lochovsky L, Min R, Mu XJ, Rozowsky J, Yan KK, Yip KY, Birney E.
Abstract
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
Figure 1Impact of Selection on ENCODE Functional Elements in Mammals and Human Populations
Panel A shows the levels of pan-mammalian constraint (mean GERP score; 24 mammals, x-axis) compared to diversity, a measure of negative selection in the human population (mean expected heterozygosity, inverted scale, y-axis) for ENCODE datasets. Each point is an average for a single dataset. The top right corners have the strongest evolutionary constraint and lowest diversity. Coding (C), UTR (U), genomic (G), intergenic (IG) and intronic (IN) averages are shown as filled squares. In each case the vertical and horizontal cross hairs show representative levels for the neutral expectation for mammalian conservation and human population diversity respectively. Panel A shows the spread over all non-exonic ENCODE elements greater than 2.5 kb from TSSs. The inner dashed box indicates that parts of the plot have been magnified for the surrounding outer panels, although the scales in the outer plots provide the exact regions and dimensions magnified. The spread for DHS sites (B) and RNA elements (D) are shown in the plots on the left. RNA elements are either long novel intronic (dark green) or long intergenic (light green) RNAs. The horizontal cross hairs are colour coded to the relevant dataset in panel D. Panel C shows the spread of TF motif instances either in regions bound by the TF (orange points) or the corresponding unbound motif matches in grey, with bound and unbound points connected with an arrow in each case showing that bound sites are generally more constrained and less diverse. Panel E shows the derived allele frequency spectrum for primate specific elements with variations outside ENCODE elements in black and variations covered by ENCODE elements in red. The increase in low frequency alleles compared to background is indicative of negative selection occurring in the set of variants annotated by the ENCODE data. Panel F shows aggregation of mammalian constraint scores over the glucocorticoid receptor (GR) TF motif in bound sites, showing the expected correlation with the information content of bases in the motif.
Nature. 2012 Sep 6;489(7414):57-74.
Figure 2Modelling Transcription Levels from Histone Modification and TF-Binding Patterns
Panels A and B show the correlative models between either histone modifications or TFs, respectively, and RNA production as measured by CAGE tag density at TSSs in K562. In each case the scatter plot shows the output of the correlation models (x-axis) compared to observed values (y-axis). The bar graphs show the most important histone modifications (A) or TFs (B) in both the initial classification phase (upper bar graph) or the quantitative regression phase (lower bar graph), with larger values indicating increasing importance of the variable in the model. Further analysis of other cell lines and RNA measurement types are reported elsewhere,.
Nature. 2012 Sep 6;489(7414):57-74.
Figure 3Patterns and Asymmetry of Chromatin Modification at Transcription Factor-binding Sites
Panel A shows the results of clustered aggregation of H3K27me3 modification signal around CTCF binding sites (a multi-functional protein involved with chromatin structure). The first three left-most plots show the signal behaviour of the histone modification over all sites (top) and then split into the high and low signal components. The high signal component is then decomposed further into six different shape classes on the right (see ref for details). The shape decomposition process is strand aware. Panel B summarises shape asymmetry for DNase1, nucleosome and histone modification signals by plotting an asymmetry ratio for each signal over all TF binding sites. All histone modifications measured in this study show predominantly asymmetric patterns at TF binding sites.
Nature. 2012 Sep 6;489(7414):57-74.
Figure 4Co-association between Transcription Factors
Panel A shows significant co-associations of TF pairs using the GSC statistic across the entire genome in K562 cells. The colour strength represents the extent of association (red (strongest) through orange to yellow (weakest)), whereas the depth of colour represents the fit to the GSC model (white meaning that the statistical model is not appropriate) as indicated by the key. The majority of TFs have a non-random association to other TFs, and these associations are dependent on the genomic context, meaning that once the genome is separated into promoter proximal and distal regions, the overall levels of co-association decrease, but more specific relationships are uncovered. Panel B illustrates three classes of behaviour. The first column shows a set of associations whose strength is independent of location in promoter and distal regions while the second shows a set of TFs which have stronger associations in promoter-proximal regions. Both these examples are from data in K562 cells and are highlighted on the genome wide coassociation matrix (panel A) by the labelled boxes A and B, respectively. The third column shows a set of TFs that show stronger association in distal regions (in the H1 hESC cell line).
Nature. 2012 Sep 6;489(7414):57-74.
Figure 5Integration of ENCODE Data by Genome-wide Segmentation
Panel A shows an illustrative region with the two segmentations methods (ChromHMM and Segway) in a dense view and the combined segmentation expanded to show each state in GM12878, beneath a compressed view of the GENCODE gene annotations. Note that at this level of zoom and genome browser resolution, some segments appear to overlap although they do not. Segmentation classes are named and coloured according to the scheme in . Beneath the segmentations are shown each of the normalised signals that were used as the input data for the segmentations. Open Chromatin signals from the DNase 1-seq and FAIRE assays are shown in blue, signal from histone modification ChIP-seq in red and TF ChIP-seq signal for Pol II and CTCF in green. The mauve ChIP-seq control signal (“Input control”) at the bottom was also included as an input to the segmentation. Panel B shows the association of selected TF (left) and RNA (right) elements in the combined segmentation states (x-axis) expressed as an observed/expected ratio for each combination of TF or RNA element and segmentation class using the heatmap scale shown in the keybesides each heatmap. Panel C shows the variability of states between cell lines, showing the distribution of occurrences of the state in the 6 cell lines at specific genome locations — from unique to one cell line to ubiquitous in all six cell lines for five states (CTCF, E, T, TSS, and R). Panel D shows the distribution of the level of methylation at individual sites from RRBS analysis in GM12878 across the different states, showing the expecting hypomethylation at TSSs and hypermethylation of genes bodies (T state) and repressed (R) regions.
Nature. 2012 Sep 6;489(7414):57-74.
Figure 6Experimental Characterisation of Segmentations
Randomly sampled E state segments (see ) from the K562 segmentation were cloned for mouse- and fish-based transgenic enhancer assays. Panel A shows a representative LacZ-stained transgenic e11.5 mouse embryo obtained with construct hs2065 (EN167, chr10:46,052,882-46,055,670, GRCh37). Highly reproducible staining in the blood vessels was observed in 9 out of 9 embryos resulting from independent transgenic integration events. Panel B shows a representative green fluorescent protein reporter transgenic medaka fish obtained from a construct with a basal hsp70 promoter on meganuclease based transfection. Reproducible transgenic expression in the circulating nucleated blood cells and the endothelial cell walls was seen in 81 out of 100 transgenic tests of this construct.
Nature. 2012 Sep 6;489(7414):57-74.
Figure 7High-Resolution Segmentation of ENCODE Data by Self-Organising Maps (SOM)
The training of the self-organising map (panel A) and analysis of the results (panels B and C) are shown. Initially we arbitrarily placed genomic segments from the chromHMM segmentation on to the toroidal map surface, although the SOM does not use the chromHMM state assignments (panel A). We then trained the map using the signal of the 12 different ChIP-seq and DNase-seq assays in the six cell types analysed. Each unit of the SOM is represented here by an hexagonal cell in a planar two-dimensional view of the toroidal map. Curved arrows indicate that traversing the edges of two dimensional view leads back to the opposite edge. The resulting map can be overlaid with any class of ENCODE or other data to view the distribution of that data within this high-resolution segmentation. In panel A the distributions of genome bases across the untrained and trained map (left and right, respectively) are shown using heatmap colours for log10 values. Panel B shows the distribution of TSSs from CAGE experiments of GENCODE annotation on the planar representations of either the initial random organisation (left) or the final trained SOM (right) using heat maps coloured according to the accompanying scales. The bottom half of panel B expands the different distributions in the SOM for all expressed TSSs (left) or TSSs specifically expressed in two example cell lines, H1 hESC (centre) and HepG2 (right). Panel C shows the association of Gene Ontology (GO) terms on the same representation of the same trained SOM. We assigned genes that are within 20 kb of a genomic segment in a SOM unit to that unit, and then associated this set of genes with GO terms using a hypergeometric distribution after correcting for multiple testing. Map units that are significantly associated to GO terms are now coloured green, with increasing strength of colour reflecting increasing numbers of genes significantly associated with the GO terms for either immune response (left) or sequence-specific TF activity (centre). In each case, specific SOM units show association with these terms. The right-hand panel shows the distribution on the same SOM of all significantly associated GO terms, now colouring by GO term count per SOM unit. For sequence-specific TF activity, two example genomic regions are extracted at the bottom of panel C from neighbouring SOM units. These are regions around the DBX1 (from SOM unit 26,31, left panel) and IRX6 (SOM unit 27,30, right panel) genes, respectively, along with their H3K27me3 ChIP-seq signal for each of the Tier 1 and 2 cell types. For DBX1, representative of a set of primarily neuronal TFs associated with unit 26,31, there is a repressive H3K27me3 signal in both H1 hESC and HUVEC cells; for IRX6, representative of a set of body patterning TFs associated with SOM unit 27,30, the repressive mark is restricted largely to the embryonic stem cell.
Nature. 2012 Sep 6;489(7414):57-74.
Figure 8Allele-Specific ENCODE Elements
Panel A shows representative allele-specific information from GM12878 cells for selected assays around the first exon of the NACC2 gene (genomic region chr9:138,950,000- 138,995,000, GRCh37). Transcription signal is shown in green, and the three sections show allele specific data for three datasets (POLR2A, H3K79me2 and H3K27me3 ChIP-seq). In each case the purple signal is the processed signal for all sequence reads for the assay, while the blue and red signals show sequence reads specifically assigned to either the paternal or maternal copies of the genome, respectively. The set of common SNPs from dbSNP, including the phased, heterozygous SNPs used to provide the assignment, are shown at the bottom of the panel. NACC2 has a statistically significant paternal bias for POLR2A and the transcription associated mark H3K79me2, and has a significant maternal bias for the repressive mark H3K27me3. Panel B shows pairwise correlations of allele specific signal within single genes (below the diagonal) or within individual ChromHMM segments across the whole genome for selected DNase-seq and histone modification and TF ChIP-seq assays. The extent of correlation is coloured according to the heatmap scale indicated from positive correlation (red) through to anti-correlation (blue).
Nature. 2012 Sep 6;489(7414):57-74.
Figure 9Examining ENCODE Elements on a per individual basis in the Normal and Cancer Genome
Panel A shows the breakdown of variants in a single genome (NA12878) by both frequency (common or rare (i.e., variants not present in the low-coverage sequencing of 179 individuals in the pilot 1 European panel of the 1000 Genomes project) and by ENCODE annotation, including protein-coding gene and non-coding elements (GENCODE annotations for protein-coding genes, pseudogenes, and other ncRNAs, as well as TF-binding sites from ChIP-seq datasets, excluding broad annotations such as histone modifications, segmentations, and RNA-seq). Annotation status is further subdivided by predicted functional effect, being non-synonymous and missense mutations for protein-coding regions and variants overlapping bound TF motifs for non-coding element annotations. A substantial proportion of variants are annotated as having predicted functional effects in the non-coding category. Panel B shows one of several relatively rare occurrences, where alignment to an individual genome sequence (paternal and maternal panels) shows a different readout from the reference genome. In this case, a paternal haplotype-specific CTCF peak is identified. Panel C shows the relative level of somatic variants from whole-genome melanoma sample that occur in DHSs unique to different cell lines. The coloured bars show cases that are significantly enriched or supressed in somatic mutations. Details of ENCODE cell types can be found at http://encodeproject.org/ENCODE/cellTypes.html.
Nature. 2012 Sep 6;489(7414):57-74.
Figure 10Comparison of Genome-wide Association Study-identified Loci with ENCODE Data
Panel A shows overlap of lead SNPs in the NHGRI GWAS SNP catalog (June 2011) with DHSs (left) or TF-binding sites (right) as red bars compared to various control SNP sets in blue. The control SNP sets are: SNPs on the Illumina 2.5M chip as an example of a widely used GWAS SNP typing panel; SNPs from the 1,000 Genomes project; SNPs extracted from 24 personal genomes (see Personal Genome Variants track at http://main.genome-browser.bx.psu.edu all shown as blue bars. In addition a further control utilised 1,000 randomisations from the genotyping SNP panel, matching the SNPs with each NHGRI catalog SNP for allele frequency and distance to the nearest TSS (light blue bars with bounds at 1.5 times the interquartile range, and any outliers beyond shown as circles). For both DHSs and TF binding regions, a larger proportion of overlaps with GWAS-implicated SNPs is found compared to any of the controls sets. Panel B shows the aggregate overlap of phenotypes to selected TF-binding sites (left matrix) or DHSs in selected cell lines (right matrix), with a count of overlaps between the phenotype and the cell line/factor. Values in green squares pass an empirical p-value threshold <=0.01 (based on the same analysis of overlaps between randomly chosen, GWAS-matched SNPs and these epigenetic features) and have at least a count of 3 overlaps. The p-value for the total number of phenotype-TF associations is <0.001. Panel C shows several SNPs associated with Crohn’s disease and other inflammatory diseases that reside in a large gene desert on chromosome 5, along with some epigenetic features suggestive of function. The SNP (rs11742570) strongly associated to Crohn’s disease overlaps a GATA2 TF binding signal determined in HUVEC cells. This region is also DNaseI hypersensitive in HUVEC and T-helper Th1 and Th2 cells.
Nature. 2012 Sep 6;489(7414):57-74.
Publication types
MeSH terms
Substances
Grant support
Full Text Sources
Other Literature Sources
Medical
Research Materials