• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 2009; 37(Database issue): D690–D697.
Published online Nov 25, 2008. doi:  10.1093/nar/gkn828
PMCID: PMC2686571

Ensembl 2009

Abstract

The Ensembl project (http://www.ensembl.org) is a comprehensive genome information system featuring an integrated set of genome annotation, databases, and other information for chordate, selected model organism and disease vector genomes. As of release 51 (November 2008), Ensembl fully supports 45 species, and three additional species have preliminary support. New species in the past year include orangutan and six additional low coverage mammalian genomes. Major additions and improvements to Ensembl since our previous report include a major redesign of our website; generation of multiple genome alignments and ancestral sequences using the new Enredo-Pecan-Ortheus pipeline and development of our software infrastructure, particularly to support the Ensembl Genomes project (http://www.ensemblgenomes.org/).

INTRODUCTION

The genome sequence of an organism provides a natural index for organizing and understanding biological data. The Ensembl project provides a comprehensive genome information system consisting of data storage, integration, analysis and visualization of a wide variety of biological data. Ensembl's primary focus is around providing gene annotation and comparative genome integration for chordate genomes, the vast majority of which are vertebrates. Ensembl concentrates particularly on mammalian genomes having developed initially around the human genome sequence. In comparison to similar projects based at the University of California Santa Cruz (1) and the National Center for Biotechnology Information (2), some of the distinguishing characteristics of the Ensembl project are:

  1. It provides consistent sets of annotation data within and between genomes:
    • – It provides a geneset for each genome, generated from an automatic pipeline where no manually curated geneset exists, with stable identifiers which are tracked between Ensembl releases.
    • –It provides relationships between genes and genomes in a comparative genomics framework in the form of sequence alignments, ortholog and paralog assignments and genetrees, again generated from an automatic pipeline where no manually curated relationships exist.
  2. It is a completely open project, not only through providing downloads of all data and software source code, but through multiple levels of programmatic access:
    • – It allows its database system to be programmed against using the Ensembl API, a powerful object oriented representation of biological entities (e.g. genes) implemented in the Perl programming language.
    • – It allows its genome annotations, alignments, variation and functional genomics data to be dynamically federated with external data sources via the DAS protocol (3,4) and visualized through many of its website interfaces (http://www.ensembl.org/).
    • – It allows its datasets to be dynamically federated with external datasets for data mining using the BioMart system (5).

The Ensembl project is now being joined by the Ensembl Genomes project (http://www.ensemblgenomes.org/), which will use Ensembl technology to ultimately provide a common interface to genomes across biology.

A continuing driver for developments in Ensembl is its active involvement in many data generation and analysis projects. Recent examples have been the Rat haplotype project (6–8) and the ENCODE project (9). Dealing with data generated by the ENCODE project in particular had led to the development of specific algorithms for experimental data handling, such as approaches for designing and assessing whole genome tiling arrays (10). Ensembl has continued to be strongly involved in analysis for publications of new vertebrate genome sequences, particularly through its genesets (11–13) (see below).

The report lists only some of the new features, new data and other improvements that we have added to Ensembl since our last report (14). Users interested in the most up-to-date details of the Ensembl project should visit the Ensembl main page (http://www.ensembl.org) and follow the ‘What's new’ link and/or subscribe to the low-volume ‘Ensembl announce’ mailing list by sending email ‘subscribe ensembl-announce’ as the message body to ku.ca.ibe@omodrojam. There is also an Ensembl blog (http://ensembl.blogspot.com/) and associated RSS feeds which in particular cover upcoming Ensembl training courses around the world (see below). Users with questions about Ensembl can consult the extensive online help, FAQ and tutorial materials (15) (include animated tutorials) or contact the Ensembl helpdesk through the website or by emailing gro.lbmesne@ksedpleh.

RESULTS

Ensembl Web site redesign

The majority of users access Ensembl through its web interface, making it a critical component of the project. It is generally recognized that major factors influencing website usability are speed and discoverability. As websites grow and their underlying databases become more complex, individual web pages tend to become larger, more complex and slower to display and it becomes harder for users to discover new functionality and navigate to the pages most appropriate to their query. The case of Ensembl is no different: the data contained in its databases is presented in different ways through a number of different ‘views’, which have been progressively added as project has developed, starting with relatively straightforward views summarizing information about a given gene, or displaying a region of genome sequence (16), to increasingly complex views such as TranscriptSNPView (17) showing sequence variation within a given transcript across a set of strains or individuals. At the same time, the amount of data contained in many views has grown, for example the increased number of species has greatly added to the data presented in views containing comparative genome information. It is not straightforward to identify bottlenecks for users in web based systems. For example, analysing web log files does not easily distinguish between web pages which are of interest to a limited number of users and pages which most users have not discovered. Perceived web site performance can also be very different for different users as a result of different browsers, desktop machines and network speeds. Since the last report (14) considerable effort has been invested in understanding and addressing these issues, culminating in a substantially redesigned and reengineered website from Ensembl release 51 (November 2008).

In the new design (release 51) the web-code has been completely re-developed with improved speed as a high priority. The changes result in substantially smaller web pages which load much faster. A single page now requires far fewer network connections to the web servers, which substantially improves performance for users distant from the web servers. This has been achieved through the adoption of standards compliant HTML, Javascript and CSS; a more streamlined use of the AJAX (Asynchronous JavaScript and XML) protocol to include additional content; the incorporation of shared memory caching using memcached (http://www.danga.com/memcached/); and optimized Apache web server settings to improve browser performance. To enable the project to prioritize improvements and measure their impact on speed, a system for continuous automated monitoring of the response speed of the Ensembl website from more than ten sites around the world was developed and deployed in early 2008.

In parallel with the redevelopment of the underlying web-code, the website has been redesigned to improve navigability and discoverability (Figure 1). The new design organizes different views into four classes: Location, Gene, Transcript and Variation, which can be easily navigated between through tabs at the top of each web page. The location class includes views of the genome sequence at a range of resolutions and genome sequence based comparative views (Figure 1A). Gene based views include textual information about the gene, views of its local genomic environment, views of the gene in the context of its orthologs and paralog relationships with other genomes in the Ensembl system and views of sequence variation within that population (Figure 1B). Transcript based views are similar to the gene based ones, but focus around individual transcript structures with more detail (Figure 1C). Variation based views display information focused around individual SNPs (data not shown). Information presented in a single view in previous versions of Ensembl is now presented as separate smaller views in the new design. The relationship between these new views is clearly shown by the left hand hierarchical menus which is context specific for each class. Each view within a class has a common header panel, summarizing the location or object. Clear and easy navigation between views is provided through the left hand menu and the left and right buttons below the header panel. Since only a specific chunk of information in shown in each view, this makes pages easier to read as well as improving the responsiveness of the servers. Configuration controls have been considerably improved and now take the form of a context specific pop-up panel for most views, e.g. allowing tracks to be enabled and disabled in genome sequence based display elements. The same panel contains controls to allow external data to be uploaded into Ensembl, or for external data sources to be federated (DAS).

Figure 1.
Screenshots of the Release 51 Ensembl website illustrating the principles of the new design and some of the new features. The figure shows an example of three of the four classes of display view using human gene SLC24A5 as the context. (A) An example ...

The ideas for the new design were developed and tested through extensive interactions with users, including one to one sessions, testing sessions of design mock ups and web-based questionnaires. Questions investigated preferences between alternative overall layouts (e.g. use of tabs/left hand menu bars) as well as detailed behaviour such as the preference for a consistent name for the protein product of transcript (translation, peptide, protein). The results of these surveys have led to a design which is user driven and was significantly different from the one we had initially planned. We will be maintaining a user panel to help in guiding interface development.

New species and improved gene annotations

In the past year, seven new species (all mammals) were added to Ensembl including one new high coverage genome Pongo pygmaeus abelii (orangutan) and six new low coverage genomes [Pteropus vampyrus (megabat), Tursiops truncates (dolphin), Tarsius syrichta (philippine tarsier), Lama pacos (alpaca), Dipodomys ordii (kangaroo rat) and Procavia capensis (rock hyrax)]. Ensembl now supports 19 low coverage 2× genome sequences, the majority generated as part of the Mamalian Genome Project (http://www.broad.mit.edu/node/296). So far only one of the original 2× genomes, Cavia porcellus (Guinea Pig), has been upgraded to high coverage (6.8×). Together with the other 13 high coverage mammalian genomes, Ensembl contains a total of 32 mammals, making it an extensive resource for mammalian comparative genomics. In total Ensembl now supports 48 genomes, 41 of which are vertebrates.

One of the major goals of Ensembl is to provide genesets which are as accurate and complete as possible and these continue to be used as reference genesets in analysis of new vertebrate genomes. Recent genome publications based on Ensembl genesets include those of Platypus Ornithorhynchus anatinus (11), the Oposum Monodelphis domestica (12) and the Rhesus Macaque Macaca mulatta (13). The gene build process is based on alignments of protein and cDNA sequences and there is continuous work to improve it and generate updated, more accurate and complete genesets. Different gene build strategies are used depending on the assembly, quality of the genome, its distance to high quality genomes and the extent of its organism-specific transcript evidence as has been previously described (18). This year one focus has been to develop a systematic post gene build comparative analysis process (using the Ensembl compara homology pipeline) to identify initial gene structures that appear to be evolutionarily inconsistent. These regions are then subject to a second, more computationally expensive localized gene build pipeline with more sensitive parameters. The major classes of problems identified are split genes, missing orthologous genes, partially predicted genes and false exons. For the test case of the horse genome with initially 20 322 gene models, this post-processing pipeline identified 236 genes that were split; added 1013 genes that had initially been missed, but for which there were orthologs; extended 1330 partially predicted genes and removed 840 false exons. The process is now being systematically applied to other high coverage mammalian genomes. These genesets will be patched in subsequent Ensembl releases.

The other major focus has been the ongoing improvement of the human geneset in collaboration with other groups. Ensembl, together with the Sanger Institute HAVANA group (19), is part of multiple collaborations to refine the human geneset including the CCDS (Consensus Coding Sequence) consortium, with RefSeq at NCBI (20) and UCSC (1), and the new ENCODE scale-up project GENCODE (http://www.sanger.ac.uk/encode/) with multiple collaborators. CCDS (http://www.ncbi.nlm.nih.gov/CCDS/) is a stable set of protein coding gene structures for which all consortium members agree to the base pair. Since our previous report (14) the human CCDS set has increased from 18 290 to 20 159 CDSs, which represents an increase from 16 003 to 17 052 genes with at least one CCDS entry. There is also a CCDS set for mouse, which has increased even more, from 13 374 to 17 707 CDSs and from 13 014 to 16 889 genes. GENCODE builds on CCDS to validate additional transcripts and extend into UTR regions, building on the ENCODE pilot project (9,21–23) and incorporating additional computational and experimental input and validation (24). One new computational approach, which is being built on within GENCODE, is to use alignments across the many mammalian genomes now available to evaluate the conservation of putative coding sequences (25). Several hundred transcript predictions generated by the Ensembl gene build pipeline which were found to have low scores in this analysis have been identified as spurious and are now filtered out. The Ensembl/HAVANA collaboration includes further efforts to improve geneset consistency, such as tighter links with UniProt (26) and input into the Genome Reference Consortium (http://www.sanger.ac.uk/sequencing/grc/) to flag discrepancies between the human genome sequence and transcript evidence.

The Ensembl/HAVANA human geneset shown in Ensembl is a combined output from these projects, incorporating all CCDS entries and merging HAVANA full length transcript annotation with the Ensembl gene build. In the last year, this process has been extended to include 4711 HAVANA pseudogenes and will be more regularly updated in future to incorporate additional validated annotation from GENCODE.

One additional geneset development is that the canonical transcripts are now defined for all genes and for all species. The canonical transcript is defined as either the longest CDS, if the gene has translated transcripts, or the longest cDNA. Should a transcript already regarded as canonical not be selected using the above rules, there is support for storing this information in the Ensembl database.

Multiple alignments for comparative genomics

The genome-wide Ensembl comparative genomics pipeline has changed significantly over 2008, and is now based on the Enredo-Pecan-Ortheus pipeline (EPO). These are a set of three programs which feed into each other. The Enredo programme (28) takes a set of genomes and creates a segmentation graph across all the genomes to extract a set of colinear homologous segments. Unlike the algorithms Ensembl has used previously, Enredo handles lineage specific duplications (for example, a duplication on the primate lineage giving rise to two copies of a series of genes in primates compared to other mammals). These colinear segments are then handed onto Pecan, a consistency based multiple aligner, which provides a highly accurate alignment of the homologous regions. Using an assessment based on ancestral repeats, Enredo+Pecan outperforms other combinations of alignment programs in mammals. Finally, the ancestral sequence reconstruction programme, Ortheus (29), generates accurate ancestral sequences across each region. Ortheus uses a branch transducer model, a type of HMM, to call deletion and insertion events, providing a realistic model under which it can infer the ancestral sequence. Figure 2 shows the results of GERP (30) analysis of constraint across different feature types found in Ensembl, showing a sharp distinction between coding exons and ancestral repeats, with regulatory regions showing a intermediate level of constraint. Ensembl release 49 (March 2008) saw the first set of EPO alignments on a set of seven mammals. In release 50 (July 2008), this set of alignments was extended to include low-coverage genomes, creating a 23 mammals EPO alignment. A set of 4-way primate EPO alignments was also added containing human, chimp, orangutan, macaque. We plan to produce EPO multiple alignments in the teleost lineage in the future.

Figure 2.
Figure shows a smoothed density plot of the GERP conservation scores (30) calculated from the 9-way EPO mammalian genome alignment corresponding to human chromosome X (Ensembl release 51). Four different types of genomic features are plotted: coding exons ...

To create the 23 mammal EPO alignment, the methodology had to be extended to include low-coverage genomes. The assemblies of low-coverage genomes are too fragmented, creating too many breakpoints in the Enredo graph, to use Enredo directly. The Enredo graph was therefore built using high-coverage genomes only. Low-coverage genomes were then mapped on the colinear regions using pairwise alignments to the human genome. For each low-coverage genome, the segments defined by the pairwise alignments were linked with stretches of N's to facilitate the process of building the final multiple sequence alignment. After the alignment has been obtained, the stretches of N's were removed.

As well as providing alignments of genome sequence, the Ensembl comparative genome analysis pipelines also generate gene trees and orthology/paralogy prediction across all Ensembl genomes. A full description of the pipeline including its close collaboration between the curated resource Treefam (31) is forthcoming (Vilella, A. et. al., submitted). A significant recent change (release 50) has been the calculation of site-wise dN/dS values in our gene trees using the SLR programme (sitewise likelihood ratio estimation of selection) (32). These values allow us to detect positions in the alignments that are under different evolutionary pressure.

Functional Genomics and Variation resources

The availability of genome wide functional data is one of the major changes in genomics in the last few years. Driven by involvement in analysis for the ENCODE project (9) and other international research consortia such as the EU FP6 funded HEROIC (High-throughput Epigenetic Regulatory Organisation In Chromatin) project, Ensembl has built up an infrastructure to support handling and display of this class of data (14). We have also recently participated in the creation of a genome-wide DNA methylation resource that has been incorporated into Ensembl (33,34). With the availability of next generation sequencing technology, array based ChIP-chip functional data is very rapidly giving way to sequence based ChIP-seq data. A major activity this year has been the development of a ChIP-seq analysis pipeline including a custom algorithm for the analysis of ChIP-seq data.

One of the characteristic features of the Ensembl project has been to go beyond presenting raw data aligned to the genome sequence by also presenting high quality consensus biological predictions, generated from automatic analysis pipelines developed to use the raw data as evidence. Examples are the Ensembl gene build pipeline generating protein coding genesets and the Ensembl comparative analysis pipelines generating genetrees and orthology and paralog relationships. The Ensembl regulatory build is the latest such pipeline and provides automatic, evidence based annotation of potential regulatory regions within the human genome. The primary inputs are maps of open chromatin created by DNase I hypersensitivity mapping and covalent modifications of histone protein tails assayed by chromatin immunoprecipitation (ChIP). The first build was released in coordination with the ENCODE Pilot Project publication (9). Since the first release reported last year (14), we have updated the regulatory build three times, each time adding more data (35,36) and a more sophisticated analysis of the chromatin conformation and modification data. The build now consists of approximately 175 000 genomic regions defined from data collected from several cell types, including CD4 cells which make up the majority of the supporting data. Approximately 40 different histone modifications are now included and more than 2700 combinations of these factors form patterns associated with protein coding genes or their promoters allowing over 23 000 of the regulatory features to be classified as gene- or promoter-associated.

The rapid adoption of next generation sequencing technologies is also having a major impact on variation data in Ensembl. Whereas data continues to be imported from dbSNP, a major new source of computationally discovered variation data is from the processing of resequencing data. This second data source is growing rapidly in parallel with next generation sequencing technology. This year, Ensembl imported the data from three successive builds of dbSNP (127, 128, and 129). It has also incorporated resequencing-based SNPs from platypus and orangutan and as well as from the resequenced human genomes of Watson and Venter. The playpus SNPs were submitted to dbSNP and make up the largest set of SNPs for that species. The orangutan SNPs will be submitted in conjunction with the publication of that genome.

Within the variation database, we have increased support for copy number variation data and annotation of individual SNPs [e.g. with disease associations identified in genome-wide scans and with expression QTLs (37)]. The Ensembl variation group is synergistic with the European Genotype Archive (EGA http://www.ebi.ac.uk/ega/) and the 1000 Genomes Project (http://www.1000genomes.org/) data coordination centre groups at the EBI (European Bioinformatics Institute). The EGA was launched in the spring of 2008 and currently manages data from several projects including the Wellcome Trust Case Control Consortium (38) and other projects that are still in pre-publication status. The synergies between these projects will underpin the growth in variation data in Ensembl and the start of its functional annotation.

Outreach

Ensembl continues to make a substantial investment in training and user support. We regard this as critical not only to help users, but also evaluate the relevance of the data we provide and the easy of use of the services we provide. As discussed earlier, user engagement has been critical in developing the web site redesign. The Ensembl Outreach and Training group provides on-site courses on request and has run 102 workshops since May 2007, with an expanding effort in Asia (workshops in China, Malaysia and India), and a substantial presence in USA (20 workshops) and Europe (64 workshops). In addition to this, alongside standalone video tutorials, eLearning courses are now being developed and piloted within the EBI training platform (http://www.ebi.ac.uk/training/user/). Finally the new Ensembl blog (http://ensembl.blogspot.com/) provides updates on upcoming Ensembl training courses around the world.

FUTURE DIRECTIONS

The impact of next generation sequencing on genomics is beginning to be felt and a major focus for Ensembl is adapting to changes in data type and scale that will result. As discussed last year (14) the scale of data is a major challenge for many bioinformatics resources. For the variation team the immediate challenge is to present the variation landscape that will be uncovered by the 1000 Genomes Project, which is now running. The gene build team is starting to develop pipelines that use next generation sequencing transcriptome data. We can envisage such data being collected systematically for many different cell types and developmental stages, providing increasingly complete evidence for alternative splicing variants and functional annotation of the time and localization of their expression.

At present the focus for genome sequencing is discovery of variation, however as both experimental and computational techniques improve, it will become possible to sequence and assemble large genomes de novo. At this point it may become cost effective to sequence many more mammalian genomes. However, a major expansion of the number of genomes provided using Ensembl technology is already underway, in the form of the Ensembl Genomes project, (http://www.ensemblgenomes.org/), which will use Ensembl technology to provide a common interface to genomes across biology. Significant API and schema developments have already taken place to support this, including the ability to store several species in a single core database.

Finally, it is clear from our website performance monitoring that despite the performance improvements from our improved web-code, network latency effects will always reduce performance for users far from our servers. As a result we have been investing in mirror sites in parallel, to improve performance for users and provide redundancy. We have recently deployed a mirror site in China in collaboration with the Beijing Genomics Institute, Shenzhen (BGI-SZ). This site's primary service region is our users in and around China as the connections between the UK and China are relatively slow. We will shortly be deploying a full mirror to the US west coast and have also been investigating operating servers in commercially managed cloud compute facilities.

FUNDING

This work was supported by the Wellcome Trust [grant numbers WT062023]; the European Molecular Biology Laboratory (EMBL); the National Institutes of Health (NIH) National Human Genome Research Institute (NHGRI); the National Institutes of Health (NIH) National Institute of Allergy and Infectious Diseases (NIAID); the Biotechnology and Biological Sciences Research Council (BBSRC); the Medical Research Council (MRC); and the European Union. Funding for open access charge: The Wellcome Trust.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We acknowledge those researchers and organizations that have provided data to Ensembl prior to publication under the understandings of the Fort Lauderdale meeting discussing Community Resource Projects. We thank all our users of our website and other resources, and those who have provided useful feedback through our mailing list.

REFERENCES

1. Karolchik D, Kuhn RM, Baertsch R, Barber GP, Clawson H, Diekhans M, Giardine B, Harte RA, Hinrichs AS, Hsu F, et al. The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res. 2008;36:D773–D779. [PMC free article] [PubMed]
2. Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Edgar R, Federhen S, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2008;36:D13–D21. [PMC free article] [PubMed]
3. Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. The Distributed Annotation System. BMC Bioinformatics. 2001;2:7. [PMC free article] [PubMed]
4. Jenkinson AM, Albrecht M, Birney E, Blankenburg H, Down T, Finn RD, Hermjakob H, Hubbard TJ, Jimenez RC, Jones P, et al. Integrating biological data – the Distributed Annotation System. BMC Bioinformatics. 2008;9(Suppl 8):S3. [PMC free article] [PubMed]
5. Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E. EnsMart: a generic system for fast and flexible access to biological data. Genome Res. 2004;14:160–169. [PMC free article] [PubMed]
6. The Star Consortium. SNP and haplotype mapping for genetic analysis in the rat. Nat. Genet. 2008;40:560–566. [PubMed]
7. Twigger SN, Pruitt KD, Fernández-Suárez XM, Karolchik D, Worley KC, Maglott DR, Brown G, Weinstock G, Gibbs RA, Kent J, et al. What everybody should know about the rat genome and its online resources. Nat. Genet. 2008;40:523–527. [PMC free article] [PubMed]
8. Aitman TJ, Critser JK, Cuppen E, Dominiczak A, Fernandez-Suarez XM, Flint J, Gauguier D, Geurts AM, Gould M, Harris PC, et al. Progress and prospects in rat genetics: a community view. Nat. Genet. 2008;40:516–522. [PubMed]
9. The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. [PMC free article] [PubMed]
10. Gräf S, Nielsen FG, Kurtz S, Huynen MA, Birney E, Stunnenberg H, Flicek P. Optimized design and assessment of whole genome tiling arrays. Bioinformatics. 2007;23:i195–i204. [PubMed]
11. Warren WC, Hillier LW, Marshall Graves JA, Birney E, Ponting CP, Grutzner F, Belov K, Miller W, Clarke L, Chinwalla AT, et al. Genome analysis of the platypus reveals unique signatures of evolution. Nature. 2008;453:175–183. [PMC free article] [PubMed]
12. Mikkelsen TS, Wakefield MJ, Aken B, Amemiya CT, Chang JL, Duke S, Garber M, Gentles AJ, Goodstadt L, Heger A, et al. Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences. Nature. 2007;447:167–177. [PubMed]
13. Rhesus Macaque Genome Sequencing and Analysis Consortium. Evolutionary and biomedical insights from the rhesus macaque genome. Science. 2007;316:222–234. [PubMed]
14. Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Ensembl 2008. Nucleic Acids Res. 2008;36:D707–D714. [PMC free article] [PubMed]
15. Spudich G, Fernandez-Suarez XM, Birney E. Genome browsing with Ensembl: a practical overview. Briefings in functional genomics & proteomics. 2007;6:202–219. [PubMed]
16. Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, et al. The Ensembl genome database project. Nucleic Acids Res. 2002;30:38–41. [PMC free article] [PubMed]
17. Cunningham F, Rios D, Griffiths M, Smith J, Ning Z, Cox T, Flicek P, Marin-Garcin P, Herrero J, Rogers J, et al. TranscriptSNPView: a genome-wide catalog of mouse coding variation. Nat. Genet. 2006;38:853. [PMC free article] [PubMed]
18. Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Ensembl 2007. Nucleic Acids Res. 2007;35:D610–D617. [PMC free article] [PubMed]
19. Wilming LG, Gilbert JG, Howe K, Trevanion S, Hubbard T, Harrow JL. The vertebrate genome annotation (Vega) database. Nucleic Acids Res. 2008;36:D753–D760. [PMC free article] [PubMed]
20. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–D65. [PMC free article] [PubMed]
21. Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J, et al. Prominent use of distal 5′ transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007;17:746–759. [PMC free article] [PubMed]
22. Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, Gilbert JG, Storey R, Swarbreck D, et al. GENCODE: producing a reference annotation for ENCODE. Genome Biol. 2006;7(Suppl 1):S41–S49. [PMC free article] [PubMed]
23. Guigó R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney B, et al. EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biology. 2006;7:S2. [PMC free article] [PubMed]
24. Tress ML, Martelli PL, Frankish A, Reeves GA, Wesselink JJ, Yeats C, Olason PL, Albrecht M, Hegyi H, Giorgetti A, et al. The implications of alternative splicing in the ENCODE protein complement. Proc. Natl Acad. Sci. USA. 2007;104:5495–5500. [PMC free article] [PubMed]
25. Clamp M, Fry B, Kamal M, Xie X, Cuff J, Lin MF, Kellis M, Lindblad-Toh K, Lander ES. Distinguishing protein-coding and noncoding genes in the human genome. Proc. Natl Acad. Sci. USA. 2007;104:19428–19433. [PMC free article] [PubMed]
26. The UniProt Consortium. The universal protein resource (UniProt) Nucleic Acids Res. 2008;36:D190–D195. [PMC free article] [PubMed]
27. Bruford EA, Lush MJ, Wright MW, Sneddon TP, Povey S, Birney E. The HGNC Database in 2008: a resource for the human genome. Nucleic Acids Res. 2008;36:D445–D448. [PMC free article] [PubMed]
28. Paten B, Herrero J, Beal K, Fitzgerald S, Birney E. Enredo and Pecan: Genome-wide mammalian consistency based multiple alignment with paralogs. Genome Res. 2008;18:1814–1828. [PMC free article] [PubMed]
29. Paten B, Herrero J, Fitzgerald S, Beal K, Flicek P, Holmes I, Birney E. Genome-wide nucleotide level mammalian ancestor reconstruction. Genome Res. 2008;18:1829–1843. [PMC free article] [PubMed]
30. Cooper GM, Stone EA, Asimenos G, Program NCS, Green ED, Batzoglou S, Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913. [PMC free article] [PubMed]
31. Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, Heriche JK, Hu Y, Kristiansen K, Li R, et al. TreeFam: 2008 Update. Nucleic Acids Res. 2008;36:D735–D740. [PMC free article] [PubMed]
32. Massingham T, Goldman N. Detecting amino acid sites under positive selection and purifying selection. Genetics. 2005;169:1753–1762. [PMC free article] [PubMed]
33. Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Graf S, Johnson N, Herrero J, Tomazou EM, et al. A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis. Nat. Biotechnol. 2008;26:779–785. [PMC free article] [PubMed]
34. Rakyan V, Down T, Thorne N, Flicek P, Kulesha E, Graf S, Tomazou E, Backdahl L, Johnson N, Herberth M, et al. An integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs) Genome Res. 2008;18:1518–1529. [PMC free article] [PubMed]
35. Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, et al. Combinatorial patterns of histone acetylations and methylations in the human genome. Nat. Genet. 2008;40:897–903. [PMC free article] [PubMed]
36. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–837. [PubMed]
37. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, Beazley C, Ingle CE, Dunning M, Flicek P, Koller D, et al. Population genomics of human gene expression. Nat. Genet. 2007;39:1217–1224. [PMC free article] [PubMed]
38. Wellcome Trust case control consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007;447:661–678. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

  • Ensembl 2008.[Nucleic Acids Res. 2008]
    Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al. Nucleic Acids Res. 2008 Jan; 36(Database issue):D707-14. Epub 2007 Nov 13.
  • Ensembl's 10th year.[Nucleic Acids Res. 2010]
    Flicek P, Aken BL, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Coates G, Fairley S, et al. Nucleic Acids Res. 2010 Jan; 38(Database issue):D557-62. Epub 2009 Nov 11.
  • Ensembl 2011.[Nucleic Acids Res. 2011]
    Flicek P, Amode MR, Barrell D, Beal K, Brent S, Chen Y, Clapham P, Coates G, Fairley S, Fitzgerald S, et al. Nucleic Acids Res. 2011 Jan; 39(Database issue):D800-6. Epub 2010 Nov 2.
  • Genome information resources - developments at Ensembl.[Trends Genet. 2004]
    Hammond MP, Birney E. Trends Genet. 2004 Jun; 20(6):268-72.
  • An overview of Ensembl.[Genome Res. 2004]
    Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, et al. Genome Res. 2004 May; 14(5):925-8. Epub 2004 Apr 12.
See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

  • Ensembl 2009
    Ensembl 2009
    Nucleic Acids Research. Jan 2009; 37(Database issue)D690
    PMC

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...