![]() | ![]() |
Formats:
|
||||||||
Copyright © 2008 The Royal Society Genome and proteome annotation: organization, interpretation and integration EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK *Authors for correspondence (Email: gabby/at/ebi.ac.uk) (Email: talavera/at/ebi.ac.uk) Received September 29, 2008; Revised October 16, 2008; Accepted October 16, 2008. Abstract Recent years have seen a huge increase in the generation of genomic and proteomic data. This has been due to improvements in current biological methodologies, the development of new experimental techniques and the use of computers as support tools. All these raw data are useless if they cannot be properly analysed, annotated, stored and displayed. Consequently, a vast number of resources have been created to present the data to the wider community. Annotation tools and databases provide the means to disseminate these data and to comprehend their biological importance. This review examines the various aspects of annotation: type, methodology and availability. Moreover, it puts a special interest on novel annotation fields, such as that of phenotypes, and highlights the recent efforts focused on the integrating annotations. Keywords: genome annotation, proteome annotation, sequencing 1. Introduction Over the last decade, significant developments in the biological and computer sciences have made it possible to generate large amounts of raw genomic and proteomic data. However, meaningful biological inferences can only be gained where expert organization and interpretation of these data are carried out. In 1999, the DNA sequence of chromosome 22, the first human chromosome to be fully sequenced, was published (Dunham et al. 1999) and the first draft of the human genome assembly was completed in 2001 (Lander et al. 2001). The human genome took 10 years to sequence. Today, the sequencing of entire genomes has become routine, resulting in ever increasing numbers of published genomes across the kingdoms of life (see Pop & Salzberg (2008) for the new challenges presented by this fact) and, even, the appearance of metagenomics research (Venter et al. 2004; Tringe et al. 2005). Consequently, this has led to a significant increase in the number of genes and corresponding translated proteomic sequences deposited into databases such as TrEMBL (Boeckmann et al. 2003), which now comprises over 6 million sequences. Likewise, the increased efficiency in tools for the elucidation of protein structure has helped to accelerate the number of experimentally determined structures which are released by the PDB each month. This is in part due to the advent of structural genomics initiatives (Burley 2000; Brenner 2001), which generally attempt to solve every representative structure of an interesting family or aim to cover fold space by picking targets unlike any other structures previously solved. Importantly, much of these genomic and proteomic data are experimentally uncharacterized, putting much emphasis on the need for accurate analytical tools and up-to-date specialist databases. This review discusses the various aspects of genome and proteome annotation, with particular focus on the way in which these methods have started to be integrated. 1.1 Overview of annotation process: genome to proteome The increasing efficiency of genome sequencing has led to a significant rise in the release rate of sequenced genomes. Sequencing can be done at different levels ranging from whole-genome tiling arrays (Yazaki et al. 2007; low accuracy) to methods in which the genome is divided into contigs, each of which is sequenced separately and the data recombined (Mardis 2008). Subsequent genome annotation involves the prediction of a number of features on the DNA: coding genes; pseudogenes; promoters–regulatory regions; untranslated regions; and repeats, to name a few. Although growing attention is focused on non-coding RNA (extensively reviewed in Huttenhofer et al. (2005), Mendes Soares & Valcarcel (2006) and Amaral et al. (2008)), traditionally, most interest is in the prediction of the coding genes, since peptides are seen as potential targets for drug discovery (Ofran et al. 2005). This prediction is not a trivial process as gene structure is not common among all the organisms (Wong et al. 2001; Blencowe 2006). For example, eukaryotes have exons and introns in their genes, whereas prokaryotes have the whole coding sequence as a continuum. In addition, different species have different rates of alternative splicing (Kan et al. 2001; Modrek et al. 2001). Alternative splicing is often predicted using expressed sequence tags (ESTs)—short sequences of a transcribed spliced nucleotide sequence that have been used extensively to identify gene transcripts and have been important in gene discovery and gene sequence determination (Adams et al. 1991). Subsequent studies have shown that the alternative splicing prediction by this method in the main sequencing projects depends on the EST coverage and that, in effect, the more ESTs, the more alternative splicing is reported (Brett et al. 2002; Gupta et al. 2004). Finally, not only cis-splicing is possible, but also some examples of trans-splicing (involving more than one pre-mRNA) have been found (Caudevilla et al. 1998; Dorn et al. 2001; Horiuchi & Aigaki 2006). Looking at all these facts together, it is easy to understand the reason for not yet having an accurate number of genes and transcripts from the fully sequenced genomes (reviewed in Southan (2004) and Brent (2008)). Once the number and location of the genes are known, the corresponding protein sequences can be generated via translation of the gene sequences. This is not always a trivial process since some genetic features reduce the accuracy of correctly translating the correct region of a given gene sequence. For example, the existence of exons that can be translated using different frames without finding any STOP codon (Clark & Thanaraj 2002) or multiple NAGNAG tandem splice acceptor sites (Hiller et al. 2004), which lead to two possible peptides having different numbers of residues. In addition, many transcripts seem to be potential targets for the non-sense-mediated decay mechanism (Lewis et al. 2003), which eliminates mRNAs containing premature ends of translation. Thus, the biological relevance of the majority of these transcripts is unclear as there is no certainty about their biological role (Neu-Yilik et al. 2004; Ravasi et al. 2006). Experimental methods can also be used to generate the amino acid sequence from an expressed protein, such as Edman degradation or mass spectrometry. The process of making sense of protein sequence data is also complex, and a huge number of tools and specialist databases have been developed in order to characterize these protein sequences (and their three-dimensional structures; figure 1
Furthermore, systems biology has started to put the genome and proteome in the context of the organism. Consequently, the expression levels of each transcript, the proteins involved in the regulation of their transcription and splicing and the generated interaction networks begin to be exhaustively studied and annotated. Also of special interest is the ChIP–chip methodology, which provides high-quality information about regulatory sequences in the DNA. Finally, not only the intra-individual systems have been studied, but also the evolutionary relationship between genes (orthology, in-paralogy and out-paralogy) is being included in the most important annotation databases. Annotation can be approached in a number of ways: from manual curation of the literature to automatic methods. The latter are very diverse; some use the transfer of information from one characterized sequence to a homologue, while, in ab initio methods, features are predicted based on a set of derived rules. Over the last two decades, tools for the annotation of genomic and proteomic sequences and their structures have been developed and made accessible for others to use. This has added to a huge availability of characterized data. The databases that store these data often specialize in curating one particular area of annotation and are often most powerful when arranged in such a way in which the data can be probed computationally. For example, CATH (Orengo et al. 1997) is a database of protein structural domains where users can obtain a broad view of a chosen protein family or a narrower view of a particular protein structure. Often, the tools that provide annotation also form the basis of the dataset that is represented in a given database. Over the last few years, there has been a move towards the integration of the wide range of genome and proteome annotation methods and databases in order to provide an overall view of the function of these genes (for an elegant project covering some of these points, see Fleming et al. 2006). 2. Types of annotation Comprehensive protein feature annotation is an effective way to build up a picture of the function of the protein. Such features may include: for genes, expression levels, position of regulatory elements, binding sites, splicing junctions and individual variance; and, for proteins, position of functional residues, identification of post-translational modifications, description of residues that interact with DNA, protein or ligand, elucidation or prediction of the domain partners, description of the overall biological unit and even data describing the three-dimensional structure of the protein. The annotation of such features can be carried out in three ways: first, manual curation from experimental data in the literature. This method provides the highest accuracy datasets; however, it relies on highly trained curators, is slow and can cover only a fraction of the data to be annotated. Alternatively, information can be derived by automatically transferring the knowledge we have about one sequence to a related sequence considered to be homologous. The accuracy of these methods depends on the evolutionary distance; the greater the distance, the less confidence you can have in accurately predicting a feature (Chothia & Lesk 1986; Wilson et al. 2000). In addition, the number of shared functional domains is also critical to the annotation transference reliability (Hegyi & Gerstein 2001). Consequently, greater coverage of the transfer of information usually results in decreasing the accuracy due to inference from more distant homologues. Finally, some annotations can be predicted using ab initio methods, which use rules trained on previous annotations or the physico-chemical properties of the molecule to predict the feature. The Rosetta method (Das et al. 2007) predicts the fold of a protein by calculating the energy for different conformations of the protein structure. In general, when choosing an annotation method, one must weigh up the often competing demands of speed and accuracy. Manual curation or experimental methods have high accuracy but are time consuming, and are probably more appropriate for small datasets. Methods that produce annotations with higher speed and coverage (e.g. transfer by homology, many ab initio methods) often do so with lower accuracy, but such a trade-off may be reasonable where datasets are large. 2.1 Manual curation Manually curated data resources are created by human eye and, as a result, data are substantiated by highly trained and knowledgeable annotators. Through this process, annotators are able to survey and critically assess all the information available. In addition, annotators are able to access information buried deep in journal publications, which is not as accessible to more automated methods. Although labour-intensive and relatively slow compared with automatic annotation methods, manually annotated datasets provide an invaluable reliable reference resource that can provide an accurate ‘gold standard’ dataset from which users can base their annotation by similarity algorithm. A number of such data resources exist (table 1); however, we illustrate with just two examples: the Vega database (Wilming et al. 2008) and the UniprotKB/SwissProt database (Boeckmann et al. 2003; see below). As well as providing a useful gold standard dataset, in many cases, manual curation represents the only way of extracting information from the literature and putting it into a format that can be queried in bulk or by a computer. This is certainly the case for the catalytic site atlas (CSA; Torrance et al. 2005), which was created for the very reason that catalytic residues were only ever reported in the literature and it was therefore impossible to extract computationally. Traditionally, information flows from the experimental laboratory into the literature, which provides the only reference to the data. These experimental data come from laboratories dedicated to that particular protein or family of proteins. For example, functional residues in a protein structure can be determined by site-directed mutagenesis (Grollman 1990) or location of a gene product in a cell can be located by fluorescence, electronic microscopy or radioactivity (Hoque & Cole 2008; Lewis et al. 2008; Usami et al. 2008). Manual annotation provides us with the most accurate way of providing data flow from experimental studies reported in the literature into databases. Of course, manual annotation can also be slow. In order to try and speed up manual curation, text mining methods have been developed to extract this information automatically (Rebholz-Schuhman et al. 2007; Zweigenbaum et al. 2007; Cohen & Hunter 2008; Zhou & He 2008). However, as of yet, these methods are in their infancy. Of course, one very effective way of stopping the flow of data from experiment to literature is to change the deposition procedure such that incorporation into a database in a computer-readable format is also compulsory at the time of submission. In addition, this also means that the process must be user-friendly. Ideally, input fields would be free text, allowing the experimentalist to fully explain the nature of the new information. However, it is likely that a high level of freedom would create extra work for the database curators. A more manageable task might be created if experimentalists were to input into restricted fields—however, this still leaves the tasks of defining these fields in such a way that they are not so ambiguous that they result in an inconsistent dataset across different entries of the database. One example of this is the annotation of function. Function is associated with many mutually overlapping levels: chemical; biochemical; cellular; organism mediated; developmental; and physiological. Therefore, a simple definition of function could lead to a huge array of inconsistent information across the database entries. (The use of ontologies is described below.) Another sizeable problem is that of ensuring that the added data are of a high quality. In addition, it is also important to respect the information given in the original entry, and conflicts in data need to be dealt with carefully. Both the EMBL nucleotide database (Kulikova et al. 2007) in conjunction with GenBank (Kulikova et al. 2007; Benson et al. 2008) and DDBJ (Okubo et al. 2006) have recognized this and have begun to provide a service of this nature. This is done through an online submission procedure (WEBIN; Kulikova et al. 2007). Through the help of controlled input fields and pull-down lists, unambiguous annotations are produced. The WEBIN submission process has evolved over time so that specific information is required for each field. To obtain the final database entry, the submitter works closely with an EMBL curator so that the best annotation is achieved from the data. 2.1.1 Vega The Vertebrate Genome Annotation database (Vega; http://vega.sanger.ac.uk; Wilming et al. 2008) provides high-quality manual annotation for 20 out of the 24 human chromosomes, four whole mouse chromosomes and approximately 40 per cent of the zebrafish Danio rerio genome. Vega also displays regions of significance from other vertebrate genomes, human haplotypes and mouse strains including the finished sequence and annotation of the major histocompatability complex from different human haplotypes, and mouse non-obese diabetes strain annotation of insulin-dependent diabetes candidate regions. The annotation process is slow and these datasets have been built up over many years; however, the result can be used as a trusted, standard data resource. For example, it has been used to provide the basis for integration between a number of organizations to form the Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/). This collaboration between the RefSeq group (Pruitt et al. 2007) at the NCBI, the Havana and the Ensembl groups at the EMBL and WTSI and the Genome informatics group at UCSC aims to provide a standardized, uniform set of protein-coding gene annotations across the human genome. Their goal is to provide a comprehensive annotation of coding and non-coding variants for each human and mouse CCDS locus to create a structured basis for a comparison with RefSeq. In addition to a high-quality dataset of gene structures, which can be used to predict gene structures on low-coverage genomes from other vertebrate species, the process of manual annotation can provide better results for the identification of polyadenylation features, non-coding genes, splice variants, pseudogenes and some of the more complex gene arrangements. 2.1.2 UniProtKB/SwissProt UniProtKB/SwissProt (Boeckmann et al. 2003) is perhaps one of the most well-known manually curated data resources providing high-quality, well-structured database entries for almost 390 000 protein sequences (release 55.5 as of 10 June 2008). Gene sequences deposited into the EMBL nucleotide database are then translated and incorporated into TrEMBL (a databank of coding nucleic acid sequences translated into protein), which stores almost 6 000 000 sequences (release 38.5 as of 10 June 2008). A subsection of these are then manually curated and added to SwissProt.The process of annotation can be described in a number of stages. First, the sequence is captured. SwissProt is a non-redundant database—each entry groups all peptides from a single gene—and therefore sequences are compared and discrepancies between them are noted. Each sequence then undergoes literature-based curation (where information is manually extracted from literature sources and added to the entry) and rigorous sequence analysis. Currently, database entries contain information from over 1400 different journals. All information added during the curation process is verified by expert biologists and therefore considered highly reliable. Information is added to the entry in a highly structured and uniform manner, making it easier to read computationally. Each line is added using an information type identifier; for example, RA, to annotate a reference, or GN, to annotate a gene name. Much of the functional nature of the entry is recorded using three such identifiers; FT identifier gives and records a description of a defined region, using a list of feature types, the CC line can contain a free-text comment, which is marked with a clearly defined CC topic (category), and, finally, a set of predefined keywords are carefully chosen, which best represent the entry. This enables maximum flexibility on how these data are used. The highly defined format is essential for the use of this data resource, and allows for very little ambiguity between entries and also allows maximum capacity for the data to be manipulated computationally. Such a task needs highly trained and knowledgeable curators. There are now 120 curators spread over the European Bioinformatics Institute (EBI) and the Swiss Institute of Bioinformatics. The SwissProt team has developed a number of resources to streamline the annotation process. Often curators within the SwissProt team will have specific families which they always curate, thus speeding up the process of knowing what to look for and also making it more accurate with the curator having a bank of knowledge already about that family. In addition, curators have a number of tools available to them to help the annotation process. These tools are bound together via a text editor with a powerful C-like macro language, Crisp (Boeckmann et al. 2003), which has been manipulated to provide a platform for highly formatted textual annotation and has the ability to launch a number of sequence analysis tools. During the curation process, a number of sequence analysis tools are used to help annotate such features as signal sequences, transmembrane domains, coiled coil domains and N-glycosylation sites, to name a few. Once an entry is created, it is checked with a syntax checker in order to highlight any inconsistencies in the format or mistakes in the controlled vocabulary fields of the entry. It has also become possible to provide some methods of automatic annotation on SwissProt sequences with high accuracy. The first automatic annotation project was that of high-quality automated and manual annotation of microbial proteins (HAMAP), in which information was transferred from manually annotated proteins to homologues of complete bacterial and archaeal proteomes and based on a set of manually curated rules (Gattiker et al. 2003). UniProt has begun to mine specialist knowledge from expert communities in their annotation procedure, with a pilot scheme to involve the yeast consortium in the annotation of yeast proteins. It is hoped that this scheme can be extended to include other specialist communities in the future (http://www.uniprot.org/news/2007/09/11/release). 2.2 Automatic annotation Once high-quality information has been transferred into a database in a computationally interpretable way, it is possible, in many cases, to transfer these annotations to homologous genes in order to provide increased annotation coverage. For example, some genomes can be annotated by comparison—much of the gene structure of the chimpanzee can be elucidated by comparison with the human genome (Chimpanzee Sequencing and Analysis Consortium 2005). However, the most successful transfer can be achieved at the proteomic level once the three-dimensional protein structure is known, as protein structure is much more conserved than sequence during evolution (Chothia & Lesk 1986). Again, this is an equivocal process since phenomena including alternative splicing or single nucleotide polymorphisms (SNPs) can alter the structure and function of the proteins (Wen et al. 2004; Hiller et al. 2005; Stetefeld & Ruegg 2005). All these processes involve the most important step of identifying homologues. Although homology is an old morphological term, which implies an evolutionary divergence from a common ancestor (e.g. mammal and fish bodily extremities), in functional annotation, this word is sometimes not properly used (Petsko 2001). Indeed, many of the so-called ‘by homology’ annotations should be renamed as ‘by similarity’ since they are not based on evolution but on resemblance, e.g. all annotations provided by tools using similarity searches to transfer information. However, as homology is an extremely complex term, beyond the goal of this review, we would recommend the reader to look at the useful explanations given by Fitch (2000), while we use the common assumption that very similar sequences or structures are homologous. Proteins that have evolved from a common ancestor are often found to share a related structure, function and sequence. The comparison of protein structures has the capability to identify very distant relationships between protein sequences. However, we do not know the three-dimensional structure of every sequence. Instead, the relationships hidden within sequence space must usually be teased out through the use of computational sequence comparison methods. Measuring sequence similarity to infer the evolutionary distance between proteins is a fundamental tenet of structural biology and has been drawn on to organize sequence space into clusters of proteins that have diverged from a common ancestor and therefore share a common protein fold. In order to detect as many related sequences as possible, powerful sequence comparison methods have been developed, such as PSI-BLAST (Altschul et al. 1997) and hidden Markov models (HMMs; Eddy 1996), which use sequence profiles built up of groups of related sequences to identify remote protein homologues. Other automatic methods involve the development of rules based on the analyses of previously characterized data. The most interesting thing about these methods is that they do not require the existence of annotated homologous relatives. Rosetta from the Baker laboratory is one of the most popular and successful tools using ab initio calculations for protein structure (Das et al. 2007). The method works by finding local common conformations of small residue stretches. These local conformations are then refined together on the tertiary structure by minimizing the free energy. This method has been extended to predict protein–protein interactions by modelling thousands of conformers, which are then ranked using van der Waals, solvation and H-bond energies (Gray et al. 2003). Ab initio methodologies have also been successfully used to predict other features such as gene promoters based on stiffness and helical deformation of nucleic acids (Goñi et al. 2007) and transcription factor binding sites from DNA–amino acid interaction preferences (Kaplan, T. et al. 2005). For an extensive review on ab initio and comparative genomics methods, see Jones (2006). A number of genome annotation sources have been created, including the UCSC genome browser (Karolchik et al. 2008) and the tools provided at the NCBI (Wheeler et al. 2007). The Ensembl pipeline (Curwen et al. 2004) provides automatic annotation of eukaryotic genomes for predicting gene structures (as well as providing other annotations such as homology mapping between species and mapping to other data resources such as expression arrays). The pipeline is a suite of programs built from observing how annotators build gene structures. The pipeline uses information from known proteins, cDNA and EST sequences. First, species-specific known protein sequences are taken from UniprotKB/SwissProt, UniprotKB/TrEMBL and RefSeq. Then, in order to increase coverage, proteins from other organisms are matched using different thresholds. In parallel, to increase coverage further, full-length cDNAs (Stoesser et al. 1998; Pruitt & Maglott 2001; Okazaki et al. 2002) are aligned to the genomic sequence. Annotation of this kind provides fast and high coverage for annotations of genomes and, as such, the genes on 39 species (release 49) have been annotated. The results of these analyses can be viewed on the Ensembl Web browser as part of the Ensembl project (Flicek et al. 2008), a comprehensive genome information portal providing an integrated set of genome annotation for chordate, disease vector genomes and a number of selected model organisms. Protein sequences in TrEMBL are annotated using an automated annotation pipeline of three programs. The first is Rulebase, which uses a number of manually curated annotation rules. For example, looking at eukaryotic protein sequences from the Interpro family IPR000685, it is found that 434 out of the 436 have chloroplast as a UniProt keyword. As a result, one can have good confidence in applying the ‘chloroplast’ keyword to any non-annotated eukaryotic protein sequences which have clustered into this Interpro family. It is a powerful technique. However, the manual generation of rules is time consuming and therefore the second program, Spearmint, generates similar rules automatically (Kretschmann et al. 2001). This program works in conjunction with the third in this suite of programs, Xanthippe, which aims to remove false positives and erroneous imports from other databases by generating rules to predict the absence of annotations, a ‘contradiction’ program rather than a prediction program (Wieser et al. 2004). 3. Annotation of phenotypes Now that parts of the major genomes are annotated to a high quality, more attention has been turned to the annotation of allele-specific information and the differences between the genomes of individuals of the same species. For example, workers at the WormBase database have recently provided their users with details of all sequenced alleles described in papers published since 2001 (Rogers et al. 2008). Variation occurs by the presence of SNPs, which, in humans, occur every 500–1000 bp and are among the most common types of genetic variation. These are associated with altered response to drug treatment, susceptibility to disease and other phenotypic variation. These types of analyses highlight regions of the genome, which are highly variable between individuals, and could lead to differences in phenotype. This type of information is integrated into the major databases (e.g. UniProt and Ensembl); however, there are also other specific databases, for example dbSNP (Smigielski et al. 2000), a catalogue of variations from the National Center for Biotechnology Information. Other databases focus on the provision of annotation of the effect of these SNPs such as SNPeffect (Reumers et al. 2005).There are a number of locus-specific databases (LSDBs) conveying such information for particular genes. Some groups have collected variation information for particular genes of interest. This information is particularly important for the elucidation of disease and integration of these sources would allow the creation of a catalogue of variation within the human genome. The federation of LSDBs was set up in order to ascertain the best mode of collecting and curating accurate lists of mutations. This was followed by an analysis of the characteristics of 94 LSDBs on the basis of 80 content criteria (Claustres et al. 2002). In addition, the Human Genome Variation Society aims to promote collection, documentation and free distribution of genomic variation information. With this in mind, work has begun on providing a catalogue of variation within the human genome. The elucidation of two individual human genomes (Levy et al. 2007; Wheeler et al. 2008) and now the beginning of the 1000 genomes project (Siva 2008) aims to cover variation in the human genome. The study of variation in complex phenotypes, where a phenotype is determined by the expression of several genes, is also being considered. These genes are called quantitative trait loci (QTLs; Reuveni et al. 2007), with each locus contributing only a small amount to the eventual phenotype (Flint & Mott 2001; Mott 2006). Such studies have identified and narrowed the genetic regions that control particular phenotypes involved in metabolic and immunological processes (Valdar et al. 2006; Liu et al. 2007) and particular phenotypes responsible for behavioural responses (Malmanger et al. 2006; Valdar et al. 2006; Liu et al. 2007). Other studies have explored the influence of environment or inbreeding on the phenotypical variations (Solberg et al. 2006). As many of the most common human diseases are caused by a polygenic effect, these sorts of studies will be crucial to find new drug targets (Rollins et al. 2006). Information conveying genetic mutations and their effect on the phenotype is essential for understanding inherited diseases. In contrast to SNPs, which have tolerated phenotypic effects, some of these mutations are extremely deleterious (e.g. the mutations which go on to cause cancer). There are several approaches to the study of the phenotypic effects of the mutation: using gene information (e.g. presence of TF binding sites or splice junctions; Conde et al. 2004); using artificial intelligence tools to score protein features such as accessibility, secondary structure or number of specific residues (Ferrer-Costa et al. 2004, 2005a,b; Capriotti et al. 2006; Bromberg & Rost 2007); or combining both strategies (Conde et al. 2005; Karchin et al. 2005). In order to increase the amount of data, it has been successfully tested using cross-species prediction. This involves training the tool using data from one species and predicting the effect on another one (Ferrer-Costa et al. 2005b). In addition, as more genome data are released, there have been initiatives to annotate a catalogue of human cancer genes and mutations (Futreal et al. 2004; Sjöblom et al. 2006; Greenman et al. 2007; Wood et al. 2007). This process typically involves two steps. The first step is known as the discovery screen, in which all coding exons are used to design primers to amplify DNA from tumour and normal samples from the same individual. After assembling the PCR results, the mutations are analysed both computationally and manually, discarding all changes appearing in normal samples and in SNP databases. The remaining genes are resequenced to verify that any changes are not an artefact (Sjöblom et al. 2006; Greenman et al. 2007; Wood et al. 2007). The second stage is the validation screen, which involves using other tumour lines to amplify and sequence genes that have non-synonymous mutations on the discovery screen. Using a similar protocol to the first stage, the discovery screen, the validation screen focuses on a set of genes instead of the whole genome (Sjöblom et al. 2006; Wood et al. 2007). However, these censuses contain driver and passenger mutations, as it is impossible to differentiate which mutations are causative and which ones are the product of speed replication cycles. Some attempts have been made to statistically predict causative genes or mutations. They use the number of mutations per gene and the synonymous/non-synonymous ratio to perform these predictions (Sjöblom et al. 2006; Greenman et al. 2007; Wood et al. 2007). Another successful approach to obtaining a catalogue of human cancer genes has involved a gene census compiled from the literature (Futreal et al. 2004). They collected genes for which at least two independent reports of somatic mutations, chromosomal rearrangements or copy number alterations were available. Finally, this protocol was used to create the Catalogue Of Somatic Mutations In Cancer (COSMIC; Forbes et al. 2008), which in its 38th release contains almost 60 000 mutations. Among the biggest resources covering genomic data are those from the Genomics Institute of the Novartis Research Foundation, which include several databases such as SymAtlas, a database of gene expression (formerly human and mouse genes, but now extended to other species; Su et al. 2004), and SNPview, a database containing SNPs from 48 genotyped mouse strains (Pletcher et al. 2004) and an interface to search human druggable genes (Orth et al. 2004).4. Integration of annotations As we have described above, many annotation methods exist for genomic and proteomic data, often spread throughout the world. It is impossible to query all these resources at once and interpretation of results from each site is different. Therefore, the user must learn to query and interpret each method separately. Furthermore, these methods often change locations as laboratories change their geographical sites. Accordingly, a new challenge has arisen: to integrate all these different sources of data into a single comprehensive source. In order for this to be possible, it is important that all databases communicate using the same language. For example, methods need to use the same term names to describe their annotations. In addition to this, there needs to be an appropriate infrastructure to facilitate the provision and display of annotations from these disparate laboratories. 4.1 Infrastructure One such method that provides the infrastructure is the distributed annotation system (DAS; Dowell et al. 2001), a method which comprises a reference server that contains the information for other servers to ‘refer’ to. DAS is a client–server system in which clients are configured to read and interpret data from multiple servers. An example in this case would be a UniProt sequence. Each partner site then supplies their annotations, which refer to that sequence in a particular XML format. Interpretation and display of the information from all servers (reference and annotation) is done by a DAS client, e.g. Dasty2 (Jimenez et al. 2008), Spice (Prlic et al. 2005), the Ensembl genome browser (Spudich et al. 2007) and the Pfam DAS alignment viewer (Finn et al. 2008). The information that the client interprets is provided by the servers organized into ‘reference servers’ that hold specific information such as the sequence to relate one entry to another on a physical map and multiple sites then act as ‘annotation servers’ providing as few or as many annotations on a segment of the reference. This method allows great flexibility in information provided and not only can information be integrated from a number of different laboratories into one central site but also any research group can view annotations provided in conjunction with their own data. Contrary to more traditional set-ups, in DAS, it is the client that does all of the interpretation of information (the smart system), and the servers merely provide the information in a computer-readable format. This contrasts with the more widely used format where the data source provides and interprets the data. The third compartment is the registry (Prlic et al. 2007), which provides a catalogue of all servers available. After its invention in 2001, the system was widely adopted by the genomic annotation community and forms a major part of the Ensembl infrastructure. It was then adopted by the proteomics world in 2005 and there now exist approximately 400 DAS servers currently registered with the DAS registry. Today, this method is heavily used in the popular genome browsers such as Ensembl, which provides approximately 200 different DAS sources, WormBase (Stein et al. 2001) and GBrowse (Stein et al. 2002). This circumvents the need for centralized control over the data and centralized database archives do not need to set aside time and resources to resolve contradictions between different third-party annotations as all information is reported, leaving the user to interpret the results. Integration can also be facilitated by providing annotation programs in such a way that others are able to run them remotely on their own datasets; this can be done by providing software as a Web service (Curcin et al. 2005; Labarga et al. 2007). As a Web service, the user is able to use the program however they like and even incorporate it into a workflow (Kappler 2008; a concatenation of several annotation tools, each one using the results from the last as the input to the next, examples include Triana and Taverna; Hull et al. 2006). These tools comprise a WSDL file (Web service definition language, an XML schema), which describes the Web service and how to access it, and a CGI script returning results via Simple Object Access Protocol (SOAP) or Representational State Transfer (REST) information transference protocols. The Web service is based on a ‘computer-to-computer’ communication, and therefore it does not need any Web interface and can be used on the command line, requiring only a client script (e.g. a PERL script) by the user side providing a greatly portable and accessible tool. 4.2 Integration of terms An important step of integration is the need for the development of a common language by which all methods can communicate. As for both genomic and proteomic annotations, standardization of the nomenclature is fundamental to the success of integration. Without a common language, it is impossible for users to interpret the data manually or computationally. This can be illustrated by looking at the term ‘function’, which can have a variety of meanings. It can be used to describe the biochemical role of the residues carrying out the function, i.e. hydrolysis, or it can mean the overall function of the domain (ATP binding) or, in fact, the full function of the protein, e.g. glutathione reductase. The overall function of the protein can also be related to its biological process or cellular location. The need for standardization has been recognized by centrally managed databases as an important feature. For example, in UniProt, curators are highly trained in the format of the entry. Each one is controlled by a number of predefined descriptions or headings, which head free-text fields to allow easy retrieval of specific topics; there are 941 keywords (by early September 2008) for curators to pick from which summarize the content of an entry. This allows uniformity in the curation of different entries and also the ability to manage the data computationally. Vega has been instrumental in the classification and standardization of annotation terms used by the community. This is particularly important when comparing haplotypes or syntenic regions. The standardization aids comparative analysis of orthologues across the different finished regions. In order to provide the platform for integration and comparison, Vega communicates with the nomenclature committees from the Human Genome Organisation (HGNC; Bruford et al. 2008), ZFIN (Sprague et al. 2008) and MGD (Eppig et al. 2007). This collaboration has standardized the nomenclature associated with transcribed regions allowing the user to interpret the evidence (cDNA, EST or protein sequences) associated with the data. These transcribed regions are associated with one of five categories that range from the most to least confidence: ‘known genes’, which are identical to human cDNA or protein sequences; ‘novel genes’, which have an open reading frame (ORF) and are identical or homologous to known cDNAs (vertebrates) and/or proteins (all species); ‘novel transcripts’, which are similar to novel genes but it has not been possible to assign an ORF; ‘putative genes’, which are homologous to spliced ESTs (vertebrates) but do not have a significant ORF/CDS; and ‘pseudogenes’, which are sequences homologous to proteins (over 50% of the subject length) with a disrupted CDS and for which an active gene can generally be found at another locus. 4.2.1 Ontologies A further step is to provide a common language with a standardized relationship between the terms in the language in the form of an ontology. This need for a unified language to describe many different areas within biology has now been widely recognized with the development of a number of ontologies; there are now over 60 ontologies listed on the EBI Ontology Lookup Service (http://www.ebi.ac.uk/ontology-lookup/). An ontology comprises a unique alphanumerical identifier, a common name, synonyms (if applicable) and a definition. These terms and definitions are clustered together by drawing relationships between them; such relationships can be described as ‘is_a’ or ‘part_of’ providing users with not only a controlled vocabulary with common terms and meanings but also relationships that can be computationally inferred. The gene ontology (GO; Ashburner et al. 2000) provides a comprehensive controlled vocabulary to describe gene product attributes. It is divided into three major sections: the molecular function of the gene product; the role it has in multi-step biological processes; and the location within the cellular components. There are now 25 264 terms in this ontology, and it has been used for a variety of purposes in approximately 530 publications in 2007 alone. In 2001, UniProt became a member of the GO consortium and initiated the GOA project (Camon et al. 2004). They provided a dedicated database curation team for the assignment of GO terms to well-characterized proteins in UniProtKB/SwissProt. For example, the 941 keywords in UniProt have been manually mapped to GO terms.A project within the GO Consortium is the Sequence Ontology (Eilbeck et al. 2005). This is an ontology describing the parts of a genomic annotation, which has been developed to facilitate exchange, analysis and management of genomic data. This standard has been used to support the features stored in the sequence databases of model organisms (Mungall & Emmert 2007) and to standardize the annotation exchange formats (GFF3 specification). Many model organism communities such as WormBase (Rogers et al. 2008), FlyBase (Grumbling & Strelets 2006), SGD (Christie et al. 2004) and DictyBase (Chisholm et al. 2006) use it to annotate their sequences, and a recent addition to this ontology are the terms that describe features on proteomic sequences and structures (Reeves et al. 2008), which is being used to standardize annotations provided by the BioSapiens NoE (The BioSapiens Network of Excellence 2005). Ontologies have also been created to provide a similar role for other aspects of biology, an ontology for molecular interactions (Kerrien et al. 2007a,b), pathways (Twigger et al. 2007), post-translational modifications (Montecchi-Palazzi 2008), to name a few. 4.3 Collaborations for the integration of annotations When the concept of e-Science was first introduced, the need for collaboration within a number of scientific disciplines was recognized (bioinformatics, chemistry, engineering, healthcare, particle physics and astronomy) in order to provide greater integration (Hey & Trefethen 2003). For bioinformatics, the myGrid project was launched by a consortium comprising the Universities of Manchester, Southampton, Nottingham, Newcastle, Sheffield, the European Bioinformatics Institute and industrial partners GSK, AstraZeneca, IBM and SUN. It was launched in order to develop an integrated infrastructure to provide tools for automatic annotation and provenance tracking. Since then, a huge number of collaborative projects in all areas of bioinformatics have been taken on. One particular project, the BioSapiens Network of Excellence, is currently nearing its end and, as a result, its benefits and achievements can be examined. The project was heavily influenced by two other projects: e-protein (http://www.e-protein.org/), which aimed to provide an automated and distributed pipeline for structural and functional annotation of all major proteomes using GRID technologies; the use of a large number of computers, often in varied geographical locations, in concert to perform very large tasks and pioneered the use of DAS for protein sequences. The BioSapiens Network of Excellence was a direct follow on and its main goal was to create a ‘Virtual Institute of Annotation’. This has been achieved through the use of the DAS infrastructure with the integration of 69 different distributed annotation sources from 19 partner sites. In addition to this, they have tackled scientific and data integration from a different angle. Members of the consortium have come together to provide a joint analysis of a number of different datasets. One particular example is their involvement in the ENCODE project (The ENCODE Project Consortium 2004; Birney et al. 2007). This ongoing project aims to identify all functional elements in the human genome sequence and was designed in three stages: pilot phase in which methods and combinations of methodologies can be evaluated on a released dataset of 1 per cent of the genome (some random and some manually picked regions); technology development phase in which new laboratory and computational methods will be designed; and the production phase. The pilot phase was launched in September 2003 initially with the funding of eight groups with expertise in existing technologies for the detection of a variety of functional elements including gene promoter repressors and exons. However, as an open consortium, results on the pilot phase were collated from 35 groups including the BioSapiens Network of Excellence. This analysis has provided more than 200 experimental and computational datasets in unprecedented detail of annotation on this 30 Mb dataset of the human genome. The overall findings of this pilot phase have been published and include some major findings. A major conclusion of this study challenges the view that the genome could be annotated as a ‘dictionary of conserved genomic elements each with an annotation about their biochemical function’ (The ENCODE Project Consortium 2004; Birney et al. 2007). Instead, it was found by numerous groups that intercalated transcripts spanning the majority of the genome existed (Tress et al. 2007). Previous analysis has shown a similar broad amount of transcription across the human (Bertone et al. 2004; Cheng et al. 2005) and mouse (Carninci et al. 2005) genomes. There were mixed opinions about the biological importance and the question remains unanswered. However, the presence of these transcribed elements was indeed established.5. Future directions Although the annotation field has advanced very fast in the last decade, there are still many remaining challenges, such as the role of the expressed non-coding RNA (Huttenhofer et al. 2005; Mendes Soares & Valcarcel 2006; Yazgan & Krebs 2007; Yazaki et al. 2007; Amaral et al. 2008), de novo biological functional prediction of proteins (Watson et al. 2005) or the systemic integration of annotated components (Ge et al. 2003; Reed et al. 2006). All these areas clearly cross the traditional borders of the genome and proteome annotation and go further through the systems biology field. Probably, in the following years, it will be necessary to merge the purely annotation work and the more basic research in order to succeed. Furthermore, the integrative part of the annotation must continue on progress, inside the bioinformatics area and influencing the experimentalists. Consequently, some things could be done to facilitate the integration of experimental results: (i) in addition to ontologies, some other technical standards should be agreed to facilitate the data linkage between resources, (ii) simple and unequivocal forms should be created for the introduction of data from biological experiments, and (iii) data deposition should be a condition for publication. 6. Conclusions Given the plethora of annotation tools that have been developed over the past few years (figure 1 Acknowledgments We gratefully thank Daniel Andrews, Roman Laskowski and Daniela Wieser for their helpful hints and comments. This work was completed as part of the BioSapiens Network of Excellence, funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LHSG-CT-2003-503265. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||
Nature. 1999 Dec 2; 402(6761):489-95.
[Nature. 1999]Nature. 2001 Feb 15; 409(6822):860-921.
[Nature. 2001]Trends Genet. 2008 Mar; 24(3):142-9.
[Trends Genet. 2008]Nucleic Acids Res. 2003 Jan 1; 31(1):365-70.
[Nucleic Acids Res. 2003]Nat Struct Biol. 2000 Nov; 7 Suppl():932-4.
[Nat Struct Biol. 2000]Curr Opin Plant Biol. 2007 Oct; 10(5):534-42.
[Curr Opin Plant Biol. 2007]Trends Genet. 2008 Mar; 24(3):133-41.
[Trends Genet. 2008]Trends Genet. 2005 May; 21(5):289-97.
[Trends Genet. 2005]EMBO J. 2006 Mar 8; 25(5):923-31.
[EMBO J. 2006]Drug Discov Today. 2005 Nov 1; 10(21):1475-82.
[Drug Discov Today. 2005]Hum Mol Genet. 2002 Feb 15; 11(4):451-64.
[Hum Mol Genet. 2002]Nat Genet. 2004 Dec; 36(12):1255-7.
[Nat Genet. 2004]Proc Natl Acad Sci U S A. 2003 Jan 7; 100(1):189-92.
[Proc Natl Acad Sci U S A. 2003]Genome Biol. 2004; 5(4):218.
[Genome Biol. 2004]Genome Res. 2006 Jan; 16(1):11-9.
[Genome Res. 2006]Annu Rev Biophys Biomol Struct. 2000; 29():291-325.
[Annu Rev Biophys Biomol Struct. 2000]Proc Natl Acad Sci U S A. 2003 Jan 7; 100(1):189-92.
[Proc Natl Acad Sci U S A. 2003]Genome Biol. 2004; 5(4):218.
[Genome Biol. 2004]Proc Natl Acad Sci U S A. 2007 Mar 27; 104(13):5495-500.
[Proc Natl Acad Sci U S A. 2007]Structure. 1997 Aug 15; 5(8):1093-108.
[Structure. 1997]Philos Trans R Soc Lond B Biol Sci. 2006 Mar 29; 361(1467):441-51.
[Philos Trans R Soc Lond B Biol Sci. 2006]EMBO J. 1986 Apr; 5(4):823-6.
[EMBO J. 1986]J Mol Biol. 2000 Mar 17; 297(1):233-49.
[J Mol Biol. 2000]Genome Res. 2001 Oct; 11(10):1632-40.
[Genome Res. 2001]Proteins. 2007; 69 Suppl 8():118-28.
[Proteins. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D753-60.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2003 Jan 1; 31(1):365-70.
[Nucleic Acids Res. 2003]J Mol Biol. 2005 Apr 1; 347(3):565-81.
[J Mol Biol. 2005]Cancer Res. 2008 Jun 15; 68(12):4802-9.
[Cancer Res. 2008]Microbiology. 2008 Jul; 154(Pt 7):1837-44.
[Microbiology. 2008]Neuroscience. 2008 Jun 12; 154(1):22-8.
[Neuroscience. 2008]Brief Bioinform. 2007 Sep; 8(5):358-75.
[Brief Bioinform. 2007]PLoS Comput Biol. 2008 Jan; 4(1):e20.
[PLoS Comput Biol. 2008]J Biomed Inform. 2008 Apr; 41(2):393-407.
[J Biomed Inform. 2008]Nucleic Acids Res. 2007 Jan; 35(Database issue):D16-20.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2008 Jan; 36(Database issue):D25-30.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D753-60.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2007 Jan; 35(Database issue):D61-5.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2003 Jan 1; 31(1):365-70.
[Nucleic Acids Res. 2003]Nucleic Acids Res. 2003 Jan 1; 31(1):365-70.
[Nucleic Acids Res. 2003]Comput Biol Chem. 2003 Feb; 27(1):49-58.
[Comput Biol Chem. 2003]Nature. 2005 Sep 1; 437(7055):69-87.
[Nature. 2005]EMBO J. 1986 Apr; 5(4):823-6.
[EMBO J. 1986]Trends Genet. 2004 May; 20(5):232-6.
[Trends Genet. 2004]Genome Biol. 2005; 6(7):R58.
[Genome Biol. 2005]Trends Biochem Sci. 2005 Sep; 30(9):515-21.
[Trends Biochem Sci. 2005]Trends Genet. 2000 May; 16(5):227-31.
[Trends Genet. 2000]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Curr Opin Struct Biol. 1996 Jun; 6(3):361-5.
[Curr Opin Struct Biol. 1996]Proteins. 2007; 69 Suppl 8():118-28.
[Proteins. 2007]J Mol Biol. 2003 Aug 1; 331(1):281-99.
[J Mol Biol. 2003]Genome Biol. 2007; 8(12):R263.
[Genome Biol. 2007]PLoS Comput Biol. 2005 Jun; 1(1):e1.
[PLoS Comput Biol. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D773-9.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2007 Jan; 35(Database issue):D5-12.
[Nucleic Acids Res. 2007]Genome Res. 2004 May; 14(5):942-50.
[Genome Res. 2004]Nucleic Acids Res. 1998 Jan 1; 26(1):8-15.
[Nucleic Acids Res. 1998]Nucleic Acids Res. 2001 Jan 1; 29(1):137-40.
[Nucleic Acids Res. 2001]Nucleic Acids Res. 2008 Jan; 36(Database issue):D612-7.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2000 Jan 1; 28(1):352-5.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D527-32.
[Nucleic Acids Res. 2005]Genome Res. 2002 May; 12(5):680-8.
[Genome Res. 2002]PLoS Biol. 2007 Sep 4; 5(10):e254.
[PLoS Biol. 2007]Nature. 2008 Apr 17; 452(7189):872-6.
[Nature. 2008]Nat Biotechnol. 2008 Mar; 26(3):256.
[Nat Biotechnol. 2008]Philos Trans R Soc Lond B Biol Sci. 2006 Mar 29; 361(1467):393-401.
[Philos Trans R Soc Lond B Biol Sci. 2006]Nat Genet. 2006 Aug; 38(8):879-87.
[Nat Genet. 2006]PLoS One. 2007 Jul 25; 2(7):e651.
[PLoS One. 2007]Mamm Genome. 2006 Dec; 17(12):1193-204.
[Mamm Genome. 2006]Mamm Genome. 2006 Feb; 17(2):129-46.
[Mamm Genome. 2006]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W242-8.
[Nucleic Acids Res. 2004]Proteins. 2004 Dec 1; 57(4):811-9.
[Proteins. 2004]Proteins. 2005 Dec 1; 61(4):878-87.
[Proteins. 2005]Nucleic Acids Res. 2007; 35(11):3823-35.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue):W501-5.
[Nucleic Acids Res. 2005]Nat Rev Cancer. 2004 Mar; 4(3):177-83.
[Nat Rev Cancer. 2004]Nature. 2007 Mar 8; 446(7132):153-8.
[Nature. 2007]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]PLoS Biol. 2004 Dec; 2(12):e393.
[PLoS Biol. 2004]Expert Opin Ther Targets. 2004 Dec; 8(6):587-96.
[Expert Opin Ther Targets. 2004]Nucleic Acids Res. 2008 Jan; 36(Database issue):D281-8.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2001 Jan 1; 29(1):82-6.
[Nucleic Acids Res. 2001]Genome Res. 2002 Oct; 12(10):1599-610.
[Genome Res. 2002]Drug Discov Today. 2005 Jun 15; 10(12):865-71.
[Drug Discov Today. 2005]Nucleic Acids Res. 2007 Jul; 35(Web Server issue):W6-11.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2006 Jul 1; 34(Web Server issue):W729-32.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2008 Jan; 36(Database issue):D445-8.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2008 Jan; 36(Database issue):D768-72.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2007 Jan; 35(Database issue):D630-7.
[Nucleic Acids Res. 2007]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D262-6.
[Nucleic Acids Res. 2004]Genome Biol. 2005; 6(5):R44.
[Genome Biol. 2005]Nucleic Acids Res. 2008 Jan; 36(Database issue):D612-7.
[Nucleic Acids Res. 2008]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D484-8.
[Nucleic Acids Res. 2006]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D311-4.
[Nucleic Acids Res. 2004]Nucleic Acids Res. 2006 Jan 1; 34(Database issue):D423-7.
[Nucleic Acids Res. 2006]Philos Transact A Math Phys Eng Sci. 2003 Aug 15; 361(1809):1809-25.
[Philos Transact A Math Phys Eng Sci. 2003]Nature. 2007 Jun 14; 447(7146):799-816.
[Nature. 2007]Proc Natl Acad Sci U S A. 2007 Mar 27; 104(13):5495-500.
[Proc Natl Acad Sci U S A. 2007]Trends Genet. 2005 May; 21(5):289-97.
[Trends Genet. 2005]EMBO J. 2006 Mar 8; 25(5):923-31.
[EMBO J. 2006]Biochem Cell Biol. 2007 Aug; 85(4):484-96.
[Biochem Cell Biol. 2007]Curr Opin Plant Biol. 2007 Oct; 10(5):534-42.
[Curr Opin Plant Biol. 2007]Curr Opin Struct Biol. 2005 Jun; 15(3):275-84.
[Curr Opin Struct Biol. 2005]Proteomics. 2008 Feb; 8(4):626-49.
[Proteomics. 2008]