![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||
Copyright © 2005 Sczyrba et al; licensee BioMed Central Ltd. XenDB: Full length cDNA prediction and cross species mapping in Xenopus laevis 1FSU College of Medicine, Department of Biomedical Sciences, 1269 W. Call Street, Tallahassee, FL 32306, USA 2AG Praktische Informatik, Technische Fakultät, Universität Bielefeld, D-33594 Bielefeld, Germany 3The Rockefeller University, Laboratory of Molecular Vertebrate Embryology, 1230 York Avenue, New York, NY 10021, USA Corresponding author.#Contributed equally. Alexander Sczyrba: asczyrba/at/techfak.uni-bielefeld.de; Michael Beckstette: mbeckste/at/TechFak.Uni-Bielefeld.de; Ali H Brivanlou: brvnlou/at/mail.rockefeller.edu; Robert Giegerich: robert/at/TechFak.Uni-Bielefeld.de; Curtis R Altmann: curtis.altmann/at/med.fsu.edu Received May 5, 2005; Accepted September 14, 2005. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article has been cited by other articles in PMC.Abstract Background Research using the model system Xenopus laevis has provided critical insights into the mechanisms of early vertebrate development and cell biology. Large scale sequencing efforts have provided an increasingly important resource for researchers. To provide full advantage of the available sequence, we have analyzed 350,468 Xenopus laevis Expressed Sequence Tags (ESTs) both to identify full length protein encoding sequences and to develop a unique database system to support comparative approaches between X. laevis and other model systems. Description Using a suffix array based clustering approach, we have identified 25,971 clusters and 40,877 singleton sequences. Generation of a consensus sequence for each cluster resulted in 31,353 tentative contig and 4,801 singleton sequences. Using both BLASTX and FASTY comparison to five model organisms and the NR protein database, more than 15,000 sequences are predicted to encode full length proteins and these have been matched to publicly available IMAGE clones when available. Each sequence has been compared to the KOG database and ~67% of the sequences have been assigned a putative functional category. Based on sequence homology to mouse and human, putative GO annotations have been determined. Conclusion The results of the analysis have been stored in a publicly available database XenDB http://bibiserv.techfak.uni-bielefeld.de/xendb/. A unique capability of the database is the ability to batch upload cross species queries to identify potential Xenopus homologues and their associated full length clones. Examples are provided including mapping of microarray results and application of 'in silico' analysis. The ability to quickly translate the results of various species into 'Xenopus-centric' information should greatly enhance comparative embryological approaches. Supplementary material can be found at http://bibiserv.techfak.uni-bielefeld.de/xendb/. Background Following the publication of the first automated cDNA sequencing study in 1991 demonstrating the utility of large scale random clone cDNA sequencing approaches [1], there has been a rapid and accelerating growth of such Expressed Sequence Tags (EST). The initial study of 600 partial human sequences has grown to more than 20.0 × 106 while more than 30 organisms have more than 100,000 sequences. To make sense of the resulting sequence, a variety of bioinformatic approaches have been developed to identify protein coding sequences and domains [2-4] and generate 'unigene' sets based on agglomerative clustering methods [5,6]. Clustering EST sequences is a widely used method for analyzing the transcriptome of a genome. Especially for organisms whose genome is not (yet) sequenced, the EST data is a valuable source of information. While enormously useful, most current analysis tools result in the loss of significant biological information such as alternatively spliced transcripts and polymorphisms [7-18]. Alternative splicing in particular plays important roles during both development and in the mature organism [7-15]. Moreover, most EST based approaches appear to overestimate the number of unique sequences compared to gene predictions based on whole genome sequencing efforts [19-22]. There are different approaches for EST clustering; the most commonly used being (1) each cluster represents a distinct gene, alternative transcripts of the same gene are grouped together into the same cluster. UniGene is one approach that uses this gene-based strategy [23-27]. (2) Alternative transcripts are represented by distinct clusters. Using genome assembly tools like CAP3 [28] or Phrap [29,30] results in such a clustering, as these tools cannot (and are not designed to) handle the kinds of differences in the EST sequences. (3) STACK [6] groups ESTs based on their tissue source first, and clusters are then generated for each tissue separately. Our approach first generates gene-oriented clusters and then attempts to generate separate contigs which potentially correspond to alternative transcripts. The underlying principle for each of these approaches is a pairwise comparison of all sequences to identify common subsequences of a given length and identity that is subsequently used to group sequences into clusters. The types of pairwise comparisons result in a runtime that is quadratic in the number of sequences to be compared. To achieve better running times, most tools try to identify promising pairs of sequences by applying word-based algorithms, which consider the frequency of common words in each pair of sequences [31]. In any case these approaches have to compare all possible pairs of sequences, resulting in a running time that grows quadratically with the number of sequences. We have implemented a pipeline for rapid processing and clustering of EST data, based on enhanced suffix arrays [32-34]. Compared to other methods it reduces the running time tremendously. While we focus on generating gene-based clusters, we also assembled each cluster separately using CAP3 to generate consensus sequences for further analyses. Liang et al. evaluated Phrap, CAP3, TA-EST and TIGR Assembler and found in their analysis that CAP3 consistently out-performed the other programs [35]. We therefore chose CAP3 for cluster assembly. All sequence and clustering information obtained with our approach was stored in a relational database system. To allow for extensive queries, GenBank annotations were incorporated including the library source, tissue type, cell type and developmental stage. Results of all sequence analyses performed on the consensus sequences were stored in the database. This way, comparative queries could be answered to identify e.g. full length clones, sequences unique to X. laevis, or shared between Xenopus and another organism. The comparative query also allows the identification of the set of Xenopus sequences most related to a set from another organism. Thus, the XenDB database is designed to address a critical issue facing many researchers: the comparison of genomic studies in one organism and their application to studies in another model organism. This task is faced by many laboratories attempting to extract the information gained in human, mouse, fly and worm microarray and library sequencing studies which often consist of large tables of genes. While other databases such as UniGene [36] or TIGR Gene Indices [37] also provide collections of clustered ESTs, the unique batch functionality of mapping results from other organisms to Xenopus laevis and retrieving their potential full length clones was not available before. Moreover, our implementation is specifically designed and focused on relating Xenopus sequence data to the major model organisms. Thus, one can search for the Xenopus homologue directly using the human or mouse protein. Construction and Content Sequence sources and cleanup 350,468 Sequences were downloaded from GenBank release 138 and stored in a relational database using the open source ORDBMS PostgreSQL. The following divisions were included: Vertebrate Sequences (VRT, 5,506 sequences), EST (344,747 sequences) and High Throughput cDNA (HTC, 215 sequences). 228,496 sequences were annotated as 5' ESTs and 116,122 as 3' ESTs. 245,415 different cDNA clones were represented in the data set, out of which 92,463 had both 5' and 3' sequences. Entries annotated as being genomic sequences were excluded from the analysis. To enhance the usability and search capabilities of the database, complete GenBank entries were incorporated. Annotations including but not limited to library source, tissue type, cell type and developmental stage were extracted directly from GenBank entries (feature: source, qualifiers: clone_lib, tissue_type, cell_type and dev_stage). Unfortunately, the sequences are not very well annotated in GenBank. 34% of the sequences do not have a tissue type assigned and 36% have no developmental stage information. Distributions of tissue types, developmental stages and clone libraries are shown in supplemental files [see additional files 2, 3 and 4 respectively]. 197,888 ESTs (57.4% of the EST sequences) had information about high quality start or end of sequencing reads. This information was used to trim sequences according to high quality regions to insure best sequence quality. Vector sequence was downloaded from GenBank and VectorDB [38] and the sequence masked using the program Vmatch [39] developed by Stefan Kurtz. Vmatch is based on a novel sequence index (enhanced suffix arrays, [32-34]), allowing for the rapid identification of similarities in large sequence sets. ESTs were trimmed to eliminate vector sequence located at either the 5' or 3' end (6678 ESTs, 1.9% of total sequence set). In some cases, additional non vector sequence preceded or followed known vector sequence. If such non-vector sequence was less than 20 bases long, it was trimmed from the EST together with the vector sequence. ESTs that had vector sequences left after trimming were discarded completely. Repetitive elements were obtained from Repbase [40] and GenBank and masked using RepeatMasker [41]. In addition, if hits against ribosomal RNA and mitochondrial sequences were found in the downloaded sequence set, the corresponding sequences were removed. The availability of complete mitochondrial genomic and ribosomal sequences makes the inclusion of these sequences unnecessary while masking was performed to minimize possible clustering errors arising from these common sequences. Sequences that had less than 100 consecutive bases left after cleanup were discarded completely (21,039 sequences, 6.0%). The resulting sequence set consisted of 317,242 sequences (90.5%) with an average length of 536 bases (see Table 1).
Clustering and assembly of tentative contig sequences The cleaned X. laevis EST sequence set was grouped into gene specific clusters using Vmatch. Vmatch preprocesses the EST sequences into an index structure: an enhanced suffix array. This data structure has been shown to be as powerful as suffix trees, with the advantage of a reduced space requirement and reduced processing time. Further on, enhanced suffix arrays have been shown to be superior to other matching tools for a variety of applications [33,42,43]. For a detailed introduction of enhanced suffix arrays see Abouelhoda et al. [34]. Briefly, the index efficiently represents all substrings of the sequences and allows the solution of matching tasks, in time independent of the size of the index (unlike BLAST). Vmatch was chosen for the following reasons: (1) At first, there was no clustering tool available which could handle large data sets efficiently, and which was documented well enough to allow a detailed replication and evaluation of existing clusters. (2) Second, Vmatch identifies similarities between sequences rapidly, and it provides additional options to cluster a set of sequences based on these matches. Furthermore, the Vmatch output provides information about how the clusters were derived. Due to the efficiency of Vmatch, we were able to perform the clustering for a wide variety of parameters on the complete sequence set (see below). This allowed us to study the effect of the parameter choice on the clustering. Moreover, in the future, the efficiency will allow us to more frequently update the data set. A longer term goal of the project is to generate a data set that maintains the different alleles in this pseudotetraploid animal as separate entries. The clustering approach has been integrated into an analysis pipeline which can be applied to other organisms that often receive less attention from the bioinformatics community. The database sequences were clustered according to the matches found in a self comparison of the index. Initially each database sequence is put into its own cluster. Then all pairs of matches are generated and each pair is evaluated to possibly form single linkage clusters. To identify matching sequences, Vmatch first computes all maximal exact matches of a given minimal length (seeds) between all sequences. These seeds are extended in both directions allowing for matches, mismatches, insertions, and deletions using the X-Drop alignment strategy as described previously. This greedy alignment strategy was developed for comparing highly similar DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources [44]. In an attempt to objectively define appropriate clustering criteria, we took advantage of the speed of the Vmatch clustering approach to systematically vary the relevant parameters (overlap length, % identity, seedlength and X-drop value). It was hypothesized that the 'correct' parameters would be revealed as an abrupt change in the curve on the resulting graph. An example of such an analysis showing the effect of varying the overlap length and % identity is presented in supplemental materials [see additional file 1]. Here a number of conclusions become apparent. First, at this level of resolution (~30 independent clusterings), a distinct point indicating the 'correct' parameter does not become readily apparent. Second, the collapse of the cluster set to few clusters containing every larger numbers of individual sequences serves as a reminder that all sequences (regardless of species) can be considered part of a single cluster. Finally, as the length overlap decreased, we observed the formation of 'superclusters' containing >10,000 sequences clearly derived from multiple gene families. These problem of 'superclusters' diminished at an overlap length of ~135 (data not shown, and not apparent in additional file 1). These clusters appear to be due to the presence of undefined repetitive elements, chimeric sequences and possibly transposed elements. Studies on the nature of the clustered sequences and the effects of parameter variation are ongoing. For the current data set, we tried to select parameters which mimic the parameters that were probably used for generating the UniGene clusters. Unfortunately, the algorithm used for constructing the UniGene clusters is not sufficiently documented to allow complete reproduction. We selected parameters designed to produce a stringent clustering of the available sequences. For the described data set, sequences were clustered when a pairwise match of at least 150 nucleotides and 98% identity was found (seedlength = 33, X-Drop = 3). The construction of the enhanced suffix array took 33 minutes on a SUN UltraSparc III (900 MHz) CPU. Clustering took another 17 minutes. This resulted in 25,971 clusters containing 276,365 sequences (87.11% of the input set) and 40,877 singletons (12.89%). The average cluster size was 10.6 (std. dev 51.8) sequences. The distribution of cluster sizes is shown in Table 1. 22,834 clusters were composed of ESTs only, 61 clusters of mRNA sequences (VRT and HTC divisions) only and 3,076 clusters of both mRNAs and ESTs. Among the singletons are 4262 sequences which contain less than 150 nt (after sequence cleanup described above) and would therefore be incapable of being joined in a cluster. Less than 25% of these sequences have a significant match against NR database and less than 2% of the sequences match full length cDNA criteria described below. Next, a consensus sequence was generated for each cluster using CAP3 [28]. The aim of this approach was to both refine the number of clusters and to improve the overall sequence quality. This latter aim simplifies the design of oligonucleotide probes. The 25,971 clusters produced 31,353 tentative contig (TC) sequences (avg. length: 1,045 bp, std. dev: 729 bp) and 4,801 singlets (avg. length: 664 bp, std. dev: 424 bp). The longest TC was 13,130 bp (DNA-dependent protein kinase catalytic subunit, accession: [Genbank:AB016434]), while the smallest TC was 154 bases long. Here, it became obvious that CAP3 is a genome assembly program not designed to assemble EST clusters containing potential splice variants: CAP3 assembly subsequently split a fraction of the clusters into separate contigs and singletons. On average, a cluster was split into 1.2 (std. dev 3.0) TCs and 1.8 (std. dev 11.3) singlets by CAP3. As illustrated in Table 1, the average length of the sequences increased from 536 bp (average for input ESTs) to 1,045 bp (average for CAP3 contig sequences) which was lower than the average length for previously characterized Xenopus full length sequences (sequences selected as full length by XGC had an average length of 2,115 bp). There are many genes whose transcript is significant longer than 2× the current state of the art sequencing run of ~1000 bp. This means that 5' and 3' sequences derived from a >2 kb transcript are unable to be joined without sequence from incomplete cDNA clones which provide a source of nested deletions. Sequences from both ends can be linked by annotation, and this has been done by a variety of clustering approaches including NCBI UniGene which uses a double linkage rule. Non-overlapping 5' and 3' ESTs are assigned to the same cluster if clone IDs are found that link at least two 5' ends from one cluster with at least two 3' ends from another cluster and the two clusters are merged. We have examined the effect of double linkage joining using the clone annotation. In this analysis, 17,588 clusters were stable and the total number of clusters was reduced from 25,971 to 21,249. Most of the joined clusters (3,122) were created from two clusters while three clusters were combined 456 times. While the number of clusters is decreased by this joining, our overall analysis is not affected. Potential full length clones selected as part of the P5P group (see below) are also unaffected by annotation linkage. We provide the identity of clusters 'linked by annotation' as part of the XenDB output. Sequence analysis We have performed a variety of sequence comparisons at the protein level including translation analysis. The sequences of cluster TCs and all singletons were subject to extensive BLASTX [45] and FASTY [46] homology searches vs. the non-redundant protein database (NR) from NCBI and the proteomes of five major model organisms using the high throughput analysis pipeline of the Genlight system [47] Proteome sets for H. sapiens, M. musculus and R. norvegicus were obtained from the International Protein Index [48,49]. The IPI provides a top-level guide to the main databases: Swiss-Prot, TrEMBL, RefSeq and Ensembl. It curates minimally redundant yet maximally complete sets of the indexed organisms. C. elegans and D. melanogaster protein sequences were retrieved from the UniProt database [50]. UniProt proteome sets are solely derived from Swiss-Prot and TrEMBL entries. Additionally, all available protein sequences for X. laevis and X. tropicalis were extracted from GenBank. additional file 5 provides an overview of the downloaded data sets. Performing separate comparisons allows a search for matching sequences based on the identity of any gene known from each species as well as query for genes which have matches in some but not all databases. We believe that this will aid in the discovery and analysis of conserved and unique genes. In addition to these databases, we have included BLASTX searches in the KOG database and have used the results to functionally classify the Xenopus sequences. All sequences resulting from the clustering and assembly processes were compared to these protein sets using BLASTX with an E-value cutoff of 1.0e-6. ESTs are often of low sequence quality, and sequencing errors can still exist in the assembled TC sequences. Therefore, all analyses against the protein databases were also done using FASTY (E-value cutoff: 1.0e-6) a version of FASTA that compares a DNA sequence to a protein sequence database, translates the DNA sequence in three forward (or reverse) frames and allows (in contrast to BLASTX) for frame shifts, maximizing the length of the resulting alignments. Identification of chimeric sequences A significant issue in EST clustering methods is the presence of chimeric sequence which inappropriately joins unrelated genes into a single cluster. While the number of chimeric sequences is estimated at less than 1% [51,52], their presence has disproportionate effects on the clustering outcome. To identify potential chimeric sequences, we analyzed the FASTY hits in the protein NR database and applied the following simple procedure: Matches of at least 100 bp in length were mapped back to the TC sequences to identify the regions that are covered by a match. If two matches overlap, the region will be extended accordingly. If after the mapping two clearly separated regions remain, the TC is flagged as potential chimera (see Figure Figure33
Examination of the identified chimeric sequences reveals three major classes. In the first, two distinct FASTY hits can be identified which do not overlap and are in opposite orientation. In the second, the second identified FASTY hit matches retroviral or transposable element related sequences. This suggests the possibility that these may reflect real transcripts in which a mobile element has been inserted into the genome. A close evaluation of such sequences may provide some insights into the evolutionary history of various populations of Xenopus. The final class of potential chimeric sequences identified contains short predicted or hypothetical proteins. This class may in fact not be chimeric at all but may reflect errors in protein coding prediction methods. The described procedure identified 113 potential chimeric TCs (0.3% of the 33,034 sequences with matches against the protein NR database), which are flagged in the database as such. We do not eliminate these potential chimeras, as they don't significantly affect the results of the sequence analyses done later on, which are mainly based on the best hit only. In fact, the analysis underestimates the number of full length sequences, as some chimeras cover two full length protein matches. A complete identification of chimeric sequences is practically impossible without a comparison to the underlying genome sequence. And even then, polycistronic transcripts which may exist cannot be separated from chimeras perfectly [53]. Definitions In the subsequent analyses we were interested in three kinds of information: (1) Full Length Orf containing COntigs (FLOCOs), (2) Full Length Insert containing CLones (FLICLs), and (3) Predicted 5' (P5P) sequences. The result of the clustering and CAP3 analysis generates a set of tentative contig sequences (TC). FLOCOs are defined as TC sequences that have an (almost) full length hit against a known protein. These sequences are especially useful for gene identification. Full length insert containing clones, FLICLs, were predicted. Such clones are distinguished by sequence homologies corresponding to the amino terminal part of a protein but are not restricted at the carboxy-terminus. These sequences are derived from clones which are predicted to carry a full length insert (see below), though the full length sequence has not been determined, usually because of single pass EST sequencing from the 5' end. Finally, we identified sequences that we call P5P for which sequence similarity did not extend through the amino-terminal end of the protein but whose length was sufficient to include a full length coding sequence of a similarly sized protein. Identification of Full Length Orf containing COntigs (FLOCOs) We were especially interested in full length hits of the TC sequences vs. known proteins. For this purpose, BLASTX and FASTY hits were categorized into four classes, representing the quality of the full length matches (see Figure Figure1):1
Table 2 shows the number of identified FLOCOs using BLASTX. 3,942 TCs were Class 1 hits in the non-redundant protein database. As the stringency of the full length definition was relaxed, the number of TCs characterized as full length increases to 5,050 (Class 2), 7,792 (Class 3) and 12,389 (Class 4) TCs respectively. As EST sequences have many sequencing errors, and even the assembly of clusters can not correct all of these, FASTY comparisons were done for the same data set (Table 3). This way, the length of the resulting alignments could be maximized. A comparison of Table 2 and Table 3 shows the effect of frame shift corrections obtained by FASTY. The number of TCs having Class 1 hits could be increased to 5,139 while the less stringent categories increased similarly by an average of 20%. The effect of frameshift correction can clearly be seen in Figure Figure2.2
Comparing the numbers of full length sequences in Table 2 and Table 3, the matches in human, mouse, rat and X. laevis are in general agreement (2619 full length sequences for Class 1 on average). What is striking is the deviation of both the number of full length TCs as well as the average length of TCs having matches against Drosophila and C. elegans: only 268 and 190 full length sequences with average lengths of 1659 and 1575 bp for Drosophila and C. elegans in Class 1, respectively. Only within the Class 4 category there are 2,249 and 1,918 TCs with average lengths of 1,611 bp and 1,563 bp, respectively. A possible explanation for this difference is the divergence of the vertebrate species from these invertebrate model systems. Selection of putative Full Length Insert containing CLones (FLICLs) Often, biologists are interested in identifying a full length clone for further study and this desire has been met by the establishment of a number of the Gene Collections (the Mammalian Gene Collection [54], the Xenopus Gene Collection [55] and the Zebrafish Gene Collection [56]). We have extended our analysis described above to select potential full length insert containing clones (FLICLs) that are available through the IMAGE consortium and provide a simple yet powerful search tool to rapidly match homologous genes of interest to their Xenopus counterparts. The Gene Collections are an NIH initiative that supports the production of cDNA libraries, clones and 5'/3' sequences to provide a set of full-length (ORF) sequences and cDNA clones of expressed genes for a variety of model systems. Since the average length of the characterized full length vertebrate protein is 1,400 bases and the average sequence length of a TC is 1,045 bases, many sequences which are full length will not be detected by the previous approach and will contain sequence gaps of approximately 350 bases. To identify additional clones that potentially carry a full length insert, we queried the database for sequence matches which were sufficiently long to include the start methionine but which did not have sufficient homology to be detected by the previous methods Thus, a sequence with a query start position (Startq) which is greater than the subject start site (Starts) is potentially a full length open reading frame (hereafter referred to as P5P, predicted 5 prime). Clearly, the value of such a prediction decreases as the values of Startq increases and the predictive value increases with lower values of Starts. Full length clones predicted by this method are subject to 3' truncations due to mispriming in poly(A) rich regions rather than at the polyA tail. Such regions would be characterized by the presence of the amino acid lysine (codons AAA, AAG) or asparagine (codons AAU, AAC). Best FASTY hits were extracted for TCs from all four full length categories as well as the P5P categories as described above. For TCs matching these categories, the most 5' EST contributing to the CAP3 contig sequence was selected. In addition, the selected clone had to span the amino-terminal end of the FASTY protein match. Finally, to ensure the ready availability of the clones and therefore the utility of the analysis, the selected clone had to be available through the IMAGE consortium. See Figure Figure11
Due to the large number of sequences, we are unable to examine each sequence individually. Since the analysis depends on the overall degree of conservation among the sequences, such an approach will not be as successful on weakly conserved genes. In general, it seems likely that decreasing e-values correspond to higher quality predictions. On a global basis, the results need to be carefully considered, as an independent assessment of the distribution of conservation among the ensemble of sequences is not available. Gene Ontology prediction and Functional Classification The Gene Ontology (GO) project [57] is an ongoing international collaborative effort to generate consistent descriptions of gene products using a set of three controlled vocabularies or ontologies: biological processes, cellular components, and molecular functions. The GO vocabulary allows consistent searching of databases using uniform queries. The availability of such vocabularies can be critical to the interpretation of high through put approaches such as microarrays. Based on FASTY homologies with both mouse and human sequence, we have mapped GO annotations to the Xenopus sequences. Of the 30,683 TCs with matches to mouse (29,971) or human IPI sequences (29,963), 19,721 TCs have been assigned putative GO annotations. Among the 10,500 potential full length ORF containing IMAGE clones, 6,886 have been assigned GO annotations. The non-redundant X. laevis data set was then classified based on their homology to known proteins from the KOG [58] database (BLASTX 1.0e-5 E-value cutoff, best hit selection). KOGS are euKaryotic clusters of Orthologous Groups. KOG includes proteins from 7 eukaryotic genomes: C. elegans, D. melanogaster, H. sapiens, A. thaliana, S. cerevisiae, S. pombe, E. cuniculi.17,624 sequences (67.3%) had a hit against the KOG database and could be assigned a functional category. Identification of conserved genes not found in major model organisms To identify additional genes within the dataset that are not found by comparison to protein sets of the major model organisms and to assess the extent of diverged or non conserved sequences, open reading frames of 600 nucleotides or longer were selected from the clustered data set for analysis. 219 sequences that did not have any hit in the previous analyses were identified (188 TCs representing 178 clusters and 31 singlets). We further restricted the number of sequences by re-running the BLASTX and FASTY analysis with E-value cutoffs of 0.01. 111 sequences (91 TCs representing 87 clusters consisting of an average of 6 ESTs per cluster and 19 singlets) without any significant similarity in protein databases could be identified and these were examined by TBLASTN against the human, mouse and 'others' EST databases (22.7 million sequences total). Signal peptides were identified by SignalP [59] as well as transmembrane domains by TMHMM [60,61]. Results are presented in Table 6. The analysis identified 46 sequences with similarity to other organisms (E<0.01) with 11 sequences matching chicken (Gallus gallus), 10 sequences matching zebrafish (Danio rerio) and 6 sequences matching the rainbow trout (Oncorhynchus mykiss). Three of the sequences matched human sequences with less significance than the cutoff used above (i.e. 1.0e-6). Among the sequences with highly significant BLAST hits were two matches to the eastern tiger salamander (Ambystoma tigrinum tigrinum) and one to the rainbow trout (Oncorhynchus mykiss). A surprising match was to barley (Hordeum vulgare, E = 9.0e-35) which was the only plant represented among these hits. The remaining 65 sequences did not have significant homology to existing public database sequences. For 7 sequences both signal peptide cleavage sites and transmembrane domains could be identified. Another 15 sequences had either a signal peptide cleavage site or a transmembrane domain. These 22 sequences are potentially novel membrane proteins.
Utility User interface The results of the analyses described above have been incorporated into an SQL database amenable to complex queries. The database can be accessed through a user friendly web based interface (XenDB). XenDB allows individual and batch queries using Xenopus accession, GI, and XenDB, UniGene and TIGR cluster IDs. In addition, the user can query the Xenopus sequence hits using any protein accession/GI number both singly and in batch mode. This allows a rapid identification of Xenopus TCs and their corresponding clones with hits to given protein sequences. The output of various queries displays the matching Xenopus cluster(s) and links to a web page as presented in Figure Figure5.5
The analysis and database system provides a very powerful tool which will enable the Xenopus community to take advantage of a number of technical and experimental advances. We have selected a couple of examples to illustrate possible types of queries. In considering the results, it is important to bear in mind that these examples can be combined to further refine the sequence set. In the first example, we sought to identify all the genes of a known type or class. In the second example, we wished to identify the set of Xenopus sequences which best matched a set of genes from another species identified using the CGAP database administered by the National Cancer Institute (NCI) [62,63]. A final example demonstrates the ability of the system to translate results identified by microarray technologies, or other related high throughput technologies, to identify likely Xenopus homologues. Homeobox gene identification Homeobox containing proteins are a very important group of transcriptional regulators that play key roles in developmental processes. They can be divided into a 'complex' and a 'dispersed' super class representing the homeotic genes and the large number of homeodomain containing proteins dispersed (and diverged) within the genome [64]. The homeotic (Hox) genes play key roles in the anterior-posterior patterning of both vertebrate and invertebrate embryos and in Xenopus are often used as markers of anterior-posterior development. [65-67]. The vertebrate homeotic genes are organized into four clusters arranged in the same order in which they are expressed in the anterior-posterior axis [64]. Of the 39 vertebrate Hox genes, we have identified 28 homologs in Xenopus laevis, while 19 are present in the protein database (Table 7). For those sequences not identified, we sought to determine whether they had been identified in the genome of Xenopus tropicalis. To do so, we used TBLASTX, provided as a tool on the Xenopus tropicalis website [68] to search for the missing sequences. Strong matches were identified for all of the remaining Hox genes except HoxD12. Using the BLASTN tool on the genome site, we confirmed that the gene order was conserved within each scaffold (data not shown). Interestingly, we were unable to identify HoxD12 within the predicted region though both HosxD11 and HoxD13 were recognized.
Homologue identification from the Cancer Genome Anatomy Project (CGAP) A second example takes advantage of the CGAP database [69] administered by the National Cancer Institute (NCI). This database and resource incorporates a large number of interconnected modules aimed at gene expression in cancer. Among the modules are a Serial Analysis of Gene Expression (SAGE) database [70,71]. The SAGE approach counts polyadenylated transcripts by sequencing a short 14 bp tag at the genes 3'end and is a quantitative method to examine gene expression [70]. Another module is the Digital Gene Expression Displayer (DGED) which distinguishes statistical differences in gene expression between two pools of libraries [72]. Each method generates tables of genes based on a wide variety of selection criteria. As would be expected, the source for the vast majority of the available data comes from either human or mouse thus demanding a tool to cross match the results in Xenopus. For this particular example, we selected a tissue based query (DGED) derived from SAGE data in which we sought a set of genes that might include potential markers for glial or astrocyte fates. For this query, we selected all brain, cortex, cerebellum and spinal cord libraries excluding any libraries derived from cell lines. This yielded 58 potential libraries. From this we selected any library labeled as a glioblastoma for pool A and libraries labeled astrocytoma for pool B while excluding the remaining libraries (which included medulloblastomas, ependymomas, etc.). We did not distinguish between cancer grades. This limited the total number of libraries to six glioblastoma and nine astrocytoma libraries containing 487,197 and 863,610 SAGE tags each, respectively. Submission of the query resulted in the identification of 395 tags with a 2× expression factor and a 0.05 significance factor (default CGAP query values). These 395 tags represented 308 different sequences (180 were >2 fold higher in glioblastoma and 128 were >2 fold higher in astrocytoma) which corresponded to 278 proteins in the public database (115 glioblastoma, 163 astrocytoma) and were matched using the batch GenBank accession module available online in XenDB to 100 and 142 Xenopus sequences, respectively. (In the interests of space we have not included the extended table but provide the saved DGED query [see additional file 6] and the two text files [see additional files 7 and 8] that can be uploaded to the XenDB database). The results table includes links to the matching cluster and TC, the e-value and rank and whether a full length clone has been identified. The contig web link leads to additional information including the consensus analysis, the top FASTY hits to five model organisms and links to the Xenopus EST sequences in the TC (Figure (Figure5).5 Homologues of Drosophila eye development genes In the final example, we take advantage of the database to perform a comparative analysis of microarray expression data. In many instances, the outcome of an array type experiment is a variety of tables listing regulated genes and the associated expression changes. Currently, there are few published Xenopus array studies available [77-85] while there exist extensive databases of expression for a variety of model organisms. The NCBI maintains a common database, the Gene Expression Omnibus [86] which contains data from over 15,000 samples including 337 Human, 92 mouse and 12 Drosophila experiments (average 25 samples/experiment). Based on an ongoing interest in eye development, we selected a recent paper by Michaut and co-workers in the Gehring lab which examined gene expression changes induced by ectopic expression of the eyeless gene (ey/Pax-6) in Drosophila imaginal disks [87]. The development of the eye is evolutionarily conserved among both vertebrates and invertebrates [88,89]. Many important insights into eye development have come from studies in Drosophila which has defined a genetic cascade of evolutionarily conserved regulatory factors [90]. One such factor is Pax-6/eyeless which is capable of inducing ectopic eyes on both flies [91] and vertebrates [92]. In the Michaut study, 371 eye-induced genes are detected using two different oligonucleotide based array platforms (Affymetrix and Hoffmann-LaRoche) and 73 are discussed in detail within the text (Michaut et al., Table 1, 2). To identify likely homologues of these genes in Xenopus, GenBank accession numbers were obtained from the NCBI Gene Expression Omnibus ([93], accession # GSE271) and used to query the XenDB database to identify 47 potential homologues of the Drosophila Pax6/ey regulated genes and included 32 predicted full length sequences (Table 8). As these sequences are available from commercial sources, they can be readily obtained and tested using the various experimental approaches available to Xenopus such as gain of function studies by microinjection.
Discussion Comparative approaches to important biological problems have resulted in enormous progress in the past decades. The advent of genomic and proteomic approaches has led to a torrent of data in many organisms and has demanded increasingly sophisticated bioinformatic approaches to organize and manage the information. We have developed an integrated information resource with a user-friendly interface powered by an automated clustering pipeline which will allow researchers to take advantage of the wealth of knowledge available in the public domain. Comparison to human and mouse Human and mouse are the best studied vertebrate organisms at the molecular level. In addition to the well publicized genome projects, both have extensive EST collections. This has led to the prediction and characterization of 44,775+ human sequences and 36,182 mouse sequences [94]. As vertebrate development is well conserved, it is important to assess the extent to which the Xenopus EST project has identified the known vertebrate genes. At the same time, one would like to identify any genes that are unique to Xenopus. Most gene prediction programs rely on homology thus eliminating this approach to unique gene identification. Sequences without significant homology could arise from incomplete sequencing that does not extend into the coding region. Results of the human genome project suggest that this would not be the case for a majority of the sequences analyzed in this report. The average 5' UTR in humans is 240 bp and the 3' UTR is 400 bp [95]. Sequencing reactions with current technologies yield readable sequence of 700 bases on average. Therefore, at least some subset of sequences would yield their protein sequence to analysis. An alternative origin of non-homologous sequences would be unspliced or improperly spliced transcripts. This possibility is also minimized by the utilization of polyA tails for RNA selection and reverse transcription priming using oligo(dT). A final, obvious and expensive approach is to select non-homologous sequences for full length double stranded sequencing. Sequence without errors more easily yields the desired open reading frame in even the simplest bioinformatic programs. Sequences without hits A class of sequences includes those without significant BLAST hits. In our analysis we have used a cutoff e-value of 10e-6. This of course is necessarily arbitrary, since as mentioned above it is not known what the exact level of similarity is between any given sequence pair. Based on this value, we remain with 43,753 sequences that neither have a BLASTX nor a FASTY hit to a known model organism sequence. The lack of similarity could be due to significant divergence of the sequence, the lack of an appropriate homologue in the public dataset, sequencing errors inherent in EST data or due to the presence of non-coding, presumably regulatory sequences, in the EST clone set. These unmatched sequences mirror the situation in the UniGene set for both mouse and human with greater than 3 and 4 × 106 EST sequences in 76,000 and 106,000 clusters respectively while fewer than 25,000 coding sequences have been recognized [21,94,96]. The source of these discrepancies are currently unclear, but may arise from non coding RNA (ncRNA)[97], micro RNA precursors [98], incompletely or unspliced transcripts [99]. In particular, ncRNAs are a likely source for a large fraction of the discrepancy based on estimates of a 10-fold greater number of non-coding transcription units than protein coding genes [100]. It has been estimated that >95% of transcription is non-coding [101]. Much of the analysis and identification of ncRNA relies on the availability of genomic sequence which is currently unavailable for X. laevis and incomplete for X. tropicalis, the highly homologous diploid species. Completeness of Xenopus EST set We have compared all the Xenopus sequences to the human and mouse protein sets to identify conserved proteins. An obvious question is how complete is the Xenopus EST set and what percentage of genes have been identified assuming that the vast majority of protein coding sequences have been evolutionarily conserved. Of the ~40,000 sequences in the IPI databases, 9,225 human and 7,664 mouse sequences do not have a strong match (E < 1.0e-6). Thus, there is a considerable effort remaining to develop a complete Xenopus protein coding set. In the course of our analysis we note the high degree of similarity between the allotetraploid laevis and diploid tropicalis Xenopus species which depended on the length of the matching sequence. For sequences covering >= 95% of the query, there was an average of 94% identity while the average identity dropped to 91% and 88% as the coverage dropped to 90 and 80% respectively. This conservation may allow sequences from both species to be combined to generate a more complete set. It is well known that the outcome of clustering methods on a large scale depends on the variety of involved parameters. A systematic comparison between UniGene or TIGR Gene Indices and our results turns out to be extremely difficult, mainly because the underlying sequence sets differ as well due to different sequence cleanup and masking approaches. To maximize the utility and usability of our analysis, we have incorporated UniGene and TGI information into our dataset and provide simple tools for identifying the related UniGene and TGI identifier. Future prospects Both the clustering and consensus generation approaches are very rapid: 50 minutes for clustering on a single 900 MHz SPARC-CPU and a few hours for assembly on a cluster of 20 heterogeneous SPARC-based machines with 450 to 900 MHz. We therefore have achieved the design goal of being able to frequently update this aspect of the analysis. The subsequent comparative sequence analysis requires significantly greater resources and time (several weeks on same cluster of heterogeneous workstations). The analysis described above is performed by various PERL based scripts developed during the course of our analysis which will allow updates and application to other model systems. We are currently working on a tool to compare clusters over time which will allow the sequence analysis described below to be performed on the restricted set of modified/new clusters rather than to the entire ensemble. The effect of CAP3 consensus generation is that a given cluster can be split into several separate TC sequences, usually due to low sequence quality or differences in the UTR regions of the sequences. The UTR end splitting is likely due to the differences between the in-paralogs in this allotetraploid species. We believe that such information will be of value to those researchers interested in a variety of evolutionary questions, examples of which will be discussed below. The difference in ploidy makes Xenopus laevis distinct from all of the other organisms for which similar analysis have been performed. As with all ongoing high throughput sequencing efforts, certain aspects of the results change in proportion to the total number of sequences. As noted above, a complete gene set for Xenopus will require additional sequencing. The generation of tetra, octo and dodecaploid species of Xenopus between 80 and 10 million years ago [102] offers opportunities in the field of evolutionary biology. For example, comparisons of 3' UTR regions between in-paralogs of Xenopus laevis and their counterpart diploid tropical species may improve statistical models of molecular evolution. At the genome level, the potential availability of genome data from the polyploid species may provide insight into questions of chromosome segregation and silencing. The selection of Xenopus as a model organism by the NIH http://www.nih.gov/science/models/ and the establishment of the Trans-NIH Xenopus Initiative [103] have directly led to the support of EST and genome sequencing efforts. Among the priorities identified is the establishment and funding of a Xenopus Database [104] which will integrate sequence, expression and other Xenopus data. We hope to be able to update the results described here on a regular basis and contribute to the community effort. Conclusion One of the primary goals of the effort was to provide a resource of gene-oriented EST clusters and transcript oriented TCs, enriched with various information from heterogeneous sources, that would be of value to the biology community and the Xenopus community in particular. Using the XenDB system, the biologist can identify sequences of interest using simple gene name queries, accessions, or gene ontologies. The identified sequences have been mapped to public resources like NCBI's UniGene and TIGR Gene Indices and a consensus sequence prepared. In addition, we have identified publicly available IMAGE clones that maximizes the 5' sequence to provide a full length construct when possible. These clones are available from IMAGE consortium providers. Availability and requirements Sequence availability, XenDB database and results display The database and associated files are freely accessible through the XenDB website: http://bibiserv.techfak.uni-bielefeld.de/xendb/. The GenBank accession numbers and FASTA formatted files of the masked and clipped input sequences, as well as the TC sequences and results of the example applications (see below) can be downloaded. Additionally, the list of full length clones is available to researchers interested in performing genome-wide studies. Programs, scripts and database dumps are available from the authors upon request. The XenDB database should be cited with the present publication as a reference. List of abbreviations used EST: Expressed Sequence Tag, ORDBMS: Object Relational Database Managemant System, TC: tentative contig sequence, KOG: clusters of euKaryotic Orthologous Groups, GO: Gene Ontology, VRT: Vertebrate Sequences, HTC: High Throughput cDNA, XGC: Xenopus Gene Collection, MGC: Mammalian Gene Collection, ZGC: Zebrafish Gene Collection, FL: Full Length, IPI: International Protein Index, CGAP: Cancer Genome Anatomy Project, DGED: Differential Gene Expression Database, SAGE: Serial Analysis of Gene Expression, ncRNA: non-coding RNA, TGI: TIGR Gene Index Authors' contributions A.S. developed and implemented the Vmatch based clustering pipeline. M.B. contributed his high throughput sequence analysis system Genlight. A.S. and M.B. developed the XenDB database schema, performed the post clustering data analyses and contributed to the manuscript. A.H.B. provided supervision and guidance on the development of the project design goals and the interpretation of analysis output with regard to biological significance. R.G. provided supervision and guidance on the development of the clustering pipeline and provided essential infrastructure. C.R.A. provided advice and guidance on the development of the clustering pipeline, the incorporation of analysis into the database and performed and interpreted the various queries presented and wrote a significant portion of the manuscript. Additional File 2 Table S1, Distribution of EST sequences in the analysis based on the annotated tissue source for the preparation of the library. (NOTE: annotations are imported directly from GenBank entries and are dependent on the original annotation.) Click here for file(36K, doc) Additional File 3 Table S2: The 20 most abundant developmental stage annotations in the X. laevis data set as annotated in GenBank: Distribution of EST sequences in the analysis based on the annotated developmental stage of the source library. (NOTE: annotations are imported directly from GenBank entries and are dependent on the original annotation.) Click here for file(30K, doc) Additional File 4 Table S3: The 30 most abundant Clone Libraries in the X. laevis data set as determined by the GenBank annotation. (NOTE: annotations are imported directly from GenBank entries and are dependent on the original annotation.) Click here for file(55K, doc) Additional File 1 Figure S1, Effect of Parameter Variation on EST Clustering: Masked and trimmed EST sequences were clustered using the Vmatch algorithm using different overlap length and percentage identity values. The total number of clusters (blue) and the number of singletons (red) are plotted against the minimal overlap length. Values were plotted at different percentage identities (squares 98%, stars 96%, circles 94%). Click here for file(4.3K, pdf) Additional File 5 Table S4: Sizes of protein sets used for sequence analysis of clustered sequences. Click here for file(31K, doc) Additional File 6 file containing the SAGE database query used in the glioblastoma and astrocytoma analysis. Additional File 7 File containing protein accession numbers of SAGE glioblastoma genes for upload to XenDb system Click here for file(1.2K, txt) Additional File 8 File containing protein accession numbers of SAGE astrocytoma genes for upload to XenDb system Click here for file(1.7K, txt) Acknowledgements The authors thank Jan Reinkensmeier for his help in setting up the XenDB Web pages, Alin Vonika, Trent Clarke and Stefan Kurtz for comments on the manuscript. The FSU School of Computation Science and Information Technology and FSU Supercomputing Facility provided computing resources. CRA was supported by an FSU Research Foundation Program Enhancement Grant. References
|
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||
Science. 1991 Jun 21; 252(5013):1651-6.
[Science. 1991]Nat Rev Genet. 2002 Sep; 3(9):698-709.
[Nat Rev Genet. 2002]Nucleic Acids Res. 1999 Oct 1; 27(19):3911-20.
[Nucleic Acids Res. 1999]Nucleic Acids Res. 2001 Jan 1; 29(1):234-8.
[Nucleic Acids Res. 2001]Genome Res. 1999 Dec; 9(12):1288-93.
[Genome Res. 1999]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D35-40.
[Nucleic Acids Res. 2004]Nat Genet. 1995 Aug; 10(4):369-71.
[Nat Genet. 1995]Genome Res. 1999 Sep; 9(9):868-77.
[Genome Res. 1999]Genome Res. 1998 Mar; 8(3):186-94.
[Genome Res. 1998]Nucleic Acids Res. 2001 Jan 1; 29(1):234-8.
[Nucleic Acids Res. 2001]Genome Res. 1999 Nov; 9(11):1135-42.
[Genome Res. 1999]Nucleic Acids Res. 2000 Sep 15; 28(18):3657-65.
[Nucleic Acids Res. 2000]Nucleic Acids Res. 2005 Jan 1; 33(Database issue):D39-45.
[Nucleic Acids Res. 2005]Nucleic Acids Res. 2001 Jan 1; 29(1):159-64.
[Nucleic Acids Res. 2001]Trends Genet. 2000 Sep; 16(9):418-20.
[Trends Genet. 2000]Nucleic Acids Res. 2004 Jul 1; 32(Web Server issue):W301-4.
[Nucleic Acids Res. 2004]J Comput Biol. 2000 Feb-Apr; 7(1-2):203-14.
[J Comput Biol. 2000]Genome Res. 1999 Sep; 9(9):868-77.
[Genome Res. 1999]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Genomics. 1997 Nov 15; 46(1):24-36.
[Genomics. 1997]Proteomics. 2004 Jul; 4(7):1985-8.
[Proteomics. 2004]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D115-9.
[Nucleic Acids Res. 2004]Genome Res. 1996 Sep; 6(9):829-45.
[Genome Res. 1996]Genome Res. 1996 Sep; 6(9):807-28.
[Genome Res. 1996]J Biol Chem. 2005 Jun 24; 280(25):23425-8.
[J Biol Chem. 2005]Nat Genet. 2000 May; 25(1):25-9.
[Nat Genet. 2000]Genome Biol. 2004; 5(2):R7.
[Genome Biol. 2004]J Mol Biol. 2004 Jul 16; 340(4):783-95.
[J Mol Biol. 2004]J Mol Biol. 2001 Jan 19; 305(3):567-80.
[J Mol Biol. 2001]Proc Int Conf Intell Syst Mol Biol. 1998; 6():175-82.
[Proc Int Conf Intell Syst Mol Biol. 1998]Genome Res. 2000 Jul; 10(7):1051-60.
[Genome Res. 2000]Cancer Invest. 2002; 20(7-8):1038-50.
[Cancer Invest. 2002]Annu Rev Biochem. 1994; 63():487-526.
[Annu Rev Biochem. 1994]Development. 1995 Dec; 121(12):4349-58.
[Development. 1995]EMBO J. 1998 Jun 15; 17(12):3413-27.
[EMBO J. 1998]Science. 1995 Oct 20; 270(5235):484-7.
[Science. 1995]Proc Natl Acad Sci U S A. 2002 Aug 20; 99(17):11287-92.
[Proc Natl Acad Sci U S A. 2002]Cancer Res. 1999 Nov 1; 59(21):5403-7.
[Cancer Res. 1999]J Neurosci. 1998 Jan 1; 18(1):237-50.
[J Neurosci. 1998]J Comp Neurol. 2000 Aug 14; 424(1):47-57.
[J Comp Neurol. 2000]Annu Rev Neurosci. 2002; 25():471-90.
[Annu Rev Neurosci. 2002]Nucleic Acids Res. 2004 Feb 13; 32(3):e29.
[Nucleic Acids Res. 2004]Genes Cells. 2004 Aug; 9(8):749-61.
[Genes Cells. 2004]Nucleic Acids Res. 2002 Jan 1; 30(1):207-10.
[Nucleic Acids Res. 2002]Proc Natl Acad Sci U S A. 2003 Apr 1; 100(7):4024-9.
[Proc Natl Acad Sci U S A. 2003]Nat Genet. 1992 Nov; 2(3):232-9.
[Nat Genet. 1992]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]Nature. 2002 Dec 5; 420(6915):520-62.
[Nature. 2002]Nature. 2004 Oct 21; 431(7011):931-45.
[Nature. 2004]Proc Natl Acad Sci U S A. 2004 Apr 20; 101(16):6062-7.
[Proc Natl Acad Sci U S A. 2004]FEBS Lett. 2004 Jun 1; 567(1):27-34.
[FEBS Lett. 2004]Cell. 2004 Jan 23; 116(2):281-97.
[Cell. 2004]Bioinformatics. 2004 Nov 1; 20(16):2579-85.
[Bioinformatics. 2004]