![]() | ![]() |
Formats:
|
||||||||||||||||||||||||||||||
Copyright © The Author 2005. Published by Oxford University Press. All rights reserved A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context School of Crystallography, Birkbeck College, London WC1E 7HX, UK 1Department of Pathology and Infectious Diseases, Royal Veterinary College, London NW1 0TU, UK *To whom correspondence should be addressed. Tel: +44 20 7631 6831; Fax: +44 20 7631 6803; Email: m.edwards/at/mail.cryst.bbk.ac.uk Received August 10, 2004; Revised September 27, 2004; Accepted May 16, 2005. The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions/at/oupjournals.org This article has been cited by other articles in PMC.Abstract An important step in understanding the regulation of a prokaryotic genome is the generation of its transcription unit map. The current strongest operon predictor depends on the distributions of intergenic distances (IGD) separating adjacent genes within and between operons. Unfortunately, experimental data on these distance distributions are limited to Escherichia coli and Bacillus subtilis. We suggest a new graph algorithmic approach based on comparative genomics to identify clusters of conserved genes independent of IGD and conservation of gene order. As a consequence, distance distributions of operon pairs for any arbitrary prokaryotic genome can be inferred. For E.coli, the algorithm predicts 854 conserved adjacent pairs with a precision of 85%. The IGD distribution for these pairs is virtually identical to the E.coli operon pair distribution. Statistical analysis of the predicted pair IGD distribution allows estimation of a genome-specific operon IGD cut-off, obviating the requirement for a training set in IGD-based operon prediction. We apply the method to a representative set of eight genomes, and show that these genome-specific IGD distributions differ considerably from each other and from the distribution in E.coli. INTRODUCTION Following determination of the genome sequence and gene identification, a major goal in understanding an organism's biology is to define the transcriptional regulatory networks. In prokaryotes, a key regulatory feature is the operon—a collection of genes transcribed into a polycistronic mRNA. Thus, determining the transcription unit (TU) or operon map indirectly locates promoters, and assists in regulon definition, identification of regulatory networks, regulatory motif detection and interpretation of microarray expression data. As co-transcription is an efficient regulation of genes that have a related biological function, determination of operons is also useful in assignment of gene function and metabolic reconstruction. Intergenic distance The most generally applicable and successful pairwise operon prediction method is the intergenic distance method developed by Collado-Vides and co-workers(1,2). A within-operon likelihood for a gene pair is derived by comparing the frequencies of within-operon and between-operon gene pairs at the given intergenic distance. The procedure has also been incorporated into several combination prediction methods (3,4). Using only intergenic distance, this method has a pairwise accuracy (mean of sensitivity and specificity) of 81% in the case of Escherichia coli, producing a map of TUs recovering 65% of known operons from the RegulonDB dataset (5) (http://www.cifn.unam.mx/Computational_Genomics/regulondb/). Besides distance, other predictors such as expression level, predicted promoter/terminator and operon size usually add comparatively little to the accuracy of predictions (4,6). Despite the importance of the distribution of inter-operon distances for operon prediction, such distribution is well established only for E.coli and Bacillus subtilis, the only organisms with enough experimentally verified operons to enable a reasonable density estimate. Due to this limitation, the distributions from E.coli or B.subtilis are usually copied over to other organisms for the purpose of operon prediction (7). This re-use of the E.coli operon set assumes that all prokaryotic genomes share a common within-operon intergenic distance distribution. Conservation methods Complementary to the intra-genome property of intergenic distance is the inter-genome property of contextual conservation of genes. Evolution in prokaryotes has occurred in a variety of niches. Operons coding for ubiquitous core processes, and to a lesser extent more specialized processes, will have evolved and been conserved across several organisms, in groups of varied phylogenetic composition. Occurrence of operons conserved in these groups has three possible explanations: the shared operon belongs to evolutionarily related organisms; there has been a horizontal transfer of the genes; or unrelated organisms have converged on a solution to a common problem using similar gene activities. While the first two cases initially impose a gene order, genome rearrangement is common (8), and convergent evolution imposes no gene order. Therefore, the content of the operon should remain broadly the same and observation of a conserved cluster of genes in proximity suggests an operon. This idea was used by Ermolaeva et al. (9) who developed a method to assess the probability of an adjacent gene pair being within an operon by observing the frequency with which their homologues occur in other genomes, imposing the constraints of immediate adjacency and a 200 bp intergenic distance cut-off on the candidate gene pairs and their homologues. Gene pair adjacency may not be conserved between species, even when genes stay within the same operon. There might be insertions, deletions or reshufflings within an operon. These problems have been addressed in several works. Overbeek et al. (10) allowed insertions in their assessment of conserved gene pairs. Wolf et al. (11) developed a method that allows for the insertion of genes into an instance of a conserved gene cluster, but the gene order must be at least partially conserved. The lack of conserved gene order is also partly addressed in later work on conserved gene neighbourhoods (12) and über-operons (13). The method of Zheng et al. (14) also relaxes the immediate adjacency constraint in their gapped phylogenetic profile method, but it remains limited to seeking gene pairs in a second genome separated by at most two genes. This gapped comparison increased the number and accuracy of their predictions of functional dependency. Here, we describe a novel graph algorithmic method that identifies conserved clusters of functionally related genes. These clusters are predominantly operons or fragments of operons. This is not surprising as co-transcription is probably the strongest constraint on conserving a gene's genomic context. The method is flexible enough to identify an instance of a conserved cluster that has been completely shuffled relative to the canonical operon and has any number of insertions. The method seeks to identify conserved clusters of genes in a query genome by finding reoccurring unordered groups of homologues from several genomes under the assumption that functionally related genes co-occur in the same directon. We call this process a directon versus directon analysis (DVDA). A directon is a set of consecutive genes on one strand of DNA which are not interrupted by RNA genes or genes on the opposite strand. Each directon is composed of one or more TUs. Conversely, all the genes in an operon exist in the same directon because operons do not normally cross strands. Using the intergenic distance distribution of pairs identified by the DVDA algorithm, we take an empirical Bayes approach to infer the probability that a pair of adjacent genes with a certain intergenic distance belongs to a common operon. This amounts to establishing an intergenic distance cut-off beyond which it becomes increasingly unlikely that the gene pair belongs to the same operon. A complete genome operon map can be created by joining gene pairs with a high probability of belonging to the same operon. By predicting an operon pair set and using this to establish a genome-specific intergenic distance distribution, we surmount the need for a known operon training set and can create a TU map of any completely sequenced genome with only a minimal annotation, that is, an open reading frame (ORF) prediction. The results of this analysis indicate that there is no universally applicable within-operon intergenic distance distribution, and that for intergenic distance-based operon prediction, each genome must be considered as a distinct case. In this paper, we suggest a series of progressively stronger assumptions on functionally related gene pairs and present the results that can be derived by adopting the assumptions. First, a graph algorithm exploits the effect of keeping genes which are functionally related in close vicinity throughout evolution. This stage results in a list of functionally related gene pairs. Such functional pairs mostly comprise gene pairs in a common operon but also a few nonoperon pairs (e.g. repressors or inducers near the controlled operon). For the next stage, we assume that the distance distribution within the functional pairs is representative of the distance distribution within operon pairs. Finally, we provide probability estimates for operon pairs in case one is prepared to accept all pairs identified by the graph algorithm as operon pairs. Although this introduces false positives (FPs), their number is quite limited and might be acceptable in view of the gain in accuracy in the probability estimations. MATERIALS AND METHODS Our approach to generating a genome-specific intergenic distance distribution for genes in operons consists of three steps. In the first step, conserved clusters of genes are identified by finding pairs of directons from different genomes with similar gene composition in a DVDA. Next, a graph algorithmic analysis of the directon comparisons results in a set of candidate operon gene pairs. Finally, a statistical analysis of the distance distribution of these gene pairs provides probabilities that a pair belongs to a common operon based on its intergenic distance. Genomes and operon resources A set of 74 genomes/chromosomes was obtained from the European Bioinformatics Institute ftp site (ftp://ftp.ebi.ac.uk/pub/databases/genomes/). We ensured that no two genomes of the set are too similar using the criteria set down in Ref. (2), assessing the evolutionary separation of two genomes by the degree of sequence divergence of a limited gene set common to both organisms. Such non-redundant set reduces the inclusion of gene pairs belonging to different operons but retaining adjacency due to evolutionary proximity. The set of non-redundant genomes comprised 1 Aquificae, 2 Crenarchaeota, 7 Euryarchaeota, 20 Proteobacteria, 3 Spirochaetes, 21 Firmicutes, 3 Actinobacteria, 2 Bacteroides, 3 Chlamydiae, 1 Chlorobi, 1 Deinococci, 1 Fusobacteria, 1 Cyanobacteria and 1 Thermotogae. The E.coli-verified operon set was provided by RegulonDB and 100 B.subtilis operons are available online (http://www.cib.nig.ac.jp/dda/taitoh/bsub.operon.html). The generation of a putative non-operon pair set was achieved by pairing the first and last gene of an operon with its neighbour in the same directon, if such gene exists. This ensures that for each non-operon pair a known transcriptional boundary is crossed. This is not an ideal solution as the extra gene could very well belong to the operon when expressed under different conditions or as a readthrough. However, using such a set does allow a rough estimate of prediction precision. Identification of maximum matchings between directons Each of the gene sets of the target genomes was compared to that of the query genome, for example, E.coli, using BLASTP (15), with the BLOSUM62 matrix and an E-value cut-off of 10−5. Assignment of homologous pairs, as detailed below, was not dependent on a high degree of sequence similarity. For each query directon with homologues in at least three organisms (suggesting conservation), the scored (BLASTP bitscore) homologues from a single target directon were retrieved. In general, there are instances where a query gene has more than one homologue in the target directon, and two query genes may exhibit homology to a single target gene. While the naïve approach would be to choose the homologue with the highest bitscore, this may force another gene in the query directon to align with a weaker homologue or not at all. To solve this problem, the following graph algorithmic approach was developed [for graph theoretic concepts and algorithms, see (16)]. The optimal assignment of query genes to their homologues in a single directon is achieved using a maximum weight maximum cardinality bipartite matching algorithm (graph algorithms were applied using LEDA—a Library of Efficient Datatypes and Algorithms; http://www.algorithmic-solutions.com/enleda.htm). A graph G(V, E) is bipartite if its vertex set V can be divided into two sets V1 and V2, such that all edges are between V1 and V2 and there are no edges within V1 or V2. In our application, V1 represents the genes in a directon of E.coli and V2, the genes in a directon of a target genome. An edge e E connects a vertex in V1 with one in V2 and represents a similarity above the cut-off; it is weighted with the alignment's bitscore. A matching is a set of edges that do not share any common vertices. A maximum weight maximum cardinality matching is a matching where the number of vertices matched is maximal and, in case there are several such matchings, one among them with the largest cumulative edge weight. Such a matching is more likely to yield the correct pair of orthologues as context and sequence similarity is a stronger indication of orthology than sequence similarity alone. Of all possible matchings for a query directon with all directons of a target organism, only the highest cardinality matching is retained. For some directons of E.coli and some target genomes, no matchings could be constructed due to the lack of homologies above the similarity cut-off.Figure 1
The assignment of orthology between a single gene from one directon to a single gene in another directon is complicated by gene duplication, that is, the presence of paralogues. If a gene in the query directon has more than one potential match, and either paralogue can legitimately be chosen without detriment to the overall matching, then the paralogue with the highest BLAST score will be chosen. Similarly for the case where there are two genes in the query directon that can match to a single gene in the target directon, the highest scoring pair will be matched. The other paralogue can be in either of two general positions relative to the matched paralogue. It can be inside, in the sense that it is between the matched paralogue and another gene matched to its orthologue; or it can be outside in the sense that it is not bound by other genes of the matching. If the unmatched paralogue is inside, then it will be recovered anyway (described in the next section). If it is outside, then it will be lost for the comparison of the query organism against that single target organism. Integrating such paralogues in an operon is beyond the reach of an approach based on comparing genomes. The final case is where both the query and target have a pair of paralogues, A and its paralogue B in the query and A′ and its paralogue B′ in the target. The assignment of the most likely orthologue match A with A′ and B with B′. Merging matchings for a single query directon For each directon of the query organism, all matchings are merged to create a graph G(Vd, Ed) as follows. The vertices v Vd are genes from the query directon. All vertices are connected to each other by edges of weight 0. The maximum matching of the first target genome is retrieved. For any two genes v1, v2 Vd in the query directon, it is established whether they are both connected to homologues from the target genome in the maximum matching. By construction of maximum matchings, the two homologues are different and located in the same directon of the target genome. If v1 and v2 both have homologues, the weight of the edge connecting them is incremented by one, otherwise the weight remains unchanged. The procedure is repeated for each target genome. Finally, all edges with weight zero are removed. This construction ensures that the integer weight of an edge e Ed connecting two genes in a directon of the query organism reflects the co-occurrences of homologues in a single directon across all the other target genomes.To maximize the use of the context data, the resulting merged graph is pruned using a minimum cut algorithm. A subset of edges of a connected graph is separating if their removal leaves a disconnected graph. A minimum cut algorithm returns a separating set of edges that has minimal total weight among all possible separating sets. If the weight of a minimal cut is below a fraction of 0.9 of the total edge weight of the graph, the edges of the cut are removed. This cut-off fraction was chosen after varying the threshold and assessing the specificity of the predictions. Decreasing this threshold would result in the inclusion of less well-conserved gene clusters, but at the expense of including more FP operon pairs. Furthermore, only the component with the largest total edge weight is retained while vertices in the other components are discarded (to be reclaimed in later iterations). At this stage of the process, the graph is trimmed down to a core of highly conserved genes within the query directon with all low frequency associations removed (see Figure 3A and B
The genes that remain in the core cluster are taken as the boundary of a set of genes contained in the same operon. The boundary genes as well as genes between them on the chromosome are collected into a conserved cluster. We include these intervening genes so as to account for organism-specific ‘hitch-hiking’ genes (12). These genes may be inserted and retained in an operon purely on the basis of a fortuitous coincidence of beneficial expression levels between the two TUs. For ease of discussion, the whole cluster will be referred to as ‘conserved’, though in general not all genes in the cluster will be found in the other genomes. The genes in the conserved cluster are then removed from all graph data structures and the above process is iterated for the remaining genes in the directon. In this way, all conserved clusters from all directons are gathered. The final output of the graph algorithmic stage of the analysis is a list of all adjacent pairs of genes which occur in a common conserved cluster. Estimating the probability of an operon pair In this section, we discuss an empirical Bayes approach (17) to the estimation of the probability that a pair of adjacent genes is in the same operon given the number, d, of base pairs between them. The DVDA algorithm as described above produces a list of DVDA pairs, pairs of adjacent genes that are functionally related. In the following, we denote the event that a pair is a DVDA pair by G ( Comparing the densities of distances in DVDA (dashed line in Figure 4a
Equation 3 is completely general and does not depend on knowledge whether a gene pair is a DVDA pair or not. It rests on the assumption that the intergenic distance distributions of operon pairs and DVDA pairs are the same. A third assumption allows us to take the information about the outcome of a DVDA analysis into account as well. Looking at Figure 4a
To summarize, Equation 3 rests on the assumption from Equation 1 only, and Equation 6 on the assumption from Equation 5 only, where both assumptions are suggested by Figure 4a Figure 5
RESULTS We assess the precision of our DVDA predictions by comparing all the adjacent gene pairs found in DVDA conserved gene clusters from E.coli to the RegulonDB known operon list and a putative non-operon set (defined in Materials and Methods). Of the 854 gene pairs in E.coli DVDA clusters (of size 2–18 genes), 334 are TP (true positive, found in RegulonDB) and 58 are FP (false positive, found in the putative non-operon pair set). This amounts to a recovery (proportion of TP among true pairs) of 49.1% of the RegulonDB known operon pairs and a precision (proportion of TP among positives) of 85.2%. Importantly, DVDA does not constrain intergenic distance and is able to predict conserved gene pairs with large intergenic separation, in contrast to the method of Salgado et al. (1) which cannot predict beyond 100 bp separation due to the paucity of known operon pairs at these distances. DVDA achieves a precision of (39%) beyond 100 bp separation, with 26 TP and 41 FP. This reduction in precision may be due in part to reduced coverage by RegulonDB at these large distances. Table 2 contains instances of possible operon pairs that are classified as FP pairs due to our definition. The set of DVDA predicted pairs separated by <100 bp has a precision of 95%. The B.subtilis 1050 predicted pairs set has 158 TP and 38 FP, a precision of 80%.
These high precision DVDA predictions are suitable to infer an intergenic distance distribution for operon pairs that is almost identical to the true (RegulonDB) distribution in the case of E.coli. Figure 4 Our approach of predicting adjacent operon pairs and estimating an intergenic distance distribution from them was applied to a representative set of eight query genomes (3 Proteobacteria, 4 Firmicutes and 1 Archaea). These results suggest that there are differences between genomes in the distribution of intergenic distances (IGDs) of predicted operon pairs. Figure 5 Table 1 describes the intersection points of the two probability estimates with a 0.5 cut-off, with 95% confidence intervals shown in brackets. It is evident that there are considerable differences in intra-operon distances between genomes. There is no obvious correlation of intra-operon distance with parameters such as number and length of directons, median IGD or size of genome (see the Supplementary Material for some examples of such parameters). The predicted E.coli cut-off values are very close to the values generated when the RegulonDB (known operon set) IGD distribution is used instead of the DVDA (predicted pair set) IGD distribution in the evaluation [P(O|d) = 0.5 cut-off is 76 bp for RegulonDB compared to 89 bp for DVDA, see Supplementary Material for figures]. Table 1 provides two different estimates P1(O) and P2(O) of the proportion of operon pairs in directons (adjacent gene pairs across directon boundaries are not counted). On average about two-thirds of adjacent gene pairs in directons are in a common operon. Again there is some variability for this characteristic between genomes that seems uncorrelated to any obvious genome statistic and suggests significant operon structure differences amongst prokaryotic genomes. DISCUSSION Operon prediction with DVDA The generation of operon maps is becoming an increasingly vital part of prokaryotic genome annotation, and IGD is the most informative predictor for this task. The inclusion of other data improves the accuracy of these predictions only marginally and in any case many of these data are available for only a limited number of genomes. Therefore, the generation of a reliable operon IGD distribution is crucial. We present a flexible operon pair prediction method able to detect permuted gene clusters containing conserved gene pairs separated by large IGD. Further, we show that predicted pairs can be used to generate genome-specific within-operon IGD distributions. From these, whole genome operon maps can be created. While the E.coli genome has become the de facto source of IGD distribution for prokaryotic operon prediction due to its experimentally validated operon set, our data suggest that transferring the distance distribution from E.coli to other genomes for operon prediction is questionable, as indicated by the variability in genome-specific values seen in Table 1 and Figure 5 The DVDA protocol differs from other methods in the field of gene context conservation in two important regards. First, in assignment of orthology, sequence similarity and context are balanced. A strong sequence similarity between two genes is not considered enough evidence to assign orthology, unless there is evidence of a conserved functional context. The whole genome is scanned for these contexts and even a bi-directional best hit could be discarded as not being the orthologue if a better context is found for that gene. Second, the conservation methods described in Introduction rely at least in part on conservation of gene order, even though this conservation is known to be very limited. DVDA is completely independent of gene order as the graph representation of the directon to directon comparison contains no information on the order of the genes on the chromosome. DVDA is also independent of IGD and so is able to predict operon pairs at relatively large separation. This property is particularly advantageous since there are verified instances of operon pairs with large intergenic separation but, due to the relative scarcity of these pairs in the RegulonDB training set, these are missed by the original IGD predictor (1). Given the occasionally very large P(O|d) and B.subtilis shows a very large cut-off for operon pairs. This large cut-off is only partly reflected in the known operon set of B.subtilis (see section 2 in the Supplementary Material). In fact, this is one of the reasons why we think the B.subtilis known operon pairs might not be representative. The prediction method is independent of any training data, excepting ORF prediction. Two of the genomes examined have large The DVDA method also allows us to estimate the proportion of within-operon to within-directon pairs for the tested organism. There is notable variability for this proportion between genomes (from ~50 to 82%) that seems uncorrelated to any obvious genome statistic. At one extreme, the plant pathogen X.fastidiosa is estimated to have ~4 out of 5 of its adjacent within-directon gene pairs belonging to operons. Again, this diversity suggests there is no universal operon distribution among prokaryotes. Contextual conservation of functionally interacting genes A conserved operon is not the only mechanism which can result in a persistent gene pair and here we describe some interesting predictions of conservation made by DVDA. These predictions further support DVDA as an accurate predictor of conserved functionally related genes, of which the operon set is the major component. Of the 58 predictions not in the RegulonDB set, 42 have an intergenic separation greater than 100 bp. Eight of these 42 pairs (lacZ lacI, 123 bp separation; pheS pheM 284 bp; atoC atoD 196 bp; ebgR ebgA 148 bp; mglB galS 280 bp; uhpT uhpC 138 bp; treB treR 119 bp; and uxuB uxuR 215 bp) consist of the leader peptide/inducer/repressor of an operon and the first gene of that operon. Retention of an operon repressor or inducer protein in the vicinity of its cognate binding site makes intuitive sense; the induction or repression mechanism would be more sensitive. Small changes in expression of the regulatory protein would result in a large change in the effective local concentration of the protein. When present, the leader peptide of an operon is necessarily immediately upstream of the operon it regulates. It is the conformation of the transcript of the leader peptide that determines whether transcription continues and the downstream operon is expressed. Leader peptides are usually associated with amino acid synthetic processes; pheM is the leader peptide for the pheST genes which code for phenylalanyl-tRNA synthetase. An interesting example of a predicted pair with ambiguous operon status is hisJ argT (221 bp separation). Active transport of metabolites in Gram-negative bacteria is facilitated by periplasmic substrate binding proteins. Both products of the conserved hisJ and argT pair are substrate binding proteins for the His permease (hisJQMP); HisJ is specific for l-Histidine while ArgT binds l-Lysine, l-Arginine and l-Ornithine (20,21). Microarray analysis (22) has suggested that both these sets of genes (argT and hisJQMP) are under the control of the nitrogen regulation system (ntrBC), and that an NtrC controlled promoter of the hisJQMP operon may be upstream of argT (23), i.e. under some conditions, these genes form an operon. An example of a conserved pair not associated with a known operon, but linked by some regulatory process, is the lon hupB predicted pair (209 bp separation). The hupB gene product (HU1) forms a histone like protein (HU) in complex with HU2 (product of the hupA gene). HU1 is degraded by the Lon protease in the absence of HU2 (24). During the log growth phase of E.coli, HU is a HU2 homodimer, whilst during stationary phase HU is a HU1–HU2 heterodimer. Claret and Rouviere-Yaniv (25) hypothesized that Lon inhibits the formation of HU1 homodimers in the absence of HU2. The HU1 homodimer lacks the necessary activity observed in the two other dimer species. This interaction is a clear case for the retention of lon in the vicinity of hupB. The intergenic region between lon and hupB in E.coli encompasses four promoters, three FIS binding sites (which extend more than 200 bp upstream of the hupB start codon) and a CRP binding site [EcoCyc, (26)], suggesting an explanation of the large separation. Further examples of DVDA predicted pairs that are not found in the RegulonDB set are presented in Table 2. Although these exceptions slightly distort the prediction of distances in operon pairs, they seem to be rare enough not to invalidate our approach and support DVDA as an accurate predictor of conserved functional interaction. It is difficult to make direct comparisons to existing gene conservation methods since databases are continuously growing and methods improved. For example, the method published by Ermolaeva et al. (9) was originally based on only 34 genomes. While the results currently available from the corresponding website (http://www.tigr.org/tigr-scripts/operons/operons.cgi) are based on a larger set of genomes, these predictions have been produced by a modified, but as yet unpublished, method. One way to compare the results of the current (February 2005) version of the Ermolaeva method to our method is to choose a confidence threshold for the Ermolaeva predictions so that each method produces approximately the same precision. Choosing a P-value of 0.6 results in 670 putative operon pairs of which 301 are TP and 49 are FN, i.e. a precision of 86.0% and a recovery of 44.3% based on the RegulonDB dataset of 680 operon pairs. This might be compared with a precision of 85.2% and a recovery of 49.1% of the 854 pairs predicted by the DVDA method on the same set. Assumptions made in the method We assume in general that a recent gene duplication (i.e. where the paralogous genes are still adjacent or at least in the same directon) are likely to remain within the same TU. There is no obvious way to assess at what point a paralogue has diverged sufficiently, in its regulatory or catalytic properties, to represent a new gene function with an optimal transcription level different enough to its sibling to justify, in terms of efficiency, the creation or modification of a separate operon. As with all distance-based operon prediction methods, precise IGDs are dependent on the accuracy of annotation of translational start sites. It could be argued that annotation of hypothetical genes is less certain than of those with defined function. In order to test the robustness of DVDA, we re-analysed data for genomes with more than 50% of non-hypothetical genes (E.coli, Mycobacterium tuberculosis, Pseudomonas aeruginosa, X.fastidiosa), removing all clusters containing a hypothetical gene. As shown in Table 1, the estimates of cut-offs for operon pairs are very similar and well within the sampling error as indicated by the bootstrap confidence intervals (see the Supplementary Material for plots of logistic regressions comparable to the ones in Figure 5 The detection of non-operon pairs that are still functionally related and conserved in proximity is a problem affecting all prediction methods based on context. The two most obvious causes of such conservation are über-operons (13) (collections of operons retaining proximity in unrelated organisms) and operon regulatory genes. Our results show instances of the second case, and these may push the 0.5 confidence threshold to larger IGDs. Presumably, this factor affects all genomes, and will not unduly impact cross-genome comparison. Further application of DVDA DVDA uses bipartite matching to decide which single region from a genome A is most similar to a single region in a genome B on the basis of overall homology rather than a few strong sequence similarities between the regions. The matched pairs of genes are probably true orthologues, an assignment supported not only by their sequence similarity but also by their conserved functional context. A family of orthologues could be created by collecting all the matchings for a single directon and clustering the genes by which (for example) E.coli gene they match to. It would be interesting to compare these families of orthologous proteins to the COGs database (27) and investigate any inconsistency. CONCLUSION In this paper, we describe a comparative genomics method for predicting gene pairs whose functional relationship depends on spatial vicinity. The majority of these pairs belong to operons. The method is based on the comparison of homologous genes across genomes and uses advanced graph algorithmic and statistical methods. No training data are required so the method can be applied to any prokaryotic genome almost immediately after it has been sequenced. In contrast to the modest input requirements (an ORF prediction), the output from applying the method is extensive: a complete genome operon map; a collection of context-assisted orthologue assignments from a broad range of unrelated organisms; and instances where functional assignments can be made on the basis of this orthology with already annotated genes. The method is independent of any conservation of gene order and will accept any number of gaps in the alignment between query and target gene clusters. Prediction of conservation is independent of IGD, allowing DVDA to identify conserved gene pairs at arbitrarily large separation. Due to the comparative nature of DVDA, as more genomes become available prediction quantity and quality will increase. DVDA predictions for the eight genomes presented and supplementary information are available online (http://dvda.cryst.bbk.ac.uk). Acknowledgments Thanks are due to the reviewers whose comments were appreciated. M.T.E. acknowledges a studentship by BBSRC and dedicates this work to the memory of Kenneth Peter Packwood. S.C.G.R. was funded by the Wellcome Trust (Grant 062508). L.W. acknowledges JISC and that funding to pay the Open Access publication charges for this article was provided by a Wellcome Trust Functional Genomics grant. Conflict of interest statement. None declared. REFERENCES 1. Salgado H., Moreno-Hagelsieb G., Smith T., Collado-Vides J. Operons in Escherichia coli: genomic analyses and predictions. Proc. Natl Acad. Sci. USA. 2000;97:6652–6657. [PubMed] 2. Moreno-Hagelsieb G., Collado-Vides J. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics. 2002;18:S329–S336. [PubMed] 3. Strong M., Mallick P., Pellegrini M., Thompson M., Eisenberg D. Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: a combined computational approach. Genome Biol. 2003;4:R59.1–R59.16. [PubMed] 4. Craven M., Page D., Shavlik J., Bockhorst J., Glasner J. A probabilistic learning approach to whole-genome operon prediction. Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB'00); 20–23 August; La Jolla, California, USA. 2000. pp. 116–127. 5. Salgado H., Gama-Castro S., Martinez-Antonio A., Daz-Peredo E., Sanchez-Solano F., Peralta-Gil M., Garcia-Alonso D., Jimnez-Jacinto V., Santos-Zavaleta A., Bonavides-Martnez C., Collado-Vides J. Transcriptional regulation, operon organization and growth conditions in Escherichia coli k-12. Nucleic Acids Res. 2004;32:D307–D310. [PubMed] 6. Bockhorst J., Qiu Y., Glasner J., Liu M., Blattner F., Craven M. Predicting bacterial transcription units using sequence and expression data. Bioinformatics. 2003;19:i34–i43. [PubMed] 7. Paredes C., Rigoutsos I., Papoutsakis E. Transcriptional organization of the Clostridium acetobutylicum genome. Nucleic Acids Res. 2004;32:1973–1981. [PubMed] 8. Mushegian A., Koonin E. Gene order is not conserved in bacterial evolution. Trends Genet. 1996;12:289–290. [PubMed] 9. Ermolaeva M., White O., Salzberg S. Prediction of operons in microbial genomes. Nucleic Acids Res. 2001;29:1216–1221. [PubMed] 10. Overbeek R., Fonstein M., D'Souza M., Pusch G., Maltsev N. The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA. 1999;96:2896–2901. [PubMed] 11. Wolf Y., Rogozin I., Kondrashov A., Koonin E. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res. 2001;11:356–372. [PubMed] 12. Rogozin I., Makarova K., Murvai J., Czabarka E., Wolf Y., Tatusov R., Szekely L., Koonin E. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res. 2002;30:2212–2223. [PubMed] 13. Lathe W.C., III, Snel B., Bork P. Gene context conservation of a higher order than operons. Trends Biochem. Sci. 2000;25:474–479. [PubMed] 14. Zheng Y., Roberts R.J., Kasif S. Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biol. 2003;3:RESEARCH0060.1–0060.9. 15. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PubMed] 16. Gibbons A. Algorithmic Graph Theory. Cambridge, UK: Cambridge University Press; 1985. 17. Efron B., Tibshirani R., Storey J.D., Tusher V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 2001;96:1151–1160. 18. Wood S.N. Modelling and smoothing parameter estimation with multiple quadratic penalties. J. R. Stat. Soc. B. 2000;62:413–428. 19. da Silva F.R., Vettore A.L., Kemper E.L., Leite A., Arruda P. Fastidian gum: the Xylella fastidiosa exopolysaccharide possibly involved in bacterial pathogenicity. FEMS Microbiol. Lett. 2001;203:165–171. [PubMed] 20. Kanehisa M., Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. [PubMed] 21. Lodish H., Berk A., Zipursky S., Matsudaira P., Baltimore P., Darnell J. Molecular Cell Biology, 4th edn. NY: W.H.Freeman and Company; 2000. 22. Zimmer D., Soupene E., Lee H., Wendisch V., Khodursky A., Peter B., Bender R., Kustu S. Nitrogen regulatory protein c-controlled genes of Escherichia coli: scavenging as a defense against nitrogen limitation. Proc. Natl Acad. Sci. USA. 2000;97:14674–14679. [PubMed] 23. Schmitz G., Durre P., Mullenbach G., Ames G.F. Nitrogen regulation of transport operons—analysis of promoters argTr and dhuA. Mol. Gen. Genet. 1987;209:403–407. [PubMed] 24. Bonnefoy E., Almeida A., Rouviere-Yaniv J. Lon-dependent regulation of the DNA binding protein HU in Escherichia coli. Proc. Natl Acad. Sci. USA. 1989;86:7691–7695. [PubMed] 25. Claret L., Rouviere-Yaniv J. Variation in HU composition during growth of Escherichia coli: the heterodimer is required for -long term survival. J. Mol. Biol. 1997;273:93–104. [PubMed] 26. Karp P., Riley M., Saier M., Paulsen I.T., Collado-Vides J., Paley S.M., Pellegrini-Toole A., Bonavides C., Gama-Castro S. The EcoCyc Database. Nucleic Acids Res. 2002;30:56–58. [PubMed] 27. Tatusov R.L., Koonin E.V., Lipman D.J. A genomic perspective on protein families. Science. 1997;278:631–637. [PubMed] 28. Iwata-Reuyl D. Biosynthesis of the 7-deazaguanosine hypermodified nucleosides of transfer RNA. Bioorg. Chem. 2003;31:24–43. [PubMed] 29. Llamas M.A., Ramos J.L., Rodriguez-Herva J.J. Transcriptional organization of the Pseudomonas putida tol-oprL genes. J. Bacteriol. 2003;185:184–195. [PubMed] 30. Dubuisson J., Vianney A., Lazzaroni J. Mutational analysis of the TolA C-terminal domain of Escherichia coli and genetic evidence for an interaction between TolA and TolB. J. Bacteriol. 2002;184:4620–4625. [PubMed] 31. Ge Y., Old I., Girons I., Charon N. The flgK motility operon of Borrelia burgdorferi is initiated by a sigma 70-like promoter. Microbiology. 1997;143:1681–1690. [PubMed] 32. Zhang Y., Cronan J. Transcriptional analysis of essential genes of the Escherichia coli fatty acid biosynthesis gene cluster by functional replacement with the analogous Salmonella typhimurium gene cluster. J. Bacteriol. 1998;180:3295–3303. [PubMed] 33. Boukhvalova M.S., Dahlquist F.W., Stewart R.C. CheW binding interactions with CheA and Tar. Importance for chemotaxis signaling in Escherichia coli. J. Biol. Chem. 2002;277:22251–22259. [PubMed] 34. Rahman M.S., Simser J.A., Macaluso K.R., Azad A.F. Molecular and functional analysis of the lepB gene, encoding a type I signal peptidase from Rickettsia rickettsii and Rickettsia typhi. J. Bacteriol. 2003;185:4578–4584. [PubMed] 35. Clements J.M., Coignard F., Johnson I., Chandler S., Palan S., Waller A., Wijkmans J., Hunter M.G. Antibacterial activities and characterization of novel inhibitors of LpxC. Antimicrob. Agents Chemother. 2002;46:1793–1799. [PubMed] 36. de Cock H., Pasveer M., Tommassen J., Bouveret E. Identification of phospholipids as new components that assist in the in vitro trimerization of a bacterial pore protein. Eur. J. Biochem. 2001;268:865–875. [PubMed] 37. Kloser A., Laird M., Deng M., Misra R. Modulations in lipid A and phospholipid biosynthesis pathways influence outer membrane protein assembly in Escherichia coli K-12. Mol. Microb. 1998;27:1003–1008. 38. Ravcheev D., Gelfand M., Mironov A., Rakhmaninova A. Purine regulon of gamma-proteobacteria: a detailed description. Genetika. 2002;38:1203–1214. [PubMed] |
PubMed related articles
Your browsing activity is empty. Activity recording is turned off. |
|||||||||||||||||||||||||||||
Proc Natl Acad Sci U S A. 2000 Jun 6; 97(12):6652-7.
[Proc Natl Acad Sci U S A. 2000]Bioinformatics. 2002; 18 Suppl 1():S329-36.
[Bioinformatics. 2002]Genome Biol. 2003; 4(9):R59.
[Genome Biol. 2003]Nucleic Acids Res. 2004 Jan 1; 32(Database issue):D307-10.
[Nucleic Acids Res. 2004]Bioinformatics. 2003; 19 Suppl 1():i34-43.
[Bioinformatics. 2003]Trends Genet. 1996 Aug; 12(8):289-90.
[Trends Genet. 1996]Nucleic Acids Res. 2001 Mar 1; 29(5):1216-21.
[Nucleic Acids Res. 2001]Proc Natl Acad Sci U S A. 1999 Mar 16; 96(6):2896-901.
[Proc Natl Acad Sci U S A. 1999]Genome Res. 2001 Mar; 11(3):356-72.
[Genome Res. 2001]Nucleic Acids Res. 2002 May 15; 30(10):2212-23.
[Nucleic Acids Res. 2002]Trends Biochem Sci. 2000 Oct; 25(10):474-9.
[Trends Biochem Sci. 2000]Bioinformatics. 2002; 18 Suppl 1():S329-36.
[Bioinformatics. 2002]Nucleic Acids Res. 1997 Sep 1; 25(17):3389-402.
[Nucleic Acids Res. 1997]Nucleic Acids Res. 2002 May 15; 30(10):2212-23.
[Nucleic Acids Res. 2002]Proc Natl Acad Sci U S A. 2000 Jun 6; 97(12):6652-7.
[Proc Natl Acad Sci U S A. 2000]Proc Natl Acad Sci U S A. 2000 Jun 6; 97(12):6652-7.
[Proc Natl Acad Sci U S A. 2000]FEMS Microbiol Lett. 2001 Sep 25; 203(2):165-71.
[FEMS Microbiol Lett. 2001]Nucleic Acids Res. 2000 Jan 1; 28(1):27-30.
[Nucleic Acids Res. 2000]Proc Natl Acad Sci U S A. 2000 Dec 19; 97(26):14674-9.
[Proc Natl Acad Sci U S A. 2000]Mol Gen Genet. 1987 Sep; 209(2):403-7.
[Mol Gen Genet. 1987]Proc Natl Acad Sci U S A. 1989 Oct; 86(20):7691-5.
[Proc Natl Acad Sci U S A. 1989]J Mol Biol. 1997 Oct 17; 273(1):93-104.
[J Mol Biol. 1997]Nucleic Acids Res. 2002 Jan 1; 30(1):56-8.
[Nucleic Acids Res. 2002]Nucleic Acids Res. 2001 Mar 1; 29(5):1216-21.
[Nucleic Acids Res. 2001]Trends Biochem Sci. 2000 Oct; 25(10):474-9.
[Trends Biochem Sci. 2000]Science. 1997 Oct 24; 278(5338):631-7.
[Science. 1997]