• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. Sep 2006; 16(9): 1169–1181.
PMCID: PMC1557769

Græmlin: General and robust alignment of multiple large interaction networks

Abstract

The recent proliferation of protein interaction networks has motivated research into network alignment: the cross-species comparison of conserved functional modules. Previous studies have laid the foundations for such comparisons and demonstrated their power on a select set of sparse interaction networks. Recently, however, new computational techniques have produced hundreds of predicted interaction networks with interconnection densities that push existing alignment algorithms to their limits. To find conserved functional modules in these new networks, we have developed Græmlin, the first algorithm capable of scalable multiple network alignment. Græmlin's explicit model of functional evolution allows both the generalization of existing alignment scoring schemes and the location of conserved network topologies other than protein complexes and metabolic pathways. To assess Græmlin's performance, we have developed the first quantitative benchmarks for network alignment, which allow comparisons of algorithms in terms of their ability to recapitulate the KEGG database of conserved functional modules. We find that Græmlin achieves substantial scalability gains over previous methods while improving sensitivity.

The publication of the second complete genome sequence in 1995 (Kaneko et al. 1995) ushered in the era of computational comparative genomics. The years that followed saw the application of cross-species genomic comparisons to problems ranging from gene prediction (Bafna and Huson 2000; Batzoglou et al. 2000; Korf et al. 2001; Alexandersson et al. 2003) to functional genomics (Pellegrini et al. 1999) to the analysis of entire genomes (Waterston et al. 2002; Cooper et al. 2004; Hillier et al. 2004). These diverse application areas were united by perhaps the most important premise of modern biology: the principle that evolutionary conservation implies functional relevance (Bejerano et al. 2004; Cooper et al. 2005; Siepel et al. 2005).

Today, the most direct analog of the exponential growth in sequence data is the rise of large-scale protein interaction network data (Uetz et al. 2000; Giot et al. 2003; Li et al. 2004). Computational and experimental techniques for inferring these networks have steadily improved (Fromont-Racine et al. 1997; Eisen et al. 1998), and state-of-the-art methods use multiple data sources to produce a unified prediction of protein interactions (Lee et al. 2004; Lu et al. 2005). The number of interaction networks is likewise increasing rapidly; in particular, a recent technique for computationally scalable data integration has produced integrated protein interaction networks for 11 microbes (Srinivasan et al. 2006), with hundreds more in preparation. Just as the rapid deposition of genomic data enabled the study of sequence conservation, the growth in network quality and availability allows us to ask questions at the network level (Milo et al. 2002).

One promising way of answering such questions is through network alignment, a systems-biological analog of sequence alignment intended to identify conserved functional modules (Hartwell et al. 1999). Research in this area has steadily progressed, beginning with manual alignments of metabolic pathways (Dandekar et al. 1999; Forst and Schulten 2001), proceeding to precursors of network alignment guided by highest-scoring pairwise BLAST hits (Altschul et al. 1997; Ogata et al. 2000; Matthews et al. 2001; Stuart et al. 2003), and culminating in the modern formulation of network alignment (Kelley et al. 2003; Koyuturk et al. 2005). Recent work has partially removed previous limitations by enabling searches for conserved multiprotein complexes in addition to pathways (Sharan et al. 2005a) and allowing the simultaneous comparison of three species rather than two (Sharan et al. 2005b). However, the general problem of finding conserved modules of arbitrary topology within an arbitrary number of networks has not yet been addressed.

In this paper we describe Græmlin, a novel network alignment framework that is fast, scalable, and capable of searching large sets of dense networks for conserved functional modules. Græmlin's probabilistic formulation of the topology-matching problem eliminates earlier restrictions on the possible architecture of conserved modules. Most important, Græmlin is the first program capable of multiple alignment of an arbitrary number of networks.

To assess Græmlin's ability to find conserved functional modules, we have performed the first quantitative comparison of network aligners. Using data sets containing known biological modules as a benchmark (Ashburner et al. 2000; Kanehisa and Goto 2000), we find that Græmlin achieves substantial gains in sensitivity over previous methods while offering fast and scalable searches of multiple, large networks. In addition to statistical benchmarking, we present detailed analyses of several alignments that suggest interesting hypotheses about protein function.

Græmlin is available through a Web interface located at http://graemlin.stanford.edu, where users can search for conserved functional modules within a large database of microbial networks. Source code is also available under the GNU Public License.

Methods

Græmlin is a network alignment algorithm capable of searching large sets of dense interaction networks for evolutionarily conserved functional modules, which are groups of homologous proteins with conserved pairwise interactions. Græmlin supports both global and local search; it can be used either to generate an exhaustive list of conserved modules in a set of networks (network-to-network alignment) or to find matches to a particular module within a database of interaction networks (query-to-network alignment).

Depending on the context of a study, one may wish to find functional modules that are present within all species or simply those that are enriched within a particular clade. Græmlin enables both kinds of comparative analysis, as it can rapidly search a large number (N > 3) of interaction networks to find functional modules that are significantly conserved in two or more species.

The efficient performance of Græmlin is due to the use of several strategies common in sequence alignment (Batzoglou 2005). First, its variant of “progressive alignment” (Feng and Doolittle 1987) allows it to scale linearly with the number of networks compared. Second, Græmlin searches for pairwise alignments between networks using a modification of the “seed extension” method popularized by BLAST (Altschul et al. 1997). Finally, it allows an explicit speed-sensitivity trade-off through the control of a parameter analogous to the BLAST word size (Altschul et al. 1990).

Below we outline our definition of a network alignment, the scoring model used by Græmlin, and its algorithm for finding high-scoring pairwise and multiple alignments.

Definition of an alignment

Given interaction networks for a set of related species, the goal of a network aligner is to extract conserved subnetworks that are statistically significant relative to alignments found in biologically unrelated networks. Such subnetworks are hypothesized to have evolved from a functional module originally present in the common ancestor.

We represent each input network as a weighted graph G i = (V i ,E i), where nodes correspond to proteins and each weighted edge specifies the probability that two proteins interact. We define a network alignment as a set of subgraphs chosen from the interaction networks of different species, together with a mapping between corresponding, or aligned, proteins. To uniquely specify an alignment, we require that the mapping be transitive; that is, if protein A is aligned to proteins B and C, then protein B must also be aligned to protein C. Mathematically, this means that the mapping is an equivalence relation; consequently, the groups of aligned proteins are disjoint, and we refer to them as equivalence classes for this reason.

We also require that all aligned proteins be homologous. Therefore, proteins in the same equivalence class are in general members of the same protein family (Andreeva et al. 2004; Marchler-Bauer et al. 2005). In this manner, a biological interpretation of an alignment is a collection of protein families whose interactions are conserved across a given set of species.

This definition affords important advantages. Because the members of a protein family descend from a common ancestor, we can reconstruct the evolutionary events leading from each ancestral protein to its extant descendants. By combining this with a reconstruction of the evolutionary history of each pairwise interaction, we can interpret each network alignment as a hypothesis about the evolution of a conserved ancestral module. Intuitively, network alignments should receive high scores if their evolutionary dynamics resemble those of known, conserved functional modules rather than those of random collections of proteins.

With this definition, there are two core problems in network alignment. First, we must devise a scoring framework that captures the knowledge we have about module evolution. Then, we must find a way to rapidly identify high-scoring alignments— meaning conserved functional modules—from among the exponentially large set of possible alignments. We address each problem in turn.

Scoring of an alignment

The evolutionary interpretation of an alignment leads to a natural scoring function. We first define two models that assign probabilities to the evolutionary events leading from the hypothesized ancestral module to modules in the extant species: the alignment model An external file that holds a picture, illustration, etc.
Object name is 1169inf1.jpg posits that the module is subject to evolutionary constraint, while the random model An external file that holds a picture, illustration, etc.
Object name is 1169inf2.jpg assumes that the proteins are under no constraints. The score of the alignment is the log-ratio of the two probabilities, a common method for scoring sequence alignments (Durbin et al. 1998). Figure 1 shows a sample alignment, together with an overview of the scoring framework.

Figure 1.
Method for scoring a multiple network alignment. (A) A sample multiple alignment. The four networks are from four different species. Each circle represents a protein, and edges link proteins that are hypothesized to interact; the width of an edge is proportional ...

Græmlin individually scores each equivalence class and each edge of an alignment. To score equivalence classes, it uses a straightforward scheme that reconstructs the most parsimonious ancestral history of an equivalence class, based on five types of evolutionary events: protein sequence mutations, protein insertions and deletions, protein duplications, and protein divergences; a protein divergence occurs when a paralogous protein loses its function and is the inverse of a duplication. The models An external file that holds a picture, illustration, etc.
Object name is 1169inf1.jpg and An external file that holds a picture, illustration, etc.
Object name is 1169inf2.jpg give each of these events a different probability. Currently, we estimate probabilities of sequence mutations in a principled manner, but we determine probabilities of other events heuristically; a detailed discussion is provided in the Supplemental material. This is analogous to sequence alignment, where traditionally substitution matrices are estimated rigorously (Henikoff and Henikoff 1993; Chiaromonte et al. 2002) but gap penalties are set in a heuristic manner (Brudno et al. 2003; Blanchette et al. 2004; Bray and Pachter 2004; Edgar 2004).

To determine the probabilities for sequence mutations, Græmlin uses weighted sum-of-pairs scoring (Altschul et al. 1989). Each model assigns a probability to a pair of proteins based on a BLAST bitscore; we trained the alignment model An external file that holds a picture, illustration, etc.
Object name is 1169inf1.jpg by sampling pairs of proteins from within the same COG (Tatusov et al. 1997; Kelley et al. 2003; Sharan et al. 2005b), and we trained the random model An external file that holds a picture, illustration, etc.
Object name is 1169inf2.jpg on random pairs of proteins. The log-ratio of these two distributions gives a scoring function for a pair of proteins: The sequence mutation score of an equivalence class is the weighted sum-of-pairs score taken over all pairs of proteins in the class, using a phylogenetic tree relating the species in the alignment.

As with equivalence classes, we define edge scores as the log-ratio of two probabilities: Each edge e is assigned a score An external file that holds a picture, illustration, etc.
Object name is 1169inf3.jpg

The random model An external file that holds a picture, illustration, etc.
Object name is 1169inf2.jpg assigns each edge a probability parametrized not only by its weight but also by the degrees of its endpoints; this captures the intuitive notion that in any graph, two nodes of high degree are more likely to interact by chance than two nodes of low degree. The alignment model for edges, however, is not as straightforward: Unlike in the case of proteins, Græmlin cannot always assume that an edge existed in the ancestral module. This assumption would, for instance, always reward highly connected modules more than equally conserved but loosely connected modules. The alternative of considering only the conservation of the presence or absence of an edge would score a completely unconnected alignment highly.

To address these issues, Græmlin uses a novel scoring scheme that allows a user to specify the desired ancestral topology; this generalizes previous edge-scoring approaches (Sharan et al. 2005b) and permits searches for arbitrary module structures, including as special cases multiprotein complexes and pathways. We use an Edge Scoring Matrix, or ESM, to encapsulate the desired module structure into a symmetric matrix. An ESM has a set of labels by which its rows and columns are indexed, and each cell in the matrix contains a probability distribution over edge weights. To score edges in an alignment, Græmlin first assigns to each equivalence class one of the labels from the ESM. Then, it scores each edge e using the cell in the matrix indexed by the labels of the two equivalence classes to which its endpoints belong: The function in the cell maps the weight of the edge to a probability PrAn external file that holds a picture, illustration, etc.
Object name is 1169inf1.jpg(e), which is used to compute the score Se = log(PrAn external file that holds a picture, illustration, etc.
Object name is 1169inf1.jpg(e)/PrAn external file that holds a picture, illustration, etc.
Object name is 1169inf2.jpg(e)).

Figure 1C shows three examples of an ESM, as well as the functions used by each to score the edges in a sample alignment; these include two special cases, pathways and multiprotein complexes, that have been the subjects of past studies (Kelley et al. 2003; Koyuturk et al. 2005; Sharan et al. 2005b). To search for conserved multiprotein complexes, we use a Complex ESM, which consists of a single label with an alignment distribution assigning high probabilities to high edge weights. A Pathway ESM has one label for each protein in the pathway and rewards high edge weights between adjacent proteins; between all other proteins, the alignment and random distributions are the same, so that Græmlin neither rewards nor penalizes edges connecting nonadjacent proteins.

The third type of ESM we consider is automatically generated when a user searches a large network for matches to a small query network. We refer to this as a Module ESM because the query network will often consist of a hypothetical or known biological module. In this case, Græmlin creates a label in the ESM for each node in the query and generates the alignment distribution based on the edges that are present or absent in the query. For each cell in the ESM, it defines the distribution based on the weight of the edge between the two corresponding proteins in the query. When aligning a query to multiple species, Græmlin refines the ESM as more species are added to the alignment; in this case, rather than creating a label for each protein, it creates a label for each equivalence class and uses kernel density estimation (Duda et al. 2000) to train the distributions from the entire set of edges present in the alignment. When used in this form, we refer to an ESM as a Location Specific Scoring Matrix, or LSSM, because of its conceptual and practical similarity to the Position Specific Scoring Matrix (PSSM) used in PSI-BLAST (Altschul et al. 1997). Figure 2 shows a simple example of the manner in which Græmlin constructs an LSSM.

Figure 2.
An example of an LSSM. As Græmlin successively adds species to the multiple alignment, the distributions in the ESM cells change to reflect the new edges. At each step, the cell with a modified distribution is highlighted together with the edge ...

Alignment algorithm

Figure 3 shows an outline of the Græmlin algorithm, including the methodology it uses for pairwise and multiple alignment.

Figure 3.
Outline of the Græmlin algorithm. (A) Shown here are four networks, together with their phylogenetic relationship. Græmlin will multiply align all four. (B) Græmlin first performs a pairwise alignment of the two closest species. ...

Pairwise alignment

To search for high-scoring alignments between a pair of networks efficiently, Græmlin first generates a set of “seeds,” which it uses to restrict the size of the search space. We refer to the structures used for seed generation as “d-clusters,” which consist of d proteins that are close together in a network and are analogs of k-mers in seeded local alignment search.

For each network, Græmlin constructs one d-cluster for every node by finding the d − 1 nearest neighbors of that node, where the length of an edge is the negative logarithm of its weight. Græmlin compares two d-clusters D 1 and D 2 by mapping a subset of nodes in D 1 to a subset of nodes in D 2 and reporting a score equal to the sum of all pairwise scores induced by the mapping; the score of two d-clusters is the highest-scoring such mapping. Græmlin identifies pairs of d-clusters, one from each network, that score higher than a threshold T and uses these as seeds. Figure 3B shows a sample set of d-clusters generated from two networks, as well as a high-scoring pair.

The benefits of the d-cluster seeding technique are severalfold. First, Græmlin can compare d-clusters rapidly, since the comparison neglects edge scores. Second, the parameters d and T allow for a speed-sensitivity trade-off. As an example, a lower value of T will achieve higher sensitivity but require increased running time; this adjustable trade-off is not present in previous techniques (Koyuturk et al. 2005; Sharan et al. 2005b). Finally, high-scoring alignments are likely to contain high-scoring d-clusters, since a high node score of an alignment is usually a prerequisite to a high overall score. We can give this intuition a mathematical foundation using ideas similar to those underlying spaced seed analysis techniques (Ma et al. 2002; Sun and Buhler 2005); this analysis, which we discuss in the Supplemental material, yields some intuition into the interplay between the two dependent parameters d and T.

Given two networks, Græmlin enumerates the set of seeds between them and tries to transform each, in turn, into a high-scoring alignment. In a manner similar to that used in existing methods (Koyuturk et al. 2005; Sharan et al. 2005b), the seed extension phase is greedy and occurs in successive rounds. At each step, all proteins adjacent to some node in the alignment constitute the “frontier,” which contains candidates to be added to the alignment. Græmlin selects from the frontier the pair of proteins that, when added to the alignment, yields the maximal increase in score; the extension phase stops when no pair of proteins on the frontier can increase the score of the alignment. Figure 3C illustrates the extension algorithm. Græmlin uses several heuristics to control for the exponential increase in the size of the frontier as it adds more nodes to the alignment.

Multiple alignment

Græmlin performs multiple alignment using an analog of the progressive alignment technique commonly used in sequence alignment. Using a phylogenetic tree, it successively aligns the closest pair of networks, constructing several new networks from the resulting alignments. Græmlin places each new network at the parent of the pair of networks that it just aligned. The constructed networks contain nodes that are no longer proteins but equivalence classes, but all scoring and alignment methodologies readily generalize to such networks. Græmlin continues this process until the only remaining networks are at the root of the phylogenetic tree.

To enable comparisons of unaligned parts of a network to more distant species as it traverses the phylogenetic tree, rather than construct a network only from the high-scoring alignments, Græmlin also maintains two additional networks composed of the unaligned nodes from the two original networks. For example, in Figure 3D, Græmlin constructs three networks from the original two that it aligns; as a result, in Figure 3E, the parent of these two networks contains one network for each possible subset of its children. The end result is that after completion of the entire multiple alignment, Græmlin produces multiple alignments of all possible subsets of species. It avoids an exponential running time in practice because after each pairwise alignment, the networks it constructs have small overlaps. The total number of nodes in all networks therefore does not increase significantly.

Results

We measured the performance of Græmlin by assessing its ability to align known biologically functional modules. We compared it to two alignment algorithms, NetworkBLAST (Sharan et al. 2005b) and MaWISh (Koyuturk et al. 2005); because the focus of MetaPathwayHunter (Pinter et al. 2005) is different from general network alignment, we did not include it in our tests. We tested these methods on a set of 10 microbial protein interaction networks constructed via the SRINI algorithm (Srinivasan et al. 2006), which generates weighted interaction networks by integrating a set of functional predictors, such as coexpression, co-inheritance, coevolution, and colocation, and computing the interaction probability for each pair of proteins. Details on the methodology for constructing these networks are included in the Supplemental material.

We assessed the sensitivity of each method by counting the number of KEGG pathways that it aligned between two species (Kanehisa and Goto 2000). We identified a KEGG pathway as “hit” if a method aligned at least three proteins in the pathway to their counterparts in the other species. We defined the “coverage” of a pathway to be the fraction of proteins correctly aligned within the pathway. Changing the definition of a hit pathway to require two or four, instead of three, proteins did not affect the relative performance of the aligners.

We did not use all KEGG pathways for these comparisons, as SRINI does not accurately recapitulate each one. We therefore curated the set of KEGG pathways by ignoring all that did not have a connected component of size at least three in each of the assessed networks. To avoid biasing results toward a specific algorithm, we did not further curate the set by examining the conservation of the pathways. We refer to each pathway in the curated set as an “alignable” KEGG pathway.

As one measure of specificity, we computed the number of “enriched” alignments. To calculate enrichment, we first assigned to each protein all of its annotations from level eight or deeper in the GO hierarchy (Ashburner et al. 2000); given an alignment, we then discarded unannotated proteins and calculated its enrichment using the GO Term-Finder (Boyle et al. 2004). We considered an alignment to be enriched if the P-value of its enrichment was <0.01.

As a second measure of specificity, we counted the fraction of nodes that have KEGG orthologs but were aligned to any nodes other than their KEGG orthologs. Both this measure and calculations of enrichment are imperfect measures of specificity, but they work as rough guides to ensure that an aligner is not completely sacrificing specificity to increase sensitivity.

We also assessed multiple alignment algorithms using these metrics. When evaluating the sensitivity metric, we identified a KEGG pathway as hit if a method aligned at least three proteins in each species to their correct counterparts in all other species. We measured specificity by computing enrichments and counting misaligned nodes exactly as in the case of pairwise alignments.

To our knowledge, the only other quantitative tests of alignment quality measured the accuracy of predictions of interactions and protein function in eukaryotic networks (Kelley et al. 2003; Sharan et al. 2005b). The first such test is relevant mainly in networks with a high number of false positives in one particular species; this is not the case in the microbial networks on which we tested, as three functional predictors used for their construction incorporate some measure of cross-species conservation. As the second test overlaps considerably with our tests measuring enrichment, we do not present results beyond those measuring our notions of sensitivity and specificity.

One issue with the networks constructed by SRINI is that they are complete; that is, SRINI assigns an interaction probability to every pair of proteins. Because these networks are intractably large for any existing algorithm, we thresholded them by removing low-weight edges before running our tests. We generated two sets of networks: one with an edge threshold of P ≥ 0.25 and another with a threshold of P ≥ 0.5. Table 1 lists the species on which we ran tests, in addition to statistics on the network sizes and presence of KEGG pathways in the networks. For comparison purposes, the table also shows the same statistics for the eukaryotic networks that previous studies on alignment have analyzed (Xenarios et al. 2002; FlyBase Consortium 2003; Christie et al. 2004; Harris et al. 2004). Table 2 lists, for each subset of species on which we tested, the number of alignable KEGG pathways present in all species in the subset.

Table 1.
Network statistics
Table 2.
KEGG pathway conservation statistics

We did not test on the eukaryotic networks because our sensitivity metric is inapplicable on them; as Table 1 shows, they recapitulate essentially no KEGG pathways. While in principle one could define other sensitivity metrics, the greater quality of the SRINI networks provides a much simpler and straightforward test scenario. In addition, the SRINI networks are much larger than the eukaryotic networks and consequently provide a better test of the scalability of an algorithm.

For all tests and all alignment algorithms, we considered alignments with P-values <0.01 as high-scoring; for each test case, we calculated P-values by sampling from a large number of runs on random data sets. We constructed the random data sets using techniques similar to those used in previous approaches by redistributing the edges of a real network while maintaining the original node degree distribution (Kelley et al. 2003; Koyuturk et al. 2005; Pinter et al. 2005; Sharan et al. 2005a). Unless noted otherwise, we ran all aligners with their default parameters. We performed all tests on a 2.8 GHz Intel Xeon processor with 2 Gb of RAM running the Linux operating system.

Network-to-network alignment

The goal of complete network-to-network alignment is to find conserved subcomponents of networks, and the results often suggest potential functional modules present in more than one species. Our first set of tests assessed performance of each algorithm under this application; this is the focus of both MaWISh, which searches for conserved heavy subgraphs, and Network-BLAST, which searches for conserved protein complexes and pathways.

With all three methods, we aligned the networks of Escherichia coli K12 and Caulobacter crescentus; E. coli and Mycoplasma tuberculosis H37Rv; E. coli and Vibrio cholerae; and E. coli and Streptomyces coelicolor. Owing to its excessive running time on networks with the lower edge threshold, we report results for NetworkBLAST only on networks with a threshold of 0.5. In addition, MaWISh cannot perform alignments on S. coelicolor because for each input species it requires COG data (Tatusov et al. 1997), which is not available for S. coelicolor. Figure 4 summarizes sensitivity for three of the test cases on networks with edge thresholds of 0.25 and 0.5, and Table 3 shows more detailed results on networks thresholded at 0.5. Complete results are included in the Supplemental material.

Figure 4.
Sensitivity comparison of methods. For three pairwise alignments of E. coli, shown are the number of KEGGs hit by each aligner. For Græmlin and MaWISh, this graph includes results on networks with edge thresholds of both hold and 0.5. For NetworkBLAST, ...
Table 3.
Results on pairwise alignment of complete networks thresholded at 0.5

These results show that there is a legitimate reason for using networks with a lower edge threshold, since the sensitivities of both MaWISh and Græmlin drop dramatically on networks with a higher threshold without a corresponding increase in specificity. Consequently, the ability of both methods to scale to data sets of this size is important.

When searching for highly connected components, or multiprotein complexes, Græmlin is significantly more sensitive than the other two methods, both with respect to the number of KEGGs hit and with respect to the average coverage of a KEGG, without sacrificing specificity. It also aligns significantly more nodes overall than the other two methods without misaligning a higher number of proteins.

When searching for pathways, Græmlin is more sensitive than NetworkBLAST, although both methods are not as sensitive as those that search for multiprotein complexes. This is predominantly because a pathway alignment must be much larger than an alignment of multiprotein complexes to be statistically significant. For example, if an alignable KEGG pathway contains a clique of four proteins, it will score highly as both a multiprotein complex and a pathway. However, because four-node cliques are much less likely to occur in unrelated networks than four-node pathways, the multiprotein complex alignment will be more statistically significant. As most alignable KEGGs appear as cliques in the SRINI microbial networks, searches for highly connected components are consequently more sensitive than pathway searches with respect to the metric of hitting KEGG pathways. However, past studies have shown that pathway searches do have uses beyond identifying conserved modules (Kelley et al. 2003; Sharan et al. 2005b). With respect to running time, only MaWISh and Græmlin can efficiently search the large networks on which we tested. While MaWISh is the faster of the two methods, the running time of Græmlin is comparable.

Of all the test cases, Græmlin and NetworkBLAST take the longest to run on E. coli versus S. coelicolor, primarily because of the size of the S. coelicolor network and the large number of homologs between these species. In this case, Græmlin can sacrifice sensitivity for speed by adjusting the parameters it uses for d-clusters. Figure 5A demonstrates the impact of T on running time and sensitivity. Running with its default parameters (d = 4, T = 7) on networks thresholded at 0.25, it finds 25 KEGG pathways in 1224 sec, but a slight increase in T yields 21 KEGGs in only 339 sec.

Figure 5.
Running-time performance of Græmlin. (A) The speed sensitivity trade-off. Each point represents a run of Græmlin with d = 4 and different values of T. For each set of parameters, the x-axis plots the running time, and the y-axis plots ...

Multiple network alignment

We also performed complete three-way alignments of (E. coli, C. crescentus, V. cholerae) and (E. coli, Campylobacter jejuni, Helicobacter pylori 26,695) using Græmlin and NetworkBLAST. Table 4 shows the results of these tests.

Table 4.
Results on multiple alignment of complete networks

On the networks thresholded at 0.5, NetworkBLAST hits slightly more KEGGs than Græmlin; however, Græmlin covers a much higher fraction of each KEGG and also misaligns fewer nodes. In addition, Græmlin is orders of magnitude faster than NetworkBLAST; on one of our test cases, the latter did not complete after running for more than 2 mo. Because Græmlin scales effectively to large network sizes, it can efficiently multiply align networks with a low edge threshold. This is important because the networks with a low edge threshold contain many more conserved KEGGs than the high-thresholded networks, as evinced by the dramatically increased sensitivity of Græmlin on this data set.

To further stress the scalability of Græmlin with respect to the number of species in a multiple alignment, Figure 5B shows the running times of Græmlin as it includes more species in the alignment. The roughly linear relation of running time to the number of species demonstrates the benefit of the progressive alignment technique.

Query-to-network alignment

Query-to-network alignment is a network analog of the BLAST algorithm; the goal is to search a large database of alignments for matches to an input query that is typically a hypothetical or known functional module. Both MaWISh and NetworkBLAST can perform query-to-network alignment by treating the query as a full network. On the other hand, Græmlin supports fast alignment of many queries to the same database by building an index as a one-time expense and maintaining it in memory for many successive queries.

Our final set of tests assessed the performance of each method on query-to-network search; we searched E. coli against

C. crescentus, C. crescentus against E. coli, E. coli against M. tuberculosis, and M. tuberculosis against E. coli. Table 5 shows the results of these tests; more detailed results are included in the Supplemental material.

Table 5.
Results on alignment of a query network to a database thresholded at 0.5

For this test, sensitivity and specificity results are similar to those in the case of complete network alignment. One major difference is the relative running times of Græmlin and MaWISh; they are comparable when the database is C. crescentus, the smallest network, but Græmlin is much faster on the other networks. This is because Græmlin can amortize its indexing step over all queries and shows Græmlin's strength as a database search tool.

In this test case, Græmlin performs comparably when using both the Pathway ESM and the Complex ESM. The Module ESM does not offer dramatic improvements over the other two ESMs, but it does give slightly higher KEGG coverage and misalign slightly fewer nodes. This is because most KEGGs that are alignable are highly connected, making the Complex ESM close to optimal under our metrics.

Biological applications

We used Græmlin to perform a 10-way alignment of E. coli, Salmonella typhimurium, V. cholerae, C. crescentus, C. jejuni, H. pylori, Synechocystis, S. coelicolor, M. tuberculosis, and S. pneumoniae. This generated roughly 2000 significant multiple alignments, each containing all or a subset of the 10 species; complete results are available in the Supplemental material. Because the analysis of these alignments is a research direction in its own right, we selected interesting alignments manually. Our focus was predominantly on results that illustrate the utility of the various features of Græmlin.

Functional annotation

Network alignment can be used to assign roles to proteins of unknown function in two ways. First, “annotation transfer” assigns to a protein of unknown function the annotation of a protein to which it is aligned. This procedure is similar to the traditional method of annotation transfer using only sequence alignment, but network alignment strengthens the hypothesis by revealing conserved interactions as well as conserved sequence. A second annotation method is unique to network alignment: If a protein of unknown function appears as part of an alignment together with a “landmark” protein of known function, we can use “landmark extension” to label the protein with a similar annotation. More highly connected and highly conserved alignments strengthen the hypothesis that the unknown protein shares function with the landmark protein.

Figure 6 shows an example of functional annotation through both pairwise and multiple network alignment. The pairwise alignment between E. coli and C. crescentus (Fig. 6A) shows a conserved DNA replication module. This includes the components of the primosome (dnaB, dnaA, gyrA, gyrB), the subunits of topoisomerase IV (parE, parC), and the β subunit of DNA polymerase III (dnaN). These protein families are all known to be involved in DNA replication.

Figure 6.
Two alignments of proteins involved in DNA replication. (A) A pairwise alignment between E. coli and C. crescentus includes several proteins involved in cell division as well as a conserved thiophene and furan oxidation protein. (B) A multiple alignment ...

Two aspects of this alignment are of interest. First, we see that the recF repair protein is linked to DNA replication in both organisms. Although this is not the primary annotation of the protein, a link to DNA replication was, in fact, found fairly recently (Kogoma 1997). Second, we observe the presence of the glucose-inhibited division proteins (gidA, gidB) and the protein trmE. Transcription of gidA affects DNA replication, both gidA and trmE are known to be involved in tRNA modification, and trmE has been implicated in cell cycle control (Gollop and March 1991). Taking these known interactions together with the alignment, we can hypothesize that both the gid proteins and trmE are likely to be involved in the cell-cycle-regulated control of DNA replication.

The multiple alignment diagram in Figure 6B extends the pairwise alignment to a multiple alignment of E. coli, S. typhimurium, V. cholerae, C. crescentus, C. jejuni, H. pylori, M. tuberculosis, S. pneumoniae, and Synechocystis. While some proteins from the pairwise alignment are absent, the core remains the same. The presence of the trmE protein in all nine species provides a compelling argument in favor of its role in DNA replication. This multiple alignment also offers another opportunity for landmark extension; the 60-kDa inner membrane protein yidC is present in all nine species and is highly connected to the other proteins in the alignment. Although known to be involved in protein secretion, the multiple alignment indicates that it is also likely to be linked to DNA replication.

Figure 7 shows an example of a 10-way multiple alignment relevant to bacterial cell division and cell envelope biogenesis. The alignment includes ftsZ, ftsW, and ftsI, well-known proteins involved in cell division, along with several other proteins from the mur and mra families known to be involved in peptidoglycan biogenesis. Many of these proteins are in contiguous operons in some species (Hara et al. 1997) but are scattered over the genome in species such as C. jejuni and H. pylori, rendering bioinformatics analysis difficult. This alignment, however, implicates them in cell division by association with the landmark proteins ftsZ, ftsW, and ftsI. In doing so, it uses information on the operon of one species (E. coli) to predict functional associations in the other species of the alignment.

Figure 7.
An alignment including proteins involved in cell division. This alignment implicates several proteins in bacterial cell division; it includes all species listed in Table 1.

Module identification

While support for functional annotation of proteins is currently the primary application of network alignment, the availability of numerous interaction networks may provide a resource for the study of functional modules. For example, Figure 8 shows that in

Figure 8.
An alignment of a hypothetical functional module. In this alignment, proteins involved in biopolymer transport interact with proteins involved in DNA recombination. The sum total of these interactions in six species suggests that the proteins may be a ...

E. coli, S. typhimurium, V. cholerae, C. jejuni, H. pylori, and C. crescentus, several proteins from the exb/tol family of biopolymer transporters are predicted to interact with a set of proteins involved in DNA recombination and integration. While the cooperation of these proteins is somewhat weak in any given species, the sum total of interactions in six distinct species suggests that DNA itself is the biopolymer transported through the tol channels before integrating into the chromosome. This alignment therefore may represent a part of a conserved module determining whether a cell is naturally competent for transformation (Dubnau 1999); this hypothesis is strengthened by studies showing that the insertional disruption of exbB in Pseudomonas stutzeri can reduce transformation efficiency to one-fifth of its previous level (Graupner and Wackernagel 2001). While in P. stutzeri the investigators used the fact that the exb genes were immediately downstream of two competence-related proteins, in species such as C. jejuni and H. pylori this chromosomal contiguity is not evident. Network alignment nevertheless identifies this module on the basis of conserved interactions.

Discussion

Interaction networks will soon constitute a vast store of data, as exemplified by the upcoming availability of hundreds of microbial interaction networks (Srinivasan et al. 2006). In light of this, network alignment is rapidly becoming an important analytical tool: Its goal is to map proteins of one interaction network to those of another and identify shared subnetworks that may constitute conserved biological modules. As with biosequence comparison, the principle that evolutionary conservation implies function can serve to increase the signal-to-noise ratio in analyses of interaction networks. With that help, biologists may be able to transfer functional annotations across species, extend annotations of known modules and landmark proteins to their strongly conserved neighboring proteins in a network, or identify novel modules by detecting unusual conserved subnetworks.

As our test results show, Græmlin is a promising method for network alignment. It scales efficiently to large inputs, particularly when searching databases, and is the first method capable of performing multiple alignments of an arbitrary number of networks. In addition, Græmlin uses a novel, flexible scoring scheme that can incorporate biologically trained parameters, and introduces the Module ESM framework, which offers the potential to search for subnetworks of arbitrary structure. In contrast, existing methods are limited to searching for multiprotein complexes, which are represented as fully connected graphs, or pathways, which are represented as ordered lists of proteins. As biological networks become less noisy and more complete, the Module ESM framework will allow fine-grained searches and analyses, and it will also offer the potential to refine models of known biological modules by quantifying the level of conservation of individual parts and interactions.

Our analyses of four conserved subnetworks accentuates several applications of network alignment, one of which is the analysis of proteins that lack functional annotation. Network alignment can do this by using conserved interactions and sequence to align an unknown protein to one with known function in another species. Alternatively, even if a protein has no homolog of known function, its occurrence as part of an alignment near well-known “landmark” proteins permits inferences about its function (Srinivasan et al. 2006).

As networks improve in quality and completeness, attention will focus on the functional annotation of modules in addition to proteins. Network alignment will play a key role by discovering groups of proteins that interact in more than one species, and it will thus offer additional evidence that such proteins work together to perform a common cellular function. As more networks become available, query-to-database network alignment will fulfill a similar role for modules as does BLAST for proteins (Altschul et al. 1997): By assembling a database of modules of known function, one may be able to annotate hypothetical modules that align to a module of known function in the database.

Although multiple network alignment is still in its infancy, it offers the potential to study modules in the context of functional evolution. Græmlin is a first step in the development of tools that will permit such studies, as it is capable of aligning many networks simultaneously and uses an evolutionarily based scoring scheme. Further algorithmic development will undoubtedly lead to data-motivated population genetic models for network evolution (McAdams et al. 2004; Koyuturk et al. 2005), where conserved interactions and conserved proteins will play the role of conserved residues. It is possible that even a SCOP-like hierarchy (Andreeva et al. 2004) for module families is on the horizon.

Although there is an extensive literature (Conte et al. 2004) on the topic of finding conserved graph topologies, the problems addressed by such algorithms are in general quite different from network alignment. For example, the evolutionary restriction on meaningful network alignments strongly constrains matches between graphs, as only homologous proteins from different species are aligned, whereas in the kind of graph matching treated by image processing algorithms (Conte et al. 2004), for example, nodes are tacitly assumed to be indistinguishable and edges represent indications of connectivity rather than beliefs about interaction. Another difference lies in the quality of the networks; probabilistic protein interaction networks are undirected graphs characterized by a low graph diameter (Barabasi and Oltvai 2004) and a high degree of topological uncertainty. As an extreme example of noisy graph structure, interaction networks based primarily on yeast two-hybrid data may not even be alignable, as several studies have questioned this assay's reliability (Bloom and Adami 2003; Drummond et al. 2005; Deeds et al. 2006). As networks increase in quality, however, ideas from general graph comparison techniques will be more relevant to network alignment.

With the impressive recent advances in sequencing, high-throughput techniques for gathering biological data, and computational methodologies for integrating such information into networks of protein interactions, comparisons of networks should become an increasingly important methodology for the molecular biologist. As our results show, Græmlin is a general and systematic methodology for comparing an arbitrary number of large networks. Many important challenges remain; for instance, the ability to reason about directed edges and align different types of interactions, such as physical contact and gene regulation, will allow more detailed analyses of biochemical pathways and regulatory cascades. On a more practical note, the ability to automatically identify interesting alignments for further study will be an important research topic unto itself.

Acknowledgments

J.F. was supported in part by a Stanford Graduate Fellowship; A.N. was supported in part by NLM training grant LM-07033 and NIH grant 5-T15-LM007033. J.F., A.N., and B.S.S. were funded by NSF grant EF-0312459, NIH grant UO1-HG003162, the NSF CAREER Award, and the Alfred P. Sloan Fellowship. B.S.S. was supported in part by a Department of Defense National Defense Science and Engineering Graduate Fellowship through the Army Research Office, and B.S.S. and H.H.M. were supported by NIH grant 1 R24 GM073011-01 and DOE Office of Science grant DEFG02-01ER63219. We thank Andreas Sundquist for helpful comments on the manuscript.

Footnotes

Supplemental material is available online at www.genome.org. Græmlin is available at http://graemlin.stanford.edu, and source code is available under the GNU Public License.

Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5235706. Freely available online through the Genome Research Open Access option.

References

  • Alexandersson M., Cawley S., Pachter L., Cawley S., Pachter L., Pachter L. SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 2003;13:496–502. [PMC free article] [PubMed]
  • Altschul S.F., Carroll R.J., Lipman D.J., Carroll R.J., Lipman D.J., Lipman D.J. Weights for data related by a tree. J. Mol. Biol. 1989;207:647–653. [PubMed]
  • Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J., Gish W., Miller W., Myers E.W., Lipman D.J., Miller W., Myers E.W., Lipman D.J., Myers E.W., Lipman D.J., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. [PubMed]
  • Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J., Schaffer A.A., Zhang J., Zhang Z., Miller W., Lipman D.J., Zhang J., Zhang Z., Miller W., Lipman D.J., Zhang Z., Miller W., Lipman D.J., Miller W., Lipman D.J., Lipman D.J. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. [PMC free article] [PubMed]
  • Andreeva A., Howorth D., Brenner S.E., Hubbard T.J.P., Chothia C., Murzin A.G., Howorth D., Brenner S.E., Hubbard T.J.P., Chothia C., Murzin A.G., Brenner S.E., Hubbard T.J.P., Chothia C., Murzin A.G., Hubbard T.J.P., Chothia C., Murzin A.G., Chothia C., Murzin A.G., Murzin A.G. SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Res. 2004;32:D226–D229. [PMC free article] [PubMed]
  • Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Dolinski K., Dwight S.S., Eppig J.T., Dwight S.S., Eppig J.T., Eppig J.T., et al. Gene Ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]
  • Bafna V., Huson D.H., Huson D.H. The conserved exon method for gene finding. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000;8:3–12. [PubMed]
  • Barabasi A.-L., Oltvai Z.N., Oltvai Z.N. Network biology: Understanding the cell's functional organization. Nat. Rev. Genet. 2004;5:101–113. [PubMed]
  • Batzoglou S. The many faces of sequence alignment. Brief. Bioinform. 2005;6:6–22. [PubMed]
  • Batzoglou S., Pachter L., Mesirov J.P., Berger B., Lander E.S., Pachter L., Mesirov J.P., Berger B., Lander E.S., Mesirov J.P., Berger B., Lander E.S., Berger B., Lander E.S., Lander E.S. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 2000;10:950–958. [PMC free article] [PubMed]
  • Bejerano G., Pheasant M., Makunin I., Stephen S., Kent W.J., Mattick J.S., Haussler D., Pheasant M., Makunin I., Stephen S., Kent W.J., Mattick J.S., Haussler D., Makunin I., Stephen S., Kent W.J., Mattick J.S., Haussler D., Stephen S., Kent W.J., Mattick J.S., Haussler D., Kent W.J., Mattick J.S., Haussler D., Mattick J.S., Haussler D., Haussler D. Ultraconserved elements in the human genome. Science. 2004;304:1321–1325. [PubMed]
  • Blanchette M., Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Kent W.J., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Riemer C., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Elnitski L., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Smit A.F.A., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Roskin K.M., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Baertsch R., Rosenbloom K., Clawson H., Green E.D., Rosenbloom K., Clawson H., Green E.D., Clawson H., Green E.D., Green E.D., et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 2004;14:708–715. [PMC free article] [PubMed]
  • Bloom J.D., Adami C., Adami C. Apparent dependence of protein evolutionary rate on number of interactions is linked to biases in protein–protein interactions data sets. BMC Evol. Biol. 2003;3 [PMC free article] [PubMed]
  • Boyle E.I., Weng S., Gollub J., Jin H., Botstein D., Cherry J.M., Sherlock G., Weng S., Gollub J., Jin H., Botstein D., Cherry J.M., Sherlock G., Gollub J., Jin H., Botstein D., Cherry J.M., Sherlock G., Jin H., Botstein D., Cherry J.M., Sherlock G., Botstein D., Cherry J.M., Sherlock G., Cherry J.M., Sherlock G., Sherlock G. GO::TermFinder—Open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics. 2004;20:3710–3715. [PMC free article] [PubMed]
  • Bray N., Pachter L., Pachter L. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. 2004;14:693–699. [PMC free article] [PubMed]
  • Brudno M., Do C.B., Cooper G.M., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Do C.B., Cooper G.M., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Cooper G.M., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Kim M.F., Davydov E., Green E.D., Sidow A., Batzoglou S., Davydov E., Green E.D., Sidow A., Batzoglou S., Green E.D., Sidow A., Batzoglou S., Sidow A., Batzoglou S., Batzoglou S. LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 2003;13:721–731. [PMC free article] [PubMed]
  • Chiaromonte F., Yap V.B., Miller W., Yap V.B., Miller W., Miller W. Scoring pairwise genomic sequence alignments. Pac. Symp. Biocomput. 2002:115–126. [PubMed]
  • Christie K.R., Weng S., Balakrishnan R., Costanzo M.C., Dolinski K., Dwight S.S., Engel S.R., Feierbach B., Fisk D.G., Hirschman J.E., Weng S., Balakrishnan R., Costanzo M.C., Dolinski K., Dwight S.S., Engel S.R., Feierbach B., Fisk D.G., Hirschman J.E., Balakrishnan R., Costanzo M.C., Dolinski K., Dwight S.S., Engel S.R., Feierbach B., Fisk D.G., Hirschman J.E., Costanzo M.C., Dolinski K., Dwight S.S., Engel S.R., Feierbach B., Fisk D.G., Hirschman J.E., Dolinski K., Dwight S.S., Engel S.R., Feierbach B., Fisk D.G., Hirschman J.E., Dwight S.S., Engel S.R., Feierbach B., Fisk D.G., Hirschman J.E., Engel S.R., Feierbach B., Fisk D.G., Hirschman J.E., Feierbach B., Fisk D.G., Hirschman J.E., Fisk D.G., Hirschman J.E., Hirschman J.E., et al. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 2004;32:D311–D314. [PMC free article] [PubMed]
  • Conte D., Foggia P., Sansone C., Vento M., Foggia P., Sansone C., Vento M., Sansone C., Vento M., Vento M. Thirty years of graph matching in pattern recognition. IJPRAI. 2004;18:265–298.
  • Cooper G.M., Brudno M., Stone E.A., Dubchak I., Batzoglou S., Sidow A., Brudno M., Stone E.A., Dubchak I., Batzoglou S., Sidow A., Stone E.A., Dubchak I., Batzoglou S., Sidow A., Dubchak I., Batzoglou S., Sidow A., Batzoglou S., Sidow A., Sidow A. Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res. 2004;14:539–548. [PMC free article] [PubMed]
  • Cooper G.M., Stone E.A., Asimenos G., Green E.D., Batzoglou S., Sidow A., Stone E.A., Asimenos G., Green E.D., Batzoglou S., Sidow A., Asimenos G., Green E.D., Batzoglou S., Sidow A., Green E.D., Batzoglou S., Sidow A., Batzoglou S., Sidow A., Sidow A. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 2005;15:901–913. [PMC free article] [PubMed]
  • Dandekar T., Schuster S., Snel B., Huynen M., Bork P., Schuster S., Snel B., Huynen M., Bork P., Snel B., Huynen M., Bork P., Huynen M., Bork P., Bork P. Pathway alignment: Application to the comparative analysis of glycolytic enzymes. Biochem. J. 1999;343:115–124. [PMC free article] [PubMed]
  • Deeds E.J., Ashenberg O., Shakhnovich E.I., Ashenberg O., Shakhnovich E.I., Shakhnovich E.I. From The Cover: A simple physical model for scaling in protein–protein interaction networks. Proc. Natl. Acad. Sci. 2006;103:311–316. [PMC free article] [PubMed]
  • Drummond D.A., Bloom J.D., Adami C., Wilke C.O., Arnold F.H., Bloom J.D., Adami C., Wilke C.O., Arnold F.H., Adami C., Wilke C.O., Arnold F.H., Wilke C.O., Arnold F.H., Arnold F.H. Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. 2005;102:14338–14343. [PMC free article] [PubMed]
  • Dubnau D. DNA uptake in bacteria. Annu. Rev. Microbiol. 1999;53:217–244. [PubMed]
  • Duda R.O., Hart P.E., Stork D.G., Hart P.E., Stork D.G., Stork D.G. Pattern classification. Wiley-Interscience; New York: 2000.
  • Durbin R., Eddy S., Krogh A., Mitchison G., Eddy S., Krogh A., Mitchison G., Krogh A., Mitchison G., Mitchison G. Biological sequence analysis. Cambridge University Press; UK: 1998.
  • Edgar R.C. MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics. 2004;5 [PMC free article] [PubMed]
  • Eisen M.B., Spellman P.T., Brown P.O., Botstein D., Spellman P.T., Brown P.O., Botstein D., Brown P.O., Botstein D., Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 1998;95:14863–14868. [PMC free article] [PubMed]
  • Feng D.F., Doolittle R.F., Doolittle R.F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 1987;25:351–360. [PubMed]
  • Fly Base Consortium. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 2003;31:172–175. [PMC free article] [PubMed]
  • Forst C.V., Schulten K., Schulten K. Phylogenetic analysis of metabolic pathways. J. Mol. Evol. 2001;52:471–489. [PubMed]
  • Fromont-Racine M., Rain J.C., Legrain P., Rain J.C., Legrain P., Legrain P. Toward a functional analysis of the yeast genome through exhaustive two-hybrid screens. Nat. Genet. 1997;16:277–282. [PubMed]
  • Giot L., Bader J.S., Brouwer C., Chaudhuri A., Kuang B., Li Y., Hao Y.L., Ooi C.E., Godwin B., Vitols E., Bader J.S., Brouwer C., Chaudhuri A., Kuang B., Li Y., Hao Y.L., Ooi C.E., Godwin B., Vitols E., Brouwer C., Chaudhuri A., Kuang B., Li Y., Hao Y.L., Ooi C.E., Godwin B., Vitols E., Chaudhuri A., Kuang B., Li Y., Hao Y.L., Ooi C.E., Godwin B., Vitols E., Kuang B., Li Y., Hao Y.L., Ooi C.E., Godwin B., Vitols E., Li Y., Hao Y.L., Ooi C.E., Godwin B., Vitols E., Hao Y.L., Ooi C.E., Godwin B., Vitols E., Ooi C.E., Godwin B., Vitols E., Godwin B., Vitols E., Vitols E., et al. A protein interaction map of Drosophila melanogaster . Science. 2003;302:1727–1736. [PubMed]
  • Gollop N., March P.E., March P.E. A GTP-binding protein (Era) has an essential role in growth rate and cell cycle control in Escherichia coli . J. Bacteriol. 1991;173:2265–2270. [PMC free article] [PubMed]
  • Graupner S., Wackernagel W., Wackernagel W. Identification and characterization of novel competence genes comA and exbB involved in natural genetic transformation of Pseudomonas stutzeri . Res. Microbiol. 2001;152:451–460. [PubMed]
  • Hara H., Yasuda S., Horiuchi K., Park J.T., Yasuda S., Horiuchi K., Park J.T., Horiuchi K., Park J.T., Park J.T. A promoter for the first nine genes of the Escherichia coli mra cluster of cell division and cell envelope biosynthesis genes, including ftsIand ftsW . J. Bacteriol. 1997;179:5802–5811. [PMC free article] [PubMed]
  • Harris T.W., Chen N., Cunningham F., Tello-Ruiz M., Antoshechkin I., Bastiani C., Bieri T., Blasiar D., Bradnam K., Chan J., Chen N., Cunningham F., Tello-Ruiz M., Antoshechkin I., Bastiani C., Bieri T., Blasiar D., Bradnam K., Chan J., Cunningham F., Tello-Ruiz M., Antoshechkin I., Bastiani C., Bieri T., Blasiar D., Bradnam K., Chan J., Tello-Ruiz M., Antoshechkin I., Bastiani C., Bieri T., Blasiar D., Bradnam K., Chan J., Antoshechkin I., Bastiani C., Bieri T., Blasiar D., Bradnam K., Chan J., Bastiani C., Bieri T., Blasiar D., Bradnam K., Chan J., Bieri T., Blasiar D., Bradnam K., Chan J., Blasiar D., Bradnam K., Chan J., Bradnam K., Chan J., Chan J., et al. WormBase: A multi-species resource for nematode biology and genomics. Nucleic Acids Res. 2004;32:D411–D417. [PMC free article] [PubMed]
  • Hartwell L.H., Hopfield J.J., Leibler S., Murray A.W., Hopfield J.J., Leibler S., Murray A.W., Leibler S., Murray A.W., Murray A.W. From molecular to modular cell biology. Nature. 1999;402:47–52. [PubMed]
  • Henikoff S., Henikoff J.G., Henikoff J.G. Performance evaluation of amino acid substitution matrices. Proteins. 1993;17:49–61. [PubMed]
  • Hillier L.W., Miller W., Birney E., Warren W., Hardison R.C., Ponting C.P., Bork P., Burt D.W., Groenen M.A.M., Delany M.E., Miller W., Birney E., Warren W., Hardison R.C., Ponting C.P., Bork P., Burt D.W., Groenen M.A.M., Delany M.E., Birney E., Warren W., Hardison R.C., Ponting C.P., Bork P., Burt D.W., Groenen M.A.M., Delany M.E., Warren W., Hardison R.C., Ponting C.P., Bork P., Burt D.W., Groenen M.A.M., Delany M.E., Hardison R.C., Ponting C.P., Bork P., Burt D.W., Groenen M.A.M., Delany M.E., Ponting C.P., Bork P., Burt D.W., Groenen M.A.M., Delany M.E., Bork P., Burt D.W., Groenen M.A.M., Delany M.E., Burt D.W., Groenen M.A.M., Delany M.E., Groenen M.A.M., Delany M.E., Delany M.E., et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004;432:695–716. [PubMed]
  • Kanehisa M., Goto S., Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30. [PMC free article] [PubMed]
  • Kaneko T., Tanaka A., Sato S., Kotani H., Sazuka T., Miyajima N., Sugiura M., Tabata S., Tanaka A., Sato S., Kotani H., Sazuka T., Miyajima N., Sugiura M., Tabata S., Sato S., Kotani H., Sazuka T., Miyajima N., Sugiura M., Tabata S., Kotani H., Sazuka T., Miyajima N., Sugiura M., Tabata S., Sazuka T., Miyajima N., Sugiura M., Tabata S., Miyajima N., Sugiura M., Tabata S., Sugiura M., Tabata S., Tabata S. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb region from map positions 64% to 92% of the genome. DNA Res. 1995;2:153–166. [PubMed]
  • Kelley B.P., Sharan R., Karp R.M., Sittler T., Root D.E., Stockwell B.R., Ideker T., Sharan R., Karp R.M., Sittler T., Root D.E., Stockwell B.R., Ideker T., Karp R.M., Sittler T., Root D.E., Stockwell B.R., Ideker T., Sittler T., Root D.E., Stockwell B.R., Ideker T., Root D.E., Stockwell B.R., Ideker T., Stockwell B.R., Ideker T., Ideker T. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. 2003;100:11394–11399. [PMC free article] [PubMed]
  • Kogoma T. Is RecF a DNA replication protein? Proc. Natl. Acad. Sci. 1997;94:3483–3484. [PMC free article] [PubMed]
  • Korf I., Flicek P., Duan D., Brent M.R., Flicek P., Duan D., Brent M.R., Duan D., Brent M.R., Brent M.R. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001;17:140–148. [PubMed]
  • Koyuturk M., Grama A., Szpankowski W., Grama A., Szpankowski W., Szpankowski W. Pairwise local alignment of protein interaction networks guided by models of evolution. Lecture Notes in Bioinformatics. 2005;3500:48–65.
  • Lee I., Date S.V., Adai A.T., Marcotte E.M., Date S.V., Adai A.T., Marcotte E.M., Adai A.T., Marcotte E.M., Marcotte E.M. A probabilistic functional network of yeast genes. Science. 2004;306:1555–1558. [PubMed]
  • Li S., Armstrong C.M., Bertin N., Ge H., Milstein S., Boxem M., Vidalain P.-O., Han J.-D.J., Chesneau A., Hao T., Armstrong C.M., Bertin N., Ge H., Milstein S., Boxem M., Vidalain P.-O., Han J.-D.J., Chesneau A., Hao T., Bertin N., Ge H., Milstein S., Boxem M., Vidalain P.-O., Han J.-D.J., Chesneau A., Hao T., Ge H., Milstein S., Boxem M., Vidalain P.-O., Han J.-D.J., Chesneau A., Hao T., Milstein S., Boxem M., Vidalain P.-O., Han J.-D.J., Chesneau A., Hao T., Boxem M., Vidalain P.-O., Han J.-D.J., Chesneau A., Hao T., Vidalain P.-O., Han J.-D.J., Chesneau A., Hao T., Han J.-D.J., Chesneau A., Hao T., Chesneau A., Hao T., Hao T., et al. A map of the interactome network of the metazoan C. elegans . Science. 2004;303:540–543. [PMC free article] [PubMed]
  • Lu L.J., Xia Y., Paccanaro A., Yu H., Gerstein M., Xia Y., Paccanaro A., Yu H., Gerstein M., Paccanaro A., Yu H., Gerstein M., Yu H., Gerstein M., Gerstein M. Assessing the limits of genomic data integration for predicting protein networks. Genome Res. 2005;15:945–953. [PMC free article] [PubMed]
  • Ma B., Tromp J., Li M., Tromp J., Li M., Li M. PatternHunter: Faster and more sensitive homology search. Bioinformatics. 2002;18:440–445. [PubMed]
  • Marchler-Bauer A., Anderson J.B., Cherukuri P.F., DeWeese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., Anderson J.B., Cherukuri P.F., DeWeese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., Cherukuri P.F., DeWeese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., DeWeese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., He S., Hurwitz D.I., Jackson J.D., Ke Z., Hurwitz D.I., Jackson J.D., Ke Z., Jackson J.D., Ke Z., Ke Z., et al. CDD: A Conserved Domain Database for protein classification. Nucleic Acids Res. 2005;33:D192–D196. [PMC free article] [PubMed]
  • Matthews L.R., Vaglio P., Reboul J., Ge H., Davis B.P., Garrels J., Vincent S., Vidal M., Vaglio P., Reboul J., Ge H., Davis B.P., Garrels J., Vincent S., Vidal M., Reboul J., Ge H., Davis B.P., Garrels J., Vincent S., Vidal M., Ge H., Davis B.P., Garrels J., Vincent S., Vidal M., Davis B.P., Garrels J., Vincent S., Vidal M., Garrels J., Vincent S., Vidal M., Vincent S., Vidal M., Vidal M. Identification of potential interaction networks using sequence-based searches for conserved protein–protein interactions or “interologs. Genome Res. 2001;11:2120–2126. [PMC free article] [PubMed]
  • McAdams H.H., Srinivasan B., Arkin A.P., Srinivasan B., Arkin A.P., Arkin A.P. The evolution of genetic regulatory systems in bacteria. Nat. Rev. Genet. 2004;5:169–178. [PubMed]
  • Milo R., Shen-Orr S., Itzkovitz S., Kashtan N., Chklovskii D., Alon U., Shen-Orr S., Itzkovitz S., Kashtan N., Chklovskii D., Alon U., Itzkovitz S., Kashtan N., Chklovskii D., Alon U., Kashtan N., Chklovskii D., Alon U., Chklovskii D., Alon U., Alon U. Network motifs: Simple building blocks of complex networks. Science. 2002;298:824–827. [PubMed]
  • Ogata H., Fujibuchi W., Goto S., Kanehisa M., Fujibuchi W., Goto S., Kanehisa M., Goto S., Kanehisa M., Kanehisa M. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res. 2000;28:4021–4028. [PMC free article] [PubMed]
  • Pellegrini M., Marcotte E.M., Thompson M.J., Eisenberg D., Yeates T.O., Marcotte E.M., Thompson M.J., Eisenberg D., Yeates T.O., Thompson M.J., Eisenberg D., Yeates T.O., Eisenberg D., Yeates T.O., Yeates T.O. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. 1999;96:4285–4288. [PMC free article] [PubMed]
  • Pinter R.Y., Rokhlenko O., Yeger-Lotem E., Ziv-Ukelson M., Rokhlenko O., Yeger-Lotem E., Ziv-Ukelson M., Yeger-Lotem E., Ziv-Ukelson M., Ziv-Ukelson M. Alignment of metabolic pathways. Bioinformatics. 2005;21:3401–3408. [PubMed]
  • Sharan R., Ideker T., Kelley B., Shamir R., Karp R.M., Ideker T., Kelley B., Shamir R., Karp R.M., Kelley B., Shamir R., Karp R.M., Shamir R., Karp R.M., Karp R.M. Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. J. Comput. Biol. 2005a;12:835–846. [PubMed]
  • Sharan R., Suthram S., Kelley R.M., Kuhn T., McCuine S., Uetz P., Sittler T., Karp R.M., Ideker T., Suthram S., Kelley R.M., Kuhn T., McCuine S., Uetz P., Sittler T., Karp R.M., Ideker T., Kelley R.M., Kuhn T., McCuine S., Uetz P., Sittler T., Karp R.M., Ideker T., Kuhn T., McCuine S., Uetz P., Sittler T., Karp R.M., Ideker T., McCuine S., Uetz P., Sittler T., Karp R.M., Ideker T., Uetz P., Sittler T., Karp R.M., Ideker T., Sittler T., Karp R.M., Ideker T., Karp R.M., Ideker T., Ideker T. Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. 2005b;102:1974–1979. [PMC free article] [PubMed]
  • Siepel A., Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Bejerano G., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Pedersen J.S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Hinrichs A.S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Hou M., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Rosenbloom K., Clawson H., Spieth J., Hillier L.W., Richards S., Clawson H., Spieth J., Hillier L.W., Richards S., Spieth J., Hillier L.W., Richards S., Hillier L.W., Richards S., Richards S., et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. [PMC free article] [PubMed]
  • Srinivasan B.S., Novak A., Flannick J., Batzoglou S., McAdams H.H., Novak A., Flannick J., Batzoglou S., McAdams H.H., Flannick J., Batzoglou S., McAdams H.H., Batzoglou S., McAdams H.H., McAdams H.H. Proceedings of the 10th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2006) 2006. Integrated protein interaction networks for 11 microbes. (in press)
  • Stuart J.M., Segal E., Koller D., Kim S.K., Segal E., Koller D., Kim S.K., Koller D., Kim S.K., Kim S.K. A gene-coexpression network for global discovery of conserved genetic modules. Science. 2003;302:249–255. [PubMed]
  • Sun Y., Buhler J., Buhler J. Designing multiple simultaneous seeds for DNA similarity search. J. Comput. Biol. 2005;12:847–861. [PubMed]
  • Tatusov R.L., Koonin E.V., Lipman D.J., Koonin E.V., Lipman D.J., Lipman D.J. A genomic perspective on protein families. Science. 1997;278:631–637. [PubMed]
  • Uetz P., Giot L., Cagney G., Mansfield T.A., Judson R.S., Knight J.R., Lockshon D., Narayan V., Srinivasan M., Pochart P., Giot L., Cagney G., Mansfield T.A., Judson R.S., Knight J.R., Lockshon D., Narayan V., Srinivasan M., Pochart P., Cagney G., Mansfield T.A., Judson R.S., Knight J.R., Lockshon D., Narayan V., Srinivasan M., Pochart P., Mansfield T.A., Judson R.S., Knight J.R., Lockshon D., Narayan V., Srinivasan M., Pochart P., Judson R.S., Knight J.R., Lockshon D., Narayan V., Srinivasan M., Pochart P., Knight J.R., Lockshon D., Narayan V., Srinivasan M., Pochart P., Lockshon D., Narayan V., Srinivasan M., Pochart P., Narayan V., Srinivasan M., Pochart P., Srinivasan M., Pochart P., Pochart P., et al. A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae . Nature. 2000;403:623–627. [PubMed]
  • Waterston R.H., Lindblad-Toh K., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Lindblad-Toh K., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Birney E., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Rogers J., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Abril J.F., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Agarwal P., Agarwala R., Ainscough R., Alexandersson M., An P., Agarwala R., Ainscough R., Alexandersson M., An P., Ainscough R., Alexandersson M., An P., Alexandersson M., An P., An P., et al. Initial sequencing and comparative analysis of the mouse genome. Nature. 2002;420:520–562. [PubMed]
  • Xenarios I., Salwinski L., Duan X.J., Higney P., Kim S.-M., Eisenberg D., Salwinski L., Duan X.J., Higney P., Kim S.-M., Eisenberg D., Duan X.J., Higney P., Kim S.-M., Eisenberg D., Higney P., Kim S.-M., Eisenberg D., Kim S.-M., Eisenberg D., Eisenberg D. DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002;30:303–305. [PMC free article] [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...