Logo of genoresGenome ResearchCSHL PressJournal HomeSubscriptionseTOC AlertsBioSupplyNet
Genome Res. 2004 Jan; 14(1): 109–115.
PMCID: PMC314287

A Cross-Genomic Approach for Systematic Mapping of Phenotypic Traits to Genes


We present a computational method for de novo identification of gene function using only cross-organismal distribution of phenotypic traits. Our approach assumes that proteins necessary for a set of phenotypic traits are preferentially conserved among organisms that share those traits. This method combines organism-to-phenotype associations,along with phylogenetic profiles,to identify proteins that have high propensities for the query phenotype; it does not require the use of any functional annotations for any proteins. We first present the statistical foundations of this approach and then apply it to a range of phenotypes to assess how its performance depends on the frequency and specificity of the phenotype. Our analysis shows that statistically significant associations are possible as long as the phenotype is neither extremely rare nor extremely common; results on the flagella,pili, thermophily,and respiratory tract tropism phenotypes suggest that reliable associations can be inferred when the phenotype does not arise from many alternate mechanisms.

The increasing number of fully sequenced genomes has made it possible to infer protein function using comparative genome techniques. Most current computational methods assign function to proteins by matching them to other proteins with known function (for review, see Bork et al. 1998); this matching has traditionally relied on sequence homology (Altschul et al. 1990), but nonhomology-based methods have also been introduced recently. The Clusters of Orthologous Groups (COGs) database (http://www.ncbi.nlm.nih.gov/COG/) is a homology-based method that establishes COGs as groups of homologs that are found in at least three major phylogenetic lineages, and enables transfer of functional information from one ortholog to the entire set of proteins within a COG (Tatusov et al. 1997). Phylogenetic profiles (Gaasterland and Ragan 1998; Pellegrini et al. 1999), gene clusters (Overbeek et al. 1999), and gene fusion analysis (Enright et al. 1999; Marcotte et al. 1999; Snel et al. 2000) are methods that can group together proteins that do not necessarily share sequence homology. Phylogenetic profiles describe the presence or absence of proteins in different genomes, and proteins with similar phylogenetic profiles are thought to share similar functions (Pellegrini et al. 1999). Gene cluster analysis (Overbeek et al. 1999; Tamames et al. 2001) infers functional relationships between genes from conservation of chromosomal proximity. Gene fusion analysis (Enright et al. 1999; Marcotte et al. 1999; Snel et al. 2000) identifies proteins that either belong to a protein complex or catalyze consecutive steps in a pathway by looking for corresponding genes that are separate in one organism, but are fused into one sequence in another. For a comparison of these nonhomology techniques see Huynen et al. (2000).

This study introduces an alternative method that infers protein function without requiring any prior functional annotation on any proteins. Instead, the method uses organism-level phenotype annotations and phylogenetic profiles to identify proteins with high propensities for a given phenotype. The method has broad applicability, as there are many well-characterized phenotypes, and phylogenetic profiles can be directly computed from sequenced genomes. A recent work (Levesque et al. 2003) described a related but different approach for predicting protein function on the basis of phenotypic traits, and applied them to identify flagellar proteins; that approach uses various settheoretic algorithms and phylogenetic information in the form of orthologous gene sets obtained from the COGs database. Another recent work (Martin et al. 2003) used clusters of phylogenetic profiles to identify proteins that differentiate Gram-positive from Gram-negative bacterial genomes.

We first present the statistical foundations of our approach. Then, we apply our method to the flagella phenotype to show that our method works better than earlier approaches that use phylogenetic profiles (Pellegrini et al. 1999; Levesque et al. 2003). In addition, we apply our approach on three new phenotypes that have not been tried previously (pili, thermophily, and respiratory tract tropism), and make novel predictions. Our analyses show that reliable associations can be inferred when the phenotype is unlikely to arise from many alternate mechanisms. As opposed to previous approaches, our method has the advantage that it can eliminate annotations that are not statistically significant; additionally, our theoretical analysis shows that phenotypes that are either extremely rare or extremely common do not permit annotations of gene function. These features are critical for general application of the approach to a wide range of phenotypes.


Each protein in a reference organism with the phenotype of interest is analyzed by identifying whether it is preferentially conserved among organisms exhibiting the phenotype. For each protein, a BLAST search (Altschul et al. 1990) against the nonredundant (nr) database (http://www.ncbi.nih.gov/BLAST/blast_databases.html) reveals possible homologs, and a genome is considered to contain a homolog when one of its proteins has an alignment to the query protein sequence with e-value below 1.0 e-10, and when the length of the alignment is at least 2/3 the length of the query sequence (the latter requirement is useful for screening out good alignments from shorter motifs).

Once homologs in all genomes are identified, proteins are matched to the phenotype of interest as follows. The extent to which a protein i is associated with a given phenotype f is quantified by a propensity score Φf(i):

equation M1

in which Tf is the number of genomes that exhibit phenotype f, N is the total number of genomes, ti,f is the number of genomes that both exhibit phenotype f and contain homologs to gene i, and ni is the total number of genomes that contain homologs to gene i.

The hypergeometric distribution is then used to screen out statistically insignificant protein-phenotype associations. For a given gene i, if its homologs are found in a total of ni genomes, then

equation M2

gives the probability that by random chance alone the gene is found in t genomes exhibiting phenotype f. The probability that a gene is found in at least ti,f genomes with phenotype f by random chance alone is equation M3. Finally, using the conservative Bonferroni correction (Miller Jr. 1991) to account for multiple testing, the probability that some gene i is found in at least ti,f genomes with phenotype f among a set of X genes is given by

equation M4

in which X is the number of genes in the organism whose genes we are annotating. These Pf(i) values are used for eliminating protein-phenotype associations that are not statistically significant.

Theoretical Limitations

The number of organisms exhibiting a phenotype limits the maximum propensity score. Φf(i) is maximized when gene i is found only in the target genomes (i.e., when ti,f = ni). Therefore, the maximum propensity equation M5 for phenotype f is:

equation M6

The number of organisms exhibiting a given phenotype also limits the statistical significance of the results. In particular, Pf(i) is minimized (most significant) when ti,f = ni = Tf, and the minimum equation M7 on the propensity scores for a phenotype f is:

equation M8

Equations 4 and 5 describe a trade-off between statistical significance and propensity when choosing the query phenotype. For a given number of sequenced genomes N, a smaller Tf (i.e., a more rare phenotype) will allow for higher propensity scores but at lower statistical significance limits. Intuitively, a large equation M9 indicates that phenotype f is too rare or too common, and a small equation M10 indicates that the phenotype is too common. With 86 genomes and 4000 genes, equation M11 is <4.0 e-07 when 7<Tf <79(see Fig. 1).

Figure 1
Relationship between maximum propensity Φ and minimum estimated equation M14 as a function of the number of organisms exhibiting phenotype f, given that there are N = 86 total genomes and that we are testing X = 4000 genes.


We apply our method to identify proteins associated with flagella, pili, thermophily, and respiratory tropism phenotypes using 86 sequenced genomes (13 archaeabacteria and 73 eubacteria) annotated for these phenotypes (see online Supplemental Material available at www.genome.org for the list of organisms and their phenotype annotations). The phenotype annotations were obtained by reading through all matching PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db-PubMed) abstracts supplemented by an exhaustive search of relevant online research Web pages; however, it is possible that a few phenotype annotations are missing in our data set. The frequency of each of these phenotypes does not preclude statistically significant associations (see Table 1), and thus, our approach is applicable.

Table 1.
Maximum Propensities and Minimum Pf Values for Flagella, Pili, Thermophily, and Respiratory Tract Tropism Phenotypes

Flagellar Proteins

The many annotated flagellar proteins in Escherichia coli allow us to assess the performance of our method. Additionally, previous work on this phenotype allows us to benchmark and compare our method with other approaches. Table 2 shows the 60 most statistically significant Escherichia coli genes with flagellar propensity scores >1.9(90% of the maximum flagellar propensity 2.15). This list includes 24 known flagellar genes, one putative motility gene (mbhA, b0230), and five nonflagellar genes known to be involved in chemotaxis.

Table 2.
The 60Most Statistically Significant Escherichia coli Genes With Flagellar Propensity Scores Greater Than 1.9 (90% of Max Propensity)

The list in Table 2 contains 12 additional known flagellar genes that were not identified using the original phylogenetic profile approach described in Pellegrini et al. (1999). This list also includes all of the genes already identified in that approach except for fliQ, which has a propensity score of 2.15, but is not included in Table 2 because its Pflagella value is not significant. Note that the original phylogenetic profile approach does not use phenotype information and instead transfers functional annotations between proteins with similar profiles.

We also compared our method to the Similarity Measure (Levesque et al. 2003) method. To make direct comparisons, we applied this method to the 86 genomes considered here. At a similarity threshold of 0.65, the least restrictive cutoff used in Levesque et al. (2003), their method identified 12 known flagellar genes; all of them are a subset of the 24 identified by our approach.

Both the Similarity Measure and our method have similar performance when applied to the COGs data with the 21 genomes considered in Levesque et al. (2003); each identifies 29 known flagellar genes, among 34 top scoring genes for the Similarity Measure algorithm and 31 top scoring genes for our method.

More rigorously, the Receiver Operating Characteristic (ROC) curves in Figure 2 compare the sensitivity versus specificity tradeoffs when all three approaches are applied to the 86 genomes considered here. These curves show that our approach consistently produces fewer false positives at each level of sensitivity. It is important to note that the false-positive rates in Figure 2 are upper bounds, because we cannot assume that all flagellar proteins have been annotated (i.e., some of the putative false positives may be flagellar proteins). Figure 2 also shows that propensity scores can be used to improve performance independently of the estimated P values. At high specificity, the ROC curves improve (move closer to the upper, left corner) as we increase the propensity cutoffs from 1 to 1.8. Larger propensity cutoffs increase the number of false negatives, and eventually at cutoffs ≥2.0, the flagella ROC curves begin to worsen.

Figure 2
Receiver Operating Characteristic (ROC) curves comparing our approach with the approaches of Levesque et al. (2003) and Pellegrini et al. (1999) on the same flagella data set. Each ROC curve for our approach is obtained by keeping all genes with propensity ...

Proteins Associated With Pili

Pili are another structural feature of some bacteria for which some of the component proteins are known. Table 3 shows the 40 most statistically significant Pseudomonas aeruginosa proteins with propensity scores >4.5 for organisms that have pili (Sauer et al. 2000). Five of the seven known proteins in this list are known fimbrial biogenesis proteins (pilA, pilN, pilO, pilP, and pilQ); their corresponding Bonferroni corrected Pf values are <0.109, with three of these five having Pf values <0.05.

Table 3.
The 40Most Statistically Significant Pseudomonas aeruginosa Genes With Pili Propensity Scores Greater Than 4.5.

Proteins Associated With Thermophily

Thermoanaerobacter tengcongensis is an anaerobic thermophilic eubacterium whose genome was sequenced recently (Bao et al. 2002). How thermophiles have adapted to survive at high temperatures is not fully understood. Radiation sensitivity studies indicate that thermophiles repair DNA efficiently, but sequencing results suggest that many of their DNA repair genes are still unrecognized because they are too different from those of well-studied organisms (Grogan 1998). Here, we use our method to uncover the 40 most statistically significant T. tengcongensis genes with thermophily propensity scores >3.0 (Table 4). This list includes three DNA repair genes, one of which is reverse gyrase. Reverse gyrase is the only known topoisomerase that induces positive supercoiling in DNA, and hence, improves DNA stability at high temperatures (Forterre et al. 2000). This list also includes nine components of ferredoxin oxidoreductase. Anaerobic metabolism involving ferredoxin oxidoreductase appears to be unique to hyperthermophiles (Kelly and Adams 1994), and oxidoreductases related to hydrogen evolution have been shown recently to be crucial in the central metabolism of hyperthermophiles, replacing dehydrogenases in many key steps of metabolism (Borges et al. 1996). In addition, the recent isolation of a strain of microorganisms from hydrothermal vents that use Fe(III) as the electron acceptor and can grow at 121°, suggests that Fe(III) reduction may be an important process for growing in hydrothermal environments (Kashefi and Lovely 2003). Altogether, Table 4 identifies at least 21 genes that may be associated with thermophily: three DNA repair genes, nine ferredoxin oxidoreductase genes, and nine additional hypothetical genes that currently have unknown function.

Table 4.
The 40Most Statistically Significant T. tengcongensis Proteins With Thermophily Propensity Scores Greater Than 3.0

Proteins Associated With Respiratory Tract Tropism

We identified 14 bacteria with respiratory tract tropism, and used this list to compute respiratory tract tropism propensity scores for Streptococcus pneumoniae genes. In this case, there are no genes with statistically significant propensities for the respiratory tract tropism phenotype – none of the top propensity scores have Pf values <2. Perhaps this is because the phenotype description is too general, as bacterial tropism is known to involve a wide variety of mechanisms that include immune evasion, metabolic adaptation, and physical attachment and invasion. The lack of statistically significant associations indicates that respiratory tropism is difficult to study as a single phenotype, at least using our method.


We have described an approach that combines organism-to-phenotype associations along with phylogenetic profiles to identify proteins with high propensities for a given phenotype; such an approach can be used to annotate proteins with phenotype information. We validated this approach by demonstrating its ability to identify known flagellar and pili proteins, and then applied it to the identification of proteins associated with thermophily.

Phenotype annotations are usually more general than traditional protein functional annotations; typically, several proteins spanning multiple functional complexes and pathways contribute to a given phenotype, and the same phenotype can be accomplished in more than one way. Correspondingly, we have found that it is insufficient to simply search for proteins that are conserved in a majority of the organisms exhibiting the query phenotype. For example, none of the identified flagellar proteins are conserved in all 40 flagellar genomes, and most of them are conserved in 20 or fewer flagellar genomes. By using propensity scores, our approach is able to match proteins to phenotype without requiring that the proteins be conserved in a majority of the organisms with that phenotype.

Proteins with the same propensity scores can have very different phylogenetic profiles, and therefore, it is unlikely that a single representative protein can be used to match and identify the set of proteins responsible for a phenotype. Figure 3 shows the average Hamming distance between phylogenetic profiles of E. Coli proteins at each flagellar propensity level. The average Hamming distance between the phylogenetic profiles of the proteins with highest flagellar propensity scores is 4.0, whereas proteins with lower propensity scores can have Hamming distances >30. In addition, Figure 4 depicts the hierarchical clustering of the top proteins associated with flagella5 and thermophily (as given in Tables Tables22 and and4),4), and shows that the phylogenetic profiles of the top proteins can vary considerably. Hence, even if it is possible to identify a representative protein for a given phenotype (e.g., as in Pellegrini et al. 1999), it is not possible to find all relevant proteins by simply searching for other proteins with similar phylogenetic profiles. Our approach is robust against these large distances between phylogenetic profiles, because it uses propensity scores as opposed to raw phylogenetic profiles.

Figure 3
Average phylogenetic distances between E. Coli proteins at each flagellar propensity level.
Figure 4
Hierarchical clustering (average-linkage) of the top proteins associated with flagella and thermophily (see Tables Tables22 and and4),4), on the basis of their phylogenetic profiles. Genomes are on the x-axis, and genes are on the y-axis. ...

An artifact of previous phylogenetic comparison approaches is that distances between phylogenetic profiles are sensitive to the size of the set of background genomes. For example, arbitrarily expanding the set of background genomes usually increases the distances between phylogenetic profiles. In our approach, this scaling relationship is automatically captured by propensity scores, and expanding the set of background genomes will, in general, increase the statistical significance (i.e., lower Pf values) of the top proteins. Follow-up work along these lines should address evolutionary distances between species; it is not obvious how to handle statistical significance in an analytical way, and nonparametric approaches may be more promising in this regard.

These initial results are encouraging, and provide a statistical framework for the general application of the approach to a large class of well-characterized phenotypes. This process might begin by looping through organism phenotype annotations and computing their equation M12 and equation M13 scores in order to filter out phenotypes that are too common or too rare, and then match the remaining phenotypes to individual proteins by checking each protein's propensity for that phenotype. With the rapidly increasing pace of whole-genome sequencing, and the commensurate accumulation of novel genes, approaches such as ours can efficiently generate high-yield hypotheses for experimental validation of gene function. In this regard, whole-organism characterization of phenotypic traits may become a central activity in the post-genomic approach to understanding biological networks.


We thank the anonymous referees for many helpful suggestions. M.S. is supported in part by NSF PECASE award MCB-0093399 and DARPA grant MDA972-00-1-0031. S.T. is supported in part by NSF CAREER award MCB-0133750 and DARPA grant N66001-02-1-8929.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.


Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1586704.


[Supplemental material available online at www.genome.org.]

5In Figure 4, it is interesting to note that the organisms that exhibit flagella, yet have few homologs to the top 60 E. coli proteins in Table 2, are archaea.


  • Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410. [PubMed]
  • Bao, Q., Tian, Y., Li, W., Xu, Z., Xuan, Z., Hu, S., Dong, W., Yang, J., Chen, Y., Xue, Y., et al. 2002. A complete sequence of the T. tengcongensis genome. Genome Res. 12: 689-700. [PMC free article] [PubMed]
  • Borges, K.M., Brummet, S.R., Bogert, A., Davis, M.C., Hujer, K.M., Domke, S.T., Szasz, J., Ravell, J., DiRuggiero, J., Fuller, C., et al. 1996. A survey of the genome of the hyperthermophilic archaeon, pyrococcus furiosus. Genome Sci. Technol. 1: 37-46.
  • Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., and Yuan, Y. 1998. Predicting function: From genes to genomes and back. J. Mol. Biol. 283: 707-725. [PubMed]
  • Enright, A.J., Iliopoulos, I., and Kyrpides, N.C. 1999. Protein interaction maps for complete genomes based on gene fusion events. Nature 402: 86-90. [PubMed]
  • Forterre, P., de la Tour, C.B., Philippe, H., and Duguet, M. 2000. Reverse gyrase from hyperthermophiles: Probable transfer of a thermoadaptation trait from archaea to bacteria. Trends Genet. 16: 152-154. [PubMed]
  • Gaasterland, T. and Ragan, M. 1998. Constructing multigenome views of whole microbial genomes. Microbiol. Comp. Genomics 3: 177-192. [PubMed]
  • Grogan, D.W. 1998. Hyperthermophiles and the problem of DNA instability. Mol. Microbiol. 28: 1043-1050. [PubMed]
  • Huynen, M., Snel III, B., Lathe, W., and Bork, P. 2000. Predicting protein function by genomic context: Quantitative evaluation and qualitative inferences. Genome Res. 10: 1024-1210. [PMC free article] [PubMed]
  • Kashefi, K. and Lovley, D.R. 2003. Extending the upper temperature limit for life. Science 301: 934. [PubMed]
  • Kelly, R.M. and Adams, M.W.W. 1994. Metabolism in hyperthermophilic microorganisms. Antonie van Leeuvenhook 66: 247-270. [PubMed]
  • Levesque, M., Shasha, D., Kim, W., Surette, M.G., and Benfey, P.N. 2003. Trait-to-gene: A computational method for predicting the function of uncharacterized genes. Curr. Biol. 13: 129-133. [PubMed]
  • Marcotte, E.M., Pellegrini, M., Ng, H.-L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999. Detecting protein function and protein–protein interactions from genome sequences. Science 285: 751-753. [PubMed]
  • Martín, M.J., Herrero, J., Mateos, A., and Dopazo, J. 2003. Comparing bacterial genomes through conservation profiles. Genome Res. 13: 991-998. [PMC free article] [PubMed]
  • Miller Jr., R.G. 1991. Simultaneous statistical inference. In Springer series in statistics Springer-Verlag, New York.
  • Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G., and Maltsev, N. 1999. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. 96: 2896-2901. [PMC free article] [PubMed]
  • Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., and Yeates, T.O. 1999. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. 96: 4285-4288. [PMC free article] [PubMed]
  • Sauer, F., Barnhart, M., Choudhury, D., Knight, S., Waksman, G., and Hultgren, S. 2000. Chaperone-assisted pilus assembly and bacterial attachment. Curr. Opin. Struc. Biol. 10: 548-556. [PubMed]
  • Snel, B., Bork, P., and Huynen, M. 2000. Genome evolution: Gene fusion versus gene fission. Trends Genet. 16: 9-11. [PubMed]
  • Tamames, J., González-Moreno, M., Mingorance, J., Valencia, A., and Vicente, M. 2001. Bringing gene order into bacterial shape. Trends Genet. 17: 124-126. [PubMed]
  • Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A genomic perspective on protein families. Science 278: 631-637. [PubMed]

Articles from Genome Research are provided here courtesy of Cold Spring Harbor Laboratory Press
PubReader format: click here to try


Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...


Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...