NCBI LogoNCBI News

In this issue

Human Genome
Map Viewer

Investigator Profile:
Eugene V. Koonin

Mouse Genome
Resources

UniSTS
Integrates Markers

GenBank
Mirror Sites

BLAST Lab

New BLAST
Features

Masthead

 


Investigator Profile


Comparative Genomics: 
From Sequence to Evolution to Function

The recent explosion in genome sequencing has led to a rapid enrichment of the protein databases, both in terms of number and variety of protein sequences. The function(s) of the majority of these proteins remains unknown. By classifying proteins according to their degree of sequence similarity, which generally reflects evolutionary (homologous) relationships, computational biologists are able to predict the three-dimensional structure and a likely function for many proteins, and determine their evolutionary origin.


COGs: A Tool for Whole Genome Comparative Analyses

The database of Clusters of Orthologous Groups of proteins (COGs), developed by NCBI investigators Tatusov, Koonin, and Lipman, is designed to classify proteins from completely sequenced genomes on the basis of orthologous relationships. The first version of this database was released in 1997, and contained proteins from seven genomes and consisted of 720 COGs. Today, the database includes proteins from 34 complete genomes and consists of 2,885 COGs. A companion program, COGNITOR, was developed to fit new proteins into COGs. COGNITOR may also be used to annotate newly sequenced genomes and allows researchers to predict the function(s) of individual proteins or protein sets.

Over a period of years, NCBI investigators developed and refined computational approaches that allowed them to detect previously unnoticed but potentially important protein sequence similarities. Using this strategy to search various protein databases, researchers are able to compare the genomes of different organisms and identify conserved protein families and key protein pathways that are modified, or absent, in an organism. Comparative genome analyses also provide fundamental insights into the organization and evolution of highly diverged species and are instrumental in identifying other biological features that may confer a distinct evolutionary advantage to an organism.

Galperin and Koonin addressed the issue of detecting targets for anti-bacterial drugs using a comparative-genomic approach. For this purpose, one needs to identify genes that are likely to be essential for the survival of bacterial pathogens, but are absent in the host. One method for predicting essential genes, sometimes called genome subtraction, capitalizes on the COGs database, which includes conserved protein families represented in at least three phylogenetically distant organisms. The ability to query the database and retrieve a list of all COGs with a particular phylogenetic pattern allows researchers to identify genes that are present in the genomes of all or most pathogenic bacteria, but absent in its eukaryotic host, delineating a potential drug target.

Experimental approaches may then be used to validate the essentiality of the selected genes for bacterial survival and characterize their cellular functions. Using the COGs database as a tool to search for drug targets in microbial genomes demonstrates the potential of comparative genomics in accelerating the drug discovery process.


Predicting New Components of Known Molecular Complexes and Pathways

In another study, Aravind, Koonin and colleagues compared 4,344 protein sequences from fission yeast with all available eukaryotic protein sequences. They identified protein sequences that were common to both fission yeast and non-fungal eukaryotes, but that were missing or significantly different in baker’s yeast. These two species of yeast are evolutionarily close enough such that direct counterparts among their genes are readily detectable, but distant enough to support substantial gene differences. Analysis of the combined data showed that since its radiation from a common ancestor with fission yeast, baker’s yeast had lost about 300 genes and approximately 300 additional genes had diverged significantly. The most notable feature of the set of genes lost in baker’s yeast was the co-elimination of functionally connected groups of proteins, such as, for example, proteins involved in post-transcriptional gene silencing. By examining patterns of coordinated gene loss, in combination with a careful analysis of conserved domains, researchers can reconstruct functional interactions between and among proteins and predict previously unknown pathways.


Selecting Target Proteins for Structure Determination

The determination of a protein’s three-dimensional structure is key to unlocking biologic function. At this time, it is still not feasible technically to determine the structures of all of the proteins encoded in the human genome, or even in a typical smaller prokaryotic genome. However, considerable information relating to a protein’s structure may be gleaned from studying its sequence. This is because there exists a limited number of distinct protein building blocks, or folds. Proteins with similar sequences tend to have similar folds and hence, similar structures. This suggests that for each sequence, researchers should be able to identify a homologous protein with a known structure that may serve as a model for the structural characterization of other proteins.

To explore this concept, Wolf and Koonin constructed a protein-fold recognition procedure based on a method for iterative searching of sequence databases. Using this approach, they determined that the distribution of the most common protein folds is similar in bacteria and archaea, but distinct in eukaryotes, demonstrating the ability of this method to detect subtle relationships between proteins from various phylogenetic lineages that were previously only detectable by structure-structure comparisons. Based on these results, investigators felt this method was both a sensitive and reliable procedure for determining potential targets, or a representative set of protein folds that would allow researchers to predict the structures for the rest of the proteins encoded in an organism’s genome with confidence and in reasonable detail.

The next step was to determine the number of structures needed to obtain characterized representatives for nearly all folds. Wolf, Grishin, and Koonin devised a mathematical model that described the distributions generated by randomly sampling from the universal population of protein folds and families. They used this equation to estimate the number of folds and families in the protein universe and in complete genomes. The total number of folds in globular, water-soluble proteins was estimated at approximately 1,000, with structural information available for about one-third of these proteins. The number of protein families that show significant sequence conservation was estimated to be between 4,000 and 7,000, with structures available for about 20 percent of these. To cover all folds, one needs to structurally characterize approximately 85 percent of the protein families, as many folds contain only one or two families. Yet, the current number of structurally characterized protein families is only between 15 and 25 percent of the required number. These data emphasize the need to carefully select targets for protein structure determination so as to maximize the chance of obtaining structures from new folds.


Eukaryotic Genomes

Wolf, Kondrashov, and Koonin used comparative genomics to further our understanding of the origins of introns, the sequences that interrupt eukaryotic genes and comprise the most important feature that distinguishes eukaryotic genes from prokaryotic ones. They compared the protein-coding sequences of the roundworm, a multicellular eukaryote, against a complete, non-redundant protein database. Results demonstrated that a large number of the eukaryotic proteins showed significantly greater similarity to bacterial homologs than to archaeal ones and that some proteins even had a greater resemblance to their bacterial counterparts than to those from other eukaryotes. In addition, approximately 1,300 “ancient” genes were identified—genes that were more or less conserved in both archaea and bacteria. Next, they estimated and compared the average intron density in roundworm “ancient” and “bacterial” genes as it has been hypothesized that the protein-coding genes of the last universal common ancestor contained introns. If this were true, then the genes of ancient and bacterial origin should differ in their intron densities because genes acquired from bacteria had only a limited time to accrue introns. Yet data did not show a statistically significant difference in intron density between these two gene categories, lending credence to a second theory that postulates that introns invaded genes after the emergence of eukaryotes.

These brief research highlights demonstrate the impact that molecular analysis of genomic data, combined with modern computational and theoretical approaches, can have on furthering our understanding of the evolutionary, fundamental and practical problems facing biomedical researchers today. These studies also show the present utility and future potential of complete genome comparisons in identifying gene products produced by a particular organism and in predicting their structure and function. Using this approach, one can also identify a gene that is common to all organisms within the three domains of life, as well as a gene that is unique to a particular domain, thereby gaining meaningful insights into the organization and evolution of biological systems. CB


Continue


NCBI News | Spring 2000 NCBI News | Spring 2001