|
|
 |


Comparative Genomics: From
Sequence to Evolution to Function
The
recent explosion in genome sequencing has led to a rapid enrichment of
the protein databases, both in terms of number and variety of protein
sequences. The function(s) of the majority of these proteins remains unknown.
By classifying proteins according to their degree of sequence similarity,
which generally reflects evolutionary (homologous) relationships, computational
biologists are able to predict the three-dimensional structure and a likely
function for many proteins, and determine their evolutionary origin.
COGs:
A Tool for Whole Genome Comparative Analyses
The database of Clusters of Orthologous Groups of proteins (COGs), developed
by NCBI investigators Tatusov, Koonin, and Lipman, is designed
to classify proteins from completely sequenced genomes on the basis of
orthologous relationships. The first version of this database was released
in 1997, and contained proteins from seven genomes and consisted of 720
COGs. Today, the database includes proteins from 34 complete genomes
and consists of 2,885 COGs. A companion program, COGNITOR, was
developed to fit new proteins into COGs. COGNITOR may also be used to
annotate newly sequenced genomes and allows researchers to predict the
function(s) of individual proteins or protein sets.
Over a period of years, NCBI investigators developed and refined computational
approaches that allowed them to detect previously unnoticed but potentially
important protein sequence similarities. Using this strategy to search
various protein databases, researchers are able to compare the genomes
of different organisms and identify conserved protein families and key
protein pathways that are modified, or absent, in an organism. Comparative
genome analyses also provide fundamental insights into the organization
and evolution of highly diverged species and are instrumental in identifying
other biological features that may confer a distinct evolutionary advantage
to an organism.
Galperin and Koonin addressed the issue of detecting targets
for anti-bacterial drugs using a comparative-genomic approach. For this
purpose, one needs to identify genes that are likely to be essential for
the survival of bacterial pathogens, but are absent in the host. One method
for predicting essential genes, sometimes called genome subtraction, capitalizes
on the COGs database, which includes conserved protein families represented
in at least three phylogenetically distant organisms. The ability to query
the database and retrieve a list of all COGs with a particular phylogenetic
pattern allows researchers to identify genes that are present in the genomes
of all or most pathogenic bacteria, but absent in its eukaryotic host,
delineating a potential drug target.
Experimental approaches may then be used to validate the essentiality
of the selected genes for bacterial survival and characterize their cellular
functions. Using the COGs database as a tool to search for drug targets
in microbial genomes demonstrates the potential of comparative genomics
in accelerating the drug discovery process.
Predicting
New Components of Known Molecular Complexes and Pathways
In another study, Aravind, Koonin and colleagues compared 4,344
protein sequences from fission yeast with all available eukaryotic protein
sequences. They identified protein sequences that were common to both
fission yeast and non-fungal eukaryotes, but that were missing or significantly
different in bakers yeast. These two species of yeast are evolutionarily
close enough such that direct counterparts among their genes are readily
detectable, but distant enough to support substantial gene differences.
Analysis of the combined data showed that since its radiation from a common
ancestor with fission yeast, bakers yeast had lost about 300 genes
and approximately 300 additional genes had diverged significantly. The
most notable feature of the set of genes lost in bakers yeast was
the co-elimination of functionally connected groups of proteins, such
as, for example, proteins involved in post-transcriptional gene silencing.
By examining patterns of coordinated gene loss, in combination with a
careful analysis of conserved domains, researchers can reconstruct functional
interactions between and among proteins and predict previously unknown
pathways.
Selecting
Target Proteins for Structure Determination
The determination of a proteins three-dimensional structure is key
to unlocking biologic function. At this time, it is still not feasible technically
to determine the structures of all of the proteins encoded in the human
genome, or even in a typical smaller prokaryotic genome. However, considerable
information relating to a proteins structure may be gleaned from
studying its sequence. This is because there exists a limited number of
distinct protein building blocks, or folds. Proteins with similar sequences
tend to have similar folds and hence, similar structures. This suggests
that for each sequence, researchers should be able to identify a homologous
protein with a known structure that may serve as a model for the structural
characterization of other proteins.
To explore this concept, Wolf and Koonin constructed a protein-fold
recognition procedure based on a method for iterative searching of sequence
databases. Using this approach, they determined that the distribution
of the most common protein folds is similar in bacteria and archaea, but
distinct in eukaryotes, demonstrating the ability of this method to detect
subtle relationships between proteins from various phylogenetic lineages
that were previously only detectable by structure-structure comparisons.
Based on these results, investigators felt this method was both a sensitive
and reliable procedure for determining potential targets, or a representative
set of protein folds that would allow researchers to predict the structures
for the rest of the proteins encoded in an organisms genome with
confidence and in reasonable detail.
The next step was to determine the number of structures needed to obtain
characterized representatives for nearly all folds. Wolf, Grishin,
and Koonin devised a mathematical model that described the distributions
generated by randomly sampling from the universal population of protein
folds and families. They used this equation to estimate the number of
folds and families in the protein universe and in complete genomes. The
total number of folds in globular, water-soluble proteins was estimated
at approximately 1,000, with structural information available for about
one-third of these proteins. The number of protein families that show
significant sequence conservation was estimated to be between 4,000 and
7,000, with structures available for about 20 percent of these. To cover
all folds, one needs to structurally characterize approximately 85 percent
of the protein families, as many folds contain only one or two families.
Yet, the current number of structurally characterized protein families
is only between 15 and 25 percent of the required number. These data emphasize
the need to carefully select targets for protein structure determination
so as to maximize the chance of obtaining structures from new folds.
Eukaryotic
Genomes
Wolf, Kondrashov, and Koonin used comparative genomics to
further our understanding of the origins of introns, the sequences that
interrupt eukaryotic genes and comprise the most important feature that
distinguishes eukaryotic genes from prokaryotic ones. They compared the
protein-coding sequences of the roundworm, a multicellular eukaryote,
against a complete, non-redundant protein database. Results demonstrated
that a large number of the eukaryotic proteins showed significantly greater
similarity to bacterial homologs than to archaeal ones and that some proteins
even had a greater resemblance to their bacterial counterparts than to
those from other eukaryotes. In addition, approximately 1,300 ancient
genes were identifiedgenes that were more or less conserved in both
archaea and bacteria. Next, they estimated and compared the average intron
density in roundworm ancient and bacterial genes
as it has been hypothesized that the protein-coding genes of the last
universal common ancestor contained introns. If this were true, then the
genes of ancient and bacterial origin should differ in their intron densities
because genes acquired from bacteria had only a limited time to accrue
introns. Yet data did not show a statistically significant difference
in intron density between these two gene categories, lending credence
to a second theory that postulates that introns invaded genes after the
emergence of eukaryotes.
| These
brief research highlights demonstrate the impact that molecular
analysis of genomic data, combined with modern computational
and theoretical approaches, can have on furthering our understanding
of the evolutionary, fundamental and practical problems facing
biomedical researchers today. These studies also show the present
utility and future potential of complete genome comparisons
in identifying gene products produced by a particular organism
and in predicting their structure and function. Using this approach,
one can also identify a gene that is common to all organisms
within the three domains of life, as well as a gene that is
unique to a particular domain, thereby gaining meaningful insights
into the organization and evolution of biological systems. CB
|
|

|