NCBI Logo NCBI News NCBI News Masthead
National Center for Biotechnology Information National Institutes of Health National Library of Medicine Nation Center for Biotechnology Information Winter 2001



In this issue...

COG Database

Plant Genomes

LinkOut

Investigator Profile:
Stephen Altschul

GenBank News

Expanded Bookshelf

BLAST Enhancements


Recent Publications

Masthead

 



COG Database Grows to 3300 Protein
Clusters and 44 Complete Genomes


Based on research conducted by NCBI’s comparative genomics group, the database of Clusters of Orthologous Groups of proteins (COGs) represents a phylogenetic classification of proteins encoded in complete genomes. The COGs are derived from an “all-against-all” sequence comparison of the encoded proteins. Each COG consists of individual proteins or groups of paralogs from at least three lineages and is therefore considered to correspond to an ancient conserved domain. The database is designed to support research on genome evolution as well as functional annotation of genomes.

At its inception in 1997, the database included 720 clusters from 7 genomes. It now includes more than 3300 COGs from 44 genomes of bacteria, archaea, and the yeast Saccharomyces cerevisiae, representing 30 major phylogenetic lineages. In addition, proteins from two eukaryotic genomes, C. elegans and D. melanogaster, have also been assigned to individual COGs. The COG home page lists the organisms included, number of proteins encoded by each genome, and the portion of those that are included in COGs.Three general kinds of information can be obtained using the COG database. For functional studies, the COGs have been classified into 18 broad functional categories, including one for uncharacterized COGs. Phylogenetic patterns show the presence or absence of proteins from a given organism in a specific COG and, when used systematically, can identify whether a particular metabolic pathway exists in an organism. Multiple alignments of COG members can be used to identify conserved sequence residues and analyze evolutionary relationships between member proteins.

Individual COG reports contain information on the number of proteins comprising the cluster, their inferred function, a function code from a list of 18 general categories, the phylogenetic pattern for the COG, the unique COG number, and a link to proteins from C.elegans and D.melanogaster assigned to the COG. If available, the pathway or functional system is also indicated as a functional sub-category. Clicking on the floppy disk icon will generate a FASTA-formatted file of protein sequences for all COG members.

The COG report also generates a table giving the gene names corresponding to cluster members from each organism. Each gene name is linked to a display of the BLAST output for its encoded protein, which includes both graphical and textual sequence alignments between the COG member and other protein database sequences. A Genomic Context link shows the organization of the genomes of the organisms represented in a COG, centered on the genes coding for the orthologous proteins that comprise the cluster. Finally, a dendrogram, constructed from multiple sequence alignments, displays sequence similarity relationships between the COG members.

A Phylogenetic Patterns search tool finds COGs that are shared by any set of organisms. Organisms may be included or excluded from the group using an input table. For closely related organisms belonging to a single clade, pre-computed tables show shared and unique COGs.

The COGnitor program is a companion tool that assigns new proteins to pre-existing COGs. COGnitor takes a protein sequence as input for sequence comparison, and suggests inclusion in a COG if there are “best hits” to proteins from at least three lineages. The output shows the COG to which the query protein is predicted to belong, a color-coded BLAST graphic delineating the regions of similarity, and the sequence alignments.

Other useful resources include:

List of COGs, which displays all COGs in the database.

Distribution histograms that show how many COGs contain proteins from a specific number of clades or species.

Phylogenetic patterns table, which organizes the patterns into sets based on the presence or absence of organisms belonging to Archaea, Eukarya or Bacteria.

Co-occurrences table, which shows the number of COGs shared by a particular pair of species or unique to one member.

Functional categories page, summarizing the functions that have been defined, the number of COGs assigned to each category, the number of proteins or domains assigned to each category, and the number of pathways and functional systems associated with each category.

The COGs are also integrated with the Genome division of Entrez. From the COG pages, proteins are linked to the Genome view and Neighbor view. From Entrez Genome, proteins are linked to their respective COGs, and COG data is included in several display options. For example, in the map display of circular genomes, the radial lines corresponding to genes are color-coded according to the functional categories used in the COG system.

The COG service is located at www.ncbi.nlm.nih.gov/COG/. The data is also available by FTP at ftp://ncbi.nlm.nih.gov/pub/COG. —VP


Continue


NCBI News | Spring 2000 NCBI News Footer