NCBI Logo
NCBI News




In this issue


Transitioning from LocusLink to Entrez Gene

Cancer Chromosomes: a New Entrez Database

HomoloGene: An Entrez Database with a New Look

BLAST Link (BLink) to Protein Alignments and Structures

Debut of the HCT Database and Anthropology/Allele Frequencies in dbMHC

350kb Sequence Length Limit Removed by Sequence Database Collaboration

New Eukaryotic Genomes at NCBI

Environmental Samples Make Big Splash

HIV Protein-Interaction Database

e-PCR and Reverse e-PCR: Greater Sensitivity, More Options

New Organisms in UniGene

RefSeq Accession Numbers Get Longer as Rat Gets Last 6-digit Accession

Slots available for FieldGuidePlus Training Course Onsite at NCBI

RefSeq Release 6 on FTP Site

Exponential Growth of GenBank Continues with Release 142

Entrez Tools is a 'Hot Spot'

BLAST Lab: Using BLASTClust

New Microbial Genomes in GenBank

Entrez Quiz

Masthead





HomoloGene: An Entrez Database with a New Look

HomoloGene is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes. The genomes represented in the recent Build 36 of HomoloGene include H. sapiens, M.musculus, R.norvegicus , D. melanogaster , A. gambiae, C. elegans , S. pombe, S. cerevisiae , N. crassa, M. grisea, A. thaliana, and P. falciparum.

NCBI has adopted a new Homolo-Gene build procedure which is guided by the taxonomic tree, relies on conserved gene order and measures of DNA similarity among closely related species, while making use of protein similarity for more distantly related organisms. The new computational procedure greatly increases the reliability of the computed homologous gene sets and the resulting HomoloGene entries now include paralogs in addition to orthologs. For more details or to search the database, see the Homologene home page at:

New Search Strategies Supported

Because HomoloGene is now an Entrez database, it can be queried using an assortment of fielded terms combined with boolean operators. Among the fields unique to Homolo-Gene is the “Ancestor” field which refers to the taxonomic group of the last common ancestor of the species represented in a HomoloGene entry. Using the “Ancestor” field it is possible to limit a search to genes conserved in one of 9 ancestral groups: Sordariomycetes (147,550 entries), Eukaryota (2,759 entries), Fungi/Metazoa (33,154 entries), Bilateria (33,213 entries), Coelomata (33,316 entries), Mammalia (9,172 entries), Ascomycota (1,083 entries), Insecta (1,689 entries), Rodentia (1,587 entries).

New Views of the Data

HomoloGene reports include homology and phenotype information drawn from Online Mendelian Inheritance in Man (OMIM), Mouse Genome Informatics (MGI), Zebrafish Information Network (ZFIN), Saccharomyces Genome Database (SGD), Clusters of Orthologous Groups (COG), and FlyBase. A “Pairwise Scores” display gives a table of pairwise statistics for members of a Homologene group that includes percent amino acid and nucleotide identities, the Jukes-Cantor genetic distance parameter (D), the ratio of non-synonymous to synonymous amino acid substitutions (Ka/Ks) for predicted proteins, and the ratio of radical to conservative changes in the transcript (Knr/Knc).

—DW

 
New HomoloGene FTP File Formats
The Homologene data is available by FTP where the data for each build is contained in two files; "homologene.data" and "homologene.xml.gz". Follow the "FTP site" link in the sidebar on the Homologene home page to download the files.
 
homologene.data
 
  homologene.data is a tab delimited file containing, from left to right:
•HomoloGene group id •Taxonomy ID •gene ID •gene symbol •geninfo identifier (gi) of the protein product of the gene •accession number of the protein product of the gene
 
 
homologene.xml.gz
 
  homologene.xml.gz is a compressed file that contains a complete XML version of the HomoloGene build and includes the information available on the public webpage. The Homologene XML DTD is available in the archive "homologene.dtd.tar" at the top level of the ftp site.  
 
The old HomoloGene FTP files of the formats used in "hmlg.ftp" and "hmlg.trip.ftp" will be discontinued after a transition period. During the transition, a new set of codes, reflecting the new build procedure, will be used in these files to indicate the nature of the evidence for homology: b - reciprocal best, B - reciprocal best in a self-consistent triplet, m - similarity between sequences that do not give reciprocal best hits.
 

Continue to:  GEO

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003