In this issue

Entrez Programming Utilities (E-Utils)



New NLM Catalog in Entrez

New Genome Builds

New Microbial Genomes in GenBank

Whole Genome Shotgun Project


Trace Archive Grows

New Organisms in UniGene

RefSeq Version 8

Submissions Corner

Predicted Records

GenBank Release 144

BLAST 2.2.10



PubChem: An Entrez Database of Small Molecules

The NCBI has released three new Entrez databases that link small organic molecules to bioactivity assays, PubMed abstracts, and protein sequences and structures. The new databases constitute the PubChem project at NCBI, a part of the NIH Roadmap Initiative. They are PubChem Substance, PubChem Compound, and PubChem Bioassay.

PubChem Substance currently contains over 800,000 chemical samples imported from 14 public sources including ChemIDplus, the Developmental Therapeutics Program at NCI, KEGG, NCBI MMDB, and the NIST Chemistry WebBook. Chemical entities in PubChem Substance records that have known structures are validated, converted to a standardized form, and imported into PubChem Compound. This standardizing allows NCBI to compute chemical parameters and similarity relationships between compounds. The compounds are grouped into levels of chemical similarity from most general to most specific: same bonding connectivity and any tautomer; same bonding connectivity; same stereochemistry; same isotopes; and same stereochemistry and isotopes. PubChem Compound also indexes these chemicals using 34 fields, many of which represent computed chemical properties such as the number of chiral centers, the number of hydrogen bond donors/acceptors, molecular formula and weight, total formal charge, and octanol-water partition coefficients (XlogP). These groups are provided as Entrez links that allow similar compounds to be retrieved quickly. The third database, PubChem Bioassay, currently includes 173 bioactivity studies from the Develop-mental Therapeutics Program at NCI, and each of these studies is linked to records in PubChem Substance. The PubChem Bioassay interface allows users to view substances that meet certain activity and/or chemical criteria, and the matching records can either be viewed in PubChem Substance or downloaded in several formats.

As part of the Entrez system, the three PubChem databases are linked to several related Entrez databases, including PubMed, Protein, and Structure. PubMed links are derived either from citations provided by submitters or by matching substance names to the MeSH medical thesaurus, which often provide extensive information about the biological activity of a substance. The Protein and Structure links reveal proteins known to interact with a compound and protein structures that contain the compound as a bound ligand. The reverse links also provide new functionalities. Now ligands within structures can be identified instantly by the link to PubChem Compound, as can chemicals described in PubMed abstracts.

Consider Gleevec, a potent tyrosine kinase inhibitor used to treat leukemia. In PubChem Substance, the query "gleevec" retrieves one record for Imatinib mesylate from ChemIDplus. Clicking on the SID (substance ID) number or the thumbnail structure loads a Sub-stance Summary showing a view of the structure, other information including chemical properties and synonyms, and links to PubChem Substance, PubChem Compound, PubMed, and records of identical compounds. This record contains both Imatinib mesylate and methanesulfonic acid; a link to identical compounds leads to substances that also contain the acid. In this case, one additional substance is found that was not retrieved by the query "gleevec", showing how similarity neighboring is able to overcome differing nomenclatures. As part of the standardizing process, substances that have multiple components give rise to several records in PubChem Compound to allow more powerful searching for similar compounds. In the present case, if the Compound Displayed pulldown menu is changed from Standardized to Component1, a different Com-pound record is shown that contains Imatinib mesylate without the acid, and this compound is linked to seven identical compounds, including itself (Figure 1). Clicking the link to the right of Same Connectivity loads these identical compounds into PubChem Compound, and then choosing Protein Struc-ture from the Display pulldown menu and clicking Display reveals three crystal structures of tyrosine kinase domains containing bound Gleevec. Only one of these structures would have been found by the text query "gleevec" in Entrez Structure, illustrating the advantage of the precomputed chemical similarities provided by PubChem Compound.

Figure 1: Substance Summary page for Compound ID 1451114 in PubChem Compound, corresponding to Imatinib mesylate (gleevec). The structure displayed is “Component1" of PubChem Substance ID 700313, the originally submitted substance that contains both imatinib mesylate and methanesulfonic acid. The standardized version of the submitted substance, containing the acid, is indexed as Compound ID 1451113 and is viewed by choosing "Standardized" from the pulldown menu.

PubChem Bioassay allows one to search for bioactivity. For instance, the query "leukemia AND lc50[tid description]" in PubChem Bioassay retrieves eight growth inhibition assays with measured LC50 values in various leukemia cell lines. Links are then provided to PubChem Substance and PubChem Compound for these chemicals so that they may be further explored.

Access PubChem at:


previous pageContinue to: GenePlot

NCBI News | Fall/Winter 2002 NCBI News: Spring 2003