Volker D Haehnke at 11:00

PubChem Atom Environments & Standardization
PubChem is an open repository for molecular structures, their properties and biological activities. The number of deposited structures has been steadily increasing since its creation in 2004. Today, it contains more than 116 million deposited substances (PubChem Substance) with 45 million unique small molecules (PubChem Compound). The content deposited in Substance is very diverse as it originates from various sources. As a consequence, due to diverging drawing standards for chemical structures, deficiencies of underlying chemistry models, and natural effects that increase the diversity in valid structure representations, non-identical structures deposited in Substance can actually describe the same molecule. Structure standardization protocols are necessary to account for these effects so that equivalent structures can be recognized and their associated information consolidated in PubChem Compound.

A new approach to the standardization of chemical structures is currently developed at NCBI. It is based on the concept of spherical atom environments that can be used to apply transformations from bad (incorrect or unfavorable) to good configurations of contained atoms and bonds. This seminar will outline the approach, describe its parameterization (that is based on the current PubChem standardization protocols), exemplify augmentations to this initial set of atom environment transformations, and showcase how atom environments can be used as efficient filters to weed out unrealistic structures that are not rejected by the current standardization protocols. Finally, remaining challenges and the next steps in the development of the approach will be discussed.

This presentation combines results obtained from an analysis of the current PubChem standardization protocols and conclusions drawn in the process, and a survey of atom types and atom environments in PubChem.

