Logo of narLink to Publisher's site
Nucleic Acids Res. Jan 2010; 38(Database issue): D320–D325.
Published online Nov 11, 2009. doi:  10.1093/nar/gkp1013
PMCID: PMC2808862

Protein Geometry Database: a flexible engine to explore backbone conformations and their relationships to covalent geometry

Abstract

The backbone bond lengths, bond angles, and planarity of a protein are influenced by the backbone conformation ([var phi]), but no tool exists to explore these relationships, leaving this area as a reservoir of untapped information about protein structure and function. The Protein Geometry Database (PGD) enables biologists to easily and flexibly query information about the conformation alone, the backbone geometry alone, and the relationships between them. The capabilities the PGD provides are valuable for assessing the uniqueness of observed conformational or geometric features in protein structure as well as discovering novel features and principles of protein structure. The PGD server is available at http://pgd.science.oregonstate.edu/ and the data and code underlying it are freely available to use and extend.

INTRODUCTION

With the explosion in the number of atomic-resolution protein structures in the past decade, the possibility to determine accurate details of protein geometry from proteins themselves rather than from small-molecule peptides has become a reality. The importance of bond angles, bond lengths, and peptide planarity in validating structures as well as discovering real and functionally important deviations from standard geometry has become increasingly apparent (1–7). To our knowledge, no database has existed until now to search peptide geometry either on a large scale to discover trends or on an individual basis to explore unusual features. Unusual features that are significant often pass unrecognized even by the structural biologists who solved the structure (Figure 1).

Figure 1.
An active-site peptide geometry feature discovered using the PGD. (A) Shown is the peptide bond between residues His306 and Asn307 in the 0.90 Å resolution structure of Cu-nitrite reductase [PDB code 2bw4 (21)]. 2FoFc electron density ...

The protein backbone confrmation is defined primarily by the dihedral angles [var phi] and Ψ together with whether the peptide bond is in the trans (ω near 180°) or cis (ω near 0°) conformation. The protein backbone geometry, on the other hand, is defined by the bond angles and lengths and deviations of the peptide from planarity. It has been shown that the average values for backbone geometry vary in a conformation-dependent manner, reflecting an intimate relationship between conformation and geometry (7). The prevalent misconception that bond angles and lengths are static has been caused in part by the lack of any straightforward way to examine their dependence on local conformation. The Protein Geometry Database (PGD) is a unique resource that now makes it possible for biologists to explore peptide geometry, peptide conformation, and the ties between them. Other databases exist to allow searching conformation alone [e.g. SPASM (8), Fragment Finder (9), Protein Segment Finder (10), Conformational Angles Database (11), PDBeMotif (12)], but even in this arena, the PGD offers a unique combination of convenience and flexibility.

IMPLEMENTATION

The PGD contains derived data for a complete, representative data set of protein structures that are relevant to discovering reliable instances of conformations and peptide geometries. This allows users to set thresholds specific to their queries at search time without being unduly limited by the cutoffs chosen during database creation. To ensure that the PGD data are representative of conformational and geometric space rather than being biased by multiple highly similar structures, the PGD contains data derived from a nonredundant set of proteins. As is common, the nonredundancy is defined by the maximum allowed sequence identity between any pair of proteins in the data set. Two thresholds of 25 and 90% are available in the PGD. The nonredundant set is taken from PISCES (13). Because different resolution ranges are suitable for different queries, the PGD maximizes structural data by using all data included in the PISCES data sets, corresponding to crystal structures determined at 3.0 Å resolution or better with no cutoff for the crystallographic R-factor. Although the lower-resolution structures in this sample do have lower accuracy, users can easily exclude them using search parameters.

The PGD contains data on per-chain and per-residue levels. For each chain, stored parameters include the PDB code, the chain ID, the sequence-identity threshold, the resolution, and the crystallographic R-factor. The sequence-identity threshold, resolution, and R-factor are all useful in parameters to define the independence and quality of the data searched. For each residue, stored parameters include a mapping back to the chain and protein, the residue number, the torsion angles [var phi], Ψ, ω and χ1, the improper dihedral ζ [describing the chirality of the Cα (14)], all seven backbone bond angles, all five backbone bond lengths, the DSSP-defined (15) secondary-structure type, and three B-factors: the mainchain average, the sidechain average, and the Cγ atom. The B-factors are useful as cutoffs to exclude residues with poorly defined conformation and geometry [e.g. (7,16)].

The PGD uses the Python-based Django framework for both populating and searching a MySQL database. Using Django allows us to follow the DRY principle (‘Don't; Repeat Yourself’) by only having one description of the database format. This reduces the difficulty of changes, increases the clarity of code, and avoids potential conflicts between multiple descriptions. A single change can transform the database schema for all applications that use it. The database is populated by interfacing with a tool written with BioPython (17) to calculate PDB-derived information. The tool, Splicer, splices derived data from the PDB files together into all possible consecutive segments from 1 to 10 residues long. This approach speeds searching because segments do not need to be constructed during every search.

A single run of Splicer to populate the PGD can take ~16 h on a current single-processor compute node, so we constructed a new Python framework for distributed, parallel data processing called Pydra <http://pydra-project.osuosl.org/>. Using Pydra, a parallel Splicer run across 20 CPUs on four nodes takes ~1 h, providing the nearly linear speedup expected for this type of coarse-grained parallelism.

The current version of the PGD contains ~3.8 million residues from nearly 16 000 protein chains, with all amino acids and secondary-structure types being well-represented (Figure 2). The PGD content will be updated on a quarterly basis or better.

Figure 2.
Extent and diversity of the database. The residue population of the PGD is shown as a function of resolution (A), amino-acid composition (B), and secondary-structure type (C). The population as a function of resolution is cumulative. At 1.0 Å ...

SEARCHING AND ANALYZING RESULTS

The search page

The PGD has a professionally designed, user-friendly yet flexible graphical interface for mining protein conformational and geometric space. Upon proceeding beyond the introductory entry page, users encounter the search page (Figure 3). On this page, users define all parts of their queries. A help window pops up for each section when entering data. Each of the criteria can be defined positively (e.g. Gly and Pro) or negatively (e.g. all but Gly, Pro).

Figure 3.
Excerpt from a representative query. The query form defines a search for three-residue motifs that do not include Gly, Pro, or prePro residues at position i at 1.5 Å resolution or better. For residue composition, red highlights indicate excluded ...

At the top of the page is the length of the motif to be searched (from 1 to 10 residues), followed by protein-chain properties and residue properties. The protein-chain properties are the length of the motif, the resolution range for selecting crystal structures, the sequence-identity threshold, and specific PDB codes to search (defaults to the full PGD). Changing the motif length will cause the corresponding set of residue properties to appear.

The residue properties are grouped into five sections: composition, conformation, mobility, angles, and lengths. The composition section allows users to indicate any grouping of specific amino-acid types to search (i.e. with no limitation to predefined categories such as hydrophobic or acidic). The conformation section allows users to restrict searches to specific classes of DSSP-defined secondary structure, defined as follows: ‘H’ — α-helix; ‘G’ — 310 helix; ‘E’ — β-strand; ‘T’ — hydrogen-bonded turn; ‘S’ — non-hydrogen-bonded turn; ‘I’ — π-helix; and ‘B’ — β-bridge. The long names are used on the search page, and the short names are used when space is limited (e.g. on the statistics page). The conformation section additionally offers options for more fine-grained conformational searches using ranges of [var phi], Ψ, and the peptide planarity, ω. The mobility, angles, and lengths sections are collapsed by default to simplify the search page for first-time users and for those who only want to perform conformational searches; clicking the titles will expand them (other sections can be expanded or hidden in the same manner). The mobility section allows searching ranges of the three B-factors of the mainchain, sidechain and Cγ-atom (Bm, Bsc and Bγ, respectively). The angles are defined by three atoms and proceed in order from N- to C-terminus of the residue, with ‘−1’ indicating an atom from the previous residue and ‘+1’ indicating an atom from the next residue. The lengths are defined by two atoms and are otherwise named and searched identically to angles.

To allow for additional flexibility and convenience in searches, we made two enhancements beyond what is typically allowed in similar databases. First, we created a query syntax for ranges that allows multiple ranges to be specified (using commas), which enables searches wrapping around circular angles in either direction (search ranges must always be specified as negative to positive from left to right). This is quite useful for searches of conformations (the β region extends beyond Ψ = +180°/−180°) or peptide planarity (which peaks at +180°/−180°). To make it difficult for users to create an invalid search, we also provide on-the-fly validation that highlights valid syntax in green and invalid syntax in red. Second, we created a special exclusion feature for selections (a green plus sign indicates when selections are included, and clicking it reverses the search to exclusion and displays a red minus sign) that allows users to easily exclude a small number of selections instead of tediously selecting almost all of them. This is useful for common cases like excluding Gly or Pro from a search.

Once a search is fully defined, clicking the ‘Submit’ button passes the query to the PGD, which immediately indicates that a search is in progress and displays results on the initial output page when the search is complete.

The initial output page

Immediately following a search, the total number of results is reported in the upper left-hand corner and the numbers of results are displayed as a function of [var phi] and Ψ on an interactive Ramachandran plot (Figure 4). The plot is colored by observation density within 10° × 10° bins. To maximize the visual contrast, coloring uses a logarithmic scale derived from the plotted values. Moving the mouse cursor over any bin produces a JavaScript popup indicating the [var phi] and Ψ ranges and the observation count. The Ramachandran plot is not limited to displaying the number of observations but instead can show any of the PGD residue attributes, using colors (like a contour plot) or even on the X and Y axes (replacing [var phi]/Ψ). Attributes from any position of the search (i−4 to i + 5) can be plotted by changing the ‘residue’ parameter. Additionally, plots can be zoomed by changing the minima and maxima, and bin sizes can be modified. Further flexibility is available in the colors/dimensions section, hidden by default. To re-plot after changing any parameter, click the ‘Re-Plot’ button. Plots or the summary data used to create them (bin definitions, observation counts, averages and standard deviations for each attribute) are downloadable using the ‘Save Plot Image’ or ‘Save Plot Data’ butttons, respectively.

Figure 4.
Excerpt from a representative output. The Ramachandran plot shows results of a search for three-residue motifs that do not include Gly, Pro or prePro residues at position i at 1.5 Å resolution or better with other settings left at their defaults. ...

Additional tools and analysis

A feature of the PGD is that it enables analysis beyond its built-in capabilities by allowing users to download a complete set of search results. Clicking ‘Data Dump’ will prompt the download of a plain-text dump of the raw results for each matching motif in tab-separated value format, ideal for importing into other applications.

In addition to the summary data provided by the Ramachandran plot, the individual motifs found by the search are viewable online by clicking ‘Browse Results’ at the top of the page. Highlighting of each column and row under the mouse cursor eases comparison within a residue or attribute. To reduce load time and maximize responsiveness, pagination splits up the potentially large result sets.

The ‘Statistics’ link at the top of the page leads to a page of summary statistics about residue i, including a breakdown of observations by amino-acid type, secondary-structure types, and the average backbone covalent geometry. Scrolling a mouse cursor over the covalent-geometry values produces a pop-up window that displays the standard deviations and ranges. Automatic highlighting of the column and row under the mouse cursor eases comparisons within residue types and attributes.

EXAMPLES

The PGD enables searches of conformational and geometric space in a powerful, flexible manner that allows for a wide variety of uses, from understanding large-scale patterns in protein structure to analyzing the significance and/or rarity of a feature in an individual structure. Here, we describe three examples from papers published by our group using the PGD that illustrate its two primary aspects of conformational and geometric searching as well as the connection between them.

Conformational searching

The first example (18) is a large-scale analysis of protein conformation that asked a simple question: What linear groups with repeating [var phi],Ψ pairs exist in proteins. To answer this question, we used the PGD to search for well-defined (Bm < 25 Å2) three-residue segments from structures solved at 1.2 Å resolution or better. At this resolution, the atomic positions and thus the torsion angles have high accuracy so if linear groups are truly tightly grouped, they should be observed as such. To ensure a maximally representative result, we chose the 25% sequence-identity threshold and included all amino-acid types but only trans peptides (all three ω values limited to ‘−180–90,90–180'). To identify linear groups in specific regions, we required all three residues to be in the same 20° × 20° box and systematically searched all such boxes using a 10° sliding window. We found that only three true clusters of linear groups exist in proteins: the right-handed α-/310-helix, the β-strand (with no substantive difference between parallel and antiparallel), and the PII helix (occupied by many nonproline residues, despite the misconception that only polyproline populates it). The 2.27 ribbon, π-helix, and left-handed α-/310-helical conformations only occur for isolated residues and rare short segments.

Geometric searching

The second example [see figure 4 of (19)] is a small-scale analysis investigating the commonality of a specific five-residue geometric motif. In glutathione reductase, a key active-site loop bridges two cysteines forming a redox-active disulfide bond. This loop has five consecutive residues with nonplanar peptide bonds, and intriguingly, they are all bent in the same direction with a summed deviation of 55° across the pentapeptide. We suspected this highly strained loop was involved in the enzyme's; function. To find out how common such an ω-deviation was in proteins, we searched the PGD for all trans pentapeptides for which each residue was at least 5° away from planarity (ω delimiter: ‘−175–90,90–175'), using cutoffs of 1.2 Å resolution and 90% sequence identity. By downloading a data dump and creating a histogram of the net deviations, we discovered that the strained active-site loop of glutathione reductase was not just unique within its own structure but also nearly unique among all proteins, with only two other examples in the PGD.

Probing conformation/covalent geometry relationships

The third example (20) is a broad analysis of the variations in covalent geometry as a function of the backbone conformation. To perform this analysis, we searched the PGD for three-residue segments of trans peptides (ω values limited to ‘−180–90,90–180') with well-defined (Bm < 25 Å2) backbones from structures solved at 1.0 Å resolution or better, using a 90% sequence-identity threshold. We split the searches into five classes to define the differing behaviors of each class: Ile/Val, Gly, Pro, residues preceding Pro, and the remaining 16 residues. By downloading a data dump, we imported the data into Matlab for further in-depth analysis of the geometry trends. The results of this analysis were captured as a conformation-dependent geometry library that was used to show how accounting for these systematic relationships could improve the accuracy of homology modeling and crystallographic refinement.

CONCLUSIONS

As the examples illustrate, the ability to explore peptide geometry and conformation and their interrelationship can provide important insights into protein structure and function. The PGD is the only database to connect peptide geometry and conformation. Its highly flexible yet intuitive search interface will allow users to characterize principles of protein structure and to answer questions about details of protein structure that are often missed or ignored.

FUNDING

National Institutes of Health (R01-GM083136 to P.A.K.); National Science Foundation (MCB-9982727 to P.A.K.). Funding for open access charge: National Institutes of Health.

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

The authors would like to thank their beta testers, who made suggestions for improving the PGD. They would also like to thank all members of the development team at the Open Source Lab at Oregon State University who assisted with coding. They also thank Nan Wang, Peter Gross and Mike Marr for work on early versions of the PGD.

REFERENCES

1. Lawson CL. An atomic view of the L-tryptophan binding site of trp repressor. Nat. Struct. Biol. 1996;3:986–987. [PubMed]
2. Dobson RCJ, Griffin MDW, Devenish SRA, Pearce FG, Hutton CA, Gerrard JA, Jameson GB, Perugini MA. Conserved main-chain peptide distortions: a proposed role for Ile203 in catalysis by dihydrodipicolinate synthase. Protein Sci. 2008;17:2080–2090. [PMC free article] [PubMed]
3. Merritt EA, Kuhn P, Sarfaty S, Erbe JL, Holmes RK, Hol WG. The 1.25 A resolution refinement of the cholera toxin B-pentamer: evidence of peptide backbone strain at the receptor-binding site. J. Mol. Biol. 1998;282:1043–1059. [PubMed]
4. Davis IW, Leaver-Fay A, Chen VB, Block JN, Kapral GJ, Wang X, Murray LW, Arendall WB, Snoeyink J, Richardson JS, et al. MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Res. 2007;35:W375–W383. [PMC free article] [PubMed]
5. Esposito L, Vitagliano L, Zagari A, Mazzarella L. Experimental evidence for the correlation of bond distances in peptide groups detected in ultrahigh-resolution protein structures. Protein Eng. 2000;13:825–828. [PubMed]
6. Laidig KE, Cameron LM. What happens to formamide during C—N bond rotation? Atomic and molecular energetics and molecular reactivity as a function of internal rotation. Can. J. Chem. 1993;71:872–879.
7. Karplus PA. Experimentally observed conformation-dependent geometry and hidden strain in proteins. Protein Sci. 1996;5:1406–1420. [PMC free article] [PubMed]
8. Kleywegt GJ. Recognition of spatial motifs in protein structures. J. Mol. Biol. 1999;285:1887–1897. [PubMed]
9. Ananthalakshmi P, Kumar CK, Jeyasimhan M, Sumathi K, Sekar K. Fragment Finder: a web-based software to identify similar three-dimensional structural motif. Nucleic Acids Res. 2005;33:W85–W88. [PMC free article] [PubMed]
10. Samson AO, Levitt M. Protein segment finder: an online search engine for segment motifs in the PDB. Nucleic Acids Res. 2009;37:D224–D228. [PMC free article] [PubMed]
11. Sheik SS, Ananthalakshmi P, Bhargavi GR, Sekar K. CADB: Conformation Angles DataBase of proteins. Nucleic Acids Res. 2003;31:448–451. [PMC free article] [PubMed]
12. Golovin A, Henrick K. MSDmotif: exploring protein sites and motifs. BMC Bioinformatics. 2008;9:312. [PMC free article] [PubMed]
13. Wang G, Dunbrack RL. PISCES: a protein sequence culling server. Bioinformatics. 2003;19:1589–1591. [PubMed]
14. Laskowski RA, MacArthur MW, Moss DS, Thornton JM. PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Crystallogr. 1993;26:283–291.
15. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. [PubMed]
16. Lovell SC, Davis IW, Arendall WB, III, de Bakker PIW, Word JM, Prisant MG, Richardson JS, Richardson DC. Structure validation by Ca geometry: Π, Ψ and Cβ deviation. Proteins: Struct. Func. Genet. 2003;50:437–450. [PubMed]
17. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. [PMC free article] [PubMed]
18. Hollingsworth SA, Berkholz DS, Karplus PA. On the occurrence of linear groups in proteins. Protein Sci. 2009;18:1321–1325. [PMC free article] [PubMed]
19. Berkholz DS, Faber HR, Savvides SN, Karplus PA. Catalytic cycle of human glutathione reductase near 1 A resolution. J. Mol. Biol. 2008;382:371–384. [PMC free article] [PubMed]
20. Berkholz DS, Shapovalov MV, Dunbrack RL, Karplus PA. Conformation dependence of backbone geometry in proteins. Structure. 2009;17:1316–1325. [PMC free article] [PubMed]
21. Antonyuk SV, Strange RW, Sawers G, Eady RR, Hasnain SS. Atomic resolution structures of resting-state, substrate- and product-complexed Cu-nitrite reductase provide insight into catalytic mechanism. Proc. Natl Acad. Sci. USA. 2005;102:12041–12046. [PMC free article] [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • PubMed
    PubMed
    PubMed citations for these articles
  • Substance
    Substance
    PubChem Substance links

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...