- We are sorry, but NCBI web applications do not support your browser and may not function properly. More information

- Journal List
- NIHPA Author Manuscripts
- PMC3243943

# PROGRESS IN THE PREDICTION OF pK_{a} VALUES IN PROTEINS

^{}

^{1}Ernest L Mehler,

^{}

^{2}Nathan Baker,

^{3}Antonio Baptista,

^{4}Yong Huang,

^{5}Francesca Milletti,

^{6}Jens Erik Nielsen,

^{7}Damien Farrell,

^{8}Tommy Carstensen,

^{8}Mats H. M. Olsson,

^{9}Jana K. Shen,

^{10}Jim Warwicker,

^{11}Sarah Williams,

^{12}and J. Michael Word

^{13}

^{1}Department of Physics, Clemson University, Clemson, USA

^{2}Physiology and Biophysics, Weill Medical College of Cornel University, USA

^{3}Pacific Northwest National Laboratory, USA

^{4}Instituto de Tecnologia Química e Biológica, Portugal

^{5}Dept. of Biochemistry and Molecular Biophysics, Washington University in St. Louis, USA

^{6}University Studi Perugia, Italy

^{7}University College Dublin, Dublin, Ireland

^{8}School of Biomolecular and Biomedical Science, Ireland

^{9}Department of Chemistry, University of Copenhagen, Denmark

^{10}Department of Chemistry and Biochemistry, University of Oklahoma, USA

^{11}Faculty of Life Sciences, University of Manchester, UK

^{12}Chemistry & Biochemistry, University of California at San Diego, USA

^{13}OpenEye Scientific Software, Inc., USA

^{}Corresponding author.

## Abstract

The p*K*_{a}-cooperative aims to provide a forum for experimental and theoretical researchers interested in protein p*K*_{a} values and protein electrostatics in general. The first round of the p*K*_{a}-cooperative, which challenged computational labs to carry out blind predictions against p*K*_{a}s experimentally determined in the laboratory of Bertrand Garcia-Moreno, was completed and results discussed at the Telluride meeting (July 6–10, 2009). This paper serves as an introduction to the reports submitted by the blind prediction participants that will be published in a special issue of *PROTEINS: Structure, Function and Bioinformatics*. Here we briefly outline existing approaches for p*K*_{a} calculations, emphasizing methods that were used by the participants in calculating the blind p*K*_{a} values in the first round of the cooperative. We then point out some of the difficulties encountered by the participating groups in making their blind predictions, and finally try to provide some insights for future developments aimed at improving the accuracy of p*K*_{a} calculations.

**Keywords:**pKa, protein electrostatics, pH dependent properties of proteins, predicting pKa values in proteins

## STATEMENT OF PURPOSE OF THE p*K*_{a}-COOPERATIVE

Computational and experimental study of acid-base equilibria in proteins has reached a point where further progress in increasing the reliability of predicting pK_{a}’s will require the development of new approaches that better describe the underlying physics regulating the system’s structure and dynamics as well as any pH-dependent phenomena ^{1}. Such improvements may be based on entirely novel algorithms or on combining the strongest components of existing approaches. To carry out the latter, an initial step will be the detailed analysis of the strengths and weaknesses of existing approaches. Toward that end participants in a workshop on protein electrostatics, organized by Marilyn Gunner and Bertrand Garcia-Moreno, concluded that it was timely to assess the different methods for calculating p*K*_{a}, how they would fare on some difficult cases, and subsequently how these approaches could be improved. It was decided that the best framework for accomplishing this goal was to establish a (preliminary) cooperative that would be a repository of data and act as a channel for bringing together researchers who are active in developing and applying methods for calculating acid/base dissociation constants in proteins. The first meeting of the p*K*_{a}-cooperative was held at Telluride, July 6–10, 2009. This paper is a summary of that meeting.

To provide a focus for the meeting, research groups involved in p*K*_{a} calculations were asked to make blind predictions using the extensive structural and experimental results on Staphylococcus nuclease (SNase) provided by the Garcia-Moreno group. This group had determined structures and measured various p*K*a of wild type SNase and a large number of mutants ^{2–10}. The results of the blind predictions were discussed at the meeting, and thanks to the willingness of all contributors to discuss their results in an open forum, the meeting was successful in identifying a number of issues relevant to improving the accuracy of pKa prediction. The open discussion allowed the group to avoid the fatal pitfall for this type of exercise to degenerate into a competition with “winners” and “losers”. The avoidance of such a trap is essential if the entire community is to profit from comparing the different methods and gain insight into how to incorporate improvements. The usefulness of making blind predictions is their objectivity for testing a given method because of the impossibility of “improving” the results by further refinement of the parameters. Thus blind predictions provide a measure of the state of development of a particular approach and gives clues where improvements are to be made. This paper serves as an introduction to the special issue of *PROTEINS: Structure, Function and Bioinformatics* that will report the results from the individual groups that participated in the blind prediction exercise.

In the next section, we give a brief overview of methods used in p*K*_{a} calculations, but concentrating on the methods used by the participants of the meeting, and then a section that is based on the experiences of the blind contributors. We asked them to write a short description of their calculations, but without including any results. We were particularly interested in problems and difficulties that were encountered during the calculations. Finally, in a concluding section, we briefly consider future directions and speculate (“predict”) on how to develop methods that are both accurate and not too computationally demanding. The ultimate goal is to not only predict p*K*_{a}s but to reveal the underlying physics regulating the ionization.

## OVERVIEW OF METHODS FOR CALCULATING p*K*_{a}s IN PROTEINS

### Introduction

The calculation of the p*K*_{a} of titratable groups in proteins had its beginning in the work of Tanford and Kirkwood based on the Poisson-Boltzmann equation (BPE) ^{11}. This early work provided methods for studying acid-base equilibria in proteins even before the 3-dimensional structure of any protein was known. With the development of x-ray crystallography as a powerful tool for the accurate determination of protein structure and the introduction of computers, it became possible to calculate the p*K*_{a}s of titratable groups in proteins at ever-increasing levels of detail and complexity. In particular, with the significant increase in computing power over the last decade, there has been a rapid development of novel methods for calculating p*K*_{a}s that, in principle, are able to give an accounting of the underlying physics that controls the acid-base equilibrium of the titrating systems in a protein. At the time that the initial use of the PBE as a tool for calculating pKa was being explored, physical chemists turned to the evaluation of dissociation constants in bifunctional acids and bases. Their approach was to express the electrostatic free energy of interaction of the bifunctional groups by Δw = q_{1}q_{2}/D_{e}R where R is the distance between the charges, q, and D_{e} is an effective screening ^{12}. As with the PBE, the so-called screened Coulomb potential (SCP) has been the starting point of many modern methods for calculating pK_{a} in proteins.

Unfortunately, the reliability of calculated p*K*_{a}s has not kept pace with the development of new and more sophisticated methods for modeling titratable systems: errors of two or more pH units in calculated p*K*_{a} values are not unusual. In particular, errors of over 1 p*K*_{a} unit are most likely in predicted values for titratable residues where the measured p*K*_{a} indicates a large shift from the reference value. Such errors are particularly troublesome for cases where residue p*K*_{a} values shift into the physiological pH range. Errors in calculated p*K*_{a} values for highly-perturbed residues are a serious issue because many studies report p*K*_{a} calculations on a subset of the titratable residues in one or a few proteins and, if the results are satisfactory, conclude that the method works. Experience suggests, however, that the reliability of a given method can only be assessed after applying the approach to many proteins of different structural characteristics ^{13}. A mitigating factor in some cases is that absolute accuracy in the p*K*_{a} value is not essential for rationalizing pH dependent processes in biological macromolecules, where the protonation state of key titratable residues at physiological pH, or changes in p*K*_{a} with structural transitions is often sufficient to develop useful insights into the physical mechanism of a biological process.

The development of experimental methods to determine p*K*_{a} values has also seen rapid progress and the introduction of NMR techniques ^{14–17} has made p*K*_{a} measurements accurate and fairly routine in globular proteins. Thus a large (and still growing) body of data is now available that can be used to test the computational approaches. Some experimentalists have developed and made available systematic data sets of values consisting of wild type and mutant proteins that can be used to carefully probe the computational methods to identify the sources of disagreement between calculated and experimental results. Such probing will hopefully lead to improvement of the computational methods.

Most methods for predicting p*K*_{a} values in proteins are based on estimating the additional free energy terms that appear when the protonatable moiety is transferred from solvent into the protein, which formally can be expressed as:

The first term on the right hand side provides a reference value representing the p*K*_{a} of the residue in the solvent (typically termed the null model), while the second term comprises all the new interactions that arise from removing the residue from the pure solvent (desolvate) and embed it in the protein (resolvate), which itself is immersed in the solvent. Theoretical and experimental evidence indicates that the most important class of interactions that determine Δp*K*_{a} are electrostatic in origin. Therefore, to be able to predict p*K*_{a} values reliably, a reasonably accurate description of the electrostatics and other relevant energy terms in the protein and surrounding environment is required. Minimally, the description of the electrostatics must comprise a term that describes the Coulomb interactions between the charges that model the protein structure and a term that describes the interaction of the charges with the solvent, often termed the “self” or “transfer” energy. The importance of the latter term was pointed out long ago by Warshel ^{18}. It is noted that recent simulation results suggest that inclusion of other components of the intermolecular potential, e.g., hydrophobic effects, may also improve the predictions. ^{19}

The calculation of the electrostatic effects can be based on a microscopic or macroscopic framework. Truly macroscopic models express the system by a continuum description and assume that the required quantities can be calculated directly from the macroscopic electrostatics equations; *i.e.*, the PBE. Microscopic models calculate all interactions at the atomic level of detail, and thermodynamic properties are obtained by statistical averaging. There is broad agreement that ultimately it is most desirable to use the microscopic framework because of its greater theoretical content. However, many microscopic methods tend to be computationally expensive and therefore, in most cases, macroscopic continuum approaches have been used because of these computational limitations. Fortunately, this issue is gradually being resolved by the availability of ever increasing computing power and more efficient methods for simulation and sampling. As a result, there has been an increasing interest in developing microscopic methods.

In recent years, a new class of methods has been developed that is based on the large data set of measured protein p*K*_{a} values that is now available. These new methods are purely empirical in concept and use the protein structure to account for the different types of interactions; *e.g.*, H-bonds or charge-charge interactions, and assign each such interaction an energetic weight that is optimized using the large data base of experimentally determined p*K*_{a} values or titration curves. The advantage of these methods is their speed, but their disadvantage is that they are not physics-based and thus provide less physical insight into the determinants of shifted p*K*_{a} values. The results of these methods seem quite reasonable provided one is within the radius of convergence defined by the data set used in the parameterization, but extrapolation is likely to be less satisfactory until the data base is extended. This implies, e.g., that the effects of mutation on the pKa of a particular group may be in error even though the wild type (WT) pKa is correctly predicted. In contrast, even though methods based on the electrostatic equations often require empirical parameters to yield reasonable results, it may still be possible to rationalize the underlying physics that leads to the shifted pKa.

Below, we briefly review p*K*_{a} prediction methods starting with macroscopic approaches followed by microscopic approaches and finally empirical methods (see for example Ref. ^{20}). It is noted that this review is not meant to be exhaustive, but primarily concentrates on the methods that were of interest and discussed at the 2009 Telluride meeting.

### Macroscopic methods

Although most physics based methods for calculating p*K*_{a}s are based on either “macroscopic” or “microscopic” models, some formulations are mixed, juxtaposing macroscopic and microscopic quantities. A typical example is using molecular dynamics (MD) with an implicit solvent description such as Generalized Born (GB). Resolving this juxtaposing of mutually inconsistent quantities in a physically reasonable way may be part of the difficulty experienced in formulating reliable methods for calculating pH dependent quantities.

#### (a) PB equation based methods

The earliest methods for calculating p*K*_{a} values represented the protein by an impenetrable sphere because the resulting PBE could be solved analytically. The most influential of these methods was developed by Tanford and Kirkwood (TK) ^{11} and Tanford and Roxby^{21}, based on a model where the protein was represented by an impenetrable sphere of radius *b* with embedded titratable points and a low dielectric constant, and an exterior region with a high dielectric constant. The TK method was introduced before any protein structures had been solved, but as soon coordinates became available the TK method was modified to account for the solvent accessibility of the titratable group since it was argued that charges near the protein surface would experience additional damping due to the polar solvent ^{22,23}. Subsequently, many other modifications have been proposed ^{24}; nevertheless, the advent of large scale computing machines has allowed the use of numerical methods which can solve the PBE directly for proteins of any shape.

Proteins and other biological macromolecules are irregularly shaped multi-atomic objects existing in water in the presence of mobile ions. The electrostatic potential () in such a system can be calculated using the PBE, i.e.,

where *ε(**r**)* is the dielectric permittivity, ρ(r) is the permanent charge density, κ is the Debye-Huckel parameter, *k _{B}* is Boltzmann constant and T is temperature.

For irregularly shaped objects the PBE does not have analytical solutions, so that, the electrostatic component of the solvation energy and the corresponding ion screening must, in practice, be calculated with numerical solutions, of which several approaches are available. The most frequently used numerical methods of solving the PBE can be grouped into two distinct categories: methods implemented on volume-filling grids (including finite difference, finite volume, and finite element methods) and boundary element (BE) methods where the solution is expressed in terms of distributions over the molecular surface. Commonly used PB solvers include (1) DelPhi developed in the Honig lab ^{25–27}, (2) APBS developed by Baker and coworkers ^{28,29} and several new additions made in the McCammon lab ^{30–32}, (3) CHARMM ^{33} is a molecular mechanics and simulation program that includes a FD based PB solver developed by Roux and co-workers ^{34}, (4) ZAP developed by Nicholls and co-workers ^{35}, (5) MEAD developed by Bashford ^{36}, (6) AFMPB solver developed by Lu and co-workers ^{30}, and (7) MIBPB developed by Wei and co-workers ^{37}.

Bashford and Karplus pioneered the field of PB-based methods for predicting p*K*_{a}s of ionizable groups. They developed a macroscopic electrostatic continuum model using detailed structural information to treat self-energies and interactions arising from permanent partial charges and titratable charges ^{38} and solved the PBE using finite difference methods. Testing the approach on lysozyme resulted in the observation that the p*K*_{a} values are very sensitive to the details of the local protein conformation, and that side-chain mobility is likely to be important in determining the observed p*K*_{a} shifts. It is also of note that the accuracy of the p*K*_{a} values already hinted at the issues that would develop around the definition of the dielectric constant.

The PB-based approach was also used by McCammon and co-workers ^{39,40} to predict p*K*_{a} values using 3D structures of the corresponding proteins/small molecules. Wade and co-workers showed that the optimization of the parameters such as partial charges could significantly improve the p*K*_{a} predictions ^{41}. The Baker and Nielsen groups collaborated successfully to develop a set of tools for p*K*_{a} calculations ^{42}. Honig and co-workers further improved the FDPB method for calculating p*K*_{a}s ^{43}. The novelty of their technique with respect to previous work was the specific incorporation within the numerical protocol of both the neutral and charged forms of each ionizable group. The multiple-site titration algorithm ^{44} developed by Gilson and co-workers addressed the necessity of computing p*K*_{a}s of proteins having large number of titratable sites, resulting in an exponentially-growing number of possible charged or uncharged states. Based on the results in Ref. ^{45} a pragmatic approach was taken by Antosiewicz and co-workers to account for conformational flexibility through the use of a high dielectric constant of 20 for the protein interior ^{46–48}. This procedure seemed to improve overall results, but left several important titration sites in serious error. Baptista and co-workers^{49} investigated the use of two distinct protein dielectric constants for computing the individual (site) and the pairwise (site-site) terms of the ionization free energies, but they found no overall improvement over the use of a single value of 20, even for buried or shifted sites. Karshikoff further explored the use of the dielectric constant to mimic protein flexibility ^{50} by assigning different local dielectric constants per residue type with a combination of the FDPB and Tanford-Roxby iterative procedures. In addition, Baptista and coworkers proposed the methodology of computing p*K*_{a}s with alternative hydrogen positions ^{51}. The method of Warwicker and co-workers ^{52} estimated the conformational relaxation in a pH-titration with a mean-field assessment of maximal side chain solvent accessibility. Another FDPB-based method was introduced by Nielsen and co-workers, which adds an explicit step to optimize the hydrogen bonds network. It was shown that this approach delivers better results than methods not optimizing the hydrogen bond network ^{53,54}.

#### (b) The PBE and conformational flexibility

It became evident that protein conformational flexibility should be explicitly taken into consideration within the same protocol that calculates p*K*_{a}s. Bashford and co-workers introduced polar proton conformational flexibility into the p*K*_{a} protocol ^{55} by generating an ensemble of conformers where the positions of polar protons were systematically varied. This information was then used to explicitly calculate intrinsic p*K*_{a} values and electrostatic interactions between titrating sites. The method was applied to the Asp, Glu, and Tyr residues of hen lysozyme. Different protocols for hydrogen atom placement were used and their effect tested against experimental p*K*_{a} values. It was determined that multi-conformational calculations significantly improved the agreement with experiment. The subsequent Monte-Carlo based method of Beroza and Case ^{56} included side chain flexibility in continuum electrostatic calculations of protein titration. Knapp and co-workers ^{57} demonstrated that the geometry and the hydrogen bonding are very important in treating p*K*_{a}s of residues involved in salt bridges. Hartbury and co-workers recently developed a rotamer repacking method called FDPB-MF that exhaustively samples side chain conformational space and rigorously calculates multibody protein-solvent interactions ^{58}. Their method achieved high accuracy on a small subset of acidic residues in turkey ovomucoid third domain, hen lysozyme, Bacillus circulans xylanase, and human and Escherichia coli thioredoxins, with root mean square deviations of 0.3 pH units ^{58}. Recently, Warwicker and coworkers developed the FD/DH method ^{52}, which is an automated combination of Finite Difference Poisson-Boltzmann (FDPB) ^{59,60} and Debye-Hückel (DH) methods. This is based on the well-known finding that Δp*K*_{a}s for water accessible groups are generally dominated by water dielectric, and can be handled in a simple DH model with relative dielectric of 78.4, whereas solvent exclusion can lead to larger Δp*K*_{a}s, handled better by FDPB with separate water and protein dielectrics ^{61}. The code statistically averages p*K*_{a}s over multiple conformers and multiple FDPB calculations. In the FD/DH method, a short-cut approximation avoids multi-conformation sampling, with DH interactions only being sampled where assessment of maximal solvent accessible surface area (SASA) for an ionisable group is greater than a fixed fraction ^{52}. This assessment is made with a mean-field sampling of side chain rotamer packing on a fixed backbone. ^{62,63}

One of the most commonly used method for incorporating conformational flexibility into p*K*_{a} calculations combines FDPB electrostatic calculations with explicit sampling of side chain, hydrogen and ligand positions. This approach, developed by Gunner and co-workers, is known as the Multi-Conformation Continuum Electrostatics (MCCE) method^{64–67}. In the MCCE the protein side chain motions are simulated explicitly while the dielectric effect of solvent and bulk protein material is modeled by continuum electrostatics. MCCE can be used to: (1) study the protein structural responses to changes in charge; (2) study the changes in charge state of ionizable residues due to structural changes in the protein; (3) study the structural and ionization changes caused by changes in solution pH; (4) find the location and stoichiometry of proton transfers coupled to electron transfer; (5) make side chain rotamer packing predictions as a function of pH. Recently Alexov and co-workers developed a hybrid p*K*_{a} method that uses distinctive different ensembles of structures representing conformational ensemble for ionized and neutral forms of the titratable residue of interest. These ensembles were generated either with MD simulations or *ab-initio* structure predictions. Then the structures were subjected to MCCE calculations and the p*K*_{a}s were predicted by averaging the corresponding titration curves.

#### (c) Generalized Born

As an alternative to PBE, a computationally faster approach based on Born’s theory of ionic solvation was developed. This approach is based on an early extension of the Born formula (proposed by Hoijtink ^{68} to allow the Born approach to be applied to systems with a distribution of N point charges and was expressed in the form

where *q _{i}* is the net charge (not necessarily integral) on particle

*i*,

*r*is the separation between

_{ij}*q*and

_{i}*q*,

_{j}*R*is the Born radius for atom

_{i}*i*, and

*δ*is the Kronecker delta. This equation and similar forms that allow the original Born approach to be extended to multi-particle systems are referred to as the generalized Born (GB) equations. One such approach was proposed by Still and coworkers

_{ij}^{69}for calculating solvation energies of organic molecules; a quantum chemical based approach was developed in the lab of Truhlar

^{70,71}. Still’s method is based on an empirically determined functional form to calculate the polarization free energy.

The proposed function was parameterized to account for both electrostatic damping and solvation. The success of the method in calculating solvation energies of small organic molecules prompted several workers to adapt it to calculating electrostatic effects in biomolecules. The further development of this theory is summarized in several reviews and research articles ^{72,73}, and a number of alternative models are now available: HCT ^{71}, ACE ^{74}, AGBNP ^{75,76}, GBMV ^{77,78}, GBSW ^{79} and ALPB ^{80–82}.

### Microscopic methods

The advantage of microscopic theory is that, in principle, no empirical parameters are needed, so that the underlying physics can be revealed. A second major advantage is that physical quantities defined at the macroscopic level, e.g., the permittivity, do not appear in microscopic formulations since the relative permittivity in a fully explicit, atomistic description is one. The major disadvantage of microscopic approaches is that they are computationally intensive, thus simplifications have to be made that can compromise the theoretical content of the method.

An important early approach in this direction was made by Warshel ^{83,84} who expressed the protein-solvent system in terms of charges and dipoles in the protein and point dipoles on a three dimensional grid for the solvent. Warshel’s approach is based on the dielectric theory of polar solvation developed by Lorentz, Debye, Sack, and Onsager (LDSO) (see for example Ref. ^{85}), which, however, maintained the microscopic treatment of the entire system. Unfortunately even Warshel’s approximations were still too compute-intensive so that further simplifications had to be introduced leading to a semi-microscopic approach that finally forced the reintroduction of a permittivity like quantity in the formulation. Nevertheless, Warshel recognized that the particular form or value of the permittivity depended on the physics of the system and should not be treated as an arbitrary parameter ^{49,86}.

The most fundamental approach for describing electrostatic, as well as all other physical interactions, are quantum mechanical (QM) methods which solve the Schrödinger equation (SE) at some level of approximation. For macromolecular systems like proteins, solving the SE for the entire system is neither possible nor desirable. The required computing power is not available, but more fundamentally, at separations where the overlap repulsion has become vanishingly small only electrostatic interactions are non-negligible and therefore must be included in the calculation. Because of these issues, most methods follow a suggestion made by Warshel and Levitt ^{87} to divide the system into regions where only the region of detailed interest is described by QM and the more distal parts of the system are described classically. Several such approaches are described below.

### Quantum mechanics/molecular mechanics (QM/MM) based methods

A computational methodology for protein p*K*_{a} predictions, based on *ab-initio* quantum mechanical treatment of part of the protein and linear Poisson-Boltzmann equation treatment of the bulk solvent, has recently been developed by Jensen and coworkers ^{88}. This method was applied to predict and interpret the p*K*_{a} values of the five carboxyl residues (Asp7, Glu10, Glu19, Asp27, and Glu43) in the serine protease inhibitor turkey ovomucoid third domain and it was found to give quite promising results. Another approach described the development and application of a computational method for the prediction and rationalization of p*K*_{a} values of ionizable residues in proteins, based on *ab-initio* QM and the effective fragment potential (EFPs) method ^{89}. In this approach the quantum region is surrounded by fragments for which the (static) potentials have been pre-determined using *ab-initio* QM. An attractive feature of this approach is that it requires no empirical parameters^{89}. It was shown that the hydrogen bonds, rather than long-range charge-charge interactions primarily determined the p*K*_{a} values. Cui and coworkers also applied QM/MM potential function in microscopic p*K*_{a} simulations^{90}, developing the QM/MM-GSBP^{91} (Generalized Solvent Boundary Potential) based thermodynamic integration (TI) approach for p*K*_{a} predictions. The system set-up is identical to a recently published study ^{92} of V66E and V66D mutants, which has a 22 Å fully flexible inner GSBP region; several simulations were also been carried out with the simpler stochastic boundary condition with a large (34 Å) water sphere. To encourage structural response in the environment, the interaction between the QM titratable group and the MM environment is scaled by a constant α (>1) in the overcharging windows. Two schemes were explored: *(a)* random walks between each TI window with a specific λ value and several overcharging windows with the same λ but different α values were realized with a Landau-Wang scheme; and *(b)* random walks were realized between all TI windows and the overcharging windows; only the overcharging windows with λ=1 were included. It is clear that, while these methods show great promise, at the present stage of development further effort will be required before they can be used routinely on large sets of cases.

### Molecular Dynamics (MD) based methods

In parallel to QM/MM approaches methods utilizing MD simulation have recently been proposed at various levels of approximation. These are combined with free energy perturbations (FEP) to calculate the change in free energy accompanying protonation or deprotonation. An interesting new approach carries out the simulations at constant pH allowing a first principle description of acid-base equilibria in proteins. Computational limitations require that in most applications some level of approximation is still required, which usually is achieved by using a continuum solvent approximation.

Alternative backbone conformations can be sampled within standard molecular dynamics protocols^{93,94}. These approaches calculate the p*K*_{a} as a thermodynamic average from conformations in the trajectory or from an average structure. Another approach, combining both MD and the Generalized Born (GB) model, for predicting p*K*_{a}s was recently reported ^{95}. This implementation of the Molecular-Mechanics Generalized-Born Surface-Accessibility (MM-GBSA) approach was tested on a panel of nine proteins, including 69 individual comparisons with experiment. An issue with these calculations is that values of ε>1 were used within the context of all atom microscopic simulations where the permittivity should be unity (Use of ε>1 within the context of a microscopic calculation is physically problematic). It was shown that the inclusion of non-electrostatic terms that are part of the MM-GBSA free energy expression, improved prediction accuracy. A similar observation was previously made ^{64} by the authors of the MCCE method concerning the inclusion of van der Waals energy into p*K*_{a} calculations. Another approach to conformational averaging is adopting a linear response approximation using conformations from both the ionized and neutral forms of the residue of interest. This approach was pioneered by Warshel within the context of the PDLD model ^{83} and has been recently extended to PB-based models ^{96,97}. Recently Washel proposed a so called overcharging approach to favor the conformational changes occurring in the MD simulations, by overcharging the titratable group of interest ^{98}.

A method for p*K*_{a}s predictions ^{99} was recently reported using continuous constant pH molecular dynamics (CPHMD) simulations ^{100,101}, which employs λ dynamics for simultaneously propagating conformational and protonation states (for a review see ^{102}). The method calculates solvation effects using the GB model, accounts for the ion screening through approximate Debye-Hückel function and applies a replica-exchange protocol for enhanced sampling in both conformational and protonation space. By allowing the microscopic coupling between protonation equilibria and conformational dynamics, the CPHMD method offers pK_{a} predictions at a first-principles level, thereby eliminating the need for the effective protein dielectric constant and high-resolution structure as typically required by macroscopic approaches. Another strength of the method is that it can be applied to study pH-dependent conformational phenomena ^{99,102}. The CPHMD method was benchmarked on 10 proteins, targeting anomalously large p*K*_{a} shifts for the carboxylate and histidine side chains. pKa of buried ionizable groups were somewhat less well reproduced than surface groups^{99}. Since the July 2009 Telluride meeting, Shen and coworkers have extended the CPHMD method to explicit-solvent simulations using a hybrid scheme in which protonation states are propagated using the GB model but conformational dynamics is driven in explicit solvent ^{103}. This modified method may yield an improved accuracy for the description of protein conformational dynamics while maintaining the efficiency for sampling protonation states.

An alternative constant pH approach has been developed using discrete protonation states and GB electrostatics ^{104}. In this method, J. Mongan *et al.* use GB-solvated MD, with periodic Monte Carlo sampling of discrete protonation states using the same GB electrostatics, to account for the important pairing of conformational dynamics and protonation state. At each MC step, a titratable residue and a new protonation state are chosen at random, with the total transition energy being used as the Metropolis criterion for the decision of protonation state. More recently, in an attempt to overcome the commonly reported convergence issues associated with constant pH MD methods, this approach has been coupled with accelerated MD.^{105} Using this coupled method (CpHaMD) ^{106}, improvement has been observed in the p*K*_{a} predictions of titratable residues of the extensively studied Hen Egg White Lysozyme (HEWL) system, relative to the earlier approach (above).

Baptista and co-workers have proposed two different constant-pH MD methods ^{107,108} that explore the complementarity of MM/MD methods (which sample conformations at a fixed protonation state) and PB models (which sample protonation states at fixed conformation). The first method, termed *implicit titration* ^{107}, uses fractional protonation states periodically updated from PB calculations performed along the MD simulation. The method is based on a potential of mean force ensuring sampling from the proper semi-grand canonical ensemble, together with a mean field approximation. The second method, termed *stochastic titration* ^{108}, uses discrete (nonfractional) protonation states which are similarly obtained from periodic PB and MC calculations. This method adopts a coupling between the MM/MD and PB/MC algorithms that generates a Markov chain sampling from the semi-grand canonical ensemble, allowing also for the use of explicit solvent in the MM/MD segments by means of an approximation; the treatment of protonatable groups with hydrogen isomerism^{109} and of redox groups (by specifying the solution reduction potential) ^{110} was later included. The stochastic titration method succesfully reproduced the helix-coil transition of polylysine^{111} and predicted the acidic p*K*_{a} values of hen egg white lysozyme in reasonable agreement with experiment^{109}.

### Continuum methods from the microscopic description

Unlike macroscopic methods where the applicability to microscopic systems has to be assumed, continuum solvent models can be rigorously derived from the microscopic description (for a review, see Ref. ^{85}). Because the method is derived from microscopic electrostatics an internal dielectric constant does not appear. Instead, statistical averaging of the electrostatic equations defines a “virtual” fluid that penetrates all of space, and is described by a sigmoidal, distance dependent screening function that modulates both the electrostatic interactions and the self-energy. It provides an alternative approach for calculating p*K*_{a} that was first developed by Mehler ^{85}. In this approach, a variational method is used to assign the titration charge to the atoms of the titrating moiety in an optimal and self-consistent way. In a later modification, a quantitative description of the hydrophobicity of the local environment was introduced that provides a mechanism to empirically modulate the electrostatic equations based on the properties of the local environment and the degree of solvent accessibility (the method contains 5 empirical parameters). A similar approach has been reported^{112} that uses the electrostatic equations derived from LDS theory, but these authors introduced empirically determined screening functions based on the region in the protein where the ionizable group is located.

### Empirical methods

In contrast to the methods described above that are based on the macroscopic or microscopic electrostatic equations, the methods described here are based on an empirical functional form with parameters optimized on the basis of a large data base of measured pKa values. For example, a study that utilized a genetic algorithm to design an empirical equation that took into account the long-range charge-charge interactions and the interactions of the given carboxylic acid group with its local environment in the protein ^{113}. Another approach was taken by Spassov and co-workers ^{114}, where a three terms empirical function describing charge-charge interactions was optimized over experimentally determined titration curves. Another method ^{115} defines an empirical equation that predicts the p*K*_{a}s based on the electrostatic potential, hydrogen bonds, and accessible surface area.

A very fast and empirical method (PROPKA) was recently developed by Jensen and coworkers ^{116,117}. It uses the 3D structure of the protein to estimate the desolvation effects and intra-protein interactions by positions and chemical nature of the groups proximate to the p*K*_{a} sites. PROPKA was tested on 233 carboxyl, 12 cysteine, 45 histidine, and 24 lysine p*K*_{a} values in various proteins resulted in a root-mean-square deviation less than one pH unit. PROPKA has become the most-widely used empirical program for p*K*_{a}s predictions.

Recently, a new method was developed by Milletti for protein p*K*_{a} calculations, MoKaBio ^{118}, which is based on a statistical method trained on experimental p*K*_{a} values of 434 unique residues. Each residue in the training set is described by a fingerprint that encodes the chemical environment within a sphere with a radius of 6 A from the site of ionization. This fingerprint contains information on the physical chemical properties of the neighboring atoms (charge, hydrophobicity, etc.) and their distance from the site of ionization. The prediction requires the following steps: *(a)* generation of a fingerprint for each ionizable site of a protein; *(b)* calculation of a similarity index (SI) between each fingerprint of the protein and all the fingerprints in the training set; *(c)* p*K*_{a} prediction by using experimental p*K*_{a} values of the top ten most similar ionizable sites in the training set weighted according to the SI. Leave-one-out cross-validation of this method on the training set of 434 p*K*_{a} values was carried out. In the development phase of this method it was observed that it was difficult to predict a p*K*_{a} shifts originating from long-range interactions. This motivated the authors of MoKaBio to choose a fingerprint similarity approach rather than other machine learning approaches such as Partial Least Square, which are based on the calculation of the contribution of individual groups to the p*K*_{a} shift of a residue.

## INSIGHTS AND DIFFICULTIES ENCOUNTERED BY p*K*_{a}-COOPERATIVE PARTICIPANTS

### The experimental dataset

The set of experimental p*K*_{a} values used for the blind prediction were obtained from crystallographic structure determinations of WT and mutants conducted by the Garcia-Moreno group. p*K*_{a} values were determined by the Garcia-Moreno group ^{3,5,6,119} for mutant proteins by performing equilibrium denaturation measurements at different pH and/or relevant NMR experiments. Mutants were designed to position a single ionized group in the core of SNAse to measure the effect of desolvating the ionizable group and plausible compensation from newly formed favorable interactions. This yielded highly perturbed p*K*_{a} values for a large number of residues at different positions in the sequence ^{2–6}, which provided a unique dataset for the blind predictions. At the time of the blind prediction exercise, 90 of the mutant p*K*_{a} values had not been released and could therefore be used for a true blind prediction exercise (p*K*_{a} values were known only to the Garcia-Moreno and the Nielsen lab at the time of submission).

It is important to stress that only a single p*K*_{a} value (that of the inserted residue) was available for each mutant protein. Furthermore, for 77 of the 90 mutant proteins only modeled structures (provided by Emil Alexov) were available. In the blind prediction, each group was free to construct their own models of the mutant proteins, and the predictions submitted thus presented an exercise in both modeling and p*K*_{a} prediction. Additionally, the experimental data set is exceptional in that it contains a very large fraction of highly shifted p*K*_{a} values (average shift from the solution p*K*_{a} value is for Asp and Glu are 2.8 and 2.3 units, respectively). Finally, it should be mentioned that upon learning of their performance on the full set of p*K*_{a} values, the participants in the Telluride meeting decided to receive experimental information for only 1/3 of the full set of p*K*_{a} values. The remaining 2/3s of the p*K*_{a} values have been withheld for additional blind predictions until May 2010, and have led to improvement in the performance of some methods.

### Calculations utilizing rigid heavy atom positions

The Baker/Nielsen group made predictions utilizing two protocols: PDB2PKA and WHAT IF. It was found that PDB2PKA performed particularly poorly on lysines, presumably because there was very little data on these residues in the calibration and training set. In contrast, WHAT IF yielded high RMSD for histidines in WT SNAse. Other than these observations no general trend was found in the results. However, the investigators concluded that use of a different dielectric constant would work well in improving the accuracy of some sites, while for others it appeared that one would need to explicitly sample different conformations to improve accuracy. The latter point is particularly important for the cases where only a modeled protein structure was available for the prediction, since success in the blind predictions depends crucially on calculating correctly the highly structure-sensitive desolvation energy.

The Warwicker group used a protein dielectric constant of 10 for generating predictions with the FD/DH method. The motivation of using a high dielectric constant of protein comes from the observation, that even where crystal structures are available, they may well represent non-ionised forms of the charge mutant, which upon ionization may undergo structural change. Such structural changes can be mimicked with high relative dielectrics in the range 10–12, rather than the 2–4, that are commonly used ^{2}. It was suggested that ionisation may introduce local conformational change, although clearly not unfolding in most cases, and predicting such conformational change is of interest. In the absence of reliable algorithms for predicting such conformational alteration, and bearing in mind that continuum models are aimed to give rapid estimates, then it may be reasonable to follow the published lead (ε_{p}=10) in a study focused to predict pKa of an introduced buried charge.

### Calculations using rigid heavy atoms and a Gaussian model (ZAP)

Mike Word used OpenEye’s ZAP PB solver to make p*K*_{a}-cooperative predictions. Although ZAP implements a discrete dielectric boundary model, its more usual mode and the mode applied here, is that of a continuous dielectric function derived from an atomic-centered Gaussian basis. This function interpolates the dielectric between the interior of the molecule and the solvent such that the predicted solvation of small molecules (<500Daltons) is within 0.5 kcal/mol of that derived from the discrete, molecular surface model of DelPhi using the same internal dielectric. There is a practical and a physical basis for this model. It is much more stable numerically, allowing estimation of solvation at an equivalent accuracy to the discrete model at about twice the grid spacing. Although it is tempting to see this model as an interpolation between the DelPhi molecular surface model and a zero-probe “van der Waals” surface model, it is actually trained to reproduce the former, i.e. to exclude water from internal spaces. However an interpolation of a kind is seen when the model is applied to larger molecules, such as proteins. As observed by Nicholls and Grant ^{35}, calculated quantities such as binding energies, or site-site interactions are commensurate with a discrete internal dielectric, but roughly twice as large. This can be rationalized by the concept of a “wetter” protein surface than the discrete model provides and likely accounts for the correspondence between the ZAP approach and methods using a higher internal dielectric. However, there is a physical difference between the two approaches in that the underlying molecular dielectric in ZAP is still set to that from electronic polarization (ε_{p}=2). The higher effective dielectric occurs because the Gaussian-based function allows water more ingress to the protein, essentially sampling solvated states that might occur from small atomic displacements. In this way, the ZAP model is accounting for more than electronic polarization via the shape of the dielectric function and not from raising the intrinsic, internal dielectric. Not surprisingly, such an approach resulted in very good predictions, similar to predictions made with standard molecular surface representation and using dielectric constant of 10 for the protein.

### Calculations using ensemble of backbone structures

The Alexov group applied two approaches to generate the predictions. They both were inspired by the understanding that ionization of a buried, non-paired group could induce significant conformational change. Their motivation stems from the same observation as made by Warwicker that the X-ray structures of the mutants (if available) are most probably obtained at conditions where the group of interest is not ionized (depending on the pH of the crystallography experiment). The representative structure (or ensemble of structures) with the group of interest were generated either with MD simulations or *ab-initio*. The most difficult to predict with MD generated structures were found to be Lys residues with side chain pointing directly into the hydrophobic core of the protein. The MD simulations, even up to 2ns simulation time, were not successful in generating conformational change leading to at least partial exposure of the ionized Lys side chain. On another hand, the *ab-initio* approach failed for cases where the plausible structural changes were not localized within a particular structural segment.

### Explicit modeling of conformational changes through MD simulations

For the purpose of the blind predictions, Williams and co-workers utilized the constant pH MD (CpHMD) method of J. Mongan *et al*.^{104} For many of the predictions, the calculated and experimentally determined p*K*_{a} results were comparable, with good representations of titration curves. However, some cases were in greater error, and the blind study highlighted some areas of the method which could be improved.

The calculation of protein p*K*_{a} values as part of a blind study was found to be more challenging. For systems where the experimental p*K*_{a} values are available, it is considerably easier to perform CpHMD simulations, since simulation length (and hence, convergence), and other method parameters can be judged, based on the known values. Williams and coworkers found convergence, an issue that was previously highlighted in constant pH MD methods, made the accurate blind p*K*_{a} prediction difficult for some residues in this study. For some of the calculations, the convergence of the p*K*_{a} value was incorrectly indicated, or was shown to be variable on performing multiple simulations. For some residues, especially those buried within the protein, strong interactions between neighboring residues persist for much of the simulation time, resulting in a low number of transitions between protonation states, and as a consequence, cause slow convergence. Therefore, simulations must generate long trajectories, and start from multiple random seeds in an attempt to help ensure that the p*K*_{a} obtained is reproducible and well-converged. However, this process was proven to be computationally expensive to carry out in a rigorous manner, especially for the numerous systems given as part of the blind study.

Since the July 2009 Telluride meeting, Williams *et al.* have adapted the CpHMD method in an effort to improve conformational sampling and thus convergence of p*K*_{a} values over simulations ^{106}. The CpHMD method has been coupled with the adapted Accelerated Molecular Dynamics (aMD) enhanced sampling method of de Oliveira *et al.* (described in reference ^{105}). This combined method (CpHaMD) employs aMD between the MC steps in replacement of conventional MD in the original CpHMD method. The use of CpHaMD has reported improvements in the p*K*_{a} predictions of the well-known problematic residues of the commonly used HEWL benchmark system, and will be further tested using the systems provided for the blind study. In addition to an increase in conformational sampling, part of the success of the method is based on the solvent model used, so any improvements made in this area would also increase the accuracy of the CpHMD method.

Shen and co-workers identified several areas of improvement. In CPHMD simulations with the GBSW implicit-solvent model ^{79}, underestimation of effective Born radii is the main reason for inaccuracies in the calculation of desolvation and interaction energies. The effective Born radii for buried atoms are too small because the overlapping region between van der Waals spheres that is inaccessible to water is not accounted for in the volume integration used to calculate the effective Born radii^{99}. As a result, the solvation energies for buried atoms are overestimated, while the Coulomb interactions between buried sites are dampened too much. For a buried ionizable side chain, the low dielectric environment favors the neutral state while attractive electrostatics interactions with nearby groups stabilize the charged state. Underestimation of effective GB radii leads to smaller magnitude in p*K*_{a} shifts due to desolvation and due to attractive electrostatic interactions. However, because of the opposite signs, these two errors cancel each other, resulting in smaller errors in the predicted p*K*_{a}’s for most interior groups, although it is not possible to predict this cancellation *a priori*

Baptista and co-workers used their stochastic titration method to run constant-pH MD simulations of just one of the mutants in the dataset, given that the method is computationally quite demanding. However, because of parameter issues for Arg and Lys residues, the runs had to be discontinued. This was the first time that Arg residues were considered as titrable in this method, illustrating how an unusual dataset can help identifying methodological issues.

Cui and co-workers reported encouraging findings for V66D, but also observed a number of limitations for their computational protocol for other cases. Analysis of the results indicated that the problem largely comes from the fact that in the exchange between λ-windows biased configurations are sampled in the low-λ windows. For example, the side chain of Asp66 becomes trapped in the solvent-exposed rotameric state even in the low-λ windows after exchanging with the high-λ (and overcharging) windows; this significantly underestimates the free energy derivatives in the low-λ windows, which leads to underestimated p*K*_{a} values. Therefore, it appears that the most serious challenges for sampling are for the intermediate λ windows. In this regard, the new GE-overcharging scheme is expected to be effective, especially, as discussed above, with its integration with ITS.

### Continuum methods from the microscopic description

The Mehler group participation in the p*K*_{a}-cooperative resulted in a number of interesting cases, e.g., the coordinate file for I72E contains two coordinate sets (A and B) for E72, which are sufficiently different to effect sizable changes in the local environments for E72. With the A coordinates E72 is embedded in a weakly hydrophobic microenvironment while the B coordinate set defines a strongly hydrophobic local environment. This results in the pK_{a} value from the A coordinates to shift upwards, but not enough, while the B coordinates shift the pK_{a} up too much. The relatively large change in local hydrophobicity is due to the difference in solvent exposed surface area. Although this difference is not large the effect on the local hydrophobicity is large because of the very strong hydrophilic character of water. Therefore a relatively small change in solvent exposed surface area has a concomitantly large effect on the local environment leading to large changes in pK_{a} values. It would be of interest to carry out MD simulations on these two systems to determine if both structures converge to the same final pKa value.

### Empirical models

Milletti used the MoKaBio program^{118}, which calculates p*K*_{a} values by using the p*K*_{a} of ionizable groups that have an environment similar to that of the residue of interest and has found that the predictions of p*K*_{a} shifts caused by an environment not encoded in the training set are challenging. It was demonstrated that MoKaBio predictions were very successful for cases resulting in high similarity index, but because most of the mutants are introduced in hydrophobic local environments, many of them could not find high enough similarity in the training set to make successful predictions. Moreover similarity is probably not the only determinant effecting pKa prediction.

The Jensen lab used PROPKA on the Telluride data set and found that their results were of the same quality as other groups. Similar to many other groups they found most of the difficulties to be due to the significant structural rearrangement that can be expected by embedding a charge in more or less hydrophobic local environment buried in the protein, e.g. the mutants V39E and F34E. Another problem was related to predicting a reasonable averaged structure for the mutants where an x-ray structure was not available. Since PROPKA in its most common guise is an average-structure approach, it relies on being able to include structural reorganization through its parameterized effective potentials. As expected PROPKA was found to have problems for predicted geometries and was especially problematic for mutations where, e.g., the size of the mutant residue is significantly different from the WT residue, e.g., G20K and A90E. These two types of mutations may also destabilize the protein and make it more prone to partial unfolding, water penetration, and large structural changes to accommodate the new residue in predominantly its ionized form. Thus, the data set provided a good indication of how well the implicit structural reorganization works. Since PROPKA has been parameterized to p*K*_{a} values where the desolvation and electrostatic contributions are more or less in balance which is not the case in the hydrophobic local environments. Blind predicted p*K*_{a} showed that the desolvation model had been over-simplified.

## FUTURE DIRECTIONS AND IMPROVEMENTS

### PB methods

A major problem emerging from the Telluride meeting is the way the models address the molecular reorganization/response to ionization/deionization of the titratable residue. Most of the PB methods utilize either a rigid protein structure or allow for side-chain and hydrogen flexibility only. In this way, the corresponding model addresses the reorganization in a particularly crude way, generally representing the protein as a uniform dielectric medium, and the best results were obtained using ε_{p}=8–10, although some large shifts are poorly reproduced. However, the response of a protein to a charge modification in its interior is certainly inhomogeneous. Both structurally and dielectrically regions respond differentially as was demonstrated in the case of the reaction center protein^{66}. The discussion led by Nathan Baker pointed out another, frequently overlooked problem, namely that there are many sets of parameters representing the radii and partial charges and the results may depend on the choice of force field parameters (see for example ^{120}). Another issue is the representation of the dielectric boundary between the protein and water phase, being either treated as a sharp or smooth boundary. Using non-discontinuous boundary allows the water high dielectric to permeate to some extent the protein interior, and thus to effectively reduce the desolvation cost. Such an approach is related in terms of the resulting dielectric map to the reduced probe radius (zero probe radius) proposed by Zhou to determine the molecular surface (see for example ^{120–122}).

### MD-based methods and method utilizing alternative backbone structures

The choice of the dielectric constant that best substitutes for conformational changes should essentially vanish when all conformational reorganization is explicitly taken into account. Two approaches have emerged: *(a)* making predictions using alternative backbone structures taken either from alternative PDB files or generated *in silico* by some means, and then using these alternative structures in independent, standard PB p*K*_{a} calculations and using an averaging scheme to calculate the pKa as done by a number of researchers in the past; *(b)* generating the alternative backbone conformations using the same procedure (MD-based or FEP) that calculates the p*K*_{a}s. Obviously, the second approach is much more physically sound.

The advantage of the first approach is that it generates representative structures for charged and uncharged forms of the titratable group, and the results do not depend on the conformational path. Only the final structures are needed so that, they can be generated *ab-initio* or taken from PDB files crystallized (if any) at different conditions (pH for example). Specifically, the *ab-initio* partial structural remodeling (the hybrid-p*K*_{a} method used in Alexov’s lab) has the advantage of quickly generating alternative backbone structures without being sensitive to large potential barriers separating alternative conformations. On the downside, such approaches need to make approximations to estimate the final p*K*_{a} predictions.

The explicit approaches (constant pH-MD based or FEP) are physically more sound and make fewer assumptions. The MD-based methods search conformation space with periodic sampling of protonation states using MC simulation. The main differences between these methods lie in their choice of solvent model and protocol for updating the protonation states. However, the convergence can be a problem in case of MD-based methods. Some structural relaxations may require simulations longer that several ns, or may simply be inaccessible with standard MD simulations. Enhanced sampling techniques such as replica exchange ^{99} and accelerated MD ^{106} have been employed to overcome such limitations. In addition to the sampling issue, some of the constant pH MD methods employ implicit solvation, which may limit the accuracy of p*K*_{a} predictions due to deficiencies of the solvent model in calculating electrostatic energies and sampling of conformational states. Improvements to the solvent models and/or incorporation of explicit-solvent sampling would surely increase the accuracy of these methods.

### Continuum methods from the microscopic description

Unlike many p*K*_{a} programs, the MM-SCP approach of Mehler and co-workers allows the user to adjust several parameters. These include some control over the iterative process to help ensure rapid convergence. Another parameter allows damping of the electrostatic interactions below an input threshold distances. The purpose of this parameter is to partially account for cases where interatomic distances are too small. Both the threshold distance as well as the damping factor can be adjusted. In their use of the program they have found that with some experience the appropriate values of these control parameters could be estimated. Nevertheless, default values have been provided for all adjustable input parameters. A recent analysis of the method using a data base derived from 59 proteins has shown that the calculation of pKa values of histidine is the most problematic with the largest percentage of residue in error by > 1 pKa unit.

### Empirical methods

The empirical methods are fast and it was found that they typically do not make large errors. This makes them ideal for quick and large-scale p*K*_{a} calculations to get an overview over up-shifted or interesting p*K*_{a} values that might be of biological importance, e.g. the two catalytic residues in lysozyme. They are unlikely to predict large pKa shift and thus to perform very well on a dataset comprised of slightly perturbed pKa’s, but they will probably not pinpoint the value of “difficult” residues (large pKa shifts) in an extreme environment. In practice, this means also that they are less sensitive since they have been parameterized against predominantly near-surface residues. The most straightforward way to improve their performance in this context is to enlarge the training dataset with a diverse set of residues that include significantly shifted p*K*_{a} values. In case of PROPKA, participating in the p*K*_{a}-cooperative has already initiated such efforts and has already resulted in a better description of the energy terms. The biggest obstacle at this point is similar to most methods discussed here, namely, how to deal with large structural reorganization (partial unfolding and water penetration). Even though it is easy to conceive approaches to include this, e.g. with MD, MC, or rotamer sampling, it would do so at the expense of its strength: computational speed and usability. The future of empirical p*K*_{a} predictors probably lies in practical use within the much larger domain of non-extreme residues and as a screening tool for more advanced methods. In case of MoKaBio, it will include more representative cases with known p*K*_{a}s that will results to better similarity index (SI) and thus to more reliable predictions.

## CONCLUSIONS

The p*K*_{a}-cooperative inspired 12 groups to make blind predictions for 77 experimentally determined p*K*_{a}s. Due to the efforts of the Garcia-Moreno group ^{2–8,119,123–127}, such a large benchmark of experimental p*K*_{a} values and in some cases experimentally determined X-ray structures, paved the way for broad range blind testing of a variety of methods with different physical platforms. The most striking result of this blind test was that nobody performed significantly better than the rest of the participants. Each method had successful and unsuccessful predictions, and thus indicating that all methods had problems with their underlying physics, with different problems in different methods. Much of the meeting was dedicated to discussing the reasons for this failure, with several potential reasons being pointed out as outlined above. Overall, the meeting was a great opportunity to discuss frankly the problems of the methods, which is invariably more enlightening and more productive than discussing achievements.

From the presentations and discussions of calculating pKa in proteins a steady, albeit somewhat slower than desired, improvement in accuracy can be seen. Therefore, it does not seem unreasonable to expect further progress during the next few years as methods are refined and new algorithms are proposed. If this progress is to make an impact on the Biophysics community and subsequently on the larger community of Biologists it will be necessary to become cognizant of how acid/base equilibria impact biological systems. In particular, because pKa are logarithmic quantities a shift of one pKa unit implies a ten-fold change in concentration, and given the tight control of pH in most biological systems, it is clear that a change in proton concentration implied by a shift of one pKa unit will not be tolerated by most body compartments. Thus it seems that the initial goal to strive for is to be able to predict pKa with errors < 1. Unfortunately, this means that our favorite indicator, RMSD, is of little use, since an RMSD of 0.3 does not guarantee that all pKa of a system are predicted within one pKa unit of their actual value (at least on the average). Fortunately there are many cases involving biological systems where the pKa value do not have be known to high accuracy. Instead, what is required to rationalize a biological process is to know the protonation state under a given set of experimental conditions as has been shown in a recent publication. ^{128}

## Acknowledgments

EA and ELM thank the consortium for helpful discussions, and the thoughtful and open contributions made by the blind contributors. Some of the sections of this overview in “OVERVIEW OF METHODS FOR CALCULATING p*K*_{a}s IN PROTEINS” and “INSIGHTS AND DIFFICULTIES ENCOUNTERED BY p*K*_{a}-COOPERATIVE PARTICIPANTS” are based on their contributions. Also any incorrect statements made in these sections (or any other) are entirely the responsibility of EA and ELM. Finally, the support of grants NIGMS R01GM093937, and NIH R03LM009748 (EA) and R01 DA015170 (ELM) is gratefully acknowledged.

## References

## Formats:

- Article |
- PubReader |
- ePub (beta) |
- PDF (391K)

- The pKa Cooperative: a collaborative effort to advance structure-based calculations of pKa values and electrostatic effects in proteins.[Proteins. 2011]
*Nielsen JE, Gunner MR, García-Moreno BE.**Proteins. 2011 Dec; 79(12):3249-59. Epub 2011 Oct 15.* - Toward accurate prediction of pKa values for internal protein residues: the importance of conformational relaxation and desolvation energy.[Proteins. 2011]
*Wallace JA, Wang Y, Shi C, Pastoor KJ, Nguyen BL, Xia K, Shen JK.**Proteins. 2011 Dec; 79(12):3364-73. Epub 2011 Jul 11.* - Calculation of pK(a) in proteins with the microenvironment modulated-screened coulomb potential.[Proteins. 2011]
*Shan J, Mehler EL.**Proteins. 2011 Dec; 79(12):3346-55. Epub 2011 Jul 11.* - Protein electrostatics and pKa blind predictions; contribution from empirical predictions of internal ionizable residues.[Proteins. 2011]
*Olsson MH.**Proteins. 2011 Dec; 79(12):3333-45. Epub 2011 Aug 30.* - Theoretical calculations of acid-dissociation constants of proteins.[Biochem Cell Biol. 1998]
*Juffer AH.**Biochem Cell Biol. 1998; 76(2-3):198-209.*

- On the Modeling of Polar Component of Solvation Energy using Smooth Gaussian-Based Dielectric Function[Journal of theoretical & computational chem...]
*Li L, Li C, Alexov E.**Journal of theoretical & computational chemistry. 2014 May; 13(3)10.1142/S0219633614400021* - Considering Protonation as a Post-translational Modification Regulating Protein Structure and Function[Annual review of biophysics. 2013]
*Schönichen A, Webb BA, Jacobson MP, Barber DL.**Annual review of biophysics. 2013; 42289-314* - Predicting pKa for proteins using COSMO-RS[PeerJ. ]
*Andersson MP, Jensen JH, Stipp SL.**PeerJ. 1e198* - Progress in developing Poisson-Boltzmann equation solvers[Molecular based mathematical biology. 2013]
*Li C, Li L, Petukh M, Alexov E.**Molecular based mathematical biology. 2013 Mar 1; 110.2478/mlbmb-2013-0002* - Protons as second messenger regulators of G protein signaling[Molecular cell. 2013]
*Isom DG, Sridharan V, Baker R, Clement ST, Smalley DM, Dohlman HG.**Molecular cell. 2013 Aug 22; 51(4)531-538*

- PubMedPubMedPubMed citations for these articles
- SubstanceSubstancePubChem Substance links
- TaxonomyTaxonomyRelated taxonomy entry
- Taxonomy TreeTaxonomy Tree

- PROGRESS IN THE PREDICTION OF pKa VALUES IN PROTEINSPROGRESS IN THE PREDICTION OF pKa VALUES IN PROTEINSNIHPA Author Manuscripts. Dec 2011; 79(12)3260PMC

Your browsing activity is empty.

Activity recording is turned off.

See more...