By varying the availability of protons, i.e. the acidity of the media, the balance of the equilibrium can be shifted. This provides a measure of the ease of proton disassociation of a site in a compound, the disassociation (or ionization) constant pKa, defined by the equation:
Alternatively, the pKa of a site can be thought of the pH at which the protonated and deprotonated fractions are equal. If the pH is higher than the pKa, the site is mostly deprotonated, and if the pH is lower than the pKa, the site is mostly protonated.
To take an example, consider the drug aspirin (acetyl salicylic acid). The structure of aspirin as given in the Merck Index, WDI or Medchem is shown on the left, CC(=O)Oc1ccccc1C(=O)O. However, the pKa of the carboxylic acid group in aspirin is 3.49, which means that under nearly all physiological conditions, this group is almost entirely deprotonated. Hence, it is the physical properties of the molecule on the right, CC(=O)Oc1ccccc1C(=O)[O-], that is of most relevant to computational chemists.
Tissue Fluid | pH |
Aqueous humor | 7.21 |
Blood (arterial) | 7.40 |
Blood (venous) | 7.39 |
Blood (maternal umbilical) | 7.25 |
Cerebrospinal fluid | 7.35 |
Duodenum | 5.5 |
Intenstine (microsurface) | 5.3 |
Lacrimal fluid (tears) | 7.4 |
Milk (breast) | 7.0 |
Muscle (skeletal) | 6.0 |
Nasal secretions | 6.0 |
Saliva | 6.4 |
Stomach | 1.0-3.5 |
Sweat | 5.4 |
Urine (female) | 5.8 |
Urine (male) | 5.7 |
The fact that only the physiological ionization state is required greatly simplifies and increases the applicability of pKa prediction. There is little to be gained from knowing the pKa of an ionizable group if it is significantly greater than or less than pH 7.3. Similarly, a pKa that is close to 7.3, implies that a significant fraction of the molecule may be either protonated or deprotonated, and analysis should be applied to both forms.
Consider a docking calculation of the following molecule NSC577 from NCI95. It is unlikely that there are any accurate experimental values or Hammett equations of the effects of the aromatic ring systems on the ionizable iodine, but it is clear that the site is deprotonated (by about 16 log units!).
The currently implemented algorithm to determine physiological ionization is built upon a pKa predictor. The algorithm can start from either the input or neutral form of the molecule. The method then iteratively either protonates the most basic site or deprotonates the most acidic site until there are no remaining "acid" or "base" sites for the specified pH, or until a loop is encoutered. The pKa's of all sites must be recalculated on each iteration, so that pKa reflects the ionization state of the rest of the molecule. One pleasant feature of this approach is that it can determine whether a neutral molecule is predominantly zwitterionic.
The table below contains a number of atom types and associated values used in an atom type based predictor. The pKa is given for the both the conjugate acid and base. For convenience, the SMARTS-like constraint [!H0] is ommitted from all entries in the "Acid" column. The rows are grouped by element, and the first matching type is used. In the table below, no more than two shells of an atoms neighbour need be inspected to determine an approximate pKa.
Conjugate Acid | Value | Conjugate Base | Model Compound |
F | 3.18 | [F-] | HF |
Cl | -6.50 | [Cl-] | HCl |
Br | -8.50 | [Br-] | HBr |
I | -9.00 | [I-] | HI |
c | 43.00 | [c-] | c1ccccc1 |
C#N | 9.30 | [C-]#N | C#N |
C | 50.00 | [C-] | CH4 |
[nH] | 16.50 | [n-] | [nH]1cccc1 |
N=N=* | -99.99 | [N-]=N=* | ??? |
NC=O | 22.00 | [N-]C=O | |
NS(=O)=O | 10.10 | [N-]S(=O)=O | ??? |
N | 32.50 | [N-] | NH3 |
??? | -3.80 | [nH] | [nH]1cccc1 |
[nH+] | 5.23 | n | n1ccccc1 |
[NH+]#* | -12.00 | N#* | RC#N |
[NH+]=C(N)N | 14.4 | N=C(N)N | NC(=N)N |
[NH+]=C(N)a | 11.6 | N=C(N)a | N=C(N)c1ccccc1 |
[NH+]=CN | 12.4 | N=CN | CC(=N)N |
[NH+]=* | -99.99 | N=* | ??? |
[NH+]a | 4.69 | Na | Nc1ccccc1 |
[NH+]C=O | 4.74 | NC=O | ??? |
[NH+]C=N | -99.99 | NC=N | ??? |
[NH+]S(=O)=O | -99.99 | NS(=O)=O | ??? |
[NH+] | ~10.5 | N | CN(C)C |
[NH2+] | ~11.1 | CNC | |
[NH3+] | 10.6 | N | CN |
[NH4+] | 9.25 | N | NH4+ |
[OH2] | 15.70 | [OH-] | H2O |
Oc | 10.00 | [O-]a | Oc1ccccc1 |
OC(=O)[O-] | 10.33 | [O-]C(=O)[O-] | CO(OH)2 |
OC(=O)a | 4.20 | [O-]C(=O)a | OC(=O)c1ccccc1 |
OC=O | 4.80 | [O-]C=O | OC(=O)C |
OC | 15.50 | [O-]C | CH3OH |
ON(=O)=O | -1.40 | [O-]N(=O)=O | HNO3 |
O[N+]=O | -12.00 | [O-][N+]=O | CN(=O)=O |
ONC=O | 9.40 | [O-]NC=O | ONC(=O)C |
ON=* | 12.34 | [O-]N=* | ON=CC |
ON(*)* | 5.2 | [O-]N(*)* | ON(C)C |
ON | 5.96 | [O-]N | ONC |
OP([O-])([O-])=O | 12.50 | [O-]P([O-])([O-])=O | H3PO4 |
OP([O-])=O | 6.70 | [O-]P([O-])=O | H3PO4 |
OP=O | 2.00 | [O-]P=O | H3PO4 |
OP[O-] | 99.99 | [O-]P[O-] | ??? |
OPa | 2.10 | [O-]Pa | OP(O)c1ccccc1 |
OP | 3.08 | [O-]P | CP(O)O |
OS(=O)(=O)[O-] | 2.0 | [O-]S(=O)(=O)[O-] | H2SO4 |
OS(=O)(=O) | -99.99 | [O-]S(=O)(=O) | H2SO4 |
OS(=O)[O-] | 7.20 | [O-]S(=O)[O-] | H2SO3 |
OS(=O) | 1.80 | [O-]S(=O) | H2SO3 |
O | 99.99 | [O-] | ??? |
[OH+] | -1.74 | O | H3O+ |
P | 29.00 | [P-] | PH3 |
[P+] | -13.00 | P | PH4+ |
S*=O | 3.52 | [S-]*=O | CH3C(=O)S |
Sa | 6.52 | [S-]a | Sc1ccccc1 |
[SH2] | 7.00 | [SH-] | SH2 |
S | 12.00 | [S-] | CH3SH |
[SH+] | -7.00 | S | CH3SH2+ |
The very generic nature of the atom types in the above table allow this approach to be applied rapidly to large datasets of diverse compounds.
This leads to the Hammett Equation for pKa:
Where Sigma is a constant assigned to a particular substituent, Rho is a constant assigned to the particular acid or base functional group and pKa0 is the pKa of the unsubstituted acid or base. For benzoic acids, pKa0 is 4.20 and Rho is defined to be 1.0.
In the original formulation, two constants need to assigned to each substituent, Sigmameta for meta-substitutions and Sigmapara for para-substitutions. This was soon extended to aliphatic systems by Taft, by introducing Sigmastar (also written Sigma*). Currently over 40 forms of sigma constant have been defined, but many of these corrolate extremely well with each other.
As a worked example, consider the pKa prediction of the compound shown below, 4-chloro-3,5-dimethylphenol.
The Hammett equation for phenol has pKa0 = 9.92 and Rho = 2.23. The Sigmameta for -CH3 is -0.06 and the Sigmapara is 0.24. Hence the predicted value of the pKa is 9.92 - 2.23*(0.24-0.06-0.06) = 9.70. This compares extremely well with the experimental value of 9.71.
Currently, a database of over 170 Hammett/Taft equations has been encoded as SMILES/SMARTS (including chirality) from the compilation given in Perrin and Serjeant's "silver book". Similarly, a database of sigma constants for several thousand functional groups has been compiled by Corwin Hansch and Al Leo, a early TDT version of which has been placed in Daylight "contrib" by John Bradshaw.
A major benefit of Hammet/Taft equations is their ability to handle special cases.
One common approach, is to extend the set of known substituents with sigma transmission equations. For example, Sigmastar of -CH2-R can be estimated as 0.41 * the Sigmastar for R. Similarly, -CH=CH- has transmission coefficient 0.51 and -C6H4- has coefficient 0.30. Similar schemes include the Exner-Fiedler method for aliphatic cycles, and the Dewar-Grisdale method for polyaromatic systems. However, this approach cannot be used when a terminal group in unparameterized.
A second approach is to use molecular orbital methods to estimate sigma values when there is no experimental data. Using the strong corrolation between sigmameta and sigmapara and charges calculated with MOPAC's AM1 hamiltonian, Peter Ertl of Novartis has been able to calculate sigma constants for over 80,000 organic functional groups. Fortunately, as described in my MUG97 talk, "Efficient Protein and Nucleic Acid Perception", there exist efficient algorithms for pattern matching large numbers of functional groups in constant time.
If the neutral state of the protein is taken as the reference state, its electrostatic energy in a given ionization state is:
where xi is 1 when the group i is ionized, and 0 when it is neutral; is +1 for bases, and 1 for acids; pKaintrinsic,i is the intrinsic pKa of group i; is the absolute value of the interaction energy of groups i and j; M is the number of ionizable groups in the protein; R is the gas constant; and T is absolute temperature.
The intrinsic pKa of group i is given by:
where is the difference between the free energy change of ionization for group i in the otherwise unionized protein and in a model compound of pKamodel,i. The intrinsic pKa thus represents the pKa of the group in the protein with all other titratable amino acids in the neutral state. Given values for and for all groups, the above equation provides the basis for the computation of pH-dependent properties of protein-ligand complexes.
Traditionally, this technique has only been used for determining the protonation states of amino acid sidechains in proetins, where the values of pKamodel,i are simply looked up in a small table. However, small molecule pKa prediction methods such as those described above could easily be incorporated, not only allowing the protein the affect the ionization state of the ligand but potentially the ligand affecting the ionization state of the protein.
The results show an impressive agreement using a very simple atom type based method. The R2 value is 0.800, and a standard deviation of 0.95. Far more accurate figures on this benchmark can be achieved using Hammett & Taft equations, where ACD Labs' ACD/pKa obtains an R2 value is 0.978, and a standard deviation of 0.39.
The following table shows the input and output structures of the physiological state algorithm above on the ten ligands used in Metaphorics' DockIt crunch.
PDB Code | Before | After |
1HYT (-2) | ||
1RBP (0) | ||
1HPV (0) | ||
1SWD (-1) | ||
1ADD (0) | ||
2CBR (-1) | ||
3CLA (0) | ||
1RDS (-1) | ||
4HVP (+2) | ||
1ELA (+1) |