Protein Engineering, Vol. 13, No. 2, 89-98,
February 2000
© 2000 Oxford University Press
Analysis and prediction of carbohydrate binding sites
1 Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London, WC1E 6BT and Department of Crystallography, Birkbeck College, Malet Street, London, WC1 7HX, UK
| Abstract |
|---|
|
|
|---|
An analysis of the characteristic properties of sugar binding sites was performed on a set of 19 sugar binding proteins. For each site six parameters were evaluated: solvation potential, residue propensity, hydrophobicity, planarity, protrusion and relative accessible surface area. Three of the parameters were found to distinguish the observed sugar binding sites from the other surface patches. These parameters were then used to calculate the probability for a surface patch to be a carbohydrate binding site. The prediction was optimized on a set of 19 non-homologous carbohydrate binding structures and a test prediction was carried out on a set of 40 proteincarbohydrate complexes. The overall accuracy of prediction achieved was 65%. Results were in general better for carbohydrate-binding enzymes than for the lectins, with a rate of success of 87%.
Keywords: prediction/proteinsugar complex/proteinsugar interactions/surface patch
| Introduction |
|---|
|
|
|---|
The recognition of the importance of carbohydrate binding proteins in biology has been steadily increasing over the past two decades. Carbohydrates are the main source of energy in living cells and have to be synthesized, transported and degraded. Moreover, proteincarbohydrate interactions are involved in cell recognition and adhesion. A great variety of proteins, with very different functions and topologies, are involved in carbohydrate recognition including enzymes, periplasmic receptors, antibodies and lectins. A number of reviews on proteincarbohydrate interactions have been published (Quiocho, 1989
| Materials and methods |
|---|
|
|
|---|
Datasets
A non-homologous dataset of proteincarbohydrate complexes was selected for the analysis of proteincarbohydrate interactions, from the PDB release of January 1995. The criteria used to assign homology are as follows: proteins showing over 30% sequence identity are assigned to the same homologous family. From each family obtained, a representative complex is chosen so that it should contain, when available, a native protein, it should represent a naturally occurring complex and should be the one with highest resolution. The selected entries are then compared using the structural alignment program SSAP (Taylor and Orengo, 1989
). The calculated SSAP scores give the degree of structural similarity of each pair, 100 indicating structural identity and zero being the lowest similarity value. Those proteins whose SSAP score is greater than 80 are considered as related, hence some sequence families are combined. From each of the final non-homologous families of proteins, a representative is again chosen as described above. This set of proteins, referred to as dataset I, was employed in the optimization of the prediction parameters. Two additional sets of proteincarbohydrate complexes were used to test the performance of the prediction algorithm. They were both selected from the PDB updated to March 1997 and were named Test set I and Test set II. The first, smaller dataset comprised all new sugar binding proteins which show no homology to any protein of dataset I nor to each other. Test set II contained all other proteincarbohydrate complexes which are homologous to structures either in the original dataset (dataset I) or Test set I.
Patch analysis
A surface patch on a protein is defined as N neighbouring solvent accessible amino acid residues surrounding a central exposed residue (Jones and Thornton, 1997a
), where N is the number of residues comprising the actual sugar binding site on the protein. The neighbouring residues are determined by their C
positions. An amino acid is defined as a surface residue if more than 1% of its accessible surface area is exposed to the solvent. An observed carbohydrate binding site patch is formed by those surface residues whose accessible surface area decreases by more than 1 Å2 after the binding of the ligand. All possible overlapping surface patches have been determined for each structure (i.e. one patch for each surface amino acid residue) with the program PATCH (Jones and Thornton, 1997a
). For each patch, six parameters are calculated: solvation potential, residue sugar interface propensity, hydrophobicity, planarity, protrusion index and relative accessible surface area (ASA) (Jones and Thornton, 1997a
).
Solvation potential
The solvation potential is a knowledge-based measure of the propensity of an amino acid residue to have a certain degree of solvation in the protein (Jones et al., 1992
). These potentials, evaluated at points across the full accessibility range from 0 to 100%, were derived from a large dataset of non-homologous proteins. The solvation potential of a given residue in a protein depends on the relative surface exposure of that residue. The solvation potential of a surface residue is given by the difference between the solvation potential value associated with its exposed ASA and the solvation potential corresponding to a residue of the same type with zero ASA
![]() |
The solvation potential for the complete `patch' is the mean solvation potential of the amino acid residues comprising the patch. The more positive the solvation potential, the higher the propensity for burial.
Interface propensity
This parameter was adapted to the sugar binding analysis by using the propensity of each amino acid residue to be in the interface with the sugar. The propensity quotient P(i) of a surface residue of type i is defined as the ratio between its frequency at the interface and the `average' frequency of any amino acid at the interface, i.e. a residue favours the interface region if it is found there more frequently than average, that is
![]() |
Values above 1 indicate a propensity for being in contact with a sugar ligand. For an easier evaluation of the results the above equation was linearized taking its natural logarithm. In this case, positive values indicate a propensity to be involved in carbohydrate binding, negative values indicate a dislike of the interface. A list of the sugar interface propensity for the 20 common amino acid residues is given in Table I
.
|
The propensity data were determined so that, in each case, the protein analysed was excluded from the set used in the residue propensity calculations (jack-knife procedure). This ensures that the structure tested does not contribute to the residue propensities used in the analysis. The patch interface propensity is the mean propensity value over the N residues forming the patch. The higher the propensity value, the greater the preference of the residues in the patch for a sugar interface.
Hydrophobicity
The patch hydrophobicity is the mean hydrophobicity associated to a given surface patch. The hydrophobicity scale used in the calculation is the Fauchère and Pliska scale (Fauchère and Pliska, 1983
). The larger the hydrophobicity parameter, the higher the hydrophobicity of the patch residues.
Planarity The planarity of a surface patch is calculated as the root mean square deviation (r.m.s.d.) of all patch atoms from the best fit plane through the patch. The higher the r.m.s.d. value, the less planar the patch.
Protrusion
The protrusion index gives an indication of the patch protrusion from the surface of a protein. Residue protrusion indexes are calculated by fitting an equimomental ellipsoid to the protein and calculating the relative location of each residue in a series of concentric shells (Thornton et al., 1986
). The patch protrusion index is the mean index over the patch residues. The higher the value, the more protruding the patch.
Accessibility The patch accessibility is the average relative ASA value over the patch residues. The higher the ASA parameter the more accessible the patch. The parameter scores of the observed binding site on a given protein are calculated, together with those of all other surface patches of that same protein. The values are divided into 10 equal intervals and ranked on a scale of 1 to 10, 1 containing the highest scores, 10 the lowest. This procedure is repeated for each proteinsugar complex in the dataset.
Patch prediction
The prediction algorithm employed is based on a comparison of the parameters described above, obtained for the calculated patches on a protein. Each of the six parameter ranges is normalized to a scale of 0 to 100, and is used to calculate the probability Pj of each patch to be a sugar binding site. The probability Pj, or combined score, is calculated from the individual scores as follows
![]() |
The number of parameters to be included in the calculation can be chosen according to the results obtained in the patch analysis previously performed. It is also possible to choose how a certain individual score should contribute to the combined score, i.e. whether a high value of a parameter should be considered favourable or unfavourable in the prediction. For example, if the characteristics required are high residue propensity, low accessibility and low protrusion index, eqn 3
becomes
![]() |
|
The shape and size of the carbohydrate binding sites varies considerably across the dataset. Consequently, no individual patch will correspond exactly to the real binding site. The percent overlap P1 (Jones and Thornton, 1997b
![]() |
![]() |
| Results |
|---|
|
|
|---|
Dataset
The dataset of 19 non-homologous carbohydrate binding proteins (Table II
) comprises nine enzyme structures, seven lectin structures, one Fab fragment and two periplasmic carbohydrate binding proteins. Some of the complexes contain more than one carbohydrate binding site. All non-identical sites of binding on a single structure were considered separately. When more than one identical binding site was present, the site containing the ligand with lowest average B factor was chosen. Cyclomaltodextrin glycosyltransferase (1cxg) and glycogen phosphorylase (6gpb) are two examples of enzymes containing carbohydrate ligands bound to sites other than the active site. In our classification, the catalytic sites of these two structures were included in the enzymes class, the other sugar binding sites in the lectin class. The 19 dataset structures present a total of 25 sites of binding.
|
Characterization of a sugar binding site
Dataset I was screened to determine which if any of the six parameters (solvation potential, propensity, hydrophobicity, planarity, protrusion and relative accessible surface area) best differentiate the carbohydrate binding site regions on a protein. The six parameters were computed for every calculated patch and for the observed sugar interface of each dataset protein. The size of the calculated patches was, in each case, equal to the size of the observed interface and is reported in Table II
. Although a monosaccharide can be in contact with up to 19 protein residues, several oligosaccharides interact with only 89 amino acids. Carbohydrate interface sizes range from a minimum of seven residues in the case of maltose bound to cyclodextrin glycosyltransferase (1cxg), to a maximum of 35 for soybean ß-amylase bound with bound maltotetraose (1byb). Enzymes tend to bury a large portion of the bound sugar while lectins leave the ligand more exposed to the solvent, usually interacting with the end monosaccharide units of an oligosaccharide. When dealing with multimeric structures, the calculation of parameters and patches were performed on the whole protein, so that the interface regions between different subunits were excluded from the calculation. Only those patches containing residues from the wanted subunits were subsequently used. For example, the homopentamer cholera toxin (1chb) has one binding site per monomer. The program PATCH is run on the complete protein but only the patches containing residues from one chain (H) are kept.
The total number of calculated patches for each structure is given in Table II
and the parameter distributions for each protein are summarized in histogram plots (one plot for each parameter). An example is presented in Figure 2
for soybean ß-amylase with bound maltotetraose.
|
In this example, the real interface scores amongst the patches having high solvation potential, highest propensity value, high hydrophobicity, low protrusion index, low relative accessible surface area and average planarity. An overview of the binding sites scoring is given in Figure 3
|
The parameter that seems to discriminate best the sugar binding sites on a protein surface is the residue interface propensity. Forty percent of all sites are ranked in the highest propensity bin (Figure 3b
Prediction
Dataset I
From the above analysis it was concluded that a surface patch has a high probability of being a carbohydrate binding site on an enzyme or periplasmic sugar binding protein, if it has high average residue propensity, low protrusion index and low relative ASA. In lectin structures the best patches will have high residue propensity score and high protrusion index. These parameters have also been used for the Fab fragment. The prediction was additionally run with the residue propensity scores alone, to determine the extent the amino acid composition of the binding site region distinguishes a binding site from the remaining protein surface. Different patch sizes were explored: 12, 15 and 20 residues for enzymes; 12 and 15 residues for lectins. The size of the patch chosen does not have a significant effect on the results, being in any case very small compared with the total size of the protein [Table II
(c) and (d)]. The best choice for the two classes of proteins was 15 residues for enzymes and 12 residues for lectins. The scoring criteria for the prediction algorithm are given in the Materials and methods section. At least 70% of the maximum overlap possible between a calculated patch and the observed binding site has to be achieved for a successful prediction (Rel
70). The results for the 19 dataset I complexes are reported in Table III
. The results for the enzyme-type binding sites were very good. Ten of the 11 binding sites were correctly predicted. The only poor prediction is the active site of glycogen phosphorylase (6gpb) for which, although the top three patches do show some overlap with the observed site, this is below the threshold set for a correct prediction. It should be noted that the patch with highest overlap ranks 12 out of 663 patches. The analysis of the lectin binding sites is more complex since almost all structures are multimeric proteins or contain more than one carbohydrate binding site. When more than one binding site is present, care must be taken in interpreting the results. The top three patches can refer to any of the binding sites, particularly if they are all of the same type and have been predicted with the same parameters. An incorrect prediction can arise if the top three patches overlap with a binding site other than the one chosen. To overcome this problem, the prediction results of all distinct binding sites of a single protein were combined. The top three calculated patches that do not show an overlap with another site of binding are taken as the best scoring patches for a binding site. The ranking of the patch with highest overlap is also accordingly shifted removing all patches that overlap with one of the other binding sites. Only six of the 14 lectin binding sites were successfully predicted. In three examples (1slt, 2aai and 1cxg4) the top three patches showed an insufficient overlap with the observed interface. Rerunning the prediction with residue propensity only, results in a correct prediction in all three cases. In the remaining five unsuccessful predictions (2cwgE, 6gpb2, 1hgh1, 1hgh2 and 1mfb), the observed binding site patch was completely missed. Wheat germ agglutinin (2cwg) has four binding sites (Wright, 1990
), only two of which are occupied by a bound oligosaccharide in this entry. The low affinity binding site (containing ligand E) was completely missed. A closer analysis has revealed that all calculated patches scoring better than those overlapping with binding site E (containing ligand E) are either overlapping with binding site D (containing ligand D), or are located in the areas containing the two unoccupied binding sites. The Fab fragment (1mfb) carbohydrate binding site was another incorrectly predicted structure. The lack of success in this case can be explained looking at the results of the patch analysis for this complex in Table II
. The ranking of the observed binding site patch protrusion index was 5, meaning that this parameter is not discriminating for this protein. A prediction was therefore run using the residue interface propensity score only. The results improved considerably. The relative overlap of the top three patches was now 44, 67 and 67% respectively and the patch with maximum P1 ranks 9 out 297 possibilities. The site of glycogen storage in glycogen phosphorylase (6gpb2) was also not located in the prediction. The highest scoring patch was in position 58 out of 664, and only 2% of the patches have a relative overlap greater than 70%. Both binding sites of influenza virus haemagglutinin (1hgh1 and 1hgh2), a membrane bound protein, were missed in the prediction. The third best patch shows an overlap with the site of low affinity binding. A visual inspection of the results showed that most of the top patches in this case, are located in the area facing the viral membrane. Of particular interest were the results obtained for cyclomaltodextrin glycosyltransferase (1cxg) which has four sugar binding sites, three of which have lectin characteristics, the other having enzyme characteristics. The program PATCH was run with two different sets of parameters on this protein, giving a successful prediction in both cases. Use of the enzyme parameters (high residue propensity, low protrusion index and low ASA) located the catalytic site, while when the lectin parameters (high residue propensity and high protrusion index) were used, the top patches all overlapped with the other three binding sites. Using only the residue propensity score in the prediction and a patch size of 15, the top four patches overlapped with the active site (patch 4 having 100% relative overlap), while the first patch overlapping with one of the maltose binding sites was in position 11.
|
Test sets Two additional sets of proteincarbohydrate complexes were used to test the performance of this prediction algorithm. The first, smaller set comprised all new sugar binding protein structures which are not homologous to any protein in the original dataset I used to refine the method and parameters. The results are summarized in Table IV
|
Test set II comprises new structures, which are homologous to proteins in either the original set or test set I. The prediction on this set of structures was therefore run with the jack-knife procedure, i.e. the protein homologous to the tested protein was excluded from the calculation of the residue propensity. The results for enzymes and periplasmic carbohydrate binding proteins of test set II were very good: 89% (17/19) of the binding sites were successfully predicted. The outcome of the prediction for lectins and immunoglobulins was less satisfactory, only 29% (4/14) of the binding sites were correctly predicted. It should be noted that this test was only performed to evaluate the algorithm, since if binding data are available for a homologue, these should be used to predict the binding site.
| Discussion |
|---|
|
|
|---|
The analysis of sugar binding sites in proteins revealed several characteristic features which allowed a remarkably successful prediction, given the simplicity of the method. As in other recognition studies [e.g. for adenylate (Moodie et al., 1996
| Acknowledgments |
|---|
The work of C.Taroni was funded by a BBSRC studentship and sponsored by Pfizer Inc. We would like to thank Dr J.Overington for useful discussions and Prof. G.E.Shulz for allowing the use of his laboratory facilities.
| Notes |
|---|
3 To whom correspondence should be addressed
| References |
|---|
|
|
|---|
Barondes,S.H., Cooper,D.N.W., Gitt,M.A. and Leffler,H. (1994) J. Biol. Chem., 269, 2080720810.
Bundle,D.R. and Young,N.M. (1992) Curr. Opin. Struct. Biol., 2, 666.
Davies,G. and Henrissat,B. (1995) Structure, 3, 853859.[Medline]
Drickamer,K. (1995) Nature Struct. Biol., 2, 437439.[Web of Science][Medline]
Drickamer,K. (1997) Structure, 5, 465468.[Medline]
Fauchère,J. and Pliska,V. (1983) Eur. J. Med. Chem., 18, 369375.[Web of Science]
Henrissat,B. and Davies,G. (1997) Curr. Opin. Struct. Biol., 7, 637644.[Web of Science][Medline]
Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) Nature, 358, 8689.[Medline]
Jones,S. and Thornton,J.M. (1997a) J. Mol. Biol., 272, 121132.[Web of Science][Medline]
Jones,S. and Thornton,J.M. (1997b) J. Mol. Biol., 272, 133143.[Web of Science][Medline]
Laskowski,R.A., Luscombe,N.M., Swindells,M.B. and Thornton,J.M. (1996) Protein Sci., 5, 24382452.[Web of Science][Medline]
Meyer,J.E.W. and Schulz,G.E. (1997) Protein Sci., 6, 10841091.[Web of Science][Medline]
Moodie,S.L., Mitchell,J.O. and Thornton,J.M. (1996) J. Mol. Biol., 263, 486500.[Web of Science][Medline]
Quiocho,F.A. (1989) Pure Appl. Chem., 61, 12931306.[Web of Science]
Sharon,N. and Lis,H. (1990) Chem. Brit., 26, 679682.
Sharon,N. (1993) Trends Biochem. Sci., 18, 221226.[Web of Science][Medline]
Spurlino,J.C., Rodseth,L.E. and Quiocho,F.A. (1992) J. Mol. Biol., 226, 1522.[Web of Science][Medline]
Taroni,C. (1998). Computational Analysis of ProteinCarbohydrate Interactions. PhD thesis, University College London.
Taylor,G. (1996) Curr. Opin. Struct. Biol., 6, 830837.[Web of Science][Medline]
Taylor,W.R. and Orengo,C.A. (1989) J. Mol. Biol., 208, 122.[Web of Science][Medline]
Thornton,J.M., Edwards,M.S., Taylor,W.R. and Barlow,D.J. (1986) EMBO J., 5, 409413.[Web of Science][Medline]
Toone,E.J. (1994) Curr. Opin. Struct. Biol., 4, 719.[Web of Science]
Vyas,N.K. (1991) Curr. Opin. Struct. Biol., 1, 732740.
Vyas,N.K., Vyas,M.N. and Quiocho,F.A. (1991) J. Biol. Chem., 266, 52265237.
Weis,W.I. (1997) Curr. Opin. Struct. Biol., 7, 624630.[Web of Science][Medline]
Wright,C.S. (1990) J. Mol. Biol., 215, 635.[Web of Science][Medline]
Received May 4, 1999; revised November 1, 1999; accepted November 16, 1999.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
R. I. Lehrer, G. Jung, P. Ruchala, S. Andre, H. J. Gabius, and W. Lu Multivalent Binding of Carbohydrates by the Human {alpha}-Defensin, HD5 J. Immunol., July 1, 2009; 183(1): 480 - 490. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Voss, A. Diez-Sampedro, B. A. Hirayama, D. D. F. Loo, and E. M. Wright Imino Sugars Are Potent Agonists of the Human Glucose Sensor SGLT3 Mol. Pharmacol., February 1, 2007; 71(2): 628 - 634. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. M. Bejar, X. Jin, M. A. Ballicora, and J. Preiss Molecular Architecture of the Glucose 1-Phosphate Site in ADP-glucose Pyrophosphorylases J. Biol. Chem., December 29, 2006; 281(52): 40473 - 40484. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Zhou, G. Zhao, J. J. Truglio, L. Wang, G. Li, W. J. Lennarz, and H. Schindelin Structural and biochemical studies of the C-terminal domain of mouse peptide-N-glycanase identify it as a mannose-binding module PNAS, November 14, 2006; 103(46): 17214 - 17219. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Wright, D. D. F. Loo, B. A. Hirayama, and E. Turk Surprising Versatility of Na+-Glucose Cotransporters: SLC5 Physiology, December 1, 2004; 19(6): 370 - 376. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Shionyu-Mitsuyama, T. Shirai, H. Ishida, and T. Yamane An empirical approach for structure-based prediction of carbohydrate-binding sites on proteins Protein Eng. Des. Sel., July 1, 2003; 16(7): 467 - 478. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Connaris, T. Takimoto, R. Russell, S. Crennell, I. Moustafa, A. Portner, and G. Taylor Probing the Sialic Acid Binding Site of the Hemagglutinin-Neuraminidase of Newcastle Disease Virus: Identification of Key Amino Acids Involved in Cell Binding, Catalysis, and Fusion J. Virol., February 15, 2002; 76(4): 1816 - 1824. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Nobeli, R. A. Laskowski, W. S. J. Valdar, and J. M. Thornton On the molecular discrimination between adenine and guanine by proteins Nucleic Acids Res., November 1, 2001; 29(21): 4294 - 4309. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. W. McLean, M. R. Bray, A. B. Boraston, N. R. Gilkes, C. A. Haynes, and D. G. Kilburn Analysis of binding of the family 2a carbohydrate-binding module from Cellulomonas fimi xylanase 10A to cellulose: specificity and identification of functionally important amino acid residues Protein Eng. Des. Sel., November 1, 2000; 13(11): 801 - 809. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
















